This page provides an overview of the GGML library (ggml/): its role as the tensor computation engine underlying llama.cpp, its major components, its compute model, and how hardware backends plug into it. This is an orientation document; deeper coverage is in the child pages:
ggml_tensor, ggml_context, ggml_cgraph, and the full op/type system, see GGML Core Architecture.GGML is a C/C++ library that provides:
enum ggml_op values.ggml_cgraph, which is then dispatched to one or more backends for execution.ggml_backend_t, ggml_backend_buffer_type_t, etc.) that each hardware target implements, enabling the same graph to be executed on CPU, CUDA, Metal, Vulkan, and others.ggml-alloc.c) that assigns tensor data into backend buffers while minimizing peak memory usage.GGML is also its own standalone project (the ggml/ directory can be built independently), but within this repository it is consumed as a subproject by the llama.cpp inference library.
ggml/
├── include/
│ ├── ggml.h # Core public API
│ ├── ggml-backend.h # Backend abstraction API
│ ├── ggml-alloc.h # Memory allocator API
│ ├── ggml-cpu.h # CPU backend public header
│ ├── ggml-cuda.h # CUDA backend public header
│ └── ... # Other backend headers
└── src/
├── ggml.c # Core: context, tensor, graph, op dispatch
├── ggml.cpp # C++ wrappers, std::terminate handler
├── ggml-alloc.c # Tensor allocator (tallocr, galloc)
├── ggml-backend.cpp # Backend buffer/device/scheduler impl
├── ggml-backend-reg.cpp # Backend registry (static + dynamic)
├── ggml-quants.c # Quantize/dequantize routines
├── gguf.cpp # GGUF file format read/write
├── ggml-cpu/ # CPU backend
├── ggml-cuda/ # CUDA backend
├── ggml-metal/ # Metal backend
└── ... # Other backends
Sources: ggml/CMakeLists.txt312-332 ggml/src/CMakeLists.txt192-226
The build produces two core library targets plus one per backend.
Library dependency diagram
| Target | Sources | Role |
|---|---|---|
ggml-base | ggml.c, ggml-alloc.c, ggml-backend.cpp, ggml-quants.c, gguf.cpp | Core tensors, ops, allocator, backend interface definitions |
ggml | ggml-backend-reg.cpp, ggml-backend-dl.cpp | Global backend registry; optionally loads backends as shared libraries |
ggml-cpu | ggml-cpu/ggml-cpu.c, ggml-cpu/ops.cpp, ggml-cpu/vec.cpp, … | CPU execution with SIMD dispatch |
ggml-cuda | ggml-cuda/*.cu | NVIDIA GPU execution |
ggml-metal | ggml-metal/ggml-metal.m | Apple Silicon GPU execution |
ggml-vulkan | ggml-vulkan/ggml-vulkan.cpp | Cross-platform GPU via Vulkan |
| (others) | HIP, SYCL, OpenCL, CANN, RPC, BLAS, … | Additional hardware targets |
Sources: ggml/src/CMakeLists.txt192-226 ggml/src/CMakeLists.txt448-462
GGML uses a define-then-execute model. Creating a tensor operation does not immediately compute anything; it records a node in a graph. Execution is explicit.
Step-by-step compute flow
Key functions in this flow:
| Function | File | Purpose |
|---|---|---|
ggml_init | ggml/src/ggml.c | Allocates a ggml_context from a caller-supplied memory block |
ggml_new_tensor_*d | ggml/src/ggml.c | Allocates a ggml_tensor header in the context arena |
ggml_build_forward_expand | ggml/src/ggml.c | Walks src[] links to topologically sort all nodes into ggml_cgraph.nodes[] |
ggml_backend_sched_graph_compute | ggml/src/ggml-backend.cpp | Splits graph across backends, allocates buffers, dispatches ops |
ggml_graph_compute | ggml/src/ggml.c | Single-backend execution without the scheduler |
ggml_free | ggml/src/ggml.c | Releases the context arena |
Sources: ggml/include/ggml.h31-90 ggml/src/ggml.c1-50
The diagram below maps the principal GGML abstractions to their source locations.
GGML core data model
Sources: ggml/include/ggml.h222-231 ggml/src/ggml.c909-922 ggml/src/ggml.c609-905
Every ggml_tensor stores data in row-major order. The ne[4] field holds element counts per dimension; nb[4] holds byte strides. Non-contiguous views (created by ggml_permute, ggml_transpose, ggml_view_*) share the same data pointer as their source (view_src) but have different nb values.
ne[0] = innermost dimension size (fastest changing)
ne[1] = next dimension
ne[2], ne[3] = outer dimensions
Byte offset of element [i0, i1, i2, i3]:
i0*nb[0] + i1*nb[1] + i2*nb[2] + i3*nb[3]
Sources: ggml/include/ggml.h112-148
GGML defines enum ggml_type covering floating-point, integer, and quantized formats. The behavior of each type is described by struct ggml_type_traits, accessed via ggml_get_type_traits(type).
| Type name | ggml_type constant | Block size | Quantized | Notes |
|---|---|---|---|---|
f32 | GGML_TYPE_F32 | 1 | No | Default compute type |
f16 | GGML_TYPE_F16 | 1 | No | IEEE half-precision |
bf16 | GGML_TYPE_BF16 | 1 | No | Brain float |
q4_0 | GGML_TYPE_Q4_0 | QK4_0 (32) | Yes | 4-bit, symmetric |
q4_1 | GGML_TYPE_Q4_1 | QK4_1 (32) | Yes | 4-bit, asymmetric |
q8_0 | GGML_TYPE_Q8_0 | QK8_0 (32) | Yes | 8-bit, used as activation type |
q4_K | GGML_TYPE_Q4_K | QK_K (256) | Yes | K-quant, super-block |
q6_K | GGML_TYPE_Q6_K | QK_K (256) | Yes | K-quant, super-block |
iq2_xs | GGML_TYPE_IQ2_XS | QK_K (256) | Yes | imatrix-dependent |
iq4_nl | GGML_TYPE_IQ4_NL | QK4_NL (32) | Yes | Non-linear 4-bit |
i8/i16/i32/i64 | GGML_TYPE_I* | 1 | No | Integer types |
The full type traits table is at ggml/src/ggml.c609-899
For quantization format details, see Quantization in GGML.
GGML defines a four-layer abstraction for hardware. Each layer is represented by an opaque pointer type backed by a vtable (interface struct).
Backend abstraction layers
| Type | Interface struct | Key methods |
|---|---|---|
ggml_backend_reg_t | ggml_backend_reg_i | get_name, dev_count, dev_get, get_proc_address |
ggml_backend_dev_t | ggml_backend_dev_i | get_name, get_description, get_memory, backend_init, buffer_type |
ggml_backend_buffer_type_t | ggml_backend_buffer_type_i | alloc_buffer, get_alignment, get_alloc_size, is_host |
ggml_backend_buffer_t | ggml_backend_buffer_i | get_base, init_tensor, cpy_tensor, clear |
ggml_backend_t | ggml_backend_i | get_name, graph_compute, graph_plan_compute, event_* |
Sources: ggml/include/ggml-backend.h1-81 ggml/src/ggml-backend-impl.h1-80 ggml/src/ggml-backend.cpp30-260
At program startup, ggml_backend_registry (a singleton in ggml-backend-reg.cpp) registers all compiled-in backends in priority order. The CPU backend is always registered last so GPU backends take precedence.
ggml/src/ggml-backend-reg.cpp107-160 shows the registration sequence:
CUDA → Metal → SYCL → Vulkan → WebGPU → zDNN → VirtGPU → OpenCL → ZenDNN
→ Hexagon → CANN → BLAS → RPC → CPU
When GGML_BACKEND_DL is enabled, backends can also be loaded from shared libraries at runtime from a directory specified by GGML_BACKEND_DIR. The ggml_add_backend_library CMake function in ggml/src/CMakeLists.txt247-292 controls whether a backend builds as a MODULE (dynamically loadable) or a static/shared library linked at build time.
GGML has two allocator layers:
ggml_context arena — A fixed-size bump allocator initialized by ggml_init. It holds ggml_tensor header structs and ggml_cgraph objects. It does not hold tensor data when using backends (the no_alloc flag is set to true).
ggml_gallocr (graph allocator, ggml-alloc.c) — Given a ggml_cgraph, it performs a liveness analysis to determine when each tensor's data is last used, then assigns data pointers into backend buffers with minimal peak memory. Operations flagged by ggml_op_can_inplace may reuse a source buffer's memory.
Sources: ggml/src/ggml-alloc.c1-90 ggml/include/ggml-alloc.h
When llama.cpp loads model weights onto a GPU but keeps some layers on CPU, the ggml_backend_sched_t scheduler (defined in ggml-backend.cpp) is responsible for:
ggml_cgraph node to a backend.ggml_gallocr.graph_compute call.Scheduler flow
Sources: ggml/src/ggml-backend.cpp260-700
The ggml_add_backend CMake function in ggml/src/CMakeLists.txt294-305 provides a uniform mechanism for enabling a backend via a CMake option (e.g. -DGGML_CUDA=ON). Each backend lives in its own subdirectory under ggml/src/ and calls ggml_add_backend_library to register itself.
The CPU backend supports multiple ISA variants compiled as separate shared libraries via ggml_add_cpu_backend_variant / ggml_add_cpu_backend_variant_impl. At runtime the scheduler selects the highest-scoring variant based on CPU feature detection. This is discussed further in CPU Backend Implementation.
Key CMake options for GGML:
| Option | Default | Effect |
|---|---|---|
GGML_CPU | ON | Include CPU backend |
GGML_CUDA | OFF | Include NVIDIA CUDA backend |
GGML_METAL | ON (macOS) | Include Apple Metal backend |
GGML_VULKAN | OFF | Include Vulkan backend |
GGML_BACKEND_DL | OFF | Build backends as loadable shared modules |
GGML_CPU_ALL_VARIANTS | OFF | Build all ISA variants of CPU backend |
GGML_NATIVE | ON | Optimize for the current machine |
Sources: ggml/CMakeLists.txt82-270 ggml/src/CMakeLists.txt294-462
GGML is the foundation that the rest of llama.cpp sits on. The inference engine (described in Core Library Architecture) uses GGML exclusively for:
ggml_backend_buffer_t.ggml_build_forward_expand.ggml_backend_sched_t.ggml-quants.c.The GGUF file format (gguf.cpp, covered in GGUF File Format) is also part of the ggml-base library target, since it handles the on-disk representation of the same tensor types GGML uses at runtime.
Refresh this wiki
This wiki was recently refreshed. Please wait 2 days to refresh again.