GGML Tensor Library

Relevant source files

Purpose and Scope

This page provides an overview of the GGML library (ggml/): its role as the tensor computation engine underlying llama.cpp, its major components, its compute model, and how hardware backends plug into it. This is an orientation document; deeper coverage is in the child pages:

For ggml_tensor, ggml_context, ggml_cgraph, and the full op/type system, see GGML Core Architecture.
For the backend abstraction layer and device registry, see Backend System and Registration.
For the CPU backend's SIMD dispatch and threadpool, see CPU Backend Implementation.
For block-based quantization formats, see Quantization in GGML.
For how GGML is used by the llama.cpp inference engine, see Core Library Architecture.

What GGML Is

GGML is a C/C++ library that provides:

Multi-dimensional tensor types supporting float (F64, F32, F16, BF16) and a range of quantized block formats (Q4_0, Q8_0, IQ series, etc.).
A library of tensor operations — arithmetic, matrix multiplication, normalization, attention, convolution, and others — declared as enum ggml_op values.
A deferred computation model — operations define a DAG (directed acyclic graph), recorded as a ggml_cgraph, which is then dispatched to one or more backends for execution.
A hardware backend abstraction — a uniform interface (ggml_backend_t, ggml_backend_buffer_type_t, etc.) that each hardware target implements, enabling the same graph to be executed on CPU, CUDA, Metal, Vulkan, and others.
An automatic memory allocator (ggml-alloc.c) that assigns tensor data into backend buffers while minimizing peak memory usage.
Automatic differentiation — gradients can be computed via backpropagation through the same graph structure.

GGML is also its own standalone project (the ggml/ directory can be built independently), but within this repository it is consumed as a subproject by the llama.cpp inference library.

Repository Layout

ggml/
├── include/
│   ├── ggml.h               # Core public API
│   ├── ggml-backend.h       # Backend abstraction API
│   ├── ggml-alloc.h         # Memory allocator API
│   ├── ggml-cpu.h           # CPU backend public header
│   ├── ggml-cuda.h          # CUDA backend public header
│   └── ...                  # Other backend headers
└── src/
    ├── ggml.c               # Core: context, tensor, graph, op dispatch
    ├── ggml.cpp             # C++ wrappers, std::terminate handler
    ├── ggml-alloc.c         # Tensor allocator (tallocr, galloc)
    ├── ggml-backend.cpp     # Backend buffer/device/scheduler impl
    ├── ggml-backend-reg.cpp # Backend registry (static + dynamic)
    ├── ggml-quants.c        # Quantize/dequantize routines
    ├── gguf.cpp             # GGUF file format read/write
    ├── ggml-cpu/            # CPU backend
    ├── ggml-cuda/           # CUDA backend
    ├── ggml-metal/          # Metal backend
    └── ...                  # Other backends

Sources: ggml/CMakeLists.txt312-332 ggml/src/CMakeLists.txt192-226

Compiled Library Targets

The build produces two core library targets plus one per backend.

Library dependency diagram

Target	Sources	Role
`ggml-base`	`ggml.c`, `ggml-alloc.c`, `ggml-backend.cpp`, `ggml-quants.c`, `gguf.cpp`	Core tensors, ops, allocator, backend interface definitions
`ggml`	`ggml-backend-reg.cpp`, `ggml-backend-dl.cpp`	Global backend registry; optionally loads backends as shared libraries
`ggml-cpu`	`ggml-cpu/ggml-cpu.c`, `ggml-cpu/ops.cpp`, `ggml-cpu/vec.cpp`, …	CPU execution with SIMD dispatch
`ggml-cuda`	`ggml-cuda/*.cu`	NVIDIA GPU execution
`ggml-metal`	`ggml-metal/ggml-metal.m`	Apple Silicon GPU execution
`ggml-vulkan`	`ggml-vulkan/ggml-vulkan.cpp`	Cross-platform GPU via Vulkan
(others)	HIP, SYCL, OpenCL, CANN, RPC, BLAS, …	Additional hardware targets

Sources: ggml/src/CMakeLists.txt192-226 ggml/src/CMakeLists.txt448-462

Computation Model

GGML uses a define-then-execute model. Creating a tensor operation does not immediately compute anything; it records a node in a graph. Execution is explicit.

Step-by-step compute flow

Key functions in this flow:

Function	File	Purpose
`ggml_init`	`ggml/src/ggml.c`	Allocates a `ggml_context` from a caller-supplied memory block
`ggml_new_tensor_*d`	`ggml/src/ggml.c`	Allocates a `ggml_tensor` header in the context arena
`ggml_build_forward_expand`	`ggml/src/ggml.c`	Walks `src[]` links to topologically sort all nodes into `ggml_cgraph.nodes[]`
`ggml_backend_sched_graph_compute`	`ggml/src/ggml-backend.cpp`	Splits graph across backends, allocates buffers, dispatches ops
`ggml_graph_compute`	`ggml/src/ggml.c`	Single-backend execution without the scheduler
`ggml_free`	`ggml/src/ggml.c`	Releases the context arena

Sources: ggml/include/ggml.h31-90 ggml/src/ggml.c1-50

Core Data Structures

The diagram below maps the principal GGML abstractions to their source locations.

GGML core data model

Sources: ggml/include/ggml.h222-231 ggml/src/ggml.c909-922 ggml/src/ggml.c609-905

Tensor layout

Every ggml_tensor stores data in row-major order. The ne[4] field holds element counts per dimension; nb[4] holds byte strides. Non-contiguous views (created by ggml_permute, ggml_transpose, ggml_view_*) share the same data pointer as their source (view_src) but have different nb values.

ne[0] = innermost dimension size (fastest changing)
ne[1] = next dimension
ne[2], ne[3] = outer dimensions

Byte offset of element [i0, i1, i2, i3]:
  i0*nb[0] + i1*nb[1] + i2*nb[2] + i3*nb[3]

Sources: ggml/include/ggml.h112-148

Data Type System

GGML defines enum ggml_type covering floating-point, integer, and quantized formats. The behavior of each type is described by struct ggml_type_traits, accessed via ggml_get_type_traits(type).

Type name	`ggml_type` constant	Block size	Quantized	Notes
`f32`	`GGML_TYPE_F32`	1	No	Default compute type
`f16`	`GGML_TYPE_F16`	1	No	IEEE half-precision
`bf16`	`GGML_TYPE_BF16`	1	No	Brain float
`q4_0`	`GGML_TYPE_Q4_0`	`QK4_0` (32)	Yes	4-bit, symmetric
`q4_1`	`GGML_TYPE_Q4_1`	`QK4_1` (32)	Yes	4-bit, asymmetric
`q8_0`	`GGML_TYPE_Q8_0`	`QK8_0` (32)	Yes	8-bit, used as activation type
`q4_K`	`GGML_TYPE_Q4_K`	`QK_K` (256)	Yes	K-quant, super-block
`q6_K`	`GGML_TYPE_Q6_K`	`QK_K` (256)	Yes	K-quant, super-block
`iq2_xs`	`GGML_TYPE_IQ2_XS`	`QK_K` (256)	Yes	imatrix-dependent
`iq4_nl`	`GGML_TYPE_IQ4_NL`	`QK4_NL` (32)	Yes	Non-linear 4-bit
`i8/i16/i32/i64`	`GGML_TYPE_I*`	1	No	Integer types

The full type traits table is at ggml/src/ggml.c609-899
For quantization format details, see Quantization in GGML.

Backend Abstraction

GGML defines a four-layer abstraction for hardware. Each layer is represented by an opaque pointer type backed by a vtable (interface struct).

Backend abstraction layers

Type	Interface struct	Key methods
`ggml_backend_reg_t`	`ggml_backend_reg_i`	`get_name`, `dev_count`, `dev_get`, `get_proc_address`
`ggml_backend_dev_t`	`ggml_backend_dev_i`	`get_name`, `get_description`, `get_memory`, `backend_init`, `buffer_type`
`ggml_backend_buffer_type_t`	`ggml_backend_buffer_type_i`	`alloc_buffer`, `get_alignment`, `get_alloc_size`, `is_host`
`ggml_backend_buffer_t`	`ggml_backend_buffer_i`	`get_base`, `init_tensor`, `cpy_tensor`, `clear`
`ggml_backend_t`	`ggml_backend_i`	`get_name`, `graph_compute`, `graph_plan_compute`, `event_*`

Sources: ggml/include/ggml-backend.h1-81 ggml/src/ggml-backend-impl.h1-80 ggml/src/ggml-backend.cpp30-260

Backend Registry

At program startup, ggml_backend_registry (a singleton in ggml-backend-reg.cpp) registers all compiled-in backends in priority order. The CPU backend is always registered last so GPU backends take precedence.

ggml/src/ggml-backend-reg.cpp107-160 shows the registration sequence:

CUDA → Metal → SYCL → Vulkan → WebGPU → zDNN → VirtGPU → OpenCL → ZenDNN
→ Hexagon → CANN → BLAS → RPC → CPU

When GGML_BACKEND_DL is enabled, backends can also be loaded from shared libraries at runtime from a directory specified by GGML_BACKEND_DIR. The ggml_add_backend_library CMake function in ggml/src/CMakeLists.txt247-292 controls whether a backend builds as a MODULE (dynamically loadable) or a static/shared library linked at build time.

Memory Allocation

GGML has two allocator layers:

ggml_context arena — A fixed-size bump allocator initialized by ggml_init. It holds ggml_tensor header structs and ggml_cgraph objects. It does not hold tensor data when using backends (the no_alloc flag is set to true).

ggml_gallocr (graph allocator, ggml-alloc.c) — Given a ggml_cgraph, it performs a liveness analysis to determine when each tensor's data is last used, then assigns data pointers into backend buffers with minimal peak memory. Operations flagged by ggml_op_can_inplace may reuse a source buffer's memory.

Sources: ggml/src/ggml-alloc.c1-90 ggml/include/ggml-alloc.h

Multi-Backend Graph Scheduling

When llama.cpp loads model weights onto a GPU but keeps some layers on CPU, the ggml_backend_sched_t scheduler (defined in ggml-backend.cpp) is responsible for:

Assigning each ggml_cgraph node to a backend.
Inserting copy operations where tensors cross backend boundaries.
Allocating device memory via ggml_gallocr.
Dispatching nodes to each backend's graph_compute call.

Scheduler flow

Sources: ggml/src/ggml-backend.cpp260-700

Build System Integration

The ggml_add_backend CMake function in ggml/src/CMakeLists.txt294-305 provides a uniform mechanism for enabling a backend via a CMake option (e.g. -DGGML_CUDA=ON). Each backend lives in its own subdirectory under ggml/src/ and calls ggml_add_backend_library to register itself.

The CPU backend supports multiple ISA variants compiled as separate shared libraries via ggml_add_cpu_backend_variant / ggml_add_cpu_backend_variant_impl. At runtime the scheduler selects the highest-scoring variant based on CPU feature detection. This is discussed further in CPU Backend Implementation.

Key CMake options for GGML:

Option	Default	Effect
`GGML_CPU`	ON	Include CPU backend
`GGML_CUDA`	OFF	Include NVIDIA CUDA backend
`GGML_METAL`	ON (macOS)	Include Apple Metal backend
`GGML_VULKAN`	OFF	Include Vulkan backend
`GGML_BACKEND_DL`	OFF	Build backends as loadable shared modules
`GGML_CPU_ALL_VARIANTS`	OFF	Build all ISA variants of CPU backend
`GGML_NATIVE`	ON	Optimize for the current machine

Sources: ggml/CMakeLists.txt82-270 ggml/src/CMakeLists.txt294-462

Relationship to llama.cpp

GGML is the foundation that the rest of llama.cpp sits on. The inference engine (described in Core Library Architecture) uses GGML exclusively for:

Storing model weight tensors in device memory via ggml_backend_buffer_t.
Building computation graphs for each transformer block via ggml_build_forward_expand.
Executing those graphs across one or more hardware backends via ggml_backend_sched_t.
Converting and quantizing weight data via the quantization routines in ggml-quants.c.

The GGUF file format (gguf.cpp, covered in GGUF File Format) is also part of the ggml-base library target, since it handles the on-disk representation of the same tensor types GGML uses at runtime.

GGML Tensor Library

Relevant source files

Purpose and Scope

For ggml_tensor, ggml_context, ggml_cgraph, and the full op/type system, see GGML Core Architecture.
For the backend abstraction layer and device registry, see Backend System and Registration.
For the CPU backend's SIMD dispatch and threadpool, see CPU Backend Implementation.
For block-based quantization formats, see Quantization in GGML.
For how GGML is used by the llama.cpp inference engine, see Core Library Architecture.

What GGML Is

GGML is a C/C++ library that provides:

Multi-dimensional tensor types supporting float (F64, F32, F16, BF16) and a range of quantized block formats (Q4_0, Q8_0, IQ series, etc.).
A library of tensor operations — arithmetic, matrix multiplication, normalization, attention, convolution, and others — declared as enum ggml_op values.
A deferred computation model — operations define a DAG (directed acyclic graph), recorded as a ggml_cgraph, which is then dispatched to one or more backends for execution.
A hardware backend abstraction — a uniform interface (ggml_backend_t, ggml_backend_buffer_type_t, etc.) that each hardware target implements, enabling the same graph to be executed on CPU, CUDA, Metal, Vulkan, and others.
An automatic memory allocator (ggml-alloc.c) that assigns tensor data into backend buffers while minimizing peak memory usage.
Automatic differentiation — gradients can be computed via backpropagation through the same graph structure.

GGML is also its own standalone project (the ggml/ directory can be built independently), but within this repository it is consumed as a subproject by the llama.cpp inference library.

Repository Layout

ggml/
├── include/
│   ├── ggml.h               # Core public API
│   ├── ggml-backend.h       # Backend abstraction API
│   ├── ggml-alloc.h         # Memory allocator API
│   ├── ggml-cpu.h           # CPU backend public header
│   ├── ggml-cuda.h          # CUDA backend public header
│   └── ...                  # Other backend headers
└── src/
    ├── ggml.c               # Core: context, tensor, graph, op dispatch
    ├── ggml.cpp             # C++ wrappers, std::terminate handler
    ├── ggml-alloc.c         # Tensor allocator (tallocr, galloc)
    ├── ggml-backend.cpp     # Backend buffer/device/scheduler impl
    ├── ggml-backend-reg.cpp # Backend registry (static + dynamic)
    ├── ggml-quants.c        # Quantize/dequantize routines
    ├── gguf.cpp             # GGUF file format read/write
    ├── ggml-cpu/            # CPU backend
    ├── ggml-cuda/           # CUDA backend
    ├── ggml-metal/          # Metal backend
    └── ...                  # Other backends

Sources: ggml/CMakeLists.txt312-332 ggml/src/CMakeLists.txt192-226

Compiled Library Targets

The build produces two core library targets plus one per backend.

Library dependency diagram

Target	Sources	Role
`ggml-base`	`ggml.c`, `ggml-alloc.c`, `ggml-backend.cpp`, `ggml-quants.c`, `gguf.cpp`	Core tensors, ops, allocator, backend interface definitions
`ggml`	`ggml-backend-reg.cpp`, `ggml-backend-dl.cpp`	Global backend registry; optionally loads backends as shared libraries
`ggml-cpu`	`ggml-cpu/ggml-cpu.c`, `ggml-cpu/ops.cpp`, `ggml-cpu/vec.cpp`, …	CPU execution with SIMD dispatch
`ggml-cuda`	`ggml-cuda/*.cu`	NVIDIA GPU execution
`ggml-metal`	`ggml-metal/ggml-metal.m`	Apple Silicon GPU execution
`ggml-vulkan`	`ggml-vulkan/ggml-vulkan.cpp`	Cross-platform GPU via Vulkan
(others)	HIP, SYCL, OpenCL, CANN, RPC, BLAS, …	Additional hardware targets

Sources: ggml/src/CMakeLists.txt192-226 ggml/src/CMakeLists.txt448-462

Computation Model

GGML uses a define-then-execute model. Creating a tensor operation does not immediately compute anything; it records a node in a graph. Execution is explicit.

Step-by-step compute flow

Key functions in this flow:

Function	File	Purpose
`ggml_init`	`ggml/src/ggml.c`	Allocates a `ggml_context` from a caller-supplied memory block
`ggml_new_tensor_*d`	`ggml/src/ggml.c`	Allocates a `ggml_tensor` header in the context arena
`ggml_build_forward_expand`	`ggml/src/ggml.c`	Walks `src[]` links to topologically sort all nodes into `ggml_cgraph.nodes[]`
`ggml_backend_sched_graph_compute`	`ggml/src/ggml-backend.cpp`	Splits graph across backends, allocates buffers, dispatches ops
`ggml_graph_compute`	`ggml/src/ggml.c`	Single-backend execution without the scheduler
`ggml_free`	`ggml/src/ggml.c`	Releases the context arena

Sources: ggml/include/ggml.h31-90 ggml/src/ggml.c1-50

Core Data Structures

The diagram below maps the principal GGML abstractions to their source locations.

GGML core data model

Sources: ggml/include/ggml.h222-231 ggml/src/ggml.c909-922 ggml/src/ggml.c609-905

Tensor layout

ne[0] = innermost dimension size (fastest changing)
ne[1] = next dimension
ne[2], ne[3] = outer dimensions

Byte offset of element [i0, i1, i2, i3]:
  i0*nb[0] + i1*nb[1] + i2*nb[2] + i3*nb[3]

Sources: ggml/include/ggml.h112-148

Data Type System

GGML defines enum ggml_type covering floating-point, integer, and quantized formats. The behavior of each type is described by struct ggml_type_traits, accessed via ggml_get_type_traits(type).

Type name	`ggml_type` constant	Block size	Quantized	Notes
`f32`	`GGML_TYPE_F32`	1	No	Default compute type
`f16`	`GGML_TYPE_F16`	1	No	IEEE half-precision
`bf16`	`GGML_TYPE_BF16`	1	No	Brain float
`q4_0`	`GGML_TYPE_Q4_0`	`QK4_0` (32)	Yes	4-bit, symmetric
`q4_1`	`GGML_TYPE_Q4_1`	`QK4_1` (32)	Yes	4-bit, asymmetric
`q8_0`	`GGML_TYPE_Q8_0`	`QK8_0` (32)	Yes	8-bit, used as activation type
`q4_K`	`GGML_TYPE_Q4_K`	`QK_K` (256)	Yes	K-quant, super-block
`q6_K`	`GGML_TYPE_Q6_K`	`QK_K` (256)	Yes	K-quant, super-block
`iq2_xs`	`GGML_TYPE_IQ2_XS`	`QK_K` (256)	Yes	imatrix-dependent
`iq4_nl`	`GGML_TYPE_IQ4_NL`	`QK4_NL` (32)	Yes	Non-linear 4-bit
`i8/i16/i32/i64`	`GGML_TYPE_I*`	1	No	Integer types

The full type traits table is at ggml/src/ggml.c609-899
For quantization format details, see Quantization in GGML.

Backend Abstraction

GGML defines a four-layer abstraction for hardware. Each layer is represented by an opaque pointer type backed by a vtable (interface struct).

Backend abstraction layers

Type	Interface struct	Key methods
`ggml_backend_reg_t`	`ggml_backend_reg_i`	`get_name`, `dev_count`, `dev_get`, `get_proc_address`
`ggml_backend_dev_t`	`ggml_backend_dev_i`	`get_name`, `get_description`, `get_memory`, `backend_init`, `buffer_type`
`ggml_backend_buffer_type_t`	`ggml_backend_buffer_type_i`	`alloc_buffer`, `get_alignment`, `get_alloc_size`, `is_host`
`ggml_backend_buffer_t`	`ggml_backend_buffer_i`	`get_base`, `init_tensor`, `cpy_tensor`, `clear`
`ggml_backend_t`	`ggml_backend_i`	`get_name`, `graph_compute`, `graph_plan_compute`, `event_*`

Sources: ggml/include/ggml-backend.h1-81 ggml/src/ggml-backend-impl.h1-80 ggml/src/ggml-backend.cpp30-260

Backend Registry

ggml/src/ggml-backend-reg.cpp107-160 shows the registration sequence:

CUDA → Metal → SYCL → Vulkan → WebGPU → zDNN → VirtGPU → OpenCL → ZenDNN
→ Hexagon → CANN → BLAS → RPC → CPU

Memory Allocation

GGML has two allocator layers:

Sources: ggml/src/ggml-alloc.c1-90 ggml/include/ggml-alloc.h

Multi-Backend Graph Scheduling

When llama.cpp loads model weights onto a GPU but keeps some layers on CPU, the ggml_backend_sched_t scheduler (defined in ggml-backend.cpp) is responsible for:

Assigning each ggml_cgraph node to a backend.
Inserting copy operations where tensors cross backend boundaries.
Allocating device memory via ggml_gallocr.
Dispatching nodes to each backend's graph_compute call.

Scheduler flow

Sources: ggml/src/ggml-backend.cpp260-700

Build System Integration

Key CMake options for GGML:

Option	Default	Effect
`GGML_CPU`	ON	Include CPU backend
`GGML_CUDA`	OFF	Include NVIDIA CUDA backend
`GGML_METAL`	ON (macOS)	Include Apple Metal backend
`GGML_VULKAN`	OFF	Include Vulkan backend
`GGML_BACKEND_DL`	OFF	Build backends as loadable shared modules
`GGML_CPU_ALL_VARIANTS`	OFF	Build all ISA variants of CPU backend
`GGML_NATIVE`	ON	Optimize for the current machine

Sources: ggml/CMakeLists.txt82-270 ggml/src/CMakeLists.txt294-462

Relationship to llama.cpp

GGML is the foundation that the rest of llama.cpp sits on. The inference engine (described in Core Library Architecture) uses GGML exclusively for:

Storing model weight tensors in device memory via ggml_backend_buffer_t.
Building computation graphs for each transformer block via ggml_build_forward_expand.
Executing those graphs across one or more hardware backends via ggml_backend_sched_t.
Converting and quantizing weight data via the quantization routines in ggml-quants.c.

GGML Tensor Library

Purpose and Scope

What GGML Is

Repository Layout

Compiled Library Targets

Computation Model

Core Data Structures

Tensor layout

Data Type System

Backend Abstraction

Backend Registry

Memory Allocation

Multi-Backend Graph Scheduling

Build System Integration

Relationship to llama.cpp

On this page

GGML Tensor Library

Purpose and Scope

What GGML Is

Repository Layout

Compiled Library Targets

Computation Model

Core Data Structures

Tensor layout

Data Type System

Backend Abstraction

Backend Registry

Memory Allocation

Multi-Backend Graph Scheduling

Build System Integration

Relationship to llama.cpp

On this page