Computation Graph Building

Relevant source files

This page describes how llama.cpp constructs GGML computation graphs for neural network inference, covering the data structures, input management system, reusable component builders, per-architecture builders, and how the resulting graphs are scheduled across hardware backends.

For context on the GGML tensor operations that the graph nodes represent, see GGML Core Architecture. For how the KV cache provides memory context during graph construction, see Memory Management and KV Cache. For how the scheduler executes the completed graph, see Backend System and Registration.

Role in the Inference Pipeline

Every call to llama_decode eventually produces a ggml_cgraph — a directed acyclic graph of tensor operations. Building that graph is the responsibility of the code in src/llama-graph.h, src/llama-graph.cpp, and the per-architecture model files under src/models/.

The graph captures a full forward pass: embedding lookup → transformer blocks (norm → attention → norm → FFN) → output projection → logit gather. Once built, it is handed to ggml_backend_sched which assigns operations to the appropriate hardware backend and executes them.

Sources: src/llama-context.cpp, src/llama-graph.h

Core Data Structures

The graph building subsystem revolves around four key types.

Type	File	Purpose
`llm_graph_params`	`src/llama-graph.h`	Input parameters: current `llama_ubatch`, `llama_cparams`, memory context pointer (`mctx`), `n_outputs`
`llm_graph_result`	`src/llama-graph.h`	Holds the built `ggml_cgraph` and a list of `llm_graph_input_i` objects; reused across batches when possible
`llm_graph_context`	`src/llama-graph.h`	Base class for all model builders; owns the `ggml_context *`, provides shared component builders
`llm_graph_input_i`	`src/llama-graph.h`	Abstract interface for uploading batch-specific data into graph tensors

The llama_context keeps two llm_graph_result instances:

gf_res_prev — the most recently used graph result
gf_res_reserve — used during the worst-case reservation pass

src/llama-context.h36-60

Sources: src/llama-graph.h, src/llama-context.h

Graph Input System

Graph inputs are tensors that change per batch but whose shapes may stay constant across batches. The llm_graph_input_i interface separates:

set_input(ubatch) — copies batch data into the tensor (always called)
can_reuse(params) — returns true if tensor shapes still match, allowing graph reuse

src/llama-graph.h81-96

Input Class Hierarchy

Sources: src/llama-graph.h, src/llama-graph.cpp

What Each Input Does

Class	Provides	Used By
`llm_graph_input_embd`	Token IDs or raw float embeddings	All models
`llm_graph_input_pos`	Position IDs, including 4D M-RoPE layout	Transformer models with RoPE
`llm_graph_input_attn_temp`	Per-token attention temperature scale	Llama4
`llm_graph_input_out_ids`	Indices of positions to emit logits for	All models (avoids computing unused positions)
`llm_graph_input_attn_no_cache`	Full `n_tokens × n_tokens` causal + SWA mask	Models run without KV cache
`llm_graph_input_attn_kv`	KV slot indices and `n_tokens × n_kv` mask	Transformer models with KV cache
`llm_graph_input_rs`	Recurrent state copy map for SSM models	Mamba, RWKV
`llm_graph_input_cross_embd`	Encoder output embeddings	Encoder-decoder (T5, multimodal)
`llm_graph_input_mean`	Mean-pooling weights	Embedding models (BERT-style)
`llm_graph_input_cls`	Row index of CLS/last token	Embedding models
`llm_graph_input_pos_bucket`	Relative position bucket matrix	T5-style relative attention

Attention Mask Construction

The no-cache mask (llm_graph_input_attn_no_cache::set_input) fills n_tokens × n_tokens with -INFINITY then sets allowed positions to 0.0f, respecting:

Sequence boundaries (masks cross-sequence attention)
Causal direction (p0 > p1 masked when causal_attn = true)
Sliding window attention (llama_hparams::is_masked_swa)

A separate SWA mask tensor (self_kq_mask_swa) is populated when hparams.swa_type != LLAMA_SWA_TYPE_NONE.

src/llama-graph.cpp361-425

For the KV cache case, the mask is computed by llama_kv_cache_context::set_input_kq_mask, which also writes the K/V slot indices.

src/llama-graph.cpp427-432

Enumerated Types for Component Builders

llm_graph_context component builders are parameterised by enums that select variants at graph-build time.

src/llama-graph.h30-56

Normalization type (llm_norm_type):

Value	Operation
`LLM_NORM`	Standard layer normalization
`LLM_NORM_RMS`	RMS normalization (Llama, Mistral, …)
`LLM_NORM_GROUP`	Group normalization (WavTokenizer)

FFN activation (llm_ffn_op_type):

Value	Activation
`LLM_FFN_SILU`	SiLU (Llama 2+)
`LLM_FFN_GELU`	GELU (GPT-NeoX, Falcon, …)
`LLM_FFN_RELU`	ReLU
`LLM_FFN_RELU_SQR`	Squared ReLU
`LLM_FFN_SWIGLU`	SwiGLU gate
`LLM_FFN_GEGLU`	GeGLU gate
`LLM_FFN_REGLU`	ReGLU gate
`LLM_FFN_SWIGLU_OAI_MOE`	SwiGLU variant for OpenAI-style MoE

FFN gate layout (llm_ffn_gate_type):

Value	Meaning
`LLM_FFN_SEQ`	Gate applied sequentially after up projection
`LLM_FFN_PAR`	Gate computed in parallel with up projection

Graph type (llm_graph_type):

Value	Use
`LLM_GRAPH_TYPE_DEFAULT`	Standard decoder-only forward pass
`LLM_GRAPH_TYPE_ENCODER`	Encoder pass (T5, multimodal)
`LLM_GRAPH_TYPE_DECODER`	Decoder pass in encoder-decoder models

`llm_graph_context` Component Builders

llm_graph_context is the base class for all per-architecture builders. It provides shared building-block methods that operate on ggml_context * and produce ggml_tensor * outputs.

Typical builder signature pattern:

build_norm(cur, weight, bias, norm_type, eps, il)
    → applies LLM_NORM / LLM_NORM_RMS / LLM_NORM_GROUP

build_ffn(cur, up_w, up_b, gate_w, gate_b, down_w, down_b,
          act_scales, ffn_op, gate_type, scale, il)
    → produces FFN output tensor

build_attn(inp, hparams, cparams, q, k, v, wo, wo_b, kq_scale, il)
    → handles QKV projection, RoPE, KV cache interaction,
      and either ggml_flash_attn_ext or manual KQV matmul

build_inp_embd(hparams, cparams, ubatch)
    → creates llm_graph_input_embd, returns embedding tensor

build_inp_pos(hparams, n_pos_per_embd)
    → creates llm_graph_input_pos, returns position tensor

build_inp_out_ids()
    → creates llm_graph_input_out_ids, returns index tensor

Concrete model builders inherit from llm_graph_context (and sometimes from an intermediate base), call these methods inside their build() implementation, and can add architecture-specific operations directly via the GGML API.

Sources: src/llama-graph.h, src/models/models.h

Typical Transformer Layer Flow

The following shows the operation sequence for a single transformer block, as built by a Llama-family model:

Sources: src/llama-graph.h, src/llama-graph.cpp

Architecture-Specific Builder Classes

Each supported model architecture is implemented in a file under src/models/. The model implements a concrete subclass of llm_graph_context (sometimes with an intermediate base class for shared logic across related architectures).

src/CMakeLists.txt39-100 src/models/models.h1-30

Sources: src/models/models.h, src/CMakeLists.txt

Recurrent / SSM Models

For state-space models (Mamba, RWKV), llm_build_mamba_base provides build_mamba_layer() and build_mamba2_layer(). These use GGML ops ggml_ssm_conv and ggml_ssm_scan instead of attention. The recurrent state is managed through llm_graph_input_rs, which carries the s_copy tensor for state propagation across sequences.

src/models/models.h13-21

Mixture-of-Experts

MoE models (Mixtral, DeepSeek, Qwen2-MoE, etc.) build an expert routing sub-graph per layer. The typical sequence:

Router: ggml_mul_mat(ffn_gate_inp, cur) → logits
Top-K selection: ggml_top_k(logits, n_expert_used)
Softmax on selected expert weights
Expert FFNs: ggml_mul_mat_id(expert_weights, cur, expert_ids) — a batched gather-matmul that selects rows from a stacked expert weight tensor

Cross-Attention (Encoder-Decoder)

Encoder-decoder models (T5, multimodal) use llama_cross to pass encoder outputs to the decoder graph builder. The struct holds v_embd (encoder outputs copied to host) and seq_ids_enc (for masking). The graph input llm_graph_input_cross_embd uploads v_embd into a graph tensor at each decode step.

src/llama-graph.h59-73 src/llama-graph.cpp316-324

Graph Reservation and the Scheduler Reserve Pass

Before any inference runs, llama_context performs a reservation pass via sched_reserve(). This builds worst-case graphs and uses them to pre-allocate all compute buffers.

src/llama-context.cpp382-500

The PP reservation is done first because it has the largest batch size and therefore drives buffer allocation. The scheduler then knows the maximum buffer size needed for any graph.

Flash Attention Auto-detection

When flash_attn_type == LLAMA_FLASH_ATTN_TYPE_AUTO, sched_reserve() builds a probe graph, inspects every GGML_OP_FLASH_ATTN_EXT node, and verifies that the scheduler assigned it to the same device as the corresponding KV cache. If any mismatch is found, Flash Attention is disabled.

src/llama-context.cpp422-460

Graph Reuse Mechanism

Rebuilding the GGML graph from scratch on every decode step would be costly. The reuse system avoids this when the graph structure (tensor shapes and op connectivity) does not change.

The llama_context stores gf_res_prev. At the start of each decode:

All llm_graph_input_i::can_reuse(params) checks are run.
If all pass, the previous ggml_cgraph is reused; only set_input() is called to update data.
If any check fails (e.g., n_tokens changed, n_kv changed), a new graph is built and the result replaces gf_res_prev.

src/llama-context.cpp164-172

can_reuse() logic by input type:

Input Class	Reuse Condition
`llm_graph_input_embd`	Same token count (or embedding count)
`llm_graph_input_pos`	`pos->ne[0] == n_tokens * n_pos_per_embd`
`llm_graph_input_out_ids`	`n_outputs` unchanged
`llm_graph_input_attn_kv`	`n_kv`, `n_tokens`, `n_stream` all unchanged
`llm_graph_input_rs`	`n_rs`, `n_seqs`, `head`, `rs_z` unchanged
`llm_graph_input_i` (default)	Never reused

Set LLAMA_GRAPH_REUSE_DISABLE=1 to force a graph rebuild on every step (useful for debugging).

Sources: src/llama-graph.cpp, src/llama-context.cpp

Backend Scheduling

Once a ggml_cgraph is built, ggml_backend_sched assigns each operation to a backend based on the buffer types of its weight tensors. Tensors stored in GPU buffers cause their consuming operations to be assigned to the GPU backend; CPU-hosted tensors run on CPU.

The graph may be split into multiple splits — contiguous subsets of nodes that run on a single backend — with data transfers between splits handled automatically.

Key scheduler metrics logged during reservation:

n_splits_pp = ggml_backend_sched_get_n_splits(sched) // for PP graph
n_nodes_pp  = ggml_graph_n_nodes(gf)
n_splits_tg = ...                                     // for TG graph
n_nodes_tg  = ...

src/llama-context.cpp484-497

For details on how backends are selected and how multi-GPU tensor splitting works, see Multi-GPU and Distributed Inference and Backend System and Registration.

Weight Buffer Selection at Load Time

At model load time, llama_model determines which buffer type to use for each tensor by probing whether the desired operation is supported on the target device. The function weight_buft_supported() builds a small test tensor and calls ggml_backend_dev_supports_op().

src/llama-model.cpp178-312

The ops probed include GGML_OP_MUL_MAT, GGML_OP_MUL_MAT_ID, GGML_OP_GET_ROWS, GGML_OP_ROPE, GGML_OP_SSM_SCAN, and others. This pre-selection ensures that when the computation graph is scheduled, every tensor is already in a buffer type supported by the backend that will run its operation.

src/llama-model.cpp315-330

Computation Graph Building

Relevant source files

Role in the Inference Pipeline

Sources: src/llama-context.cpp, src/llama-graph.h

Core Data Structures

The graph building subsystem revolves around four key types.

Type	File	Purpose
`llm_graph_params`	`src/llama-graph.h`	Input parameters: current `llama_ubatch`, `llama_cparams`, memory context pointer (`mctx`), `n_outputs`
`llm_graph_result`	`src/llama-graph.h`	Holds the built `ggml_cgraph` and a list of `llm_graph_input_i` objects; reused across batches when possible
`llm_graph_context`	`src/llama-graph.h`	Base class for all model builders; owns the `ggml_context *`, provides shared component builders
`llm_graph_input_i`	`src/llama-graph.h`	Abstract interface for uploading batch-specific data into graph tensors

The llama_context keeps two llm_graph_result instances:

gf_res_prev — the most recently used graph result
gf_res_reserve — used during the worst-case reservation pass

src/llama-context.h36-60

Sources: src/llama-graph.h, src/llama-context.h

Graph Input System

Graph inputs are tensors that change per batch but whose shapes may stay constant across batches. The llm_graph_input_i interface separates:

set_input(ubatch) — copies batch data into the tensor (always called)
can_reuse(params) — returns true if tensor shapes still match, allowing graph reuse

src/llama-graph.h81-96

Input Class Hierarchy

Sources: src/llama-graph.h, src/llama-graph.cpp

What Each Input Does

Class	Provides	Used By
`llm_graph_input_embd`	Token IDs or raw float embeddings	All models
`llm_graph_input_pos`	Position IDs, including 4D M-RoPE layout	Transformer models with RoPE
`llm_graph_input_attn_temp`	Per-token attention temperature scale	Llama4
`llm_graph_input_out_ids`	Indices of positions to emit logits for	All models (avoids computing unused positions)
`llm_graph_input_attn_no_cache`	Full `n_tokens × n_tokens` causal + SWA mask	Models run without KV cache
`llm_graph_input_attn_kv`	KV slot indices and `n_tokens × n_kv` mask	Transformer models with KV cache
`llm_graph_input_rs`	Recurrent state copy map for SSM models	Mamba, RWKV
`llm_graph_input_cross_embd`	Encoder output embeddings	Encoder-decoder (T5, multimodal)
`llm_graph_input_mean`	Mean-pooling weights	Embedding models (BERT-style)
`llm_graph_input_cls`	Row index of CLS/last token	Embedding models
`llm_graph_input_pos_bucket`	Relative position bucket matrix	T5-style relative attention

Attention Mask Construction

The no-cache mask (llm_graph_input_attn_no_cache::set_input) fills n_tokens × n_tokens with -INFINITY then sets allowed positions to 0.0f, respecting:

Sequence boundaries (masks cross-sequence attention)
Causal direction (p0 > p1 masked when causal_attn = true)
Sliding window attention (llama_hparams::is_masked_swa)

A separate SWA mask tensor (self_kq_mask_swa) is populated when hparams.swa_type != LLAMA_SWA_TYPE_NONE.

src/llama-graph.cpp361-425

For the KV cache case, the mask is computed by llama_kv_cache_context::set_input_kq_mask, which also writes the K/V slot indices.

src/llama-graph.cpp427-432

Enumerated Types for Component Builders

llm_graph_context component builders are parameterised by enums that select variants at graph-build time.

src/llama-graph.h30-56

Normalization type (llm_norm_type):

Value	Operation
`LLM_NORM`	Standard layer normalization
`LLM_NORM_RMS`	RMS normalization (Llama, Mistral, …)
`LLM_NORM_GROUP`	Group normalization (WavTokenizer)

FFN activation (llm_ffn_op_type):

Value	Activation
`LLM_FFN_SILU`	SiLU (Llama 2+)
`LLM_FFN_GELU`	GELU (GPT-NeoX, Falcon, …)
`LLM_FFN_RELU`	ReLU
`LLM_FFN_RELU_SQR`	Squared ReLU
`LLM_FFN_SWIGLU`	SwiGLU gate
`LLM_FFN_GEGLU`	GeGLU gate
`LLM_FFN_REGLU`	ReGLU gate
`LLM_FFN_SWIGLU_OAI_MOE`	SwiGLU variant for OpenAI-style MoE

FFN gate layout (llm_ffn_gate_type):

Value	Meaning
`LLM_FFN_SEQ`	Gate applied sequentially after up projection
`LLM_FFN_PAR`	Gate computed in parallel with up projection

Graph type (llm_graph_type):

Value	Use
`LLM_GRAPH_TYPE_DEFAULT`	Standard decoder-only forward pass
`LLM_GRAPH_TYPE_ENCODER`	Encoder pass (T5, multimodal)
`LLM_GRAPH_TYPE_DECODER`	Decoder pass in encoder-decoder models

`llm_graph_context` Component Builders

llm_graph_context is the base class for all per-architecture builders. It provides shared building-block methods that operate on ggml_context * and produce ggml_tensor * outputs.

Typical builder signature pattern:

build_norm(cur, weight, bias, norm_type, eps, il)
    → applies LLM_NORM / LLM_NORM_RMS / LLM_NORM_GROUP

build_ffn(cur, up_w, up_b, gate_w, gate_b, down_w, down_b,
          act_scales, ffn_op, gate_type, scale, il)
    → produces FFN output tensor

build_attn(inp, hparams, cparams, q, k, v, wo, wo_b, kq_scale, il)
    → handles QKV projection, RoPE, KV cache interaction,
      and either ggml_flash_attn_ext or manual KQV matmul

build_inp_embd(hparams, cparams, ubatch)
    → creates llm_graph_input_embd, returns embedding tensor

build_inp_pos(hparams, n_pos_per_embd)
    → creates llm_graph_input_pos, returns position tensor

build_inp_out_ids()
    → creates llm_graph_input_out_ids, returns index tensor

Sources: src/llama-graph.h, src/models/models.h

Typical Transformer Layer Flow

The following shows the operation sequence for a single transformer block, as built by a Llama-family model:

Sources: src/llama-graph.h, src/llama-graph.cpp

Architecture-Specific Builder Classes

src/CMakeLists.txt39-100 src/models/models.h1-30

Sources: src/models/models.h, src/CMakeLists.txt

Recurrent / SSM Models

src/models/models.h13-21

Mixture-of-Experts

MoE models (Mixtral, DeepSeek, Qwen2-MoE, etc.) build an expert routing sub-graph per layer. The typical sequence:

Router: ggml_mul_mat(ffn_gate_inp, cur) → logits
Top-K selection: ggml_top_k(logits, n_expert_used)
Softmax on selected expert weights
Expert FFNs: ggml_mul_mat_id(expert_weights, cur, expert_ids) — a batched gather-matmul that selects rows from a stacked expert weight tensor

Cross-Attention (Encoder-Decoder)

src/llama-graph.h59-73 src/llama-graph.cpp316-324

Graph Reservation and the Scheduler Reserve Pass

Before any inference runs, llama_context performs a reservation pass via sched_reserve(). This builds worst-case graphs and uses them to pre-allocate all compute buffers.

src/llama-context.cpp382-500

The PP reservation is done first because it has the largest batch size and therefore drives buffer allocation. The scheduler then knows the maximum buffer size needed for any graph.

Flash Attention Auto-detection

src/llama-context.cpp422-460

Graph Reuse Mechanism

Rebuilding the GGML graph from scratch on every decode step would be costly. The reuse system avoids this when the graph structure (tensor shapes and op connectivity) does not change.

The llama_context stores gf_res_prev. At the start of each decode:

All llm_graph_input_i::can_reuse(params) checks are run.
If all pass, the previous ggml_cgraph is reused; only set_input() is called to update data.
If any check fails (e.g., n_tokens changed, n_kv changed), a new graph is built and the result replaces gf_res_prev.

src/llama-context.cpp164-172

can_reuse() logic by input type:

Input Class	Reuse Condition
`llm_graph_input_embd`	Same token count (or embedding count)
`llm_graph_input_pos`	`pos->ne[0] == n_tokens * n_pos_per_embd`
`llm_graph_input_out_ids`	`n_outputs` unchanged
`llm_graph_input_attn_kv`	`n_kv`, `n_tokens`, `n_stream` all unchanged
`llm_graph_input_rs`	`n_rs`, `n_seqs`, `head`, `rs_z` unchanged
`llm_graph_input_i` (default)	Never reused

Set LLAMA_GRAPH_REUSE_DISABLE=1 to force a graph rebuild on every step (useful for debugging).

Sources: src/llama-graph.cpp, src/llama-context.cpp

Backend Scheduling

The graph may be split into multiple splits — contiguous subsets of nodes that run on a single backend — with data transfers between splits handled automatically.

Key scheduler metrics logged during reservation:

n_splits_pp = ggml_backend_sched_get_n_splits(sched) // for PP graph
n_nodes_pp  = ggml_graph_n_nodes(gf)
n_splits_tg = ...                                     // for TG graph
n_nodes_tg  = ...

src/llama-context.cpp484-497

For details on how backends are selected and how multi-GPU tensor splitting works, see Multi-GPU and Distributed Inference and Backend System and Registration.

Weight Buffer Selection at Load Time

src/llama-model.cpp178-312

src/llama-model.cpp315-330

Computation Graph Building

Role in the Inference Pipeline

Core Data Structures

Graph Input System

Input Class Hierarchy

What Each Input Does

Attention Mask Construction

Enumerated Types for Component Builders

`llm_graph_context` Component Builders

Typical Transformer Layer Flow

Architecture-Specific Builder Classes

Recurrent / SSM Models

Mixture-of-Experts

Cross-Attention (Encoder-Decoder)

Graph Reservation and the Scheduler Reserve Pass

Flash Attention Auto-detection

Graph Reuse Mechanism

Backend Scheduling

Weight Buffer Selection at Load Time

On this page

Computation Graph Building

Role in the Inference Pipeline

Core Data Structures

Graph Input System

Input Class Hierarchy

What Each Input Does

Attention Mask Construction

Enumerated Types for Component Builders

`llm_graph_context` Component Builders

Typical Transformer Layer Flow

Architecture-Specific Builder Classes

Recurrent / SSM Models

Mixture-of-Experts

Cross-Attention (Encoder-Decoder)

Graph Reservation and the Scheduler Reserve Pass

Flash Attention Auto-detection

Graph Reuse Mechanism

Backend Scheduling

Weight Buffer Selection at Load Time

On this page

Computation Graph Building

Role in the Inference Pipeline

Core Data Structures

Graph Input System

Input Class Hierarchy

What Each Input Does

Attention Mask Construction

Enumerated Types for Component Builders

llm_graph_context Component Builders

Typical Transformer Layer Flow

Architecture-Specific Builder Classes

Recurrent / SSM Models

Mixture-of-Experts

Cross-Attention (Encoder-Decoder)

Graph Reservation and the Scheduler Reserve Pass

Flash Attention Auto-detection

Graph Reuse Mechanism

Backend Scheduling

Weight Buffer Selection at Load Time

On this page

Computation Graph Building

Role in the Inference Pipeline

Core Data Structures

Graph Input System

Input Class Hierarchy

What Each Input Does

Attention Mask Construction

Enumerated Types for Component Builders

llm_graph_context Component Builders

Typical Transformer Layer Flow

Architecture-Specific Builder Classes

Recurrent / SSM Models

Mixture-of-Experts

Cross-Attention (Encoder-Decoder)

Graph Reservation and the Scheduler Reserve Pass

Flash Attention Auto-detection

Graph Reuse Mechanism

Backend Scheduling

Weight Buffer Selection at Load Time

On this page

`llm_graph_context` Component Builders

`llm_graph_context` Component Builders