This page describes how llama.cpp constructs GGML computation graphs for neural network inference, covering the data structures, input management system, reusable component builders, per-architecture builders, and how the resulting graphs are scheduled across hardware backends.
For context on the GGML tensor operations that the graph nodes represent, see GGML Core Architecture. For how the KV cache provides memory context during graph construction, see Memory Management and KV Cache. For how the scheduler executes the completed graph, see Backend System and Registration.
Every call to llama_decode eventually produces a ggml_cgraph — a directed acyclic graph of tensor operations. Building that graph is the responsibility of the code in src/llama-graph.h, src/llama-graph.cpp, and the per-architecture model files under src/models/.
The graph captures a full forward pass: embedding lookup → transformer blocks (norm → attention → norm → FFN) → output projection → logit gather. Once built, it is handed to ggml_backend_sched which assigns operations to the appropriate hardware backend and executes them.
Sources: src/llama-context.cpp, src/llama-graph.h
The graph building subsystem revolves around four key types.
| Type | File | Purpose |
|---|---|---|
llm_graph_params | src/llama-graph.h | Input parameters: current llama_ubatch, llama_cparams, memory context pointer (mctx), n_outputs |
llm_graph_result | src/llama-graph.h | Holds the built ggml_cgraph and a list of llm_graph_input_i objects; reused across batches when possible |
llm_graph_context | src/llama-graph.h | Base class for all model builders; owns the ggml_context *, provides shared component builders |
llm_graph_input_i | src/llama-graph.h | Abstract interface for uploading batch-specific data into graph tensors |
The llama_context keeps two llm_graph_result instances:
gf_res_prev — the most recently used graph resultgf_res_reserve — used during the worst-case reservation passSources: src/llama-graph.h, src/llama-context.h
Graph inputs are tensors that change per batch but whose shapes may stay constant across batches. The llm_graph_input_i interface separates:
set_input(ubatch) — copies batch data into the tensor (always called)can_reuse(params) — returns true if tensor shapes still match, allowing graph reuseSources: src/llama-graph.h, src/llama-graph.cpp
| Class | Provides | Used By |
|---|---|---|
llm_graph_input_embd | Token IDs or raw float embeddings | All models |
llm_graph_input_pos | Position IDs, including 4D M-RoPE layout | Transformer models with RoPE |
llm_graph_input_attn_temp | Per-token attention temperature scale | Llama4 |
llm_graph_input_out_ids | Indices of positions to emit logits for | All models (avoids computing unused positions) |
llm_graph_input_attn_no_cache | Full n_tokens × n_tokens causal + SWA mask | Models run without KV cache |
llm_graph_input_attn_kv | KV slot indices and n_tokens × n_kv mask | Transformer models with KV cache |
llm_graph_input_rs | Recurrent state copy map for SSM models | Mamba, RWKV |
llm_graph_input_cross_embd | Encoder output embeddings | Encoder-decoder (T5, multimodal) |
llm_graph_input_mean | Mean-pooling weights | Embedding models (BERT-style) |
llm_graph_input_cls | Row index of CLS/last token | Embedding models |
llm_graph_input_pos_bucket | Relative position bucket matrix | T5-style relative attention |
The no-cache mask (llm_graph_input_attn_no_cache::set_input) fills n_tokens × n_tokens with -INFINITY then sets allowed positions to 0.0f, respecting:
p0 > p1 masked when causal_attn = true)llama_hparams::is_masked_swa)A separate SWA mask tensor (self_kq_mask_swa) is populated when hparams.swa_type != LLAMA_SWA_TYPE_NONE.
For the KV cache case, the mask is computed by llama_kv_cache_context::set_input_kq_mask, which also writes the K/V slot indices.
llm_graph_context component builders are parameterised by enums that select variants at graph-build time.
Normalization type (llm_norm_type):
| Value | Operation |
|---|---|
LLM_NORM | Standard layer normalization |
LLM_NORM_RMS | RMS normalization (Llama, Mistral, …) |
LLM_NORM_GROUP | Group normalization (WavTokenizer) |
FFN activation (llm_ffn_op_type):
| Value | Activation |
|---|---|
LLM_FFN_SILU | SiLU (Llama 2+) |
LLM_FFN_GELU | GELU (GPT-NeoX, Falcon, …) |
LLM_FFN_RELU | ReLU |
LLM_FFN_RELU_SQR | Squared ReLU |
LLM_FFN_SWIGLU | SwiGLU gate |
LLM_FFN_GEGLU | GeGLU gate |
LLM_FFN_REGLU | ReGLU gate |
LLM_FFN_SWIGLU_OAI_MOE | SwiGLU variant for OpenAI-style MoE |
FFN gate layout (llm_ffn_gate_type):
| Value | Meaning |
|---|---|
LLM_FFN_SEQ | Gate applied sequentially after up projection |
LLM_FFN_PAR | Gate computed in parallel with up projection |
Graph type (llm_graph_type):
| Value | Use |
|---|---|
LLM_GRAPH_TYPE_DEFAULT | Standard decoder-only forward pass |
LLM_GRAPH_TYPE_ENCODER | Encoder pass (T5, multimodal) |
LLM_GRAPH_TYPE_DECODER | Decoder pass in encoder-decoder models |
llm_graph_context Component Buildersllm_graph_context is the base class for all per-architecture builders. It provides shared building-block methods that operate on ggml_context * and produce ggml_tensor * outputs.
Typical builder signature pattern:
build_norm(cur, weight, bias, norm_type, eps, il)
→ applies LLM_NORM / LLM_NORM_RMS / LLM_NORM_GROUP
build_ffn(cur, up_w, up_b, gate_w, gate_b, down_w, down_b,
act_scales, ffn_op, gate_type, scale, il)
→ produces FFN output tensor
build_attn(inp, hparams, cparams, q, k, v, wo, wo_b, kq_scale, il)
→ handles QKV projection, RoPE, KV cache interaction,
and either ggml_flash_attn_ext or manual KQV matmul
build_inp_embd(hparams, cparams, ubatch)
→ creates llm_graph_input_embd, returns embedding tensor
build_inp_pos(hparams, n_pos_per_embd)
→ creates llm_graph_input_pos, returns position tensor
build_inp_out_ids()
→ creates llm_graph_input_out_ids, returns index tensor
Concrete model builders inherit from llm_graph_context (and sometimes from an intermediate base), call these methods inside their build() implementation, and can add architecture-specific operations directly via the GGML API.
Sources: src/llama-graph.h, src/models/models.h
The following shows the operation sequence for a single transformer block, as built by a Llama-family model:
Sources: src/llama-graph.h, src/llama-graph.cpp
Each supported model architecture is implemented in a file under src/models/. The model implements a concrete subclass of llm_graph_context (sometimes with an intermediate base class for shared logic across related architectures).
src/CMakeLists.txt39-100 src/models/models.h1-30
Sources: src/models/models.h, src/CMakeLists.txt
For state-space models (Mamba, RWKV), llm_build_mamba_base provides build_mamba_layer() and build_mamba2_layer(). These use GGML ops ggml_ssm_conv and ggml_ssm_scan instead of attention. The recurrent state is managed through llm_graph_input_rs, which carries the s_copy tensor for state propagation across sequences.
MoE models (Mixtral, DeepSeek, Qwen2-MoE, etc.) build an expert routing sub-graph per layer. The typical sequence:
ggml_mul_mat(ffn_gate_inp, cur) → logitsggml_top_k(logits, n_expert_used)ggml_mul_mat_id(expert_weights, cur, expert_ids) — a batched gather-matmul that selects rows from a stacked expert weight tensorEncoder-decoder models (T5, multimodal) use llama_cross to pass encoder outputs to the decoder graph builder. The struct holds v_embd (encoder outputs copied to host) and seq_ids_enc (for masking). The graph input llm_graph_input_cross_embd uploads v_embd into a graph tensor at each decode step.
src/llama-graph.h59-73 src/llama-graph.cpp316-324
Before any inference runs, llama_context performs a reservation pass via sched_reserve(). This builds worst-case graphs and uses them to pre-allocate all compute buffers.
The PP reservation is done first because it has the largest batch size and therefore drives buffer allocation. The scheduler then knows the maximum buffer size needed for any graph.
When flash_attn_type == LLAMA_FLASH_ATTN_TYPE_AUTO, sched_reserve() builds a probe graph, inspects every GGML_OP_FLASH_ATTN_EXT node, and verifies that the scheduler assigned it to the same device as the corresponding KV cache. If any mismatch is found, Flash Attention is disabled.
Rebuilding the GGML graph from scratch on every decode step would be costly. The reuse system avoids this when the graph structure (tensor shapes and op connectivity) does not change.
The llama_context stores gf_res_prev. At the start of each decode:
llm_graph_input_i::can_reuse(params) checks are run.ggml_cgraph is reused; only set_input() is called to update data.n_tokens changed, n_kv changed), a new graph is built and the result replaces gf_res_prev.can_reuse() logic by input type:
| Input Class | Reuse Condition |
|---|---|
llm_graph_input_embd | Same token count (or embedding count) |
llm_graph_input_pos | pos->ne[0] == n_tokens * n_pos_per_embd |
llm_graph_input_out_ids | n_outputs unchanged |
llm_graph_input_attn_kv | n_kv, n_tokens, n_stream all unchanged |
llm_graph_input_rs | n_rs, n_seqs, head, rs_z unchanged |
llm_graph_input_i (default) | Never reused |
Set LLAMA_GRAPH_REUSE_DISABLE=1 to force a graph rebuild on every step (useful for debugging).
Sources: src/llama-graph.cpp, src/llama-context.cpp
Once a ggml_cgraph is built, ggml_backend_sched assigns each operation to a backend based on the buffer types of its weight tensors. Tensors stored in GPU buffers cause their consuming operations to be assigned to the GPU backend; CPU-hosted tensors run on CPU.
The graph may be split into multiple splits — contiguous subsets of nodes that run on a single backend — with data transfers between splits handled automatically.
Key scheduler metrics logged during reservation:
n_splits_pp = ggml_backend_sched_get_n_splits(sched) // for PP graph
n_nodes_pp = ggml_graph_n_nodes(gf)
n_splits_tg = ... // for TG graph
n_nodes_tg = ...
For details on how backends are selected and how multi-GPU tensor splitting works, see Multi-GPU and Distributed Inference and Backend System and Registration.
At model load time, llama_model determines which buffer type to use for each tensor by probing whether the desired operation is supported on the target device. The function weight_buft_supported() builds a small test tensor and calls ggml_backend_dev_supports_op().
The ops probed include GGML_OP_MUL_MAT, GGML_OP_MUL_MAT_ID, GGML_OP_GET_ROWS, GGML_OP_ROPE, GGML_OP_SSM_SCAN, and others. This pre-selection ensures that when the computation graph is scheduled, every tensor is already in a buffer type supported by the backend that will run its operation.
Refresh this wiki
This wiki was recently refreshed. Please wait 2 days to refresh again.