This page documents the llama_context structure, its lifecycle, parameter management, and the overall orchestration of inference operations. The context serves as the runtime state manager for model inference, coordinating between the loaded model, memory systems, computation backends, and batch processing pipelines.
For information about model loading and weight management, see Model Loading and Representation. For batch preparation and splitting, see Batch Processing Pipeline. For memory and KV cache specifics, see Memory Management and KV Cache. For computation graph construction, see Computation Graph Building.
The llama_context is the central runtime object for inference operations. It manages:
Diagram: llama_context Structure and Dependencies
Sources: src/llama-context.h36-235 src/llama-context.cpp21-358
The context is initialized with llama_context_params, which is converted internally to llama_cparams:
| Parameter | Type | Description |
|---|---|---|
n_ctx | uint32_t | Total context size (0 = use model default) |
n_batch | uint32_t | Logical maximum batch size |
n_ubatch | uint32_t | Physical batch size for processing |
n_seq_max | uint32_t | Maximum number of sequences |
n_threads | int32_t | Threads for generation |
n_threads_batch | int32_t | Threads for batch processing |
rope_scaling_type | enum | RoPE scaling method |
pooling_type | enum | Embedding pooling strategy |
attention_type | enum | Causal vs non-causal attention |
flash_attn_type | enum | Flash attention mode |
type_k, type_v | ggml_type | KV cache quantization types |
samplers | llama_sampler_seq_config* | Backend samplers (optional) |
Sources: include/llama.h327-376 src/llama-context.cpp35-172
Diagram: Context Parameter Processing Flow
Key processing steps in src/llama-context.cpp35-172:
n_ctx = GGML_PAD(n_ctx, 256) for optimal memory alignmentn_stream = 1, n_ctx_seq = n_ctxn_stream = n_seq_max, n_ctx_seq = n_ctx / n_seq_maxrope_freq_scale and rope_freq_base from model or paramsyarn_attn_factor based on scaling parametersSources: src/llama-context.cpp35-172
The llama_context constructor src/llama-context.cpp21-358 performs comprehensive initialization:
Diagram: Context Construction Sequence
Sources: src/llama-context.cpp21-358
The context initializes multiple backend types in priority order:
model.devices): CUDA, Metal, Vulkan, etc.Diagram: Backend Initialization Structure
For GPU backends, the context also sets up host buffer types for efficient CPU-GPU transfers src/llama-context.cpp289-296
Sources: src/llama-context.cpp211-251
The context automatically detects when pipeline parallelism can be enabled src/llama-context.cpp306-338:
Conditions for pipeline parallelism:
model.n_devices() > 1)model.n_gpu_layers() > model.hparams.n_layer)LLAMA_SPLIT_MODE_LAYER)cparams.offload_kqv)Sources: src/llama-context.cpp306-338
The sched_reserve() method src/llama-context.cpp380-439 allocates compute resources:
Diagram: Scheduler Reservation Flow
The reservation process builds a worst-case computation graph to determine maximum buffer requirements, ensuring all subsequent inference operations will fit without reallocation.
Sources: src/llama-context.cpp380-439
The primary inference methods are llama_encode() and llama_decode(), which share a common processing pipeline:
Diagram: Decode/Encode Processing Sequence
Sources: src/llama-context.cpp527-634
The process_ubatch() method src/llama-context.cpp441-525 is the core inference kernel:
Diagram: Process Ubatch Internal Flow
Sources: src/llama-context.cpp441-525
Graph reuse is a critical performance optimization. A graph can be reused when:
llm_graph_input_i::can_reuse())graph_reuse_disable == false)The context tracks the previous ubatch parameters and graph inputs to determine reusability:
| Reuse Check | Implementation |
|---|---|
| Ubatch match | Compare n_tokens, n_seqs, n_outputs, etc. |
| Input reuse | Each input type (embd, pos, mask, etc.) implements can_reuse() |
| Force rebuild | Environment variable LLAMA_GRAPH_REUSE_DISABLE |
Sources: src/llama-context.cpp163-168 src/llama-graph.h82-103
Thread counts can be adjusted at runtime via llama_set_n_threads() src/llama-context.cpp865-878:
This propagates to all backends that support dynamic thread configuration through the set_n_threads_fns registry src/llama-context.cpp241-250
Sources: src/llama-context.cpp865-878 src/llama-context.cpp241-250
The context supports dynamic adapter management:
LoRA Adapters src/llama-context.cpp890-936:
set_adapter_lora(adapter, scale): Apply LoRA with scalingrm_adapter_lora(adapter): Remove specific adapterclear_adapter_lora(): Remove all adaptersControl Vectors src/llama-context.cpp938-985:
apply_adapter_cvec(data, len, n_embd, il_start, il_end): Apply control vector to layer rangeSources: src/llama-context.cpp890-985
The context supports full state serialization for saving/loading inference state:
Diagram: State Serialization Structure
The state can be saved/loaded for the entire context or per-sequence, enabling advanced use cases like speculative decoding and multi-turn conversations.
Sources: src/llama-context.cpp987-1313 examples/save-load-state/save-load-state.cpp68-104
The context maintains a unified output buffer for logits and embeddings src/llama-context.cpp1315-1411:
Output data is organized as:
[n_vocab, n_outputs] - token predictions[n_embd, n_outputs] - sequence embeddingsThe buffer is reallocated only when the required size exceeds the current capacity.
Sources: src/llama-context.cpp1315-1411
The context can report memory usage across buffer types via memory_breakdown() src/llama-context.cpp1413-1516:
| Memory Category | Description |
|---|---|
| Model | Weight storage from llama_model |
| Context | KV cache, cells, and persistent state |
| Compute | Temporary computation buffers |
This breakdown is used by llama_params_fit() for automatic memory optimization.
Sources: src/llama-context.cpp1413-1516
The synchronize() method src/llama-context.cpp636-642 waits for all pending backend operations to complete. This is essential before:
Sources: src/llama-context.cpp636-642
The context optionally supports backend samplers src/llama-context.cpp63-80 which enable:
Backend samplers are specified during context creation and become part of the computation graph before scheduler reservation.
Sources: src/llama-context.cpp63-80 src/llama-context.h207-236
User-facing tools typically use the common_init_from_params() helper common/common.cpp1073-1178:
Diagram: Common Initialization Flow
This pattern handles model loading, context creation, adapter application, and sampler initialization in a consistent way across all tools.
Sources: common/common.cpp1073-1178
Refresh this wiki
This wiki was recently refreshed. Please wait 2 days to refresh again.