Inference Context and Orchestration

Relevant source files

This page documents the llama_context structure, its lifecycle, parameter management, and the overall orchestration of inference operations. The context serves as the runtime state manager for model inference, coordinating between the loaded model, memory systems, computation backends, and batch processing pipelines.

For information about model loading and weight management, see Model Loading and Representation. For batch preparation and splitting, see Batch Processing Pipeline. For memory and KV cache specifics, see Memory Management and KV Cache. For computation graph construction, see Computation Graph Building.

Overview and Architecture

The llama_context is the central runtime object for inference operations. It manages:

Backend devices and computation schedulers
Memory modules (KV cache, recurrent states, etc.)
Batch allocation and processing
Graph building and reuse
Threading and synchronization
Adapters (LoRA, control vectors)
Sampling state and backend samplers

Diagram: llama_context Structure and Dependencies

Sources: src/llama-context.h36-235 src/llama-context.cpp21-358

Context Parameters

Parameter Structures

The context is initialized with llama_context_params, which is converted internally to llama_cparams:

Parameter	Type	Description
`n_ctx`	`uint32_t`	Total context size (0 = use model default)
`n_batch`	`uint32_t`	Logical maximum batch size
`n_ubatch`	`uint32_t`	Physical batch size for processing
`n_seq_max`	`uint32_t`	Maximum number of sequences
`n_threads`	`int32_t`	Threads for generation
`n_threads_batch`	`int32_t`	Threads for batch processing
`rope_scaling_type`	`enum`	RoPE scaling method
`pooling_type`	`enum`	Embedding pooling strategy
`attention_type`	`enum`	Causal vs non-causal attention
`flash_attn_type`	`enum`	Flash attention mode
`type_k`, `type_v`	`ggml_type`	KV cache quantization types
`samplers`	`llama_sampler_seq_config*`	Backend samplers (optional)

Sources: include/llama.h327-376 src/llama-context.cpp35-172

Parameter Processing and Validation

Diagram: Context Parameter Processing Flow

Key processing steps in src/llama-context.cpp35-172:

Context size padding: n_ctx = GGML_PAD(n_ctx, 256) for optimal memory alignment
Stream calculation:
- Unified KV cache: n_stream = 1, n_ctx_seq = n_ctx
- Split KV cache: n_stream = n_seq_max, n_ctx_seq = n_ctx / n_seq_max
RoPE scaling: Determine rope_freq_scale and rope_freq_base from model or params
YaRN adjustments: Calculate yarn_attn_factor based on scaling parameters

Sources: src/llama-context.cpp35-172

Context Lifecycle

Construction

The llama_context constructor src/llama-context.cpp21-358 performs comprehensive initialization:

Diagram: Context Construction Sequence

Sources: src/llama-context.cpp21-358

Backend Initialization

The context initializes multiple backend types in priority order:

GPU backends (from model.devices): CUDA, Metal, Vulkan, etc.
ACCEL backends: BLAS and other CPU accelerators
CPU backend: Fallback host computation

Diagram: Backend Initialization Structure

For GPU backends, the context also sets up host buffer types for efficient CPU-GPU transfers src/llama-context.cpp289-296

Sources: src/llama-context.cpp211-251

Pipeline Parallelism Detection

The context automatically detects when pipeline parallelism can be enabled src/llama-context.cpp306-338:

Conditions for pipeline parallelism:

Multiple GPU devices (model.n_devices() > 1)
All layers offloaded (model.n_gpu_layers() > model.hparams.n_layer)
Layer-wise split mode (LLAMA_SPLIT_MODE_LAYER)
KV cache offloading enabled (cparams.offload_kqv)
No tensor overrides
All GPUs support async compute and events

Sources: src/llama-context.cpp306-338

Scheduler Reservation

The sched_reserve() method src/llama-context.cpp380-439 allocates compute resources:

Diagram: Scheduler Reservation Flow

The reservation process builds a worst-case computation graph to determine maximum buffer requirements, ensuring all subsequent inference operations will fit without reallocation.

Sources: src/llama-context.cpp380-439

Inference Orchestration

The Encode/Decode Flow

The primary inference methods are llama_encode() and llama_decode(), which share a common processing pipeline:

Diagram: Decode/Encode Processing Sequence

Sources: src/llama-context.cpp527-634

Process Ubatch - Core Inference Kernel

The process_ubatch() method src/llama-context.cpp441-525 is the core inference kernel:

Memory Application: Apply the memory context (KV cache updates)
Graph Selection: Build or reuse computation graph based on ubatch parameters
Input Preparation: Set batch-specific tensors (tokens, positions, masks, etc.)
Computation: Schedule and execute the graph on backends
Output Extraction: Copy results from compute buffers

Diagram: Process Ubatch Internal Flow

Sources: src/llama-context.cpp441-525

Graph Reuse Optimization

Graph reuse is a critical performance optimization. A graph can be reused when:

The ubatch parameters match the previous ubatch
All graph inputs can be reused (verified by llm_graph_input_i::can_reuse())
Graph reuse is not disabled (graph_reuse_disable == false)

The context tracks the previous ubatch parameters and graph inputs to determine reusability:

Reuse Check	Implementation
Ubatch match	Compare `n_tokens`, `n_seqs`, `n_outputs`, etc.
Input reuse	Each input type (embd, pos, mask, etc.) implements `can_reuse()`
Force rebuild	Environment variable `LLAMA_GRAPH_REUSE_DISABLE`

Sources: src/llama-context.cpp163-168 src/llama-graph.h82-103

State Management

Threading Configuration

Thread counts can be adjusted at runtime via llama_set_n_threads() src/llama-context.cpp865-878:

This propagates to all backends that support dynamic thread configuration through the set_n_threads_fns registry src/llama-context.cpp241-250

Sources: src/llama-context.cpp865-878 src/llama-context.cpp241-250

Adapter Management

The context supports dynamic adapter management:

LoRA Adapters src/llama-context.cpp890-936:

set_adapter_lora(adapter, scale): Apply LoRA with scaling
rm_adapter_lora(adapter): Remove specific adapter
clear_adapter_lora(): Remove all adapters
Adapter changes trigger scheduler re-reservation

Control Vectors src/llama-context.cpp938-985:

apply_adapter_cvec(data, len, n_embd, il_start, il_end): Apply control vector to layer range
Implemented as layer-wise bias additions

Sources: src/llama-context.cpp890-985

State Serialization

The context supports full state serialization for saving/loading inference state:

Diagram: State Serialization Structure

The state can be saved/loaded for the entire context or per-sequence, enabling advanced use cases like speculative decoding and multi-turn conversations.

Sources: src/llama-context.cpp987-1313 examples/save-load-state/save-load-state.cpp68-104

Memory and Resource Management

Output Buffer Management

The context maintains a unified output buffer for logits and embeddings src/llama-context.cpp1315-1411:

Output data is organized as:

Logits: [n_vocab, n_outputs] - token predictions
Embeddings: [n_embd, n_outputs] - sequence embeddings

The buffer is reallocated only when the required size exceeds the current capacity.

Sources: src/llama-context.cpp1315-1411

Memory Breakdown

The context can report memory usage across buffer types via memory_breakdown() src/llama-context.cpp1413-1516:

Memory Category	Description
Model	Weight storage from `llama_model`
Context	KV cache, cells, and persistent state
Compute	Temporary computation buffers

This breakdown is used by llama_params_fit() for automatic memory optimization.

Sources: src/llama-context.cpp1413-1516

Synchronization

The synchronize() method src/llama-context.cpp636-642 waits for all pending backend operations to complete. This is essential before:

Scheduler re-reservation
State serialization
Memory inspection
Context destruction

Sources: src/llama-context.cpp636-642

Backend Samplers

The context optionally supports backend samplers src/llama-context.cpp63-80 which enable:

Token sampling on GPU (reduces CPU-GPU transfers)
Per-sequence sampling configurations
Integration with the sampling graph

Backend samplers are specified during context creation and become part of the computation graph before scheduler reservation.

Sources: src/llama-context.cpp63-80 src/llama-context.h207-236

Common Initialization Pattern

User-facing tools typically use the common_init_from_params() helper common/common.cpp1073-1178:

Diagram: Common Initialization Flow

This pattern handles model loading, context creation, adapter application, and sampler initialization in a consistent way across all tools.

Sources: common/common.cpp1073-1178

Inference Context and Orchestration

Relevant source files

Overview and Architecture

The llama_context is the central runtime object for inference operations. It manages:

Backend devices and computation schedulers
Memory modules (KV cache, recurrent states, etc.)
Batch allocation and processing
Graph building and reuse
Threading and synchronization
Adapters (LoRA, control vectors)
Sampling state and backend samplers

Diagram: llama_context Structure and Dependencies

Sources: src/llama-context.h36-235 src/llama-context.cpp21-358

Context Parameters

Parameter Structures

The context is initialized with llama_context_params, which is converted internally to llama_cparams:

Parameter	Type	Description
`n_ctx`	`uint32_t`	Total context size (0 = use model default)
`n_batch`	`uint32_t`	Logical maximum batch size
`n_ubatch`	`uint32_t`	Physical batch size for processing
`n_seq_max`	`uint32_t`	Maximum number of sequences
`n_threads`	`int32_t`	Threads for generation
`n_threads_batch`	`int32_t`	Threads for batch processing
`rope_scaling_type`	`enum`	RoPE scaling method
`pooling_type`	`enum`	Embedding pooling strategy
`attention_type`	`enum`	Causal vs non-causal attention
`flash_attn_type`	`enum`	Flash attention mode
`type_k`, `type_v`	`ggml_type`	KV cache quantization types
`samplers`	`llama_sampler_seq_config*`	Backend samplers (optional)

Sources: include/llama.h327-376 src/llama-context.cpp35-172

Parameter Processing and Validation

Diagram: Context Parameter Processing Flow

Key processing steps in src/llama-context.cpp35-172:

Context size padding: n_ctx = GGML_PAD(n_ctx, 256) for optimal memory alignment
Stream calculation:
- Unified KV cache: n_stream = 1, n_ctx_seq = n_ctx
- Split KV cache: n_stream = n_seq_max, n_ctx_seq = n_ctx / n_seq_max
RoPE scaling: Determine rope_freq_scale and rope_freq_base from model or params
YaRN adjustments: Calculate yarn_attn_factor based on scaling parameters

Sources: src/llama-context.cpp35-172

Context Lifecycle

Construction

The llama_context constructor src/llama-context.cpp21-358 performs comprehensive initialization:

Diagram: Context Construction Sequence

Sources: src/llama-context.cpp21-358

Backend Initialization

The context initializes multiple backend types in priority order:

GPU backends (from model.devices): CUDA, Metal, Vulkan, etc.
ACCEL backends: BLAS and other CPU accelerators
CPU backend: Fallback host computation

Diagram: Backend Initialization Structure

For GPU backends, the context also sets up host buffer types for efficient CPU-GPU transfers src/llama-context.cpp289-296

Sources: src/llama-context.cpp211-251

Pipeline Parallelism Detection

The context automatically detects when pipeline parallelism can be enabled src/llama-context.cpp306-338:

Conditions for pipeline parallelism:

Multiple GPU devices (model.n_devices() > 1)
All layers offloaded (model.n_gpu_layers() > model.hparams.n_layer)
Layer-wise split mode (LLAMA_SPLIT_MODE_LAYER)
KV cache offloading enabled (cparams.offload_kqv)
No tensor overrides
All GPUs support async compute and events

Sources: src/llama-context.cpp306-338

Scheduler Reservation

The sched_reserve() method src/llama-context.cpp380-439 allocates compute resources:

Diagram: Scheduler Reservation Flow

The reservation process builds a worst-case computation graph to determine maximum buffer requirements, ensuring all subsequent inference operations will fit without reallocation.

Sources: src/llama-context.cpp380-439

Inference Orchestration

The Encode/Decode Flow

The primary inference methods are llama_encode() and llama_decode(), which share a common processing pipeline:

Diagram: Decode/Encode Processing Sequence

Sources: src/llama-context.cpp527-634

Process Ubatch - Core Inference Kernel

The process_ubatch() method src/llama-context.cpp441-525 is the core inference kernel:

Memory Application: Apply the memory context (KV cache updates)
Graph Selection: Build or reuse computation graph based on ubatch parameters
Input Preparation: Set batch-specific tensors (tokens, positions, masks, etc.)
Computation: Schedule and execute the graph on backends
Output Extraction: Copy results from compute buffers

Diagram: Process Ubatch Internal Flow

Sources: src/llama-context.cpp441-525

Graph Reuse Optimization

Graph reuse is a critical performance optimization. A graph can be reused when:

The ubatch parameters match the previous ubatch
All graph inputs can be reused (verified by llm_graph_input_i::can_reuse())
Graph reuse is not disabled (graph_reuse_disable == false)

The context tracks the previous ubatch parameters and graph inputs to determine reusability:

Reuse Check	Implementation
Ubatch match	Compare `n_tokens`, `n_seqs`, `n_outputs`, etc.
Input reuse	Each input type (embd, pos, mask, etc.) implements `can_reuse()`
Force rebuild	Environment variable `LLAMA_GRAPH_REUSE_DISABLE`

Sources: src/llama-context.cpp163-168 src/llama-graph.h82-103

State Management

Threading Configuration

Thread counts can be adjusted at runtime via llama_set_n_threads() src/llama-context.cpp865-878:

This propagates to all backends that support dynamic thread configuration through the set_n_threads_fns registry src/llama-context.cpp241-250

Sources: src/llama-context.cpp865-878 src/llama-context.cpp241-250

Adapter Management

The context supports dynamic adapter management:

LoRA Adapters src/llama-context.cpp890-936:

set_adapter_lora(adapter, scale): Apply LoRA with scaling
rm_adapter_lora(adapter): Remove specific adapter
clear_adapter_lora(): Remove all adapters
Adapter changes trigger scheduler re-reservation

Control Vectors src/llama-context.cpp938-985:

apply_adapter_cvec(data, len, n_embd, il_start, il_end): Apply control vector to layer range
Implemented as layer-wise bias additions

Sources: src/llama-context.cpp890-985

State Serialization

The context supports full state serialization for saving/loading inference state:

Diagram: State Serialization Structure

The state can be saved/loaded for the entire context or per-sequence, enabling advanced use cases like speculative decoding and multi-turn conversations.

Sources: src/llama-context.cpp987-1313 examples/save-load-state/save-load-state.cpp68-104

Memory and Resource Management

Output Buffer Management

The context maintains a unified output buffer for logits and embeddings src/llama-context.cpp1315-1411:

Output data is organized as:

Logits: [n_vocab, n_outputs] - token predictions
Embeddings: [n_embd, n_outputs] - sequence embeddings

The buffer is reallocated only when the required size exceeds the current capacity.

Sources: src/llama-context.cpp1315-1411

Memory Breakdown

The context can report memory usage across buffer types via memory_breakdown() src/llama-context.cpp1413-1516:

Memory Category	Description
Model	Weight storage from `llama_model`
Context	KV cache, cells, and persistent state
Compute	Temporary computation buffers

This breakdown is used by llama_params_fit() for automatic memory optimization.

Sources: src/llama-context.cpp1413-1516

Synchronization

The synchronize() method src/llama-context.cpp636-642 waits for all pending backend operations to complete. This is essential before:

Scheduler re-reservation
State serialization
Memory inspection
Context destruction

Sources: src/llama-context.cpp636-642

Backend Samplers

The context optionally supports backend samplers src/llama-context.cpp63-80 which enable:

Token sampling on GPU (reduces CPU-GPU transfers)
Per-sequence sampling configurations
Integration with the sampling graph

Backend samplers are specified during context creation and become part of the computation graph before scheduler reservation.

Sources: src/llama-context.cpp63-80 src/llama-context.h207-236

Common Initialization Pattern

User-facing tools typically use the common_init_from_params() helper common/common.cpp1073-1178:

Diagram: Common Initialization Flow

This pattern handles model loading, context creation, adapter application, and sampler initialization in a consistent way across all tools.

Sources: common/common.cpp1073-1178

Inference Context and Orchestration

Overview and Architecture

Context Parameters

Parameter Structures

Parameter Processing and Validation

Context Lifecycle

Construction

Backend Initialization

Pipeline Parallelism Detection

Scheduler Reservation

Inference Orchestration

The Encode/Decode Flow

Process Ubatch - Core Inference Kernel

Graph Reuse Optimization

State Management

Threading Configuration

Adapter Management

State Serialization

Memory and Resource Management

Output Buffer Management

Memory Breakdown

Synchronization

Backend Samplers

Common Initialization Pattern

On this page

Inference Context and Orchestration

Overview and Architecture

Context Parameters

Parameter Structures

Parameter Processing and Validation

Context Lifecycle

Construction

Backend Initialization

Pipeline Parallelism Detection

Scheduler Reservation

Inference Orchestration

The Encode/Decode Flow

Process Ubatch - Core Inference Kernel

Graph Reuse Optimization

State Management

Threading Configuration

Adapter Management

State Serialization

Memory and Resource Management

Output Buffer Management

Memory Breakdown

Synchronization

Backend Samplers

Common Initialization Pattern

On this page