Advanced Features

Relevant source files

This page is an overview of advanced capabilities in llama.cpp that go beyond basic model loading and text generation. The features documented here require more involved configuration and are intended for users who need higher throughput, constrained outputs, or expanded input modalities.

For basic usage and parameters, see Getting Started and Configuration and Parameters.
For the underlying inference context and KV cache, see Inference Context and Orchestration and Memory Management and KV Cache.
For GPU backend details, see GPU and Accelerator Backends.

Overview

The advanced features covered under this section are:

Feature	What it does	Key entry point
Speculative Decoding	Accelerates generation using a smaller draft model	`examples/speculative/`
Grammar & Structured Output	Constrains token generation to valid grammar/JSON	`llama_sampler_init_grammar`
Flash Attention	Fused attention kernels for GPU backends	`ggml_cuda_flash_attn_ext`
Multi-GPU & Distributed	Splits tensors across devices or machines	`ggml_backend_cuda_split_buffer_type`, RPC backend
Multimodal Support	Encodes images and audio as token embeddings	`libmtmd` / `mtmd_context`

Each feature has a dedicated sub-page with full implementation details. The sections below give a factual summary of how each feature fits into the broader system.

Speculative Decoding

Speculative decoding runs two models in tandem: a small, fast draft model that generates candidate tokens, and a larger target model that verifies them in a single parallel forward pass.

System flow:

Diagram: Speculative Decoding Execution Flow

The seq_draft struct (examples/speculative/speculative.cpp18-30) tracks per-branch state including the draft token list, batch indices, and per-sequence sampler. Multiple parallel draft branches are supported via n_seq_dft.

Token verification is handled by common_sampler_sample_and_accept_n (common/sampling.cpp521-548). It iterates over draft tokens and stops at the first mismatch:

Vocab compatibility between draft and target models is checked at startup, with a tolerance of SPEC_VOCAB_MAX_SIZE_DIFFERENCE = 128 tokens (examples/speculative/speculative.cpp15).

For full details, see Speculative Decoding.

Sources: examples/speculative/speculative.cpp1-550 common/sampling.cpp521-558 common/sampling.h67-86

Grammar and Structured Output

Grammar-constrained sampling restricts the set of valid next tokens at each step. Two mechanisms are available: GBNF grammars (a built-in BNF-like format) and llguidance (an optional Rust-based Lark grammar engine).

Diagram: Grammar Sampler Chain

Initialization in common_sampler_init (common/sampling.cpp179-348) selects the grammar implementation:

Condition	Sampler created
`params.grammar` starts with `%llguidance`	`llama_sampler_init_llg` (requires `LLAMA_LLGUIDANCE=ON`)
`params.grammar_lazy == true`	`llama_sampler_init_grammar_lazy_patterns`
`params.grammar_lazy == false`	`llama_sampler_init_grammar`

Grammar triggers (patterns or specific tokens) can activate lazy grammar sampling mid-generation, which is useful for tool-call schemas where grammar enforcement should only begin after a particular token is produced.

The grammar_first flag in common_sampler_sample (common/sampling.cpp449-519) applies grammar constraints before the sampling chain instead of after, ensuring all output candidates satisfy the grammar — useful but slower.

For full details, see Grammar and Structured Output.

Sources: common/sampling.cpp179-348 common/sampling.cpp449-519 common/sampling.h55-86

Flash Attention and Optimizations

Flash Attention computes attention in a numerically stable, memory-efficient fused kernel. In llama.cpp the CUDA backend implements this as ggml_cuda_flash_attn_ext (ggml/src/ggml-cuda/fattn.cu460-478).

Diagram: CUDA Flash Attention Kernel Selection

The kernel selection function ggml_cuda_get_best_fattn_kernel (ggml/src/ggml-cuda/fattn.cu280-458) evaluates:

Head dimension (K->ne[0]): supports 40, 64, 72, 80, 96, 112, 128, 256, 576
KV types: F16, Q4_0, Q8_0 (more with GGML_CUDA_FA_ALL_QUANTS)
GPU compute capability (cc): Volta (cc=700), Turing (cc=750+), Ampere (cc=800+), Ada (cc=890+), Blackwell
GQA ratio: grouped query attention ratio influences the ncols2 specialization in the MMA kernel

The MMA kernel (fattn-mma-f16.cuh) uses fattn_mma_config (ggml/src/ggml-cuda/fattn-mma-f16.cuh10-24) to tune thread counts, occupancy, and pipeline stages per architecture. Configurations are selected via ggml_cuda_fattn_mma_get_config (ggml/src/ggml-cuda/fattn-mma-f16.cuh114-126).

Numerical stability is maintained by the constant FATTN_KQ_MAX_OFFSET (ggml/src/ggml-cuda/fattn-common.cuh19), which shifts the softmax range to avoid overflow in the VKQ accumulators.

For the vision encoder, flash attention is controlled per context via clip_flash_attn_type (tools/mtmd/clip.cpp159) and applied inside clip_graph::build_attn (tools/mtmd/clip.cpp593-653).

For full details, see Flash Attention and Optimizations.

Sources: ggml/src/ggml-cuda/fattn.cu272-482 ggml/src/ggml-cuda/fattn-common.cuh1-25 ggml/src/ggml-cuda/fattn-mma-f16.cuh10-141 tools/mtmd/clip.cpp593-653

Multi-GPU and Distributed Inference

llama.cpp supports two approaches for spreading inference across multiple devices:

Approach	Mechanism	Scope
Tensor splitting	`ggml_backend_cuda_split_buffer_type`	Multiple GPUs on one host
Pipeline parallelism	Layer assignment across devices	Multiple GPUs on one host
RPC backend	Remote ggml backend over TCP	Multiple machines

The --tensor-split CLI flag controls how tensor data is divided across CUDA devices. The split buffer type allocates contiguous regions of each tensor on different devices according to the specified ratios.

The RPC backend (ggml-rpc) allows remote machines to expose their GPU or CPU as a ggml backend, and the scheduler distributes graph nodes to them the same way it would to local devices.

For full details, see Multi-GPU and Distributed Inference.

Multimodal Support

Multimodal input is implemented in the libmtmd library (tools/mtmd/), which bridges raw image/audio data to llama.cpp token embeddings.

Diagram: libmtmd Component Relationships

Diagram: Projector Type to Graph Builder Mapping

The mtmd_context struct (tools/mtmd/mtmd.cpp121-399) holds both a vision clip_ctx * ctx_v and an audio clip_ctx * ctx_a. Each clip_ctx owns a ggml_backend_sched and dispatches the vision/audio encoder graph to GPU or CPU.

Key data structures:

Struct	Purpose
`mtmd_bitmap`	Raw image (RGBRGB...) or audio data with optional user ID
`mtmd_image_tokens`	Preprocessed image patches as `clip_image_f32_batch`, with `nx`/`ny` patch counts
`mtmd_audio_tokens`	Preprocessed audio frames, token count
`mtmd_input_chunk`	Tagged union of text tokens, image tokens, or audio tokens
`mtmd_input_chunks`	Ordered sequence of chunks for one prompt

The clip_graph::build_vit method (tools/mtmd/clip.cpp288-457) implements the shared Vision Transformer forward pass used by most vision encoders. Architecture-specific subclasses override build() for custom patch merging, positional embedding handling, or audio stacking.

The clip_flash_attn_type field in clip_ctx (tools/mtmd/clip.cpp159) controls whether the vision encoder uses fused attention. It maps from the llama-level llama_flash_attn_type enum via mtmd_get_clip_flash_attn_type (tools/mtmd/mtmd.cpp95-102).

The public C API is declared in tools/mtmd/mtmd.h and covers:

mtmd_init / mtmd_free
mtmd_tokenize — converts text+media into mtmd_input_chunks
mtmd_encode_chunk — runs the vision/audio encoder graph
mtmd_get_input_embd — retrieves the computed float embeddings

For full details, see Multimodal Support.

Sources: tools/mtmd/mtmd.cpp121-399 tools/mtmd/mtmd.h1-120 tools/mtmd/clip.cpp141-221 tools/mtmd/clip.cpp784-881 tools/mtmd/CMakeLists.txt1-85 docs/multimodal.md1-50

Advanced Features

Relevant source files

For basic usage and parameters, see Getting Started and Configuration and Parameters.
For the underlying inference context and KV cache, see Inference Context and Orchestration and Memory Management and KV Cache.
For GPU backend details, see GPU and Accelerator Backends.

Overview

The advanced features covered under this section are:

Feature	What it does	Key entry point
Speculative Decoding	Accelerates generation using a smaller draft model	`examples/speculative/`
Grammar & Structured Output	Constrains token generation to valid grammar/JSON	`llama_sampler_init_grammar`
Flash Attention	Fused attention kernels for GPU backends	`ggml_cuda_flash_attn_ext`
Multi-GPU & Distributed	Splits tensors across devices or machines	`ggml_backend_cuda_split_buffer_type`, RPC backend
Multimodal Support	Encodes images and audio as token embeddings	`libmtmd` / `mtmd_context`

Each feature has a dedicated sub-page with full implementation details. The sections below give a factual summary of how each feature fits into the broader system.

Speculative Decoding

Speculative decoding runs two models in tandem: a small, fast draft model that generates candidate tokens, and a larger target model that verifies them in a single parallel forward pass.

System flow:

Diagram: Speculative Decoding Execution Flow

Token verification is handled by common_sampler_sample_and_accept_n (common/sampling.cpp521-548). It iterates over draft tokens and stops at the first mismatch:

Vocab compatibility between draft and target models is checked at startup, with a tolerance of SPEC_VOCAB_MAX_SIZE_DIFFERENCE = 128 tokens (examples/speculative/speculative.cpp15).

For full details, see Speculative Decoding.

Sources: examples/speculative/speculative.cpp1-550 common/sampling.cpp521-558 common/sampling.h67-86

Grammar and Structured Output

Diagram: Grammar Sampler Chain

Initialization in common_sampler_init (common/sampling.cpp179-348) selects the grammar implementation:

Condition	Sampler created
`params.grammar` starts with `%llguidance`	`llama_sampler_init_llg` (requires `LLAMA_LLGUIDANCE=ON`)
`params.grammar_lazy == true`	`llama_sampler_init_grammar_lazy_patterns`
`params.grammar_lazy == false`	`llama_sampler_init_grammar`

For full details, see Grammar and Structured Output.

Sources: common/sampling.cpp179-348 common/sampling.cpp449-519 common/sampling.h55-86

Flash Attention and Optimizations

Diagram: CUDA Flash Attention Kernel Selection

The kernel selection function ggml_cuda_get_best_fattn_kernel (ggml/src/ggml-cuda/fattn.cu280-458) evaluates:

Head dimension (K->ne[0]): supports 40, 64, 72, 80, 96, 112, 128, 256, 576
KV types: F16, Q4_0, Q8_0 (more with GGML_CUDA_FA_ALL_QUANTS)
GPU compute capability (cc): Volta (cc=700), Turing (cc=750+), Ampere (cc=800+), Ada (cc=890+), Blackwell
GQA ratio: grouped query attention ratio influences the ncols2 specialization in the MMA kernel

Numerical stability is maintained by the constant FATTN_KQ_MAX_OFFSET (ggml/src/ggml-cuda/fattn-common.cuh19), which shifts the softmax range to avoid overflow in the VKQ accumulators.

For the vision encoder, flash attention is controlled per context via clip_flash_attn_type (tools/mtmd/clip.cpp159) and applied inside clip_graph::build_attn (tools/mtmd/clip.cpp593-653).

For full details, see Flash Attention and Optimizations.

Sources: ggml/src/ggml-cuda/fattn.cu272-482 ggml/src/ggml-cuda/fattn-common.cuh1-25 ggml/src/ggml-cuda/fattn-mma-f16.cuh10-141 tools/mtmd/clip.cpp593-653

Multi-GPU and Distributed Inference

llama.cpp supports two approaches for spreading inference across multiple devices:

Approach	Mechanism	Scope
Tensor splitting	`ggml_backend_cuda_split_buffer_type`	Multiple GPUs on one host
Pipeline parallelism	Layer assignment across devices	Multiple GPUs on one host
RPC backend	Remote ggml backend over TCP	Multiple machines

The RPC backend (ggml-rpc) allows remote machines to expose their GPU or CPU as a ggml backend, and the scheduler distributes graph nodes to them the same way it would to local devices.

For full details, see Multi-GPU and Distributed Inference.

Multimodal Support

Multimodal input is implemented in the libmtmd library (tools/mtmd/), which bridges raw image/audio data to llama.cpp token embeddings.

Diagram: libmtmd Component Relationships

Diagram: Projector Type to Graph Builder Mapping

Key data structures:

Struct	Purpose
`mtmd_bitmap`	Raw image (RGBRGB...) or audio data with optional user ID
`mtmd_image_tokens`	Preprocessed image patches as `clip_image_f32_batch`, with `nx`/`ny` patch counts
`mtmd_audio_tokens`	Preprocessed audio frames, token count
`mtmd_input_chunk`	Tagged union of text tokens, image tokens, or audio tokens
`mtmd_input_chunks`	Ordered sequence of chunks for one prompt

The public C API is declared in tools/mtmd/mtmd.h and covers:

mtmd_init / mtmd_free
mtmd_tokenize — converts text+media into mtmd_input_chunks
mtmd_encode_chunk — runs the vision/audio encoder graph
mtmd_get_input_embd — retrieves the computed float embeddings

For full details, see Multimodal Support.

Sources: tools/mtmd/mtmd.cpp121-399 tools/mtmd/mtmd.h1-120 tools/mtmd/clip.cpp141-221 tools/mtmd/clip.cpp784-881 tools/mtmd/CMakeLists.txt1-85 docs/multimodal.md1-50

Advanced Features

Overview

Speculative Decoding

Grammar and Structured Output

Flash Attention and Optimizations

Multi-GPU and Distributed Inference

Multimodal Support

On this page

Advanced Features

Overview

Speculative Decoding

Grammar and Structured Output

Flash Attention and Optimizations

Multi-GPU and Distributed Inference

Multimodal Support

On this page