This page is an overview of advanced capabilities in llama.cpp that go beyond basic model loading and text generation. The features documented here require more involved configuration and are intended for users who need higher throughput, constrained outputs, or expanded input modalities.
The advanced features covered under this section are:
| Feature | What it does | Key entry point |
|---|---|---|
| Speculative Decoding | Accelerates generation using a smaller draft model | examples/speculative/ |
| Grammar & Structured Output | Constrains token generation to valid grammar/JSON | llama_sampler_init_grammar |
| Flash Attention | Fused attention kernels for GPU backends | ggml_cuda_flash_attn_ext |
| Multi-GPU & Distributed | Splits tensors across devices or machines | ggml_backend_cuda_split_buffer_type, RPC backend |
| Multimodal Support | Encodes images and audio as token embeddings | libmtmd / mtmd_context |
Each feature has a dedicated sub-page with full implementation details. The sections below give a factual summary of how each feature fits into the broader system.
Speculative decoding runs two models in tandem: a small, fast draft model that generates candidate tokens, and a larger target model that verifies them in a single parallel forward pass.
System flow:
Diagram: Speculative Decoding Execution Flow
The seq_draft struct (examples/speculative/speculative.cpp18-30) tracks per-branch state including the draft token list, batch indices, and per-sequence sampler. Multiple parallel draft branches are supported via n_seq_dft.
Token verification is handled by common_sampler_sample_and_accept_n (common/sampling.cpp521-548). It iterates over draft tokens and stops at the first mismatch:
Vocab compatibility between draft and target models is checked at startup, with a tolerance of SPEC_VOCAB_MAX_SIZE_DIFFERENCE = 128 tokens (examples/speculative/speculative.cpp15).
For full details, see Speculative Decoding.
Sources: examples/speculative/speculative.cpp1-550 common/sampling.cpp521-558 common/sampling.h67-86
Grammar-constrained sampling restricts the set of valid next tokens at each step. Two mechanisms are available: GBNF grammars (a built-in BNF-like format) and llguidance (an optional Rust-based Lark grammar engine).
Diagram: Grammar Sampler Chain
Initialization in common_sampler_init (common/sampling.cpp179-348) selects the grammar implementation:
| Condition | Sampler created |
|---|---|
params.grammar starts with %llguidance | llama_sampler_init_llg (requires LLAMA_LLGUIDANCE=ON) |
params.grammar_lazy == true | llama_sampler_init_grammar_lazy_patterns |
params.grammar_lazy == false | llama_sampler_init_grammar |
Grammar triggers (patterns or specific tokens) can activate lazy grammar sampling mid-generation, which is useful for tool-call schemas where grammar enforcement should only begin after a particular token is produced.
The grammar_first flag in common_sampler_sample (common/sampling.cpp449-519) applies grammar constraints before the sampling chain instead of after, ensuring all output candidates satisfy the grammar — useful but slower.
For full details, see Grammar and Structured Output.
Sources: common/sampling.cpp179-348 common/sampling.cpp449-519 common/sampling.h55-86
Flash Attention computes attention in a numerically stable, memory-efficient fused kernel. In llama.cpp the CUDA backend implements this as ggml_cuda_flash_attn_ext (ggml/src/ggml-cuda/fattn.cu460-478).
Diagram: CUDA Flash Attention Kernel Selection
The kernel selection function ggml_cuda_get_best_fattn_kernel (ggml/src/ggml-cuda/fattn.cu280-458) evaluates:
K->ne[0]): supports 40, 64, 72, 80, 96, 112, 128, 256, 576F16, Q4_0, Q8_0 (more with GGML_CUDA_FA_ALL_QUANTS)cc): Volta (cc=700), Turing (cc=750+), Ampere (cc=800+), Ada (cc=890+), BlackwellThe MMA kernel (fattn-mma-f16.cuh) uses fattn_mma_config (ggml/src/ggml-cuda/fattn-mma-f16.cuh10-24) to tune thread counts, occupancy, and pipeline stages per architecture. Configurations are selected via ggml_cuda_fattn_mma_get_config (ggml/src/ggml-cuda/fattn-mma-f16.cuh114-126).
Numerical stability is maintained by the constant FATTN_KQ_MAX_OFFSET (ggml/src/ggml-cuda/fattn-common.cuh19), which shifts the softmax range to avoid overflow in the VKQ accumulators.
For the vision encoder, flash attention is controlled per context via clip_flash_attn_type (tools/mtmd/clip.cpp159) and applied inside clip_graph::build_attn (tools/mtmd/clip.cpp593-653).
For full details, see Flash Attention and Optimizations.
Sources: ggml/src/ggml-cuda/fattn.cu272-482 ggml/src/ggml-cuda/fattn-common.cuh1-25 ggml/src/ggml-cuda/fattn-mma-f16.cuh10-141 tools/mtmd/clip.cpp593-653
llama.cpp supports two approaches for spreading inference across multiple devices:
| Approach | Mechanism | Scope |
|---|---|---|
| Tensor splitting | ggml_backend_cuda_split_buffer_type | Multiple GPUs on one host |
| Pipeline parallelism | Layer assignment across devices | Multiple GPUs on one host |
| RPC backend | Remote ggml backend over TCP | Multiple machines |
The --tensor-split CLI flag controls how tensor data is divided across CUDA devices. The split buffer type allocates contiguous regions of each tensor on different devices according to the specified ratios.
The RPC backend (ggml-rpc) allows remote machines to expose their GPU or CPU as a ggml backend, and the scheduler distributes graph nodes to them the same way it would to local devices.
For full details, see Multi-GPU and Distributed Inference.
Multimodal input is implemented in the libmtmd library (tools/mtmd/), which bridges raw image/audio data to llama.cpp token embeddings.
Diagram: libmtmd Component Relationships
Diagram: Projector Type to Graph Builder Mapping
The mtmd_context struct (tools/mtmd/mtmd.cpp121-399) holds both a vision clip_ctx * ctx_v and an audio clip_ctx * ctx_a. Each clip_ctx owns a ggml_backend_sched and dispatches the vision/audio encoder graph to GPU or CPU.
Key data structures:
| Struct | Purpose |
|---|---|
mtmd_bitmap | Raw image (RGBRGB...) or audio data with optional user ID |
mtmd_image_tokens | Preprocessed image patches as clip_image_f32_batch, with nx/ny patch counts |
mtmd_audio_tokens | Preprocessed audio frames, token count |
mtmd_input_chunk | Tagged union of text tokens, image tokens, or audio tokens |
mtmd_input_chunks | Ordered sequence of chunks for one prompt |
The clip_graph::build_vit method (tools/mtmd/clip.cpp288-457) implements the shared Vision Transformer forward pass used by most vision encoders. Architecture-specific subclasses override build() for custom patch merging, positional embedding handling, or audio stacking.
The clip_flash_attn_type field in clip_ctx (tools/mtmd/clip.cpp159) controls whether the vision encoder uses fused attention. It maps from the llama-level llama_flash_attn_type enum via mtmd_get_clip_flash_attn_type (tools/mtmd/mtmd.cpp95-102).
The public C API is declared in tools/mtmd/mtmd.h and covers:
mtmd_init / mtmd_freemtmd_tokenize — converts text+media into mtmd_input_chunksmtmd_encode_chunk — runs the vision/audio encoder graphmtmd_get_input_embd — retrieves the computed float embeddingsFor full details, see Multimodal Support.
Sources: tools/mtmd/mtmd.cpp121-399 tools/mtmd/mtmd.h1-120 tools/mtmd/clip.cpp141-221 tools/mtmd/clip.cpp784-881 tools/mtmd/CMakeLists.txt1-85 docs/multimodal.md1-50
Refresh this wiki
This wiki was recently refreshed. Please wait 2 days to refresh again.