This page documents the speculative decoding feature in llama.cpp: the draft-target model architecture, the verification loop, KV cache state management, and how to configure it in both the standalone example and the HTTP server. For general token sampling (Top-K, Top-P, etc.), see 3.8. For the HTTP server that hosts speculative decoding behind an API, see 6.2.
Standard autoregressive decoding generates exactly one token per target-model forward pass. Speculative decoding reduces latency by using a small, fast draft model to generate a sequence of candidate tokens, then using the target model to verify the entire candidate sequence in a single batched forward pass.
The key insight is that a batched forward pass over N candidate tokens costs only slightly more than a single-token pass, so accepting several draft tokens at once improves throughput without changing the output distribution.
Two verification strategies are supported:
| Strategy | Condition | Residual on rejection |
|---|---|---|
| Greedy | temp == 0: accept if sampled target token == draft token | Discard all remaining draft tokens |
| Stochastic | temp > 0: accept draft token t if r ≤ p_tgt(t) / p_dft(t) | Sample from max(0, p_tgt − p_dft) normalized distribution |
Both strategies are provably equivalent in output distribution to running the target model alone.
Sources: examples/speculative/speculative.cpp236-418
Draft-target model pipeline:
Sources: examples/speculative/speculative.cpp67-175
seq_draftDefined in examples/speculative/speculative.cpp18-30 one instance per parallel draft branch:
| Field | Type | Purpose |
|---|---|---|
active | bool | Whether this sequence is currently being used |
drafting | bool | Whether this sequence is still generating draft tokens |
skip | bool | Skips this sequence in the current draft iteration (used after branch split) |
i_batch_dft | int | Index of this sequence's last token in the draft batch |
i_batch_tgt | vector<int> | Positions within batch_tgt for each drafted token |
tokens | vector<llama_token> | The drafted token sequence |
dists | vector<vector<llama_token_data>> | Probability distributions saved at each draft step (for stochastic verification) |
smpl | common_sampler * | Per-branch sampler, cloned from the main sampler at the start of each round |
Sources: examples/speculative/speculative.cpp18-30
common_samplerThe target model uses a single shared common_sampler * (smpl) created with common_sampler_init. Each draft branch has its own common_sampler * obtained via common_sampler_clone. This ensures that sampler state (repetition penalties, grammar, history) evolves correctly along each speculative branch.
Sources: common/sampling.cpp385-394 examples/speculative/speculative.cpp196-203
common_sampler_sample_and_accept_nDeclared in common/sampling.h83-86 implemented in common/sampling.cpp521-558
This function encapsulates the inner verification loop for greedy speculative decoding. It:
idxs (logit positions in the target batch) paired with draft tokens.common_sampler_sample at each position.common_sampler_accept immediately.Inputs:
gsmpl - target model sampler
ctx - target llama_context
idxs - vector of logit indices, size = draft.size() + 1
draft - vector of draft tokens
Returns:
accepted tokens (at least 1, at most idxs.size())
The overload without idxs assumes they are [0, 1, 2, ..., draft.size()].
Code flow:
Sources: common/sampling.cpp521-558 common/sampling.h67-86
The full speculative.cpp example implements the more complex stochastic verification itself (using common_sampler_get_candidates and ratio testing), because common_sampler_sample_and_accept_n only covers the greedy path.
High-level state machine per generation round:
Both models evaluate the full prompt before entering the speculative loop:
llama_decode(ctx_tgt, batch[0..n_input-2]) // all but last token
llama_decode(ctx_tgt, batch[n_input-1]) // last token alone (for logits)
llama_decode(ctx_dft, batch[0..n_input]) // all tokens
examples/speculative/speculative.cpp172-177
For each of n_draft steps, the draft model samples one (or more, on a branch split) tokens and adds them to batch_tgt and batch_dft. The draft model is evaluated once per step over the active sequences.
Branch splitting: if cur_p->data[f].p > p_draft_split, a new seq_draft entry is created by copying the KV cache and sampler state of the parent branch. This enables the target model to evaluate a tree of possible continuations.
examples/speculative/speculative.cpp482-588
The target evaluates batch_tgt, which contains one token per position per branch. All branches are assigned to different seq_id values in the KV cache.
examples/speculative/speculative.cpp591-608
After the verification loop determines which sequence (s_keep) to keep:
llama_memory_seq_keep(mem_dft, s_keep)
llama_memory_seq_cp (mem_dft, s_keep, 0, -1, -1)
llama_memory_seq_keep(mem_dft, 0)
llama_memory_seq_rm (mem_tgt, s_keep, n_past_tgt, -1) // trim accepted position
llama_memory_seq_keep(mem_tgt, s_keep)
llama_memory_seq_cp (mem_tgt, s_keep, 0, -1, -1)
llama_memory_seq_keep(mem_tgt, 0)
All other branch sequences are discarded. The accepted sequence becomes sequence 0 in both contexts.
examples/speculative/speculative.cpp424-435
Before the speculative loop starts, speculative.cpp validates that the two models are compatible:
llama_vocab_type(vocab_tgt) == llama_vocab_type(vocab_dft)|n_vocab_tgt - n_vocab_dft| <= SPEC_VOCAB_MAX_SIZE_DIFFERENCE (128).SPEC_VOCAB_CHECK_START_TOKEN_ID (5) onward must have identical text strings.examples/speculative/speculative.cpp104-145
llama-server integrates speculative decoding transparently. The draft model is loaded alongside the main model at startup. Slot processing uses the same draft-then-verify loop; the output is identical to non-speculative inference.
Relevant CLI parameters for llama-server:
| Flag | Default | Description |
|---|---|---|
-md, --model-draft FNAME | none | Path to the draft model GGUF file |
--draft, --draft-max N | 16 | Maximum tokens to draft per round |
--draft-min N | 0 | Minimum tokens to draft before verifying |
--draft-p-min P | 0.75 | Minimum greedy probability to continue drafting |
-cd, --ctx-size-draft N | 0 (from model) | Context size for the draft model |
-ngld, --gpu-layers-draft N | auto | GPU layers for the draft model |
-devd, --device-draft | — | Devices to offload the draft model to |
-td, --threads-draft N | (same as --threads) | Generation threads for draft |
-tbd, --threads-batch-draft N | (same as -td) | Batch threads for draft |
-ctkd, --cache-type-k-draft | f16 | KV cache K type for draft model |
-ctvd, --cache-type-v-draft | f16 | KV cache V type for draft model |
--spec-replace TARGET DRAFT | — | String translation when vocabs differ |
-otd, --override-tensor-draft | — | Per-tensor buffer type overrides for draft |
Sources: tools/server/README.md117-232
Example invocation:
The draft model can also be loaded from HuggingFace with -hfd, --hf-repo-draft.
Built-in preset combinations (which pre-select a matching main + draft model pair) are also available, e.g. --fim-qwen-7b-spec and --fim-qwen-14b-spec.
Sources: tools/server/README.md104-238
The test suite at tools/server/tests/unit/test_speculative.py verifies:
temperature=0.0, top_k=1) output with a draft model matches output without a draft model.(draft_min, draft_max) combinations all produce the same greedy output.The tests use stories15m_moe (F16 MoE) as the target model and stories15M-q4_0.gguf as the draft model.
Sources: tools/server/tests/unit/test_speculative.py1-117
Sources: examples/speculative/speculative.cpp common/sampling.h common/sampling.cpp
temp=0), the draft acceptance rate is deterministic and equals the fraction of positions where the draft model predicts the same top token as the target model.--draft-p-min controls early stopping of the draft phase: if the draft model's top-token probability drops below this threshold, drafting stops early to avoid spending time on low-confidence tokens that are likely to be rejected.n_seq_dft > 1 in speculative.cpp) improves acceptance rates at the cost of higher KV cache usage by exploring multiple candidate continuations simultaneously.Refresh this wiki
This wiki was recently refreshed. Please wait 2 days to refresh again.