Speculative Decoding

Relevant source files

This page documents the speculative decoding feature in llama.cpp: the draft-target model architecture, the verification loop, KV cache state management, and how to configure it in both the standalone example and the HTTP server. For general token sampling (Top-K, Top-P, etc.), see 3.8. For the HTTP server that hosts speculative decoding behind an API, see 6.2.

What Is Speculative Decoding?

Standard autoregressive decoding generates exactly one token per target-model forward pass. Speculative decoding reduces latency by using a small, fast draft model to generate a sequence of candidate tokens, then using the target model to verify the entire candidate sequence in a single batched forward pass.

The key insight is that a batched forward pass over N candidate tokens costs only slightly more than a single-token pass, so accepting several draft tokens at once improves throughput without changing the output distribution.

Two verification strategies are supported:

Strategy	Condition	Residual on rejection
Greedy	`temp == 0`: accept if sampled target token == draft token	Discard all remaining draft tokens
Stochastic	`temp > 0`: accept draft token `t` if `r ≤ p_tgt(t) / p_dft(t)`	Sample from `max(0, p_tgt − p_dft)` normalized distribution

Both strategies are provably equivalent in output distribution to running the target model alone.

Sources: examples/speculative/speculative.cpp236-418

System Overview

Draft-target model pipeline:

Sources: examples/speculative/speculative.cpp67-175

Key Data Structures

`seq_draft`

Defined in examples/speculative/speculative.cpp18-30 one instance per parallel draft branch:

Field	Type	Purpose
`active`	`bool`	Whether this sequence is currently being used
`drafting`	`bool`	Whether this sequence is still generating draft tokens
`skip`	`bool`	Skips this sequence in the current draft iteration (used after branch split)
`i_batch_dft`	`int`	Index of this sequence's last token in the draft batch
`i_batch_tgt`	`vector<int>`	Positions within `batch_tgt` for each drafted token
`tokens`	`vector<llama_token>`	The drafted token sequence
`dists`	`vector<vector<llama_token_data>>`	Probability distributions saved at each draft step (for stochastic verification)
`smpl`	`common_sampler *`	Per-branch sampler, cloned from the main sampler at the start of each round

Sources: examples/speculative/speculative.cpp18-30

`common_sampler`

The target model uses a single shared common_sampler * (smpl) created with common_sampler_init. Each draft branch has its own common_sampler * obtained via common_sampler_clone. This ensures that sampler state (repetition penalties, grammar, history) evolves correctly along each speculative branch.

Sources: common/sampling.cpp385-394 examples/speculative/speculative.cpp196-203

Core Verification Function

`common_sampler_sample_and_accept_n`

Declared in common/sampling.h83-86 implemented in common/sampling.cpp521-558

This function encapsulates the inner verification loop for greedy speculative decoding. It:

Iterates over idxs (logit positions in the target batch) paired with draft tokens.
Calls common_sampler_sample at each position.
Calls common_sampler_accept immediately.
Stops at the first mismatch.
Always appends one additional token at the end (the target's own correction token).

Inputs:
  gsmpl          - target model sampler
  ctx            - target llama_context
  idxs           - vector of logit indices, size = draft.size() + 1
  draft          - vector of draft tokens

Returns:
  accepted tokens (at least 1, at most idxs.size())

The overload without idxs assumes they are [0, 1, 2, ..., draft.size()].

Code flow:

Sources: common/sampling.cpp521-558 common/sampling.h67-86

The full speculative.cpp example implements the more complex stochastic verification itself (using common_sampler_get_candidates and ratio testing), because common_sampler_sample_and_accept_n only covers the greedy path.

Full Speculative Loop

High-level state machine per generation round:

Prompt Evaluation

Both models evaluate the full prompt before entering the speculative loop:

llama_decode(ctx_tgt, batch[0..n_input-2])  // all but last token
llama_decode(ctx_tgt, batch[n_input-1])     // last token alone (for logits)
llama_decode(ctx_dft, batch[0..n_input])    // all tokens

examples/speculative/speculative.cpp172-177

Draft Phase

For each of n_draft steps, the draft model samples one (or more, on a branch split) tokens and adds them to batch_tgt and batch_dft. The draft model is evaluated once per step over the active sequences.

Branch splitting: if cur_p->data[f].p > p_draft_split, a new seq_draft entry is created by copying the KV cache and sampler state of the parent branch. This enables the target model to evaluate a tree of possible continuations.

examples/speculative/speculative.cpp482-588

Target Evaluation

The target evaluates batch_tgt, which contains one token per position per branch. All branches are assigned to different seq_id values in the KV cache.

examples/speculative/speculative.cpp591-608

KV Cache Cleanup After Verification

After the verification loop determines which sequence (s_keep) to keep:

llama_memory_seq_keep(mem_dft, s_keep)
llama_memory_seq_cp  (mem_dft, s_keep, 0, -1, -1)
llama_memory_seq_keep(mem_dft, 0)

llama_memory_seq_rm  (mem_tgt, s_keep, n_past_tgt, -1)   // trim accepted position
llama_memory_seq_keep(mem_tgt, s_keep)
llama_memory_seq_cp  (mem_tgt, s_keep, 0, -1, -1)
llama_memory_seq_keep(mem_tgt, 0)

All other branch sequences are discarded. The accepted sequence becomes sequence 0 in both contexts.

examples/speculative/speculative.cpp424-435

Vocabulary Compatibility

Before the speculative loop starts, speculative.cpp validates that the two models are compatible:

Vocab type: llama_vocab_type(vocab_tgt) == llama_vocab_type(vocab_dft)
Special tokens: BOS/EOS presence flags and token IDs must match.
Vocab size difference: |n_vocab_tgt - n_vocab_dft| <= SPEC_VOCAB_MAX_SIZE_DIFFERENCE (128).
Token text: tokens from SPEC_VOCAB_CHECK_START_TOKEN_ID (5) onward must have identical text strings.

examples/speculative/speculative.cpp104-145

Server-Side Speculative Decoding

llama-server integrates speculative decoding transparently. The draft model is loaded alongside the main model at startup. Slot processing uses the same draft-then-verify loop; the output is identical to non-speculative inference.

Relevant CLI parameters for llama-server:

Flag	Default	Description
`-md, --model-draft FNAME`	none	Path to the draft model GGUF file
`--draft, --draft-max N`	16	Maximum tokens to draft per round
`--draft-min N`	0	Minimum tokens to draft before verifying
`--draft-p-min P`	0.75	Minimum greedy probability to continue drafting
`-cd, --ctx-size-draft N`	0 (from model)	Context size for the draft model
`-ngld, --gpu-layers-draft N`	auto	GPU layers for the draft model
`-devd, --device-draft`	—	Devices to offload the draft model to
`-td, --threads-draft N`	(same as `--threads`)	Generation threads for draft
`-tbd, --threads-batch-draft N`	(same as `-td`)	Batch threads for draft
`-ctkd, --cache-type-k-draft`	f16	KV cache K type for draft model
`-ctvd, --cache-type-v-draft`	f16	KV cache V type for draft model
`--spec-replace TARGET DRAFT`	—	String translation when vocabs differ
`-otd, --override-tensor-draft`	—	Per-tensor buffer type overrides for draft

Sources: tools/server/README.md117-232

Example invocation:

The draft model can also be loaded from HuggingFace with -hfd, --hf-repo-draft.

Built-in preset combinations (which pre-select a matching main + draft model pair) are also available, e.g. --fim-qwen-7b-spec and --fim-qwen-14b-spec.

Sources: tools/server/README.md104-238

Server Test Validation

The test suite at tools/server/tests/unit/test_speculative.py verifies:

Output identity: greedy (temperature=0.0, top_k=1) output with a draft model matches output without a draft model.
Parameter sweep: multiple (draft_min, draft_max) combinations all produce the same greedy output.
Context boundary: speculative decoding does not overflow the slot's allocated KV context.

The tests use stories15m_moe (F16 MoE) as the target model and stories15M-q4_0.gguf as the draft model.

Sources: tools/server/tests/unit/test_speculative.py1-117

Code Entity Map

Sources: examples/speculative/speculative.cpp common/sampling.h common/sampling.cpp

Performance Notes

The effective speedup depends on the draft acceptance rate. A higher acceptance rate means more tokens per target forward pass.
For greedy decoding (temp=0), the draft acceptance rate is deterministic and equals the fraction of positions where the draft model predicts the same top token as the target model.
--draft-p-min controls early stopping of the draft phase: if the draft model's top-token probability drops below this threshold, drafting stops early to avoid spending time on low-confidence tokens that are likely to be rejected.
The draft model should be 5–10× smaller than the target model for a meaningful speedup. Using a model that is too small often yields low acceptance rates; using one that is too large reduces the computational advantage.
The tree-based drafting variant (n_seq_dft > 1 in speculative.cpp) improves acceptance rates at the cost of higher KV cache usage by exploring multiple candidate continuations simultaneously.

Speculative Decoding

Relevant source files

What Is Speculative Decoding?

Two verification strategies are supported:

Strategy	Condition	Residual on rejection
Greedy	`temp == 0`: accept if sampled target token == draft token	Discard all remaining draft tokens
Stochastic	`temp > 0`: accept draft token `t` if `r ≤ p_tgt(t) / p_dft(t)`	Sample from `max(0, p_tgt − p_dft)` normalized distribution

Both strategies are provably equivalent in output distribution to running the target model alone.

Sources: examples/speculative/speculative.cpp236-418

System Overview

Draft-target model pipeline:

Sources: examples/speculative/speculative.cpp67-175

Key Data Structures

`seq_draft`

Defined in examples/speculative/speculative.cpp18-30 one instance per parallel draft branch:

Field	Type	Purpose
`active`	`bool`	Whether this sequence is currently being used
`drafting`	`bool`	Whether this sequence is still generating draft tokens
`skip`	`bool`	Skips this sequence in the current draft iteration (used after branch split)
`i_batch_dft`	`int`	Index of this sequence's last token in the draft batch
`i_batch_tgt`	`vector<int>`	Positions within `batch_tgt` for each drafted token
`tokens`	`vector<llama_token>`	The drafted token sequence
`dists`	`vector<vector<llama_token_data>>`	Probability distributions saved at each draft step (for stochastic verification)
`smpl`	`common_sampler *`	Per-branch sampler, cloned from the main sampler at the start of each round

Sources: examples/speculative/speculative.cpp18-30

`common_sampler`

Sources: common/sampling.cpp385-394 examples/speculative/speculative.cpp196-203

Core Verification Function

`common_sampler_sample_and_accept_n`

Declared in common/sampling.h83-86 implemented in common/sampling.cpp521-558

This function encapsulates the inner verification loop for greedy speculative decoding. It:

Iterates over idxs (logit positions in the target batch) paired with draft tokens.
Calls common_sampler_sample at each position.
Calls common_sampler_accept immediately.
Stops at the first mismatch.
Always appends one additional token at the end (the target's own correction token).

Inputs:
  gsmpl          - target model sampler
  ctx            - target llama_context
  idxs           - vector of logit indices, size = draft.size() + 1
  draft          - vector of draft tokens

Returns:
  accepted tokens (at least 1, at most idxs.size())

The overload without idxs assumes they are [0, 1, 2, ..., draft.size()].

Code flow:

Sources: common/sampling.cpp521-558 common/sampling.h67-86

Full Speculative Loop

High-level state machine per generation round:

Prompt Evaluation

Both models evaluate the full prompt before entering the speculative loop:

llama_decode(ctx_tgt, batch[0..n_input-2])  // all but last token
llama_decode(ctx_tgt, batch[n_input-1])     // last token alone (for logits)
llama_decode(ctx_dft, batch[0..n_input])    // all tokens

examples/speculative/speculative.cpp172-177

Draft Phase

examples/speculative/speculative.cpp482-588

Target Evaluation

The target evaluates batch_tgt, which contains one token per position per branch. All branches are assigned to different seq_id values in the KV cache.

examples/speculative/speculative.cpp591-608

KV Cache Cleanup After Verification

After the verification loop determines which sequence (s_keep) to keep:

llama_memory_seq_keep(mem_dft, s_keep)
llama_memory_seq_cp  (mem_dft, s_keep, 0, -1, -1)
llama_memory_seq_keep(mem_dft, 0)

llama_memory_seq_rm  (mem_tgt, s_keep, n_past_tgt, -1)   // trim accepted position
llama_memory_seq_keep(mem_tgt, s_keep)
llama_memory_seq_cp  (mem_tgt, s_keep, 0, -1, -1)
llama_memory_seq_keep(mem_tgt, 0)

All other branch sequences are discarded. The accepted sequence becomes sequence 0 in both contexts.

examples/speculative/speculative.cpp424-435

Vocabulary Compatibility

Before the speculative loop starts, speculative.cpp validates that the two models are compatible:

Vocab type: llama_vocab_type(vocab_tgt) == llama_vocab_type(vocab_dft)
Special tokens: BOS/EOS presence flags and token IDs must match.
Vocab size difference: |n_vocab_tgt - n_vocab_dft| <= SPEC_VOCAB_MAX_SIZE_DIFFERENCE (128).
Token text: tokens from SPEC_VOCAB_CHECK_START_TOKEN_ID (5) onward must have identical text strings.

examples/speculative/speculative.cpp104-145

Server-Side Speculative Decoding

Relevant CLI parameters for llama-server:

Flag	Default	Description
`-md, --model-draft FNAME`	none	Path to the draft model GGUF file
`--draft, --draft-max N`	16	Maximum tokens to draft per round
`--draft-min N`	0	Minimum tokens to draft before verifying
`--draft-p-min P`	0.75	Minimum greedy probability to continue drafting
`-cd, --ctx-size-draft N`	0 (from model)	Context size for the draft model
`-ngld, --gpu-layers-draft N`	auto	GPU layers for the draft model
`-devd, --device-draft`	—	Devices to offload the draft model to
`-td, --threads-draft N`	(same as `--threads`)	Generation threads for draft
`-tbd, --threads-batch-draft N`	(same as `-td`)	Batch threads for draft
`-ctkd, --cache-type-k-draft`	f16	KV cache K type for draft model
`-ctvd, --cache-type-v-draft`	f16	KV cache V type for draft model
`--spec-replace TARGET DRAFT`	—	String translation when vocabs differ
`-otd, --override-tensor-draft`	—	Per-tensor buffer type overrides for draft

Sources: tools/server/README.md117-232

Example invocation:

The draft model can also be loaded from HuggingFace with -hfd, --hf-repo-draft.

Built-in preset combinations (which pre-select a matching main + draft model pair) are also available, e.g. --fim-qwen-7b-spec and --fim-qwen-14b-spec.

Sources: tools/server/README.md104-238

Server Test Validation

The test suite at tools/server/tests/unit/test_speculative.py verifies:

Output identity: greedy (temperature=0.0, top_k=1) output with a draft model matches output without a draft model.
Parameter sweep: multiple (draft_min, draft_max) combinations all produce the same greedy output.
Context boundary: speculative decoding does not overflow the slot's allocated KV context.

The tests use stories15m_moe (F16 MoE) as the target model and stories15M-q4_0.gguf as the draft model.

Sources: tools/server/tests/unit/test_speculative.py1-117

Code Entity Map

Sources: examples/speculative/speculative.cpp common/sampling.h common/sampling.cpp

Performance Notes

The effective speedup depends on the draft acceptance rate. A higher acceptance rate means more tokens per target forward pass.
For greedy decoding (temp=0), the draft acceptance rate is deterministic and equals the fraction of positions where the draft model predicts the same top token as the target model.
--draft-p-min controls early stopping of the draft phase: if the draft model's top-token probability drops below this threshold, drafting stops early to avoid spending time on low-confidence tokens that are likely to be rejected.
The draft model should be 5–10× smaller than the target model for a meaningful speedup. Using a model that is too small often yields low acceptance rates; using one that is too large reduces the computational advantage.
The tree-based drafting variant (n_seq_dft > 1 in speculative.cpp) improves acceptance rates at the cost of higher KV cache usage by exploring multiple candidate continuations simultaneously.

Speculative Decoding

What Is Speculative Decoding?

System Overview

Key Data Structures

`seq_draft`

`common_sampler`

Core Verification Function

`common_sampler_sample_and_accept_n`

Full Speculative Loop

Prompt Evaluation

Draft Phase

Target Evaluation

KV Cache Cleanup After Verification

Vocabulary Compatibility

Server-Side Speculative Decoding

Server Test Validation

Code Entity Map

Performance Notes

On this page

Speculative Decoding

What Is Speculative Decoding?

System Overview

Key Data Structures

`seq_draft`

`common_sampler`

Core Verification Function

`common_sampler_sample_and_accept_n`

Full Speculative Loop

Prompt Evaluation

Draft Phase

Target Evaluation

KV Cache Cleanup After Verification

Vocabulary Compatibility

Server-Side Speculative Decoding

Server Test Validation

Code Entity Map

Performance Notes

On this page