This page documents the token sampling and generation system in llama.cpp, which is responsible for selecting the next token during text generation. This includes the sampling pipeline architecture, available sampling algorithms, parameter configuration, and integration with the inference loop.
For information about the overall inference flow, see Inference Context and Orchestration. For computation graph building, see Computation Graph Building. For batch processing of tokens, see Batch Processing Pipeline.
Token sampling is the final step in text generation where the model's output logits (probability scores for each token in the vocabulary) are transformed and used to select the next token. The system uses a sampler chain architecture where multiple sampling algorithms can be composed together, each transforming the logit distribution before the final token selection.
The sampling process occurs after the model has computed logits for the current position. These logits represent the model's "confidence" for each possible next token. Sampling algorithms apply various transformations (filtering, temperature scaling, etc.) to these logits before selecting a token, which can be done greedily (always pick the highest probability) or stochastically (sample according to probabilities).
Sources: README.md322-503 common/common.h179-245
Sampler Chain Architecture
The sampler chain is implemented as a linked sequence of individual samplers. Each sampler in the chain receives a llama_token_data_array (containing token IDs, logits, and probabilities) and modifies it before passing it to the next sampler.
Sources: common/common.h111-125 include/llama.h198-211
| Structure | Purpose | Location |
|---|---|---|
llama_token_data | Holds a single token's ID, logit, and probability | include/llama.h198-202 |
llama_token_data_array | Array of token data with metadata (size, selection index, sorted flag) | include/llama.h204-211 |
llama_sampler | Opaque sampler object | include/llama.h63 |
common_params_sampling | Configuration parameters for sampling | common/common.h180-245 |
common_sampler_type | Enum of available sampler types | common/common.h111-125 |
Sources: include/llama.h198-211 common/common.h111-125 common/common.h180-245
Parameter Configuration Flow
| Parameter | Type | Default | Description |
|---|---|---|---|
seed | uint32_t | LLAMA_DEFAULT_SEED | Random seed for sampling |
n_prev | int32_t | 64 | Number of previous tokens to remember |
n_probs | int32_t | 0 | Output top N token probabilities (0 = disabled) |
min_keep | int32_t | 0 | Minimum tokens to keep after filtering |
top_k | int32_t | 40 | Top-K filtering (≤0 = disabled) |
top_p | float | 0.95 | Top-P (nucleus) sampling (1.0 = disabled) |
min_p | float | 0.05 | Min-P filtering (0.0 = disabled) |
temp | float | 0.80 | Temperature (≤0 = greedy, 0 = no probs output) |
Sources: common/common.h180-212 common/arg.cpp420-428
| Parameter | Type | Default | Description |
|---|---|---|---|
penalty_last_n | int32_t | 64 | Tokens to penalize (0 = disabled, -1 = context size) |
penalty_repeat | float | 1.0 | Repetition penalty (1.0 = disabled) |
penalty_freq | float | 0.0 | Frequency penalty (0.0 = disabled) |
penalty_present | float | 0.0 | Presence penalty (0.0 = disabled) |
Sources: common/common.h195-198
| Parameter | Type | Default | Description |
|---|---|---|---|
dry_multiplier | float | 0.0 | DRY penalty multiplier (0.0 = disabled) |
dry_base | float | 1.75 | Base for exponential penalty |
dry_allowed_length | int32_t | 2 | Minimum repetition length before penalty |
dry_penalty_last_n | int32_t | -1 | Context window for DRY (-1 = full context) |
dry_sequence_breakers | std::vector<std::string> | {"\n", ":", "\"", "*"} | Tokens that break repetition sequences |
Sources: common/common.h199-215
| Parameter | Type | Default | Description |
|---|---|---|---|
xtc_probability | float | 0.0 | XTC sampling probability (0.0 = disabled) |
xtc_threshold | float | 0.1 | XTC threshold (>0.5 disables XTC) |
typ_p | float | 1.0 | Typical P sampling (1.0 = disabled) |
dynatemp_range | float | 0.0 | Dynamic temperature range (0.0 = disabled) |
dynatemp_exponent | float | 1.0 | Dynamic temperature exponent |
top_n_sigma | float | -1.0 | Top-N sigma filtering (-1.0 = disabled) |
mirostat | int32_t | 0 | Mirostat mode (0=off, 1=v1, 2=v2) |
mirostat_tau | float | 5.0 | Mirostat target entropy |
mirostat_eta | float | 0.1 | Mirostat learning rate |
Sources: common/common.h189-208
Sampling Algorithm Pipeline
Purpose: Discourages the model from repeating tokens that have appeared recently in the context.
Types:
penalty_repeat applied to recently seen tokensConfiguration:
penalty_last_n: Number of recent tokens to considerpenalty_repeat: Multiplier for repeat penalty (>1.0 = penalize, <1.0 = encourage)penalty_freq: Frequency penalty strengthpenalty_present: Presence penalty strengthSources: common/common.h195-198
Purpose: Advanced repetition penalty that penalizes multi-token sequences, not just individual tokens.
Mechanism:
multiplier * base^(sequence_length - allowed_length)Configuration:
dry_multiplier: Overall penalty strengthdry_base: Base for exponential growthdry_allowed_length: Minimum sequence length before penalty appliesdry_sequence_breakers: Tokens that break sequencesSources: common/common.h199-215
Purpose: Limits consideration to the K most probable tokens.
Algorithm:
Configuration:
top_k: Number of tokens to keep (≤0 = keep all)Effect: Prevents the model from selecting very unlikely tokens, improving coherence but potentially reducing creativity.
Sources: common/common.h186
Purpose: Dynamically selects a set of tokens whose cumulative probability exceeds P.
Algorithm:
Configuration:
top_p: Cumulative probability threshold (1.0 = disabled)Effect: Adapts to the confidence of the model's predictions - uses more tokens when the model is uncertain, fewer when confident.
Sources: common/common.h187
Purpose: Filters tokens based on their probability relative to the top token.
Algorithm:
max_probthreshold = min_p * max_probprobability >= thresholdConfiguration:
min_p: Minimum relative probability (0.0 = disabled)Effect: More adaptive than fixed thresholding; preserves more tokens when the distribution is flat, fewer when it's peaked.
Sources: common/common.h188
Purpose: Occasionally excludes high-probability tokens to encourage diversity.
Algorithm:
xtc_probability, exclude tokens above xtc_threshold * max_probConfiguration:
xtc_probability: How often to exclude top choicesxtc_threshold: Relative probability thresholdEffect: Adds controlled randomness to prevent the model from being too predictable.
Sources: common/common.h189-190
Purpose: Controls the randomness of the sampling distribution.
Algorithm:
logit_scaled = logit / temperatureConfiguration:
temp: Temperature value
temp < 1.0: Sharper distribution (more deterministic)temp = 1.0: Unchanged distributiontemp > 1.0: Flatter distribution (more random)temp ≤ 0.0: Greedy selection (argmax)Effect: Primary control for generation randomness vs. determinism.
Sources: common/common.h192
Purpose: Samples tokens that are "typically" expected based on entropy.
Algorithm:
Configuration:
typ_p: Typical probability mass (1.0 = disabled)Effect: Balances between too-obvious and too-surprising token choices.
Sources: common/common.h191
Purpose: Maintains consistent text perplexity (predictability) over time.
Algorithm:
mirostat_tau)mirostat_eta) to adaptModes:
Configuration:
mirostat: Mode (0=off, 1=v1, 2=v2)mirostat_tau: Target perplexitymirostat_eta: Adaptation rateEffect: Creates more consistent text quality by preventing the model from becoming too confident or too uncertain.
Sources: common/common.h205-208
Purpose: Statistical filtering based on standard deviation.
Algorithm:
Configuration:
top_n_sigma: Number of standard deviations (-1.0 = disabled)Effect: Removes statistical outliers from consideration.
Sources: common/common.h206
Inference Loop Integration
The typical usage pattern for token sampling in an inference loop involves:
llama_sampler_sample()Example Code Flow: examples/save-load-state/save-load-state.cpp47-101
Sources: examples/save-load-state/save-load-state.cpp47-101
| Function | Purpose | Location |
|---|---|---|
llama_sampler_chain_init() | Create a new sampler chain | include/llama.h |
llama_sampler_chain_add() | Add sampler to chain | include/llama.h |
llama_sampler_sample() | Sample next token from context | include/llama.h |
llama_sampler_free() | Free sampler resources | include/llama.h |
llama_sampler_init_dist() | Create distribution sampler | include/llama.h |
llama_sampler_init_penalties() | Create penalty sampler | include/llama.h |
llama_sampler_init_top_k() | Create top-k sampler | include/llama.h |
llama_sampler_init_top_p() | Create top-p sampler | include/llama.h |
llama_sampler_init_temp() | Create temperature sampler | include/llama.h |
Sources: include/llama.h198-211 examples/save-load-state/save-load-state.cpp47-52
The sampling system supports constraining output to follow formal grammars specified in GBNF (GGML BNF) format. This enables structured output generation (e.g., JSON, specific formats).
| Type | Description | Use Case |
|---|---|---|
| GBNF File | User-provided grammar file | Custom structured formats |
| JSON Mode | Built-in JSON grammar | API responses, structured data |
| Lazy Grammars | Applied only when triggered | Conditional structure enforcement |
Grammars can be applied conditionally using triggers:
Trigger Types (common_grammar_trigger_type):
COMMON_GRAMMAR_TRIGGER_TYPE_TOKEN: Activate on specific tokenCOMMON_GRAMMAR_TRIGGER_TYPE_WORD: Activate on word matchCOMMON_GRAMMAR_TRIGGER_TYPE_PATTERN: Activate on regex patternCOMMON_GRAMMAR_TRIGGER_TYPE_PATTERN_FULL: Activate on full text patternConfiguration:
grammar: GBNF grammar stringgrammar_lazy: Enable lazy applicationgrammar_triggers: Vector of trigger conditionspreserved_tokens: Tokens to exclude from grammar constraintsSources: common/common.h139-232 README.md356-367
Logit bias allows directly manipulating token probabilities before sampling:
Logit Bias Application
Data Structure:
llama_logit_bias {
llama_token token; // Token ID to bias
float bias; // Bias value (added to logit)
}
Configuration:
logit_bias: General logit biaseslogit_bias_eog: Pre-calculated biases for end-of-generation tokensEffects:
Common Uses:
Sources: common/common.h234-241 include/llama.h198-202
The sampling system tracks performance metrics when enabled:
Metrics Available:
timing_per_token: Per-token timing informationno_perf: Disable performance tracking (reduces overhead)These metrics are accessible through llama_perf_context_print() and provide insight into sampling overhead relative to model inference.
Sources: common/common.h210-211 examples/embedding/embedding.cpp404
The default sampler order is optimized for typical use:
1. Penalties (modifies logits, cheap)
2. DRY (scans context, moderate cost)
3. Top-N Sigma (statistical filtering)
4. Top-K (reduces array size)
5. Typical-P (entropy-based)
6. Top-P (cumulative probability)
7. Min-P (relative threshold)
8. XTC (probabilistic exclusion)
9. Temperature (final scaling)
Rationale:
This order minimizes computational work by reducing the candidate set early.
Configuration: Users can customize order via samplers vector in common_params_sampling.
Sources: common/common.h217-227
Sampler state (including RNG state) can be saved and restored as part of context state:
Functions:
llama_state_get_data(): Serialize context state including samplerllama_state_set_data(): Restore context statellama_state_seq_get_data(): Serialize specific sequencellama_state_seq_set_data(): Restore specific sequenceThis enables:
Example Usage: examples/save-load-state/save-load-state.cpp68-188
Sources: examples/save-load-state/save-load-state.cpp68-188 include/llama.h
When backend_sampling = true, sampling operations are offloaded to backend implementations (GPU when available), potentially improving performance.
Configuration:
backend_sampling: Enable backend-accelerated samplingSources: common/common.h237
Some sampling parameters can adapt dynamically:
Adaptive-P Sampling:
adaptive_target: Target probability to maintainadaptive_decay: EMA decay rate for adaptationDynamic Temperature:
dynatemp_range: Range for temperature variationdynatemp_exponent: Controls entropy-to-temperature mappingRefresh this wiki
This wiki was recently refreshed. Please wait 2 days to refresh again.