Token Sampling and Generation

Relevant source files

Purpose and Scope

This page documents the token sampling and generation system in llama.cpp, which is responsible for selecting the next token during text generation. This includes the sampling pipeline architecture, available sampling algorithms, parameter configuration, and integration with the inference loop.

For information about the overall inference flow, see Inference Context and Orchestration. For computation graph building, see Computation Graph Building. For batch processing of tokens, see Batch Processing Pipeline.

Overview

Token sampling is the final step in text generation where the model's output logits (probability scores for each token in the vocabulary) are transformed and used to select the next token. The system uses a sampler chain architecture where multiple sampling algorithms can be composed together, each transforming the logit distribution before the final token selection.

The sampling process occurs after the model has computed logits for the current position. These logits represent the model's "confidence" for each possible next token. Sampling algorithms apply various transformations (filtering, temperature scaling, etc.) to these logits before selecting a token, which can be done greedily (always pick the highest probability) or stochastically (sample according to probabilities).

Sources: README.md322-503 common/common.h179-245

Sampling Pipeline Architecture

Sampler Chain Architecture

The sampler chain is implemented as a linked sequence of individual samplers. Each sampler in the chain receives a llama_token_data_array (containing token IDs, logits, and probabilities) and modifies it before passing it to the next sampler.

Sources: common/common.h111-125 include/llama.h198-211

Key Data Structures

Structure	Purpose	Location
`llama_token_data`	Holds a single token's ID, logit, and probability	include/llama.h198-202
`llama_token_data_array`	Array of token data with metadata (size, selection index, sorted flag)	include/llama.h204-211
`llama_sampler`	Opaque sampler object	include/llama.h63
`common_params_sampling`	Configuration parameters for sampling	common/common.h180-245
`common_sampler_type`	Enum of available sampler types	common/common.h111-125

Sources: include/llama.h198-211 common/common.h111-125 common/common.h180-245

Sampling Parameters Configuration

Parameter Configuration Flow

Core Sampling Parameters

Parameter	Type	Default	Description
`seed`	`uint32_t`	`LLAMA_DEFAULT_SEED`	Random seed for sampling
`n_prev`	`int32_t`	64	Number of previous tokens to remember
`n_probs`	`int32_t`	0	Output top N token probabilities (0 = disabled)
`min_keep`	`int32_t`	0	Minimum tokens to keep after filtering
`top_k`	`int32_t`	40	Top-K filtering (≤0 = disabled)
`top_p`	`float`	0.95	Top-P (nucleus) sampling (1.0 = disabled)
`min_p`	`float`	0.05	Min-P filtering (0.0 = disabled)
`temp`	`float`	0.80	Temperature (≤0 = greedy, 0 = no probs output)

Sources: common/common.h180-212 common/arg.cpp420-428

Penalty Parameters

Parameter	Type	Default	Description
`penalty_last_n`	`int32_t`	64	Tokens to penalize (0 = disabled, -1 = context size)
`penalty_repeat`	`float`	1.0	Repetition penalty (1.0 = disabled)
`penalty_freq`	`float`	0.0	Frequency penalty (0.0 = disabled)
`penalty_present`	`float`	0.0	Presence penalty (0.0 = disabled)

Sources: common/common.h195-198

DRY (Don't Repeat Yourself) Parameters

Parameter	Type	Default	Description
`dry_multiplier`	`float`	0.0	DRY penalty multiplier (0.0 = disabled)
`dry_base`	`float`	1.75	Base for exponential penalty
`dry_allowed_length`	`int32_t`	2	Minimum repetition length before penalty
`dry_penalty_last_n`	`int32_t`	-1	Context window for DRY (-1 = full context)
`dry_sequence_breakers`	`std::vector<std::string>`	`{"\n", ":", "\"", "*"}`	Tokens that break repetition sequences

Sources: common/common.h199-215

Advanced Sampling Parameters

Parameter	Type	Default	Description
`xtc_probability`	`float`	0.0	XTC sampling probability (0.0 = disabled)
`xtc_threshold`	`float`	0.1	XTC threshold (>0.5 disables XTC)
`typ_p`	`float`	1.0	Typical P sampling (1.0 = disabled)
`dynatemp_range`	`float`	0.0	Dynamic temperature range (0.0 = disabled)
`dynatemp_exponent`	`float`	1.0	Dynamic temperature exponent
`top_n_sigma`	`float`	-1.0	Top-N sigma filtering (-1.0 = disabled)
`mirostat`	`int32_t`	0	Mirostat mode (0=off, 1=v1, 2=v2)
`mirostat_tau`	`float`	5.0	Mirostat target entropy
`mirostat_eta`	`float`	0.1	Mirostat learning rate

Sources: common/common.h189-208

Sampling Algorithms

Sampling Algorithm Pipeline

1. Penalties (Repetition Control)

Purpose: Discourages the model from repeating tokens that have appeared recently in the context.

Types:

Repeat Penalty: Multiplicative penalty penalty_repeat applied to recently seen tokens
Frequency Penalty: Proportional to how many times a token has appeared
Presence Penalty: Binary penalty applied if token has appeared at all

Configuration:

penalty_last_n: Number of recent tokens to consider
penalty_repeat: Multiplier for repeat penalty (>1.0 = penalize, <1.0 = encourage)
penalty_freq: Frequency penalty strength
penalty_present: Presence penalty strength

Sources: common/common.h195-198

2. DRY (Don't Repeat Yourself)

Purpose: Advanced repetition penalty that penalizes multi-token sequences, not just individual tokens.

Mechanism:

Detects when the current generation matches a pattern from earlier in the context
Applies exponential penalty: multiplier * base^(sequence_length - allowed_length)
Stops checking at sequence breaker tokens (newlines, punctuation)

Configuration:

dry_multiplier: Overall penalty strength
dry_base: Base for exponential growth
dry_allowed_length: Minimum sequence length before penalty applies
dry_sequence_breakers: Tokens that break sequences

Sources: common/common.h199-215

3. Top-K Filtering

Purpose: Limits consideration to the K most probable tokens.

Algorithm:

Sort tokens by probability (descending)
Keep only the top K tokens
Discard all others

Configuration:

top_k: Number of tokens to keep (≤0 = keep all)

Effect: Prevents the model from selecting very unlikely tokens, improving coherence but potentially reducing creativity.

Sources: common/common.h186

4. Top-P (Nucleus Sampling)

Purpose: Dynamically selects a set of tokens whose cumulative probability exceeds P.

Algorithm:

Sort tokens by probability (descending)
Accumulate probabilities until sum ≥ P
Keep only those tokens
Discard the rest

Configuration:

top_p: Cumulative probability threshold (1.0 = disabled)

Effect: Adapts to the confidence of the model's predictions - uses more tokens when the model is uncertain, fewer when confident.

Sources: common/common.h187

5. Min-P Filtering

Purpose: Filters tokens based on their probability relative to the top token.

Algorithm:

Find maximum probability token: max_prob
Calculate threshold: threshold = min_p * max_prob
Keep only tokens with probability >= threshold

Configuration:

min_p: Minimum relative probability (0.0 = disabled)

Effect: More adaptive than fixed thresholding; preserves more tokens when the distribution is flat, fewer when it's peaked.

Sources: common/common.h188

6. XTC (Exclude Top Choices)

Purpose: Occasionally excludes high-probability tokens to encourage diversity.

Algorithm:

With probability xtc_probability, exclude tokens above xtc_threshold * max_prob
Sample from remaining tokens

Configuration:

xtc_probability: How often to exclude top choices
xtc_threshold: Relative probability threshold

Effect: Adds controlled randomness to prevent the model from being too predictable.

Sources: common/common.h189-190

7. Temperature Scaling

Purpose: Controls the randomness of the sampling distribution.

Algorithm:

Scale logits: logit_scaled = logit / temperature
Recompute probabilities using softmax

Configuration:

temp: Temperature value
- temp < 1.0: Sharper distribution (more deterministic)
- temp = 1.0: Unchanged distribution
- temp > 1.0: Flatter distribution (more random)
- temp ≤ 0.0: Greedy selection (argmax)

Effect: Primary control for generation randomness vs. determinism.

Sources: common/common.h192

8. Typical Sampling

Purpose: Samples tokens that are "typically" expected based on entropy.

Algorithm:

Compute entropy of the distribution
Keep tokens with similar "surprisal" (information content)
Filter out both very likely and very unlikely tokens

Configuration:

typ_p: Typical probability mass (1.0 = disabled)

Effect: Balances between too-obvious and too-surprising token choices.

Sources: common/common.h191

9. Mirostat Sampling

Purpose: Maintains consistent text perplexity (predictability) over time.

Algorithm:

Target a specific perplexity level (mirostat_tau)
Dynamically adjust filtering threshold based on recent perplexity
Use learning rate (mirostat_eta) to adapt

Modes:

Mirostat v1: Original algorithm
Mirostat v2: Improved version with better stability

Configuration:

mirostat: Mode (0=off, 1=v1, 2=v2)
mirostat_tau: Target perplexity
mirostat_eta: Adaptation rate

Effect: Creates more consistent text quality by preventing the model from becoming too confident or too uncertain.

Sources: common/common.h205-208

10. Top-N Sigma Filtering

Purpose: Statistical filtering based on standard deviation.

Algorithm:

Compute mean and standard deviation of logits
Keep tokens within N standard deviations of mean

Configuration:

top_n_sigma: Number of standard deviations (-1.0 = disabled)

Effect: Removes statistical outliers from consideration.

Sources: common/common.h206

Integration with Inference Loop

Inference Loop Integration

Typical Usage Pattern

The typical usage pattern for token sampling in an inference loop involves:

Initialization: Create a sampler chain and add desired samplers
Context Preparation: Encode prompt into the KV cache
Generation Loop:
- Decode batch to get logits
- Sample next token using llama_sampler_sample()
- Optionally check for end-of-sequence
- Add sampled token to batch for next iteration
Cleanup: Free sampler resources

Example Code Flow: examples/save-load-state/save-load-state.cpp47-101

Sources: examples/save-load-state/save-load-state.cpp47-101

Key API Functions

Function	Purpose	Location
`llama_sampler_chain_init()`	Create a new sampler chain	include/llama.h
`llama_sampler_chain_add()`	Add sampler to chain	include/llama.h
`llama_sampler_sample()`	Sample next token from context	include/llama.h
`llama_sampler_free()`	Free sampler resources	include/llama.h
`llama_sampler_init_dist()`	Create distribution sampler	include/llama.h
`llama_sampler_init_penalties()`	Create penalty sampler	include/llama.h
`llama_sampler_init_top_k()`	Create top-k sampler	include/llama.h
`llama_sampler_init_top_p()`	Create top-p sampler	include/llama.h
`llama_sampler_init_temp()`	Create temperature sampler	include/llama.h

Sources: include/llama.h198-211 examples/save-load-state/save-load-state.cpp47-52

Grammar Constraints

The sampling system supports constraining output to follow formal grammars specified in GBNF (GGML BNF) format. This enables structured output generation (e.g., JSON, specific formats).

Grammar Types

Type	Description	Use Case
GBNF File	User-provided grammar file	Custom structured formats
JSON Mode	Built-in JSON grammar	API responses, structured data
Lazy Grammars	Applied only when triggered	Conditional structure enforcement

Grammar Triggers

Grammars can be applied conditionally using triggers:

Trigger Types (common_grammar_trigger_type):

COMMON_GRAMMAR_TRIGGER_TYPE_TOKEN: Activate on specific token
COMMON_GRAMMAR_TRIGGER_TYPE_WORD: Activate on word match
COMMON_GRAMMAR_TRIGGER_TYPE_PATTERN: Activate on regex pattern
COMMON_GRAMMAR_TRIGGER_TYPE_PATTERN_FULL: Activate on full text pattern

Configuration:

grammar: GBNF grammar string
grammar_lazy: Enable lazy application
grammar_triggers: Vector of trigger conditions
preserved_tokens: Tokens to exclude from grammar constraints

Sources: common/common.h139-232 README.md356-367

Logit Bias

Logit bias allows directly manipulating token probabilities before sampling:

Logit Bias Application

Usage

Data Structure:

llama_logit_bias {
  llama_token token;  // Token ID to bias
  float bias;         // Bias value (added to logit)
}

Configuration:

logit_bias: General logit biases
logit_bias_eog: Pre-calculated biases for end-of-generation tokens

Effects:

Positive bias: Increases token probability
Negative bias: Decreases token probability
Large negative bias (e.g., -INFINITY): Effectively bans token

Common Uses:

Banning specific tokens or words
Encouraging certain formats or patterns
Controlling end-of-generation behavior

Sources: common/common.h234-241 include/llama.h198-202

Performance Considerations

Timing Metrics

The sampling system tracks performance metrics when enabled:

Metrics Available:

timing_per_token: Per-token timing information
no_perf: Disable performance tracking (reduces overhead)

These metrics are accessible through llama_perf_context_print() and provide insight into sampling overhead relative to model inference.

Sources: common/common.h210-211 examples/embedding/embedding.cpp404

Sampler Order Optimization

The default sampler order is optimized for typical use:

1. Penalties (modifies logits, cheap)
2. DRY (scans context, moderate cost)
3. Top-N Sigma (statistical filtering)
4. Top-K (reduces array size)
5. Typical-P (entropy-based)
6. Top-P (cumulative probability)
7. Min-P (relative threshold)
8. XTC (probabilistic exclusion)
9. Temperature (final scaling)

Rationale:

Logit modifications first (penalties, DRY)
Size-reducing filters next (top-k reduces work for later stages)
Distribution-shaping filters (typical-p, top-p, min-p)
Final scaling and selection (temperature, final sampling)

This order minimizes computational work by reducing the candidate set early.

Configuration: Users can customize order via samplers vector in common_params_sampling.

Sources: common/common.h217-227

State Management

Sampler State Serialization

Sampler state (including RNG state) can be saved and restored as part of context state:

Functions:

llama_state_get_data(): Serialize context state including sampler
llama_state_set_data(): Restore context state
llama_state_seq_get_data(): Serialize specific sequence
llama_state_seq_set_data(): Restore specific sequence

This enables:

Checkpointing generation mid-stream
Branching generation paths
Reproducible generation with saved states

Example Usage: examples/save-load-state/save-load-state.cpp68-188

Sources: examples/save-load-state/save-load-state.cpp68-188 include/llama.h

Advanced Features

Backend Sampling

When backend_sampling = true, sampling operations are offloaded to backend implementations (GPU when available), potentially improving performance.

Configuration:

backend_sampling: Enable backend-accelerated sampling

Sources: common/common.h237

Adaptive Parameters

Some sampling parameters can adapt dynamically:

Adaptive-P Sampling:

adaptive_target: Target probability to maintain
adaptive_decay: EMA decay rate for adaptation
Automatically adjusts to maintain consistent randomness

Dynamic Temperature:

dynatemp_range: Range for temperature variation
dynatemp_exponent: Controls entropy-to-temperature mapping
Temperature adapts based on model confidence

Sources: common/common.h203-204 common/common.h193-194

Token Sampling and Generation

Relevant source files

Purpose and Scope

Overview

Sources: README.md322-503 common/common.h179-245

Sampling Pipeline Architecture

Sampler Chain Architecture

Sources: common/common.h111-125 include/llama.h198-211

Key Data Structures

Structure	Purpose	Location
`llama_token_data`	Holds a single token's ID, logit, and probability	include/llama.h198-202
`llama_token_data_array`	Array of token data with metadata (size, selection index, sorted flag)	include/llama.h204-211
`llama_sampler`	Opaque sampler object	include/llama.h63
`common_params_sampling`	Configuration parameters for sampling	common/common.h180-245
`common_sampler_type`	Enum of available sampler types	common/common.h111-125

Sources: include/llama.h198-211 common/common.h111-125 common/common.h180-245

Sampling Parameters Configuration

Parameter Configuration Flow

Core Sampling Parameters

Parameter	Type	Default	Description
`seed`	`uint32_t`	`LLAMA_DEFAULT_SEED`	Random seed for sampling
`n_prev`	`int32_t`	64	Number of previous tokens to remember
`n_probs`	`int32_t`	0	Output top N token probabilities (0 = disabled)
`min_keep`	`int32_t`	0	Minimum tokens to keep after filtering
`top_k`	`int32_t`	40	Top-K filtering (≤0 = disabled)
`top_p`	`float`	0.95	Top-P (nucleus) sampling (1.0 = disabled)
`min_p`	`float`	0.05	Min-P filtering (0.0 = disabled)
`temp`	`float`	0.80	Temperature (≤0 = greedy, 0 = no probs output)

Sources: common/common.h180-212 common/arg.cpp420-428

Penalty Parameters

Parameter	Type	Default	Description
`penalty_last_n`	`int32_t`	64	Tokens to penalize (0 = disabled, -1 = context size)
`penalty_repeat`	`float`	1.0	Repetition penalty (1.0 = disabled)
`penalty_freq`	`float`	0.0	Frequency penalty (0.0 = disabled)
`penalty_present`	`float`	0.0	Presence penalty (0.0 = disabled)

Sources: common/common.h195-198

DRY (Don't Repeat Yourself) Parameters

Parameter	Type	Default	Description
`dry_multiplier`	`float`	0.0	DRY penalty multiplier (0.0 = disabled)
`dry_base`	`float`	1.75	Base for exponential penalty
`dry_allowed_length`	`int32_t`	2	Minimum repetition length before penalty
`dry_penalty_last_n`	`int32_t`	-1	Context window for DRY (-1 = full context)
`dry_sequence_breakers`	`std::vector<std::string>`	`{"\n", ":", "\"", "*"}`	Tokens that break repetition sequences

Sources: common/common.h199-215

Advanced Sampling Parameters

Parameter	Type	Default	Description
`xtc_probability`	`float`	0.0	XTC sampling probability (0.0 = disabled)
`xtc_threshold`	`float`	0.1	XTC threshold (>0.5 disables XTC)
`typ_p`	`float`	1.0	Typical P sampling (1.0 = disabled)
`dynatemp_range`	`float`	0.0	Dynamic temperature range (0.0 = disabled)
`dynatemp_exponent`	`float`	1.0	Dynamic temperature exponent
`top_n_sigma`	`float`	-1.0	Top-N sigma filtering (-1.0 = disabled)
`mirostat`	`int32_t`	0	Mirostat mode (0=off, 1=v1, 2=v2)
`mirostat_tau`	`float`	5.0	Mirostat target entropy
`mirostat_eta`	`float`	0.1	Mirostat learning rate

Sources: common/common.h189-208

Sampling Algorithms

Sampling Algorithm Pipeline

1. Penalties (Repetition Control)

Purpose: Discourages the model from repeating tokens that have appeared recently in the context.

Types:

Repeat Penalty: Multiplicative penalty penalty_repeat applied to recently seen tokens
Frequency Penalty: Proportional to how many times a token has appeared
Presence Penalty: Binary penalty applied if token has appeared at all

Configuration:

penalty_last_n: Number of recent tokens to consider
penalty_repeat: Multiplier for repeat penalty (>1.0 = penalize, <1.0 = encourage)
penalty_freq: Frequency penalty strength
penalty_present: Presence penalty strength

Sources: common/common.h195-198

2. DRY (Don't Repeat Yourself)

Purpose: Advanced repetition penalty that penalizes multi-token sequences, not just individual tokens.

Mechanism:

Detects when the current generation matches a pattern from earlier in the context
Applies exponential penalty: multiplier * base^(sequence_length - allowed_length)
Stops checking at sequence breaker tokens (newlines, punctuation)

Configuration:

dry_multiplier: Overall penalty strength
dry_base: Base for exponential growth
dry_allowed_length: Minimum sequence length before penalty applies
dry_sequence_breakers: Tokens that break sequences

Sources: common/common.h199-215

3. Top-K Filtering

Purpose: Limits consideration to the K most probable tokens.

Algorithm:

Sort tokens by probability (descending)
Keep only the top K tokens
Discard all others

Configuration:

top_k: Number of tokens to keep (≤0 = keep all)

Effect: Prevents the model from selecting very unlikely tokens, improving coherence but potentially reducing creativity.

Sources: common/common.h186

4. Top-P (Nucleus Sampling)

Purpose: Dynamically selects a set of tokens whose cumulative probability exceeds P.

Algorithm:

Sort tokens by probability (descending)
Accumulate probabilities until sum ≥ P
Keep only those tokens
Discard the rest

Configuration:

top_p: Cumulative probability threshold (1.0 = disabled)

Effect: Adapts to the confidence of the model's predictions - uses more tokens when the model is uncertain, fewer when confident.

Sources: common/common.h187

5. Min-P Filtering

Purpose: Filters tokens based on their probability relative to the top token.

Algorithm:

Find maximum probability token: max_prob
Calculate threshold: threshold = min_p * max_prob
Keep only tokens with probability >= threshold

Configuration:

min_p: Minimum relative probability (0.0 = disabled)

Effect: More adaptive than fixed thresholding; preserves more tokens when the distribution is flat, fewer when it's peaked.

Sources: common/common.h188

6. XTC (Exclude Top Choices)

Purpose: Occasionally excludes high-probability tokens to encourage diversity.

Algorithm:

With probability xtc_probability, exclude tokens above xtc_threshold * max_prob
Sample from remaining tokens

Configuration:

xtc_probability: How often to exclude top choices
xtc_threshold: Relative probability threshold

Effect: Adds controlled randomness to prevent the model from being too predictable.

Sources: common/common.h189-190

7. Temperature Scaling

Purpose: Controls the randomness of the sampling distribution.

Algorithm:

Scale logits: logit_scaled = logit / temperature
Recompute probabilities using softmax

Configuration:

temp: Temperature value
- temp < 1.0: Sharper distribution (more deterministic)
- temp = 1.0: Unchanged distribution
- temp > 1.0: Flatter distribution (more random)
- temp ≤ 0.0: Greedy selection (argmax)

Effect: Primary control for generation randomness vs. determinism.

Sources: common/common.h192

8. Typical Sampling

Purpose: Samples tokens that are "typically" expected based on entropy.

Algorithm:

Compute entropy of the distribution
Keep tokens with similar "surprisal" (information content)
Filter out both very likely and very unlikely tokens

Configuration:

typ_p: Typical probability mass (1.0 = disabled)

Effect: Balances between too-obvious and too-surprising token choices.

Sources: common/common.h191

9. Mirostat Sampling

Purpose: Maintains consistent text perplexity (predictability) over time.

Algorithm:

Target a specific perplexity level (mirostat_tau)
Dynamically adjust filtering threshold based on recent perplexity
Use learning rate (mirostat_eta) to adapt

Modes:

Mirostat v1: Original algorithm
Mirostat v2: Improved version with better stability

Configuration:

mirostat: Mode (0=off, 1=v1, 2=v2)
mirostat_tau: Target perplexity
mirostat_eta: Adaptation rate

Effect: Creates more consistent text quality by preventing the model from becoming too confident or too uncertain.

Sources: common/common.h205-208

10. Top-N Sigma Filtering

Purpose: Statistical filtering based on standard deviation.

Algorithm:

Compute mean and standard deviation of logits
Keep tokens within N standard deviations of mean

Configuration:

top_n_sigma: Number of standard deviations (-1.0 = disabled)

Effect: Removes statistical outliers from consideration.

Sources: common/common.h206

Integration with Inference Loop

Inference Loop Integration

Typical Usage Pattern

The typical usage pattern for token sampling in an inference loop involves:

Initialization: Create a sampler chain and add desired samplers
Context Preparation: Encode prompt into the KV cache
Generation Loop:
- Decode batch to get logits
- Sample next token using llama_sampler_sample()
- Optionally check for end-of-sequence
- Add sampled token to batch for next iteration
Cleanup: Free sampler resources

Example Code Flow: examples/save-load-state/save-load-state.cpp47-101

Sources: examples/save-load-state/save-load-state.cpp47-101

Key API Functions

Function	Purpose	Location
`llama_sampler_chain_init()`	Create a new sampler chain	include/llama.h
`llama_sampler_chain_add()`	Add sampler to chain	include/llama.h
`llama_sampler_sample()`	Sample next token from context	include/llama.h
`llama_sampler_free()`	Free sampler resources	include/llama.h
`llama_sampler_init_dist()`	Create distribution sampler	include/llama.h
`llama_sampler_init_penalties()`	Create penalty sampler	include/llama.h
`llama_sampler_init_top_k()`	Create top-k sampler	include/llama.h
`llama_sampler_init_top_p()`	Create top-p sampler	include/llama.h
`llama_sampler_init_temp()`	Create temperature sampler	include/llama.h

Sources: include/llama.h198-211 examples/save-load-state/save-load-state.cpp47-52

Grammar Constraints

The sampling system supports constraining output to follow formal grammars specified in GBNF (GGML BNF) format. This enables structured output generation (e.g., JSON, specific formats).

Grammar Types

Type	Description	Use Case
GBNF File	User-provided grammar file	Custom structured formats
JSON Mode	Built-in JSON grammar	API responses, structured data
Lazy Grammars	Applied only when triggered	Conditional structure enforcement

Grammar Triggers

Grammars can be applied conditionally using triggers:

Trigger Types (common_grammar_trigger_type):

COMMON_GRAMMAR_TRIGGER_TYPE_TOKEN: Activate on specific token
COMMON_GRAMMAR_TRIGGER_TYPE_WORD: Activate on word match
COMMON_GRAMMAR_TRIGGER_TYPE_PATTERN: Activate on regex pattern
COMMON_GRAMMAR_TRIGGER_TYPE_PATTERN_FULL: Activate on full text pattern

Configuration:

grammar: GBNF grammar string
grammar_lazy: Enable lazy application
grammar_triggers: Vector of trigger conditions
preserved_tokens: Tokens to exclude from grammar constraints

Sources: common/common.h139-232 README.md356-367

Logit Bias

Logit bias allows directly manipulating token probabilities before sampling:

Logit Bias Application

Usage

Data Structure:

llama_logit_bias {
  llama_token token;  // Token ID to bias
  float bias;         // Bias value (added to logit)
}

Configuration:

logit_bias: General logit biases
logit_bias_eog: Pre-calculated biases for end-of-generation tokens

Effects:

Positive bias: Increases token probability
Negative bias: Decreases token probability
Large negative bias (e.g., -INFINITY): Effectively bans token

Common Uses:

Banning specific tokens or words
Encouraging certain formats or patterns
Controlling end-of-generation behavior

Sources: common/common.h234-241 include/llama.h198-202

Performance Considerations

Timing Metrics

The sampling system tracks performance metrics when enabled:

Metrics Available:

timing_per_token: Per-token timing information
no_perf: Disable performance tracking (reduces overhead)

These metrics are accessible through llama_perf_context_print() and provide insight into sampling overhead relative to model inference.

Sources: common/common.h210-211 examples/embedding/embedding.cpp404

Sampler Order Optimization

The default sampler order is optimized for typical use:

1. Penalties (modifies logits, cheap)
2. DRY (scans context, moderate cost)
3. Top-N Sigma (statistical filtering)
4. Top-K (reduces array size)
5. Typical-P (entropy-based)
6. Top-P (cumulative probability)
7. Min-P (relative threshold)
8. XTC (probabilistic exclusion)
9. Temperature (final scaling)

Rationale:

Logit modifications first (penalties, DRY)
Size-reducing filters next (top-k reduces work for later stages)
Distribution-shaping filters (typical-p, top-p, min-p)
Final scaling and selection (temperature, final sampling)

This order minimizes computational work by reducing the candidate set early.

Configuration: Users can customize order via samplers vector in common_params_sampling.

Sources: common/common.h217-227

State Management

Sampler State Serialization

Sampler state (including RNG state) can be saved and restored as part of context state:

Functions:

llama_state_get_data(): Serialize context state including sampler
llama_state_set_data(): Restore context state
llama_state_seq_get_data(): Serialize specific sequence
llama_state_seq_set_data(): Restore specific sequence

This enables:

Checkpointing generation mid-stream
Branching generation paths
Reproducible generation with saved states

Example Usage: examples/save-load-state/save-load-state.cpp68-188

Sources: examples/save-load-state/save-load-state.cpp68-188 include/llama.h

Advanced Features

Backend Sampling

When backend_sampling = true, sampling operations are offloaded to backend implementations (GPU when available), potentially improving performance.

Configuration:

backend_sampling: Enable backend-accelerated sampling

Sources: common/common.h237

Adaptive Parameters

Some sampling parameters can adapt dynamically:

Adaptive-P Sampling:

adaptive_target: Target probability to maintain
adaptive_decay: EMA decay rate for adaptation
Automatically adjusts to maintain consistent randomness

Dynamic Temperature:

dynatemp_range: Range for temperature variation
dynatemp_exponent: Controls entropy-to-temperature mapping
Temperature adapts based on model confidence

Sources: common/common.h203-204 common/common.h193-194

Token Sampling and Generation

Purpose and Scope

Overview

Sampling Pipeline Architecture

Key Data Structures

Sampling Parameters Configuration

Core Sampling Parameters

Penalty Parameters

DRY (Don't Repeat Yourself) Parameters

Advanced Sampling Parameters

Sampling Algorithms

1. Penalties (Repetition Control)

2. DRY (Don't Repeat Yourself)

3. Top-K Filtering

4. Top-P (Nucleus Sampling)

5. Min-P Filtering

6. XTC (Exclude Top Choices)

7. Temperature Scaling

8. Typical Sampling

9. Mirostat Sampling

10. Top-N Sigma Filtering

Integration with Inference Loop

Typical Usage Pattern

Key API Functions

Grammar Constraints

Grammar Types

Grammar Triggers

Logit Bias

Usage

Performance Considerations

Timing Metrics

Sampler Order Optimization

State Management

Sampler State Serialization

Advanced Features

Backend Sampling

Adaptive Parameters

On this page

Token Sampling and Generation

Purpose and Scope

Overview

Sampling Pipeline Architecture

Key Data Structures

Sampling Parameters Configuration

Core Sampling Parameters

Penalty Parameters

DRY (Don't Repeat Yourself) Parameters

Advanced Sampling Parameters

Sampling Algorithms

1. Penalties (Repetition Control)

2. DRY (Don't Repeat Yourself)

3. Top-K Filtering

4. Top-P (Nucleus Sampling)

5. Min-P Filtering

6. XTC (Exclude Top Choices)

7. Temperature Scaling

8. Typical Sampling

9. Mirostat Sampling

10. Top-N Sigma Filtering

Integration with Inference Loop

Typical Usage Pattern

Key API Functions

Grammar Constraints

Grammar Types

Grammar Triggers

Logit Bias

Usage

Performance Considerations

Timing Metrics

Sampler Order Optimization

State Management

Sampler State Serialization

Advanced Features

Backend Sampling

Adaptive Parameters

On this page