This page documents the configuration system that controls text generation behavior in the Transformers library. It covers the GenerationConfig class, GenerationMode enum, and the logic that determines which generation strategy to use based on configuration parameters.
For information about the actual generation algorithms and strategies, see Logits Processing Pipeline. For details on assisted generation and speculative decoding, see Assisted and Speculative Decoding. For cache implementations used during generation, see Cache System.
The generation configuration system provides a unified interface for controlling all aspects of text generation through the GenerationConfig class. This configuration determines which generation mode (greedy search, sampling, beam search, or assisted generation) will be used, controls output length, manipulates token probabilities, and manages caching behavior.
Key Components:
Sources: src/transformers/generation/configuration_utils.py81-638 src/transformers/generation/configuration_utils.py63-79
The GenerationConfig class is defined in src/transformers/generation/configuration_utils.py81-638 and serves as the central configuration object for all generation operations. It is a PushToHubMixin subclass that can be serialized to JSON and saved/loaded from the Hub.
Sources: src/transformers/generation/configuration_utils.py107-337
The configuration parameters are organized into logical categories:
| Category | Key Parameters | Purpose |
|---|---|---|
| Length Control | max_length, max_new_tokens, min_length, min_new_tokens, early_stopping, max_time, stop_strings | Control the length of generated sequences |
| Strategy Selection | do_sample, num_beams | Determine which generation mode to use |
| Cache Management | use_cache, cache_implementation, cache_config | Configure KV cache behavior |
| Logits Manipulation | temperature, top_k, top_p, min_p, top_h, typical_p, epsilon_cutoff, eta_cutoff, repetition_penalty, encoder_repetition_penalty, length_penalty, no_repeat_ngram_size, bad_words_ids, forced_bos_token_id, forced_eos_token_id | Modify token probability distributions |
| Output Control | output_scores, output_logits, output_attentions, output_hidden_states, return_dict_in_generate, num_return_sequences | Control what information is returned |
| Special Tokens | pad_token_id, bos_token_id, eos_token_id | Define special token IDs |
| Encoder-Decoder | encoder_no_repeat_ngram_size, decoder_start_token_id | Encoder-decoder specific parameters |
| Assisted Generation | num_assistant_tokens, num_assistant_tokens_schedule, assistant_confidence_threshold, prompt_lookup_num_tokens, max_matching_ngram_size, assistant_early_exit, assistant_lookbehind, target_lookbehind | Control speculative/assisted decoding |
| Performance | compile_config (CompileConfig), disable_compile | Control torch.compile behavior for static-cache decoding |
Sources: src/transformers/generation/configuration_utils.py341-437
The GenerationMode enum is defined in src/transformers/generation/configuration_utils.py63-79 and represents the available generation strategies:
CONTRASTIVE_SEARCH, DOLA_GENERATION, GROUP_BEAM_SEARCH, and CONSTRAINED_BEAM_SEARCH are deprecated modes. They are no longer implemented directly in GenerationMixin — their logic has been moved to external Hub repositories and must be loaded via the custom_generate mechanism.
The GENERATION_MODES_MAPPING dict in src/transformers/generation/utils.py132-143 maps each GenerationMode to the private method that implements it:
GenerationMode | Implementation | Notes |
|---|---|---|
GREEDY_SEARCH | _sample | Uses do_sample=False |
SAMPLE | _sample | Uses do_sample=True |
BEAM_SEARCH | _beam_search | Uses do_sample=False |
BEAM_SAMPLE | _beam_search | Uses do_sample=True |
ASSISTED_GENERATION | _assisted_decoding | Requires draft model or prompt lookup |
DOLA_GENERATION | "transformers-community/dola" | Deprecated, loads from Hub |
CONTRASTIVE_SEARCH | "transformers-community/contrastive-search" | Deprecated, loads from Hub |
GROUP_BEAM_SEARCH | "transformers-community/group-beam-search" | Deprecated, loads from Hub |
CONSTRAINED_BEAM_SEARCH | "transformers-community/constrained-beam-search" | Deprecated, loads from Hub |
Both GREEDY_SEARCH and SAMPLE are dispatched to _sample. The difference is that greedy search is equivalent to sampling with the argmax (i.e., do_sample=False).
Sources: src/transformers/generation/configuration_utils.py63-79 src/transformers/generation/utils.py132-143
The generation mode is determined by calling GenerationConfig.get_generation_mode(), which is also referenced in GenerationMixin.generate(). The selection follows a priority-based algorithm:
The mode selection follows this priority order:
assistant_model or prompt_lookup_num_tokens is set → ASSISTED_GENERATIONnum_beams > 1:
do_sample=True → BEAM_SAMPLEBEAM_SEARCHdo_sample=True → SAMPLEGREEDY_SEARCHSources: src/transformers/generation/configuration_utils.py81-638 src/transformers/generation/utils.py132-143
Once the mode is determined, GenerationMixin.generate() calls the appropriate private implementation:
Dispatch from generate() in GenerationMixin
Sources: src/transformers/generation/utils.py132-143 src/transformers/generation/utils.py369-491
The GenerationConfig can be loaded from multiple sources with a defined precedence order:
Configuration values are resolved in this order (later overrides earlier):
_get_default_generation_params() src/transformers/generation/configuration_utils.py1093-1122config.jsongeneration_config.json if it existsgenerate() methodThe GenerationConfig class provides several loading methods:
| Method | Purpose | Location |
|---|---|---|
__init__(**kwargs) | Create config with explicit parameters | src/transformers/generation/configuration_utils.py341-437 |
from_pretrained(pretrained_model_name) | Load from Hub or local directory | src/transformers/generation/configuration_utils.py439-548 |
from_model_config(model_config) | Extract from model configuration | src/transformers/generation/configuration_utils.py550-617 |
update(**kwargs) | Update existing config with new values | src/transformers/generation/configuration_utils.py848-877 |
Sources: src/transformers/generation/configuration_utils.py439-617 src/transformers/generation/utils.py369-431
The GenerationConfig validates itself at initialization time through the validate method src/transformers/generation/configuration_utils.py879-1091 Validation includes:
Incompatible Parameter Combinations: Detect mutually exclusive settings
Sampling Parameters Without do_sample: Warn if sampling-only parameters are set when do_sample=False
temperature, top_k, top_p, min_p, typical_p, epsilon_cutoff, eta_cutoffBeam-Only Parameters Without Beam Search: Warn if beam-specific parameters are set when num_beams=1
length_penalty, early_stoppingOutput Dependencies: Check that output flags requiring return_dict_in_generate=True are valid
output_scores, output_logits, output_attentions, output_hidden_statesCache Requirements: Validate that assisted generation uses a rollback-compatible cache (currently only DynamicCache)
The validation uses different severity levels:
| Level | Action | Example |
|---|---|---|
| Error | Raise ValueError | Impossible parameter combinations (e.g., num_return_sequences > num_beams in beam search) |
| Warning | Log warning via logger.warning_once() | Inconsequent but technically valid configs (e.g., setting temperature with do_sample=False) |
| Info | No action | Valid configurations |
Sources: src/transformers/generation/configuration_utils.py879-1091
When parameters are not explicitly set (None), the generation system applies defaults during the generation loop via _get_default_generation_params() src/transformers/generation/configuration_utils.py1093-1122:
| Parameter | Default Value | Condition |
|---|---|---|
do_sample | False | Always |
num_beams | 1 | Always |
use_cache | True | If model supports caching |
max_length | 20 | If neither max_length nor max_new_tokens is set |
temperature | 1.0 | If do_sample=True |
top_k | 50 | If do_sample=True |
top_p | 1.0 | If do_sample=True |
renormalize_logits | False | Always |
output_scores | False | Always |
output_logits | False | Always |
output_attentions | False | Always |
output_hidden_states | False | Always |
return_dict_in_generate | False | Always |
num_return_sequences | 1 | Always |
Sources: src/transformers/generation/configuration_utils.py1093-1122
The GenerationConfig is tightly integrated with the GenerationMixin class src/transformers/generation/utils.py337-1492:
Throughout generation, the configuration is accessed via:
generation_config.max_new_tokensgeneration_config.get_generation_mode() (method on GenerationConfig)generation_config.update(**kwargs)generation_config._get_default_generation_params()Sources: src/transformers/generation/utils.py369-491 src/transformers/generation/configuration_utils.py81-638
The cache_implementation field in GenerationConfig accepts a string name that maps to a specific Cache subclass. The recognized values are defined in src/transformers/generation/configuration_utils.py45-56:
| String value | Cache class | Notes |
|---|---|---|
"dynamic" | DynamicCache | Default; grows as tokens are generated |
"dynamic_full" | DynamicCache (full history) | Required for assisted generation rollback |
"offloaded" | DynamicCache (CPU offloaded) | Saves GPU memory |
"quantized" | QuantizedCache | Quantizes KV cache to reduce memory |
"static" | StaticCache | Pre-allocated; required for torch.compile |
"offloaded_static" | StaticCache (CPU offloaded) | — |
"sliding_window" | — | Deprecated |
"hybrid" | — | Deprecated |
"hybrid_chunked" | — | Deprecated |
Additional parameters for a specific cache class can be passed via cache_config (a dict).
Sources: src/transformers/generation/configuration_utils.py45-56
Assisted generation has unique configuration requirements src/transformers/generation/candidate_generator.py103-191:
cache_implementation="dynamic_full" for rollback support (set automatically)GenerationConfig derived from the main confignum_assistant_tokens controls speculation depth (how many draft tokens to propose per step)num_assistant_tokens_schedule ("heuristic", "heuristic_transient", "constant") controls how that number adapts at runtimeassistant_confidence_threshold enables dynamic speculation cutoff based on the draft model's confidenceEncoder-decoder models use specific configuration parameters:
decoder_start_token_id: Initial token ID for the decoderencoder_no_repeat_ngram_size: Prevent n-grams from the encoder input from appearing in the decoder outputdecoder_start_token_id can be a list to allow different start tokens per batch elementWhen using static caches for torch.compile compatibility:
cache_implementation must be "static" or "offloaded_static"max_cache_len or max_length determines the pre-allocated tensor sizecache_config can pass extra parameters (e.g., max_batch_size)compile_config (a CompileConfig instance) controls how generate() invokes torch.compile on the forward passdisable_compile=True suppresses automatic compilation even when a static cache is usedSources: src/transformers/generation/candidate_generator.py103-191 src/transformers/generation/configuration_utils.py45-56 src/transformers/generation/configuration_utils.py330-337
The generation output type is determined by the mode and return_dict_in_generate flag:
| Mode | return_dict_in_generate | Output Type |
|---|---|---|
GREEDY_SEARCH / SAMPLE | False | torch.LongTensor (just sequences) |
GREEDY_SEARCH / SAMPLE | True (decoder-only) | GenerateDecoderOnlyOutput |
GREEDY_SEARCH / SAMPLE | True (encoder-decoder) | GenerateEncoderDecoderOutput |
BEAM_SEARCH / BEAM_SAMPLE | False | torch.LongTensor (just sequences) |
BEAM_SEARCH / BEAM_SAMPLE | True (decoder-only) | GenerateBeamDecoderOnlyOutput |
BEAM_SEARCH / BEAM_SAMPLE | True (encoder-decoder) | GenerateBeamEncoderDecoderOutput |
All output classes defined in src/transformers/generation/utils.py146-329 share common fields:
sequences: Generated token IDsscores (optional): Processed prediction scores per steplogits (optional): Raw prediction logits per steppast_key_values (optional): Final cache stateBeam outputs additionally include:
sequences_scores: Final beam scoresbeam_indices: Beam indices for each generated tokenSources: src/transformers/generation/utils.py146-329
Refresh this wiki
This wiki was recently refreshed. Please wait 2 days to refresh again.