This page introduces the text generation subsystem of the transformers library. It covers the major abstractions, how they relate to each other, and how the generate() call flows through the system. For detailed coverage of specific subsystems, see the child pages:
For the model architecture that generation is built on top of (e.g., decoder-only LLMs), see Model Architectures.
The generation subsystem is the collection of classes and utilities that implement auto-regressive text generation. Any model that can generate sequences (causal LMs, encoder-decoder models, etc.) acquires this capability by inheriting from GenerationMixin, which provides the generate() method and all supporting infrastructure.
The subsystem is entirely contained in the src/transformers/generation/ directory, with the KV cache implementations in src/transformers/cache_utils.py.
Primary source files:
| File | Responsibility |
|---|---|
src/transformers/generation/utils.py | GenerationMixin, output dataclasses |
src/transformers/generation/configuration_utils.py | GenerationConfig, GenerationMode |
src/transformers/generation/logits_process.py | All LogitsProcessor subclasses |
src/transformers/generation/stopping_criteria.py | StoppingCriteria subclasses |
src/transformers/generation/candidate_generator.py | CandidateGenerator (assisted decoding) |
src/transformers/generation/continuous_batching.py | ContinuousMixin, ContinuousBatchingManager |
src/transformers/cache_utils.py | Cache, DynamicCache, StaticCache, etc. |
Sources: src/transformers/generation/__init__.py1-203 src/transformers/generation/utils.py1-143
The following diagram maps the major abstractions to their code locations.
Diagram: Generation System – Component Map
Sources: src/transformers/generation/utils.py53-109 src/transformers/generation/configuration_utils.py63-79 src/transformers/cache_utils.py27-56 src/transformers/generation/__init__.py38-108
GenerationMixin ClassGenerationMixin is defined in src/transformers/generation/utils.py337-364 and is the entry point for all generation functionality. It inherits from ContinuousMixin (from generation/continuous_batching.py), which adds serving-oriented continuous batching support.
Any model that produces sequences should inherit from it. For example, LlamaForCausalLM and BartForConditionalGeneration both inherit from PreTrainedModel which in turn inherits from GenerationMixin. The class includes the following public-facing methods:
| Method | Purpose |
|---|---|
generate() | Main entry point; orchestrates the entire decoding loop |
prepare_inputs_for_generation() | Assembles model input dict per step |
compute_transition_scores() | Computes per-token log probabilities from beam scores |
adjust_generation_fn() | Loads GenerationConfig and optional custom generate.py from a Hub repo |
load_custom_generate() | Fetches and returns a custom generate function from a Hub repo |
The generate() method internally dispatches to one of two private methods: _sample() (for greedy/sampling) or _beam_search() (for beam methods). Deprecated modes (DoLa, contrastive search, group beam search) are dispatched to Hub-hosted community modules, as defined in GENERATION_MODES_MAPPING at src/transformers/generation/utils.py132-143
Sources: src/transformers/generation/utils.py337-491
GenerationConfigGenerationConfig in src/transformers/generation/configuration_utils.py81-337 is a flat configuration object that controls all aspects of generation. It can be loaded from a generation_config.json file alongside the model weights via GenerationConfig.from_pretrained(), and saved via save_pretrained().
Parameters are grouped by function:
| Group | Key Parameters |
|---|---|
| Length control | max_new_tokens, min_new_tokens, max_length, stop_strings |
| Strategy selection | do_sample, num_beams |
| Sampling | temperature, top_k, top_p, min_p, typical_p |
| Penalties | repetition_penalty, no_repeat_ngram_size, bad_words_ids |
| Cache | use_cache, cache_implementation, cache_config |
| Outputs | return_dict_in_generate, output_scores, output_logits, output_attentions, output_hidden_states |
| Assisted decoding | num_assistant_tokens, num_assistant_tokens_schedule, prompt_lookup_num_tokens |
| Compilation | compile_config, disable_compile |
GenerationConfig also carries the get_generation_mode() method which examines parameters and returns a GenerationMode enum value. This is the primary mechanism by which generate() selects a decoding strategy.
GenerationMode EnumGenerationMode.GREEDY_SEARCH → _sample() (do_sample=False, num_beams=1)
GenerationMode.SAMPLE → _sample() (do_sample=True, num_beams=1)
GenerationMode.BEAM_SEARCH → _beam_search() (do_sample=False, num_beams>1)
GenerationMode.BEAM_SAMPLE → _beam_search() (do_sample=True, num_beams>1)
GenerationMode.ASSISTED_GENERATION → _assisted_decoding()
Deprecated modes (DOLA_GENERATION, CONTRASTIVE_SEARCH, GROUP_BEAM_SEARCH, CONSTRAINED_BEAM_SEARCH) resolve to Hub repository identifiers rather than local methods.
Sources: src/transformers/generation/configuration_utils.py63-79 src/transformers/generation/configuration_utils.py341-380
generate() Call FlowThe following diagram traces execution through generate() from invocation to token output.
Diagram: generate() execution flow
Sources: src/transformers/generation/utils.py493-591 src/transformers/generation/utils.py132-143
When return_dict_in_generate=True, generate() returns a typed ModelOutput subclass. The specific type depends on whether the model is decoder-only or encoder-decoder, and whether beam search was used.
Diagram: Output type selection
| Class | Fields |
|---|---|
GenerateDecoderOnlyOutput | sequences, scores, logits, attentions, hidden_states, past_key_values |
GenerateEncoderDecoderOutput | all above + encoder_attentions, encoder_hidden_states, decoder_attentions, cross_attentions, decoder_hidden_states |
GenerateBeamDecoderOnlyOutput | sequences, sequences_scores, scores, logits, beam_indices, attentions, hidden_states, past_key_values |
GenerateBeamEncoderDecoderOutput | all beam fields + encoder/decoder/cross variants |
All four are defined as @dataclass subclasses of ModelOutput in src/transformers/generation/utils.py147-334
Sources: src/transformers/generation/utils.py147-334
At each generation step, raw model logits are passed through a LogitsProcessorList, which applies zero or more LogitsProcessor instances in sequence. Each processor takes (input_ids, scores) and returns modified scores.
LogitsProcessor is an abstract base class in src/transformers/generation/logits_process.py48-55 LogitsProcessorList (a list subclass) invokes each processor in order via its __call__ method at src/transformers/generation/logits_process.py65-93
The processors built and added by generate() are determined by the active GenerationConfig fields:
GenerationConfig field | Processor added |
|---|---|
temperature | TemperatureLogitsWarper |
top_k | TopKLogitsWarper |
top_p | TopPLogitsWarper |
min_p | MinPLogitsWarper |
top_h | TopHLogitsWarper |
repetition_penalty | RepetitionPenaltyLogitsProcessor |
no_repeat_ngram_size | NoRepeatNGramLogitsProcessor |
bad_words_ids | NoBadWordsLogitsProcessor |
min_length | MinLengthLogitsProcessor |
min_new_tokens | MinNewTokensLengthLogitsProcessor |
forced_bos_token_id | ForcedBOSTokenLogitsProcessor |
forced_eos_token_id | ForcedEOSTokenLogitsProcessor |
watermarking_config | WatermarkLogitsProcessor or SynthIDTextWatermarkLogitsProcessor |
guidance_scale > 1 | UnbatchedClassifierFreeGuidanceLogitsProcessor |
Callers may also pass a custom logits_processor: LogitsProcessorList argument directly to generate(), which is merged with the auto-built list.
For full details see Logits Processing Pipeline.
Sources: src/transformers/generation/logits_process.py48-100 src/transformers/generation/utils.py73-100
The KV cache is managed through a class hierarchy rooted at Cache and CacheLayerMixin in src/transformers/cache_utils.py The concrete implementations differ in memory allocation strategy and torch.compile compatibility:
| Class | Key behaviour | is_compileable |
|---|---|---|
DynamicCache | Grows by torch.cat each step | No |
StaticCache | Pre-allocates fixed tensors, writes via index_copy_ | Yes |
QuantizedCache | Quantizes older entries; keeps a residual buffer in full precision | No |
EncoderDecoderCache | Wraps two caches for encoder-decoder models | depends |
The cache type to instantiate is selected by cache_implementation in GenerationConfig. Valid string values include "dynamic", "static", "offloaded_static", "quantized", and others defined in ALL_CACHE_IMPLEMENTATIONS at src/transformers/generation/configuration_utils.py45-56
For full details see Cache System.
Sources: src/transformers/cache_utils.py27-500 src/transformers/generation/configuration_utils.py45-56
The two internal decoding loops are:
_sample() — handles both greedy (do_sample=False) and multinomial sampling (do_sample=True). Single unified loop that at each step either takes the argmax or samples from the processed distribution._beam_search() — maintains num_beams candidate sequences per batch item. After max_new_tokens steps or EOS, selects the highest-scoring beam. Also handles beam sampling when do_sample=True.When assistant_model or prompt_lookup_num_tokens is passed to generate(), the mode becomes ASSISTED_GENERATION and the _assisted_decoding() loop runs. This uses a CandidateGenerator to propose multiple draft tokens cheaply, which are then verified in a single target-model forward pass.
Concrete generators include:
| Class | Strategy |
|---|---|
AssistedCandidateGenerator | Uses a smaller draft model with the same tokenizer |
AssistedCandidateGeneratorDifferentTokenizers | Uses a draft model with a different tokenizer (universal assisted generation) |
PromptLookupCandidateGenerator | Matches n-grams in the prompt to propose continuation tokens |
EarlyExitCandidateGenerator | Uses intermediate model layers as the draft |
UniversalSpeculativeDecodingGenerator | Token-space alignment between mismatched vocabularies |
For full details see Assisted and Speculative Decoding.
Sources: src/transformers/generation/utils.py132-143 src/transformers/generation/candidate_generator.py39-76 src/transformers/generation/candidate_generator.py78-199
StoppingCriteriaList aggregates StoppingCriteria objects and is called at each step. Generation halts when any criterion returns True. The built-in implementations are:
| Class | Stops when |
|---|---|
MaxLengthCriteria | Total sequence length ≥ max_length |
MaxTimeCriteria | Wall-clock time ≥ max_time |
EosTokenCriteria | Any batch item's last token is an EOS id |
StopStringCriteria | Decoded output contains a stop string |
ConfidenceCriteria | Used internally in assisted decoding |
Sources: src/transformers/generation/utils.py101-109 src/transformers/generation/__init__.py78-87
generate FunctionsA model repository may include a custom_generate/generate.py file. When loaded with trust_remote_code=True, the adjust_generation_fn() method at src/transformers/generation/utils.py369-430 replaces self.generate with the custom function via functools.partial. This also works cross-model: any model can use any repo's custom function by passing custom_generate="org/repo" to generate().
The custom function must accept model as its first argument, and may extend or completely replace the standard generation loop.
Sources: src/transformers/generation/utils.py369-491 docs/source/en/generation_strategies.md99-230
GenerationMixin also inherits from ContinuousMixin (generation/continuous_batching.py), which adds ContinuousBatchingManager and related classes for server-side use. This enables processing multiple requests of different lengths in a single batched pass with dynamic scheduling (FIFOScheduler, PrefillFirstScheduler).
For full details see Continuous Batching.
Sources: src/transformers/generation/utils.py72 src/transformers/generation/__init__.py88-94
The generation system sits between PreTrainedModel and user-facing entry points such as the pipeline() API.
Diagram: Generation system in broader library context
Sources: src/transformers/generation/utils.py337-364 src/transformers/generation/configuration_utils.py81-100
Refresh this wiki
This wiki was recently refreshed. Please wait 2 days to refresh again.