Generation System

Relevant source files

This page introduces the text generation subsystem of the transformers library. It covers the major abstractions, how they relate to each other, and how the generate() call flows through the system. For detailed coverage of specific subsystems, see the child pages:

Configuration and decoding mode selection → Generation Configuration and Modes
Logits processors and stopping criteria → Logits Processing Pipeline
KV cache implementations → Cache System
Speculative and assisted decoding → Assisted and Speculative Decoding
Continuous batching for serving → Continuous Batching

For the model architecture that generation is built on top of (e.g., decoder-only LLMs), see Model Architectures.

Overview

The generation subsystem is the collection of classes and utilities that implement auto-regressive text generation. Any model that can generate sequences (causal LMs, encoder-decoder models, etc.) acquires this capability by inheriting from GenerationMixin, which provides the generate() method and all supporting infrastructure.

The subsystem is entirely contained in the src/transformers/generation/ directory, with the KV cache implementations in src/transformers/cache_utils.py.

Primary source files:

File	Responsibility
`src/transformers/generation/utils.py`	`GenerationMixin`, output dataclasses
`src/transformers/generation/configuration_utils.py`	`GenerationConfig`, `GenerationMode`
`src/transformers/generation/logits_process.py`	All `LogitsProcessor` subclasses
`src/transformers/generation/stopping_criteria.py`	`StoppingCriteria` subclasses
`src/transformers/generation/candidate_generator.py`	`CandidateGenerator` (assisted decoding)
`src/transformers/generation/continuous_batching.py`	`ContinuousMixin`, `ContinuousBatchingManager`
`src/transformers/cache_utils.py`	`Cache`, `DynamicCache`, `StaticCache`, etc.

Sources: src/transformers/generation/__init__.py1-203 src/transformers/generation/utils.py1-143

Component Map

The following diagram maps the major abstractions to their code locations.

Diagram: Generation System – Component Map

Sources: src/transformers/generation/utils.py53-109 src/transformers/generation/configuration_utils.py63-79 src/transformers/cache_utils.py27-56 src/transformers/generation/__init__.py38-108

The `GenerationMixin` Class

GenerationMixin is defined in src/transformers/generation/utils.py337-364 and is the entry point for all generation functionality. It inherits from ContinuousMixin (from generation/continuous_batching.py), which adds serving-oriented continuous batching support.

Any model that produces sequences should inherit from it. For example, LlamaForCausalLM and BartForConditionalGeneration both inherit from PreTrainedModel which in turn inherits from GenerationMixin. The class includes the following public-facing methods:

Method	Purpose
`generate()`	Main entry point; orchestrates the entire decoding loop
`prepare_inputs_for_generation()`	Assembles model input dict per step
`compute_transition_scores()`	Computes per-token log probabilities from beam scores
`adjust_generation_fn()`	Loads `GenerationConfig` and optional custom `generate.py` from a Hub repo
`load_custom_generate()`	Fetches and returns a custom `generate` function from a Hub repo

The generate() method internally dispatches to one of two private methods: _sample() (for greedy/sampling) or _beam_search() (for beam methods). Deprecated modes (DoLa, contrastive search, group beam search) are dispatched to Hub-hosted community modules, as defined in GENERATION_MODES_MAPPING at src/transformers/generation/utils.py132-143

Sources: src/transformers/generation/utils.py337-491

`GenerationConfig`

GenerationConfig in src/transformers/generation/configuration_utils.py81-337 is a flat configuration object that controls all aspects of generation. It can be loaded from a generation_config.json file alongside the model weights via GenerationConfig.from_pretrained(), and saved via save_pretrained().

Parameters are grouped by function:

Group	Key Parameters
Length control	`max_new_tokens`, `min_new_tokens`, `max_length`, `stop_strings`
Strategy selection	`do_sample`, `num_beams`
Sampling	`temperature`, `top_k`, `top_p`, `min_p`, `typical_p`
Penalties	`repetition_penalty`, `no_repeat_ngram_size`, `bad_words_ids`
Cache	`use_cache`, `cache_implementation`, `cache_config`
Outputs	`return_dict_in_generate`, `output_scores`, `output_logits`, `output_attentions`, `output_hidden_states`
Assisted decoding	`num_assistant_tokens`, `num_assistant_tokens_schedule`, `prompt_lookup_num_tokens`
Compilation	`compile_config`, `disable_compile`

GenerationConfig also carries the get_generation_mode() method which examines parameters and returns a GenerationMode enum value. This is the primary mechanism by which generate() selects a decoding strategy.

`GenerationMode` Enum

GenerationMode.GREEDY_SEARCH       → _sample()       (do_sample=False, num_beams=1)
GenerationMode.SAMPLE              → _sample()       (do_sample=True,  num_beams=1)
GenerationMode.BEAM_SEARCH         → _beam_search()  (do_sample=False, num_beams>1)
GenerationMode.BEAM_SAMPLE         → _beam_search()  (do_sample=True,  num_beams>1)
GenerationMode.ASSISTED_GENERATION → _assisted_decoding()

Deprecated modes (DOLA_GENERATION, CONTRASTIVE_SEARCH, GROUP_BEAM_SEARCH, CONSTRAINED_BEAM_SEARCH) resolve to Hub repository identifiers rather than local methods.

Sources: src/transformers/generation/configuration_utils.py63-79 src/transformers/generation/configuration_utils.py341-380

The `generate()` Call Flow

The following diagram traces execution through generate() from invocation to token output.

Diagram: generate() execution flow

Sources: src/transformers/generation/utils.py493-591 src/transformers/generation/utils.py132-143

Output Types

When return_dict_in_generate=True, generate() returns a typed ModelOutput subclass. The specific type depends on whether the model is decoder-only or encoder-decoder, and whether beam search was used.

Diagram: Output type selection

Class	Fields
`GenerateDecoderOnlyOutput`	`sequences`, `scores`, `logits`, `attentions`, `hidden_states`, `past_key_values`
`GenerateEncoderDecoderOutput`	all above + `encoder_attentions`, `encoder_hidden_states`, `decoder_attentions`, `cross_attentions`, `decoder_hidden_states`
`GenerateBeamDecoderOnlyOutput`	`sequences`, `sequences_scores`, `scores`, `logits`, `beam_indices`, `attentions`, `hidden_states`, `past_key_values`
`GenerateBeamEncoderDecoderOutput`	all beam fields + encoder/decoder/cross variants

All four are defined as @dataclass subclasses of ModelOutput in src/transformers/generation/utils.py147-334

Sources: src/transformers/generation/utils.py147-334

Logits Processing Pipeline

At each generation step, raw model logits are passed through a LogitsProcessorList, which applies zero or more LogitsProcessor instances in sequence. Each processor takes (input_ids, scores) and returns modified scores.

LogitsProcessor is an abstract base class in src/transformers/generation/logits_process.py48-55 LogitsProcessorList (a list subclass) invokes each processor in order via its __call__ method at src/transformers/generation/logits_process.py65-93

The processors built and added by generate() are determined by the active GenerationConfig fields:

`GenerationConfig` field	Processor added
`temperature`	`TemperatureLogitsWarper`
`top_k`	`TopKLogitsWarper`
`top_p`	`TopPLogitsWarper`
`min_p`	`MinPLogitsWarper`
`top_h`	`TopHLogitsWarper`
`repetition_penalty`	`RepetitionPenaltyLogitsProcessor`
`no_repeat_ngram_size`	`NoRepeatNGramLogitsProcessor`
`bad_words_ids`	`NoBadWordsLogitsProcessor`
`min_length`	`MinLengthLogitsProcessor`
`min_new_tokens`	`MinNewTokensLengthLogitsProcessor`
`forced_bos_token_id`	`ForcedBOSTokenLogitsProcessor`
`forced_eos_token_id`	`ForcedEOSTokenLogitsProcessor`
`watermarking_config`	`WatermarkLogitsProcessor` or `SynthIDTextWatermarkLogitsProcessor`
`guidance_scale > 1`	`UnbatchedClassifierFreeGuidanceLogitsProcessor`

Callers may also pass a custom logits_processor: LogitsProcessorList argument directly to generate(), which is merged with the auto-built list.

For full details see Logits Processing Pipeline.

Sources: src/transformers/generation/logits_process.py48-100 src/transformers/generation/utils.py73-100

Cache System

The KV cache is managed through a class hierarchy rooted at Cache and CacheLayerMixin in src/transformers/cache_utils.py The concrete implementations differ in memory allocation strategy and torch.compile compatibility:

Class	Key behaviour	`is_compileable`
`DynamicCache`	Grows by `torch.cat` each step	No
`StaticCache`	Pre-allocates fixed tensors, writes via `index_copy_`	Yes
`QuantizedCache`	Quantizes older entries; keeps a residual buffer in full precision	No
`EncoderDecoderCache`	Wraps two caches for encoder-decoder models	depends

The cache type to instantiate is selected by cache_implementation in GenerationConfig. Valid string values include "dynamic", "static", "offloaded_static", "quantized", and others defined in ALL_CACHE_IMPLEMENTATIONS at src/transformers/generation/configuration_utils.py45-56

For full details see Cache System.

Sources: src/transformers/cache_utils.py27-500 src/transformers/generation/configuration_utils.py45-56

Decoding Strategies

Standard Strategies

The two internal decoding loops are:

_sample() — handles both greedy (do_sample=False) and multinomial sampling (do_sample=True). Single unified loop that at each step either takes the argmax or samples from the processed distribution.
_beam_search() — maintains num_beams candidate sequences per batch item. After max_new_tokens steps or EOS, selects the highest-scoring beam. Also handles beam sampling when do_sample=True.

Assisted / Speculative Decoding

When assistant_model or prompt_lookup_num_tokens is passed to generate(), the mode becomes ASSISTED_GENERATION and the _assisted_decoding() loop runs. This uses a CandidateGenerator to propose multiple draft tokens cheaply, which are then verified in a single target-model forward pass.

Concrete generators include:

Class	Strategy
`AssistedCandidateGenerator`	Uses a smaller draft model with the same tokenizer
`AssistedCandidateGeneratorDifferentTokenizers`	Uses a draft model with a different tokenizer (universal assisted generation)
`PromptLookupCandidateGenerator`	Matches n-grams in the prompt to propose continuation tokens
`EarlyExitCandidateGenerator`	Uses intermediate model layers as the draft
`UniversalSpeculativeDecodingGenerator`	Token-space alignment between mismatched vocabularies

For full details see Assisted and Speculative Decoding.

Sources: src/transformers/generation/utils.py132-143 src/transformers/generation/candidate_generator.py39-76 src/transformers/generation/candidate_generator.py78-199

Stopping Criteria

StoppingCriteriaList aggregates StoppingCriteria objects and is called at each step. Generation halts when any criterion returns True. The built-in implementations are:

Class	Stops when
`MaxLengthCriteria`	Total sequence length ≥ `max_length`
`MaxTimeCriteria`	Wall-clock time ≥ `max_time`
`EosTokenCriteria`	Any batch item's last token is an EOS id
`StopStringCriteria`	Decoded output contains a stop string
`ConfidenceCriteria`	Used internally in assisted decoding

Sources: src/transformers/generation/utils.py101-109 src/transformers/generation/__init__.py78-87

Custom `generate` Functions

A model repository may include a custom_generate/generate.py file. When loaded with trust_remote_code=True, the adjust_generation_fn() method at src/transformers/generation/utils.py369-430 replaces self.generate with the custom function via functools.partial. This also works cross-model: any model can use any repo's custom function by passing custom_generate="org/repo" to generate().

The custom function must accept model as its first argument, and may extend or completely replace the standard generation loop.

Sources: src/transformers/generation/utils.py369-491 docs/source/en/generation_strategies.md99-230

Continuous Batching

GenerationMixin also inherits from ContinuousMixin (generation/continuous_batching.py), which adds ContinuousBatchingManager and related classes for server-side use. This enables processing multiple requests of different lengths in a single batched pass with dynamic scheduling (FIFOScheduler, PrefillFirstScheduler).

For full details see Continuous Batching.

Sources: src/transformers/generation/utils.py72 src/transformers/generation/__init__.py88-94

Integration with the Broader Library

The generation system sits between PreTrainedModel and user-facing entry points such as the pipeline() API.

Diagram: Generation system in broader library context

Sources: src/transformers/generation/utils.py337-364 src/transformers/generation/configuration_utils.py81-100

Generation System

Relevant source files

Configuration and decoding mode selection → Generation Configuration and Modes
Logits processors and stopping criteria → Logits Processing Pipeline
KV cache implementations → Cache System
Speculative and assisted decoding → Assisted and Speculative Decoding
Continuous batching for serving → Continuous Batching

For the model architecture that generation is built on top of (e.g., decoder-only LLMs), see Model Architectures.

Overview

The subsystem is entirely contained in the src/transformers/generation/ directory, with the KV cache implementations in src/transformers/cache_utils.py.

Primary source files:

File	Responsibility
`src/transformers/generation/utils.py`	`GenerationMixin`, output dataclasses
`src/transformers/generation/configuration_utils.py`	`GenerationConfig`, `GenerationMode`
`src/transformers/generation/logits_process.py`	All `LogitsProcessor` subclasses
`src/transformers/generation/stopping_criteria.py`	`StoppingCriteria` subclasses
`src/transformers/generation/candidate_generator.py`	`CandidateGenerator` (assisted decoding)
`src/transformers/generation/continuous_batching.py`	`ContinuousMixin`, `ContinuousBatchingManager`
`src/transformers/cache_utils.py`	`Cache`, `DynamicCache`, `StaticCache`, etc.

Sources: src/transformers/generation/__init__.py1-203 src/transformers/generation/utils.py1-143

Component Map

The following diagram maps the major abstractions to their code locations.

Diagram: Generation System – Component Map

Sources: src/transformers/generation/utils.py53-109 src/transformers/generation/configuration_utils.py63-79 src/transformers/cache_utils.py27-56 src/transformers/generation/__init__.py38-108

The `GenerationMixin` Class

Method	Purpose
`generate()`	Main entry point; orchestrates the entire decoding loop
`prepare_inputs_for_generation()`	Assembles model input dict per step
`compute_transition_scores()`	Computes per-token log probabilities from beam scores
`adjust_generation_fn()`	Loads `GenerationConfig` and optional custom `generate.py` from a Hub repo
`load_custom_generate()`	Fetches and returns a custom `generate` function from a Hub repo

Sources: src/transformers/generation/utils.py337-491

`GenerationConfig`

Parameters are grouped by function:

Group	Key Parameters
Length control	`max_new_tokens`, `min_new_tokens`, `max_length`, `stop_strings`
Strategy selection	`do_sample`, `num_beams`
Sampling	`temperature`, `top_k`, `top_p`, `min_p`, `typical_p`
Penalties	`repetition_penalty`, `no_repeat_ngram_size`, `bad_words_ids`
Cache	`use_cache`, `cache_implementation`, `cache_config`
Outputs	`return_dict_in_generate`, `output_scores`, `output_logits`, `output_attentions`, `output_hidden_states`
Assisted decoding	`num_assistant_tokens`, `num_assistant_tokens_schedule`, `prompt_lookup_num_tokens`
Compilation	`compile_config`, `disable_compile`

`GenerationMode` Enum

GenerationMode.GREEDY_SEARCH       → _sample()       (do_sample=False, num_beams=1)
GenerationMode.SAMPLE              → _sample()       (do_sample=True,  num_beams=1)
GenerationMode.BEAM_SEARCH         → _beam_search()  (do_sample=False, num_beams>1)
GenerationMode.BEAM_SAMPLE         → _beam_search()  (do_sample=True,  num_beams>1)
GenerationMode.ASSISTED_GENERATION → _assisted_decoding()

Deprecated modes (DOLA_GENERATION, CONTRASTIVE_SEARCH, GROUP_BEAM_SEARCH, CONSTRAINED_BEAM_SEARCH) resolve to Hub repository identifiers rather than local methods.

Sources: src/transformers/generation/configuration_utils.py63-79 src/transformers/generation/configuration_utils.py341-380

The `generate()` Call Flow

The following diagram traces execution through generate() from invocation to token output.

Diagram: generate() execution flow

Sources: src/transformers/generation/utils.py493-591 src/transformers/generation/utils.py132-143

Output Types

Diagram: Output type selection

Class	Fields
`GenerateDecoderOnlyOutput`	`sequences`, `scores`, `logits`, `attentions`, `hidden_states`, `past_key_values`
`GenerateEncoderDecoderOutput`	all above + `encoder_attentions`, `encoder_hidden_states`, `decoder_attentions`, `cross_attentions`, `decoder_hidden_states`
`GenerateBeamDecoderOnlyOutput`	`sequences`, `sequences_scores`, `scores`, `logits`, `beam_indices`, `attentions`, `hidden_states`, `past_key_values`
`GenerateBeamEncoderDecoderOutput`	all beam fields + encoder/decoder/cross variants

All four are defined as @dataclass subclasses of ModelOutput in src/transformers/generation/utils.py147-334

Sources: src/transformers/generation/utils.py147-334

Logits Processing Pipeline

The processors built and added by generate() are determined by the active GenerationConfig fields:

`GenerationConfig` field	Processor added
`temperature`	`TemperatureLogitsWarper`
`top_k`	`TopKLogitsWarper`
`top_p`	`TopPLogitsWarper`
`min_p`	`MinPLogitsWarper`
`top_h`	`TopHLogitsWarper`
`repetition_penalty`	`RepetitionPenaltyLogitsProcessor`
`no_repeat_ngram_size`	`NoRepeatNGramLogitsProcessor`
`bad_words_ids`	`NoBadWordsLogitsProcessor`
`min_length`	`MinLengthLogitsProcessor`
`min_new_tokens`	`MinNewTokensLengthLogitsProcessor`
`forced_bos_token_id`	`ForcedBOSTokenLogitsProcessor`
`forced_eos_token_id`	`ForcedEOSTokenLogitsProcessor`
`watermarking_config`	`WatermarkLogitsProcessor` or `SynthIDTextWatermarkLogitsProcessor`
`guidance_scale > 1`	`UnbatchedClassifierFreeGuidanceLogitsProcessor`

Callers may also pass a custom logits_processor: LogitsProcessorList argument directly to generate(), which is merged with the auto-built list.

For full details see Logits Processing Pipeline.

Sources: src/transformers/generation/logits_process.py48-100 src/transformers/generation/utils.py73-100

Cache System

Class	Key behaviour	`is_compileable`
`DynamicCache`	Grows by `torch.cat` each step	No
`StaticCache`	Pre-allocates fixed tensors, writes via `index_copy_`	Yes
`QuantizedCache`	Quantizes older entries; keeps a residual buffer in full precision	No
`EncoderDecoderCache`	Wraps two caches for encoder-decoder models	depends

For full details see Cache System.

Sources: src/transformers/cache_utils.py27-500 src/transformers/generation/configuration_utils.py45-56

Decoding Strategies

Standard Strategies

The two internal decoding loops are:

_sample() — handles both greedy (do_sample=False) and multinomial sampling (do_sample=True). Single unified loop that at each step either takes the argmax or samples from the processed distribution.
_beam_search() — maintains num_beams candidate sequences per batch item. After max_new_tokens steps or EOS, selects the highest-scoring beam. Also handles beam sampling when do_sample=True.

Assisted / Speculative Decoding

Concrete generators include:

Class	Strategy
`AssistedCandidateGenerator`	Uses a smaller draft model with the same tokenizer
`AssistedCandidateGeneratorDifferentTokenizers`	Uses a draft model with a different tokenizer (universal assisted generation)
`PromptLookupCandidateGenerator`	Matches n-grams in the prompt to propose continuation tokens
`EarlyExitCandidateGenerator`	Uses intermediate model layers as the draft
`UniversalSpeculativeDecodingGenerator`	Token-space alignment between mismatched vocabularies

For full details see Assisted and Speculative Decoding.

Sources: src/transformers/generation/utils.py132-143 src/transformers/generation/candidate_generator.py39-76 src/transformers/generation/candidate_generator.py78-199

Stopping Criteria

StoppingCriteriaList aggregates StoppingCriteria objects and is called at each step. Generation halts when any criterion returns True. The built-in implementations are:

Class	Stops when
`MaxLengthCriteria`	Total sequence length ≥ `max_length`
`MaxTimeCriteria`	Wall-clock time ≥ `max_time`
`EosTokenCriteria`	Any batch item's last token is an EOS id
`StopStringCriteria`	Decoded output contains a stop string
`ConfidenceCriteria`	Used internally in assisted decoding

Sources: src/transformers/generation/utils.py101-109 src/transformers/generation/__init__.py78-87

Custom `generate` Functions

The custom function must accept model as its first argument, and may extend or completely replace the standard generation loop.

Sources: src/transformers/generation/utils.py369-491 docs/source/en/generation_strategies.md99-230

Continuous Batching

For full details see Continuous Batching.

Sources: src/transformers/generation/utils.py72 src/transformers/generation/__init__.py88-94

Integration with the Broader Library

The generation system sits between PreTrainedModel and user-facing entry points such as the pipeline() API.

Diagram: Generation system in broader library context

Sources: src/transformers/generation/utils.py337-364 src/transformers/generation/configuration_utils.py81-100

Generation System

Overview

Component Map

The `GenerationMixin` Class

`GenerationConfig`

`GenerationMode` Enum

The `generate()` Call Flow

Output Types

Logits Processing Pipeline

Cache System

Decoding Strategies

Standard Strategies

Assisted / Speculative Decoding

Stopping Criteria

Custom `generate` Functions

Continuous Batching

Integration with the Broader Library

On this page

Generation System

Overview

Component Map

The `GenerationMixin` Class

`GenerationConfig`

`GenerationMode` Enum

The `generate()` Call Flow

Output Types

Logits Processing Pipeline

Cache System

Decoding Strategies

Standard Strategies

Assisted / Speculative Decoding

Stopping Criteria

Custom `generate` Functions

Continuous Batching

Integration with the Broader Library

On this page

Generation System

Overview

Component Map

The GenerationMixin Class

GenerationConfig

GenerationMode Enum

The generate() Call Flow

Output Types

Logits Processing Pipeline

Cache System

Decoding Strategies

Standard Strategies

Assisted / Speculative Decoding

Stopping Criteria

Custom generate Functions

Continuous Batching

Integration with the Broader Library

On this page

Generation System

Overview

Component Map

The GenerationMixin Class

GenerationConfig

GenerationMode Enum

The generate() Call Flow

Output Types

Logits Processing Pipeline

Cache System

Decoding Strategies

Standard Strategies

Assisted / Speculative Decoding

Stopping Criteria

Custom generate Functions

Continuous Batching

Integration with the Broader Library

On this page

The `GenerationMixin` Class

`GenerationConfig`

`GenerationMode` Enum

The `generate()` Call Flow

Custom `generate` Functions

The `GenerationMixin` Class

`GenerationConfig`

`GenerationMode` Enum

The `generate()` Call Flow

Custom `generate` Functions