This page covers the Whisper encoder-decoder architecture, its extended generation interface (WhisperGenerationMixin), the AutomaticSpeechRecognitionPipeline, and related audio models: Wav2Vec2, SpeechT5, and Bark.
For the base encoder-decoder architectural pattern (cross-attention, shift_tokens_right, Seq2SeqLMOutput) shared with BART and T5, see Encoder-Decoder Models. For the generation infrastructure (logits processors, cache, decoding strategies) that Whisper builds on, see Generation System.
Whisper is an encoder-decoder transformer for multitask speech processing: transcription, translation, language identification, and voice activity detection. The encoder consumes a fixed-length log-mel spectrogram; the decoder generates text conditioned on the encoded audio and a task/language prefix.
Raw audio is preprocessed outside the model by WhisperFeatureExtractor into a log-mel spectrogram of shape (batch, num_mel_bins, 3000) — representing 30 seconds of audio at 100 frames/s. The encoder reduces this to 1500 frames through two convolutional layers before passing it through the transformer stack.
Diagram: WhisperEncoder components (modeling_whisper.py)
Sources: src/transformers/models/whisper/modeling_whisper.py558-690
conv1 maps num_mel_bins → d_model; conv2 downsamples by stride 2. Both use GELU activation. The output length formula is (input_length - 1) // 2 + 1, implemented in _get_feat_extract_output_lengths().
src/transformers/models/whisper/modeling_whisper.py549-555
embed_positions holds frozen sinusoidal embeddings for up to max_source_positions positions. The sinusoids() function generates them at initialization:
src/transformers/models/whisper/modeling_whisper.py53-62
embed_positions.requires_grad_(False) is set directly, and _init_weights restores the sinusoidal values via init.copy_() to survive any accidental modifications.
src/transformers/models/whisper/modeling_whisper.py540-548
Each WhisperEncoderLayer applies pre-norm self-attention followed by an FFN (fc1 → activation → fc2). Float16 activations are clamped to finfo(float16).max - 1000 to prevent overflow.
src/transformers/models/whisper/modeling_whisper.py364-421
The encoder does not use an attention mask. Padding and silence are handled implicitly.
Diagram: WhisperDecoder components (modeling_whisper.py)
Sources: src/transformers/models/whisper/modeling_whisper.py693-800
Unlike the encoder, the decoder uses learned positional embeddings (WhisperPositionalEmbedding) indexed by position. Each WhisperDecoderLayer contains:
self_attn: causal masked self-attention (is_causal=True)encoder_attn: cross-attention attending to encoder outputself_attn_layer_norm, encoder_attn_layer_norm, final_layer_normfc1 (d_model → decoder_ffn_dim) → activation → fc2 (decoder_ffn_dim → d_model)src/transformers/models/whisper/modeling_whisper.py423-523
WhisperAttention handles both self-attention and cross-attention. Cross-attention is triggered when key_value_states is not None.
For autoregressive decoding, Whisper uses EncoderDecoderCache, which wraps two sub-caches:
self_attention_cache: growing KV cache for decoder self-attentioncross_attention_cache: fixed KV cache for encoder-decoder cross-attentionAfter the first decoder step, the encoder output is fixed. is_updated[layer_idx] tracks whether each layer has already written its cross-attention keys/values. On subsequent steps, the cached values are reused without re-computation.
src/transformers/models/whisper/modeling_whisper.py314-339
| Class | Inherits From | Task |
|---|---|---|
WhisperModel | WhisperPreTrainedModel | Base encoder-decoder; Seq2SeqModelOutput |
WhisperForConditionalGeneration | WhisperGenerationMixin, WhisperPreTrainedModel | Transcription/translation; Seq2SeqLMOutput |
WhisperForAudioClassification | WhisperPreTrainedModel | Encoder + projector + classifier |
WhisperForCausalLM | WhisperPreTrainedModel, GenerationMixin | Decoder-only; draft model for assisted decoding |
Sources: src/transformers/models/whisper/modeling_whisper.py527-556 tests/models/whisper/test_modeling_whisper.py351-360
WhisperPreTrainedModel declares main_input_name = "input_features" (not input_ids), input_modalities = ("audio", "text"), and sets _supports_flash_attn = True, _supports_sdpa = True, _supports_flex_attn = True, _can_compile_fullgraph = True.
WhisperForAudioClassification supports optional weighted averaging of all encoder hidden states via layer_weights, enabled by config.use_weighted_layer_sum. The weights are initialized to 1 / (num_hidden_layers + 1).
freeze_encoder() on WhisperModel or WhisperForConditionalGeneration sets requires_grad=False on all encoder parameters, useful for decoder-only fine-tuning.
src/transformers/models/whisper/modeling_whisper.py593-596
SpecAugment masking during training is controlled by config.mask_time_prob / config.mask_time_length (time axis) and config.mask_feature_prob / config.mask_feature_length (frequency axis). The _compute_mask_indices() function (shared with Wav2Vec2 and SpeechT5) generates the mask spans.
src/transformers/models/whisper/modeling_whisper.py83-199
WhisperGenerationMixin is defined in src/transformers/models/whisper/generation_whisper.py and extends GenerationMixin. Because WhisperForConditionalGeneration inherits it before WhisperPreTrainedModel, its generate() takes precedence over GenerationMixin.generate().
src/transformers/models/whisper/generation_whisper.py240
WhisperGenerationMixin.generate() adds Whisper-specific parameters to the standard generation interface:
src/transformers/models/whisper/generation_whisper.py383-412
| Parameter | Type | Description |
|---|---|---|
return_timestamps | bool | Enable timestamp token generation |
task | str | "transcribe" or "translate" |
language | str | list[str] | ISO code ("en"), token ("<|en|>"), or full name |
condition_on_prev_tokens | bool | Feed previous segment output as next segment prefix |
temperature | float | tuple | Single value or fallback schedule |
compression_ratio_threshold | float | Zlib compression ratio cutoff (e.g., 1.35) |
logprob_threshold | float | Average log-prob cutoff per segment (e.g., -1.0) |
no_speech_threshold | float | Silence detection threshold (e.g., 0.6) |
num_segment_frames | int | Frames per chunk (default 3000, i.e., 30 s) |
return_token_timestamps | bool | Attach per-token timestamps via DTW |
prompt_ids | torch.Tensor | Custom vocabulary context prepended to each segment |
prompt_condition_type | str | "first-segment" or "all-segments" |
Language can be specified as an ISO code, the token form, a full language name, or a per-batch list. The mapping tables TO_LANGUAGE_CODE and LANGUAGES (in tokenization_whisper.py) handle normalization.
When audio is longer than 30 seconds, the encoder features are split into num_segment_frames-length chunks and processed iteratively. A short-form path (force_unique_generate_call=True or single-chunk audio) bypasses the loop and calls super().generate() directly.
Diagram: Long-form transcription loop (generation_whisper.py)
Sources: src/transformers/models/whisper/generation_whisper.py383-900 src/transformers/models/whisper/generation_whisper.py126-237
Each generated segment is validated before being accepted. If a check fails, the segment is discarded and regenerated using the next temperature value in the fallback schedule.
| Threshold | Failure Condition | Rationale |
|---|---|---|
compression_ratio_threshold | zlib.compress(text).size / text.size > threshold | High compression = repetitive output |
logprob_threshold | mean(log P(token)) < threshold | Low avg log-prob = uncertain output |
no_speech_threshold | P(no-speech token) > threshold AND logprob < logprob_threshold | Combined silence detection |
temperature accepts a tuple such as (0.0, 0.2, 0.4, 0.6, 0.8, 1.0). Generation starts at temperature[0] (greedy or beam) and escalates only on failure.
src/transformers/models/whisper/generation_whisper.py489-503
The logits processors installed per segment:
WhisperTimeStampLogitsProcessor: enforces valid timestamp token patterns (monotonically increasing, paired)SuppressTokensAtBeginLogitsProcessor: suppresses tokens invalid at the start of generationSuppressTokensLogitsProcessor: suppresses always-invalid tokensWhisperNoSpeechDetection: monitors the no-speech token probabilitysrc/transformers/models/whisper/generation_whisper.py26-33
_extract_token_timestamps() maps each output token to a position in the input audio using the cross-attention weights collected during generation.
Diagram: DTW timestamp pipeline (generation_whisper.py)
Sources: src/transformers/models/whisper/generation_whisper.py241-381
_dynamic_time_warping() implements standard DP DTW. It returns text_indices (output token positions) and time_indices (corresponding input frame positions), from which token-level timestamps are derived.
src/transformers/models/whisper/generation_whisper.py64-115
_median_filter() applies a sliding median of width config.median_filter_width (padding with reflection) to smooth cross-attention noise before DTW.
src/transformers/models/whisper/generation_whisper.py43-61
generation_config.alignment_heads is a list of [layer_index, head_index] pairs. These heads are empirically selected per model variant for best alignment quality.
WhisperTokenizer in src/transformers/models/whisper/tokenization_whisper.py is a BPE tokenizer backed by the HuggingFace tokenizers library. It inherits from TokenizersBackend and extends it with Whisper's special token vocabulary.
src/transformers/models/whisper/tokenization_whisper.py163-276
LANGUAGES maps 100 ISO language codes to full names. TO_LANGUAGE_CODE provides the reverse lookup plus aliases ("burmese" → "my", "mandarin" → "zh", etc.). TASK_IDS = ["translate", "transcribe"].
src/transformers/models/whisper/tokenization_whisper.py40-160
Whisper generation is controlled by a fixed prefix injected at the start of decoding:
| Position | Token | Example |
|---|---|---|
| 0 | <|startoftranscript|> | Always present |
| 1 | Language token | <|en|> |
| 2 | Task token | <|transcribe|> or <|translate|> |
| 3 | Timestamp control | <|notimestamps|> or omitted |
set_prefix_tokens() constructs this prefix based on self.language, self.task, and self.predict_timestamps.
Timestamp tokens have IDs above the special token range and are formatted as <|T.TT|> (e.g., <|1.08|>). Two methods decode these:
_decode_with_timestamps(): returns a string interleaving decoded text with timestamp annotations; handles segment boundaries and cumulative time offsets.src/transformers/models/whisper/tokenization_whisper.py279-325
_compute_offsets(): identifies consecutive timestamp token pairs as segment boundaries and returns a list of (start, end) tuples in seconds.src/transformers/models/whisper/tokenization_whisper.py328-420
The default time resolution is 0.02 s/token (50 tokens per second of audio coverage), configurable via time_precision.
AutomaticSpeechRecognitionPipeline in src/transformers/pipelines/automatic_speech_recognition.py is a ChunkPipeline supporting multiple ASR backends behind a uniform interface. It sets _pipeline_calls_generate = True, _load_feature_extractor = True, and _load_tokenizer = True.
src/transformers/pipelines/automatic_speech_recognition.py112-185
Default generation config: max_new_tokens=256, num_beams=5.
self.type is assigned in __init__ based on the loaded model:
self.type | Condition | Typical Model |
|---|---|---|
"seq2seq_whisper" | config.model_type == "whisper" | WhisperForConditionalGeneration |
"seq2seq" | model in MODEL_FOR_SPEECH_SEQ_2_SEQ_MAPPING_NAMES | Speech2TextForConditionalGeneration |
"ctc_with_lm" | decoder argument passed | Wav2Vec2ForCTC + pyctcdecode |
"ctc" | default | Wav2Vec2ForCTC, HubertForCTC |
src/transformers/pipelines/automatic_speech_recognition.py196-207
Diagram: AutomaticSpeechRecognitionPipeline flow (automatic_speech_recognition.py)
Sources: src/transformers/pipelines/automatic_speech_recognition.py209-600
preprocess() normalizes inputs before feature extraction:
str URL: httpx.get() → ffmpeg_read() (requires ffmpeg)str path: file open → ffmpeg_read()bytes: ffmpeg_read()np.ndarray or torch.Tensor: used directly as float32 waveformdict: {"sampling_rate": int, "raw": array, "stride": (left, right)}src/transformers/pipelines/automatic_speech_recognition.py363-420
For long audio, chunk_iter() yields overlapping feature extractor outputs:
|<-------- chunk_len -------->|
|--stride_left--|--content--|--stride_right--|
Each yielded dict includes a stride tuple (chunk_len_samples, stride_left_samples, stride_right_samples). rescale_stride() converts this from audio sample space to token/logit space using the model's downsampling ratio.
src/transformers/pipelines/automatic_speech_recognition.py41-84
For CTC models, _find_longest_common_sequence() stitches overlapping chunk outputs by finding the longest token subsequence common to the overlap region.
src/transformers/pipelines/automatic_speech_recognition.py87-109
Note for Whisper:
chunk_length_sis experimental withseq2seq_whisper. For audio longer than 30 s, useWhisperGenerationMixin.generate()directly. The pipeline warns about this.
src/transformers/pipelines/automatic_speech_recognition.py284-297
return_timestamps | Supported Model Types | Mechanism |
|---|---|---|
"char" | CTC only | CTC frame-level logit alignment |
"word" | CTC, Whisper | CTC alignment or Whisper DTW |
True | Whisper only | Timestamp tokens from decoder |
src/transformers/pipelines/automatic_speech_recognition.py229-252
Wav2Vec2 takes raw waveforms (not mel spectrograms) directly. The convolutional feature encoder replaces the mel spectrogram frontend.
Diagram: Wav2Vec2 architecture (modeling_wav2vec2.py)
Sources: src/transformers/models/wav2vec2/modeling_wav2vec2.py1-99
Wav2Vec2ForCTC applies a linear projection then CTC decode over the frame sequence.ctc_with_lm mode wraps a pyctcdecode BeamSearchDecoderCTC.SpecAugment masking (_compute_mask_indices()) is shared between Wav2Vec2, Whisper, and SpeechT5 for training augmentation.
src/transformers/models/wav2vec2/modeling_wav2vec2.py101-200
SpeechT5 uses a single transformer backbone for multiple speech-text tasks. Task specificity is provided by modality pre-nets (before the encoder) and post-nets (after the decoder).
Diagram: SpeechT5 task variants (modeling_speecht5.py)
Sources: src/transformers/models/speecht5/modeling_speecht5.py209-400
| Class | Encoder Pre-net | Decoder Post-net |
|---|---|---|
SpeechT5ForSpeechToText | SpeechT5SpeechEncoderPrenet | SpeechT5TextDecoderPostnet |
SpeechT5ForTextToSpeech | SpeechT5TextEncoderPrenet | SpeechT5SpeechDecoderPostnet + SpeechT5HifiGan |
SpeechT5ForSpeechToSpeech | SpeechT5SpeechEncoderPrenet | SpeechT5SpeechDecoderPostnet + SpeechT5HifiGan |
SpeechT5HifiGan is a standalone neural vocoder (mel spectrogram → waveform) that can be used independently via the text-to-audio pipeline.
shift_spectrograms_right() is the spectrogram analog of shift_tokens_right(), handling the reduction factor (subsampling) during teacher-forced spectrogram decoding.
src/transformers/models/speecht5/modeling_speecht5.py68-86
Bark generates audio from text through three autoregressive stages. Each stage uses a separate PreTrainedModel and GenerationConfig subclass.
Diagram: Bark multi-stage pipeline (modeling_bark.py)
Sources: src/transformers/models/bark/modeling_bark.py361-650
BarkCausalModel is the shared base for the semantic and coarse stages. It is a GPT-2-style model with BarkBlock layers (causal BarkSelfAttention + BarkMLP), learned positional embeddings, and separate input/output vocabulary sizes.
BarkFineModel processes all 8 EnCodec codebook tokens simultaneously using bidirectional (is_causal=False) attention, then predicts the fine tokens for each codebook.
Each stage has its own generation config:
BarkSemanticGenerationConfigBarkCoarseGenerationConfigBarkFineGenerationConfigSpecialized logits processors:
| Processor | Purpose |
|---|---|
AlternatingCodebooksLogitsProcessor | Enforces alternating codebook token generation in coarse stage |
BarkEosPrioritizerLogitsProcessor | Raises EOS probability near the max length limit |
SuppressTokensLogitsProcessor | Blocks vocabulary entries invalid for the current stage |
src/transformers/models/bark/modeling_bark.py26-30
BarkModel.generate() orchestrates all three stages, accepting a text string and returning a waveform tensor.
src/transformers/audio_utils.py provides pure-NumPy audio DSP utilities used by feature extractors. They are framework-agnostic.
| Function | Purpose |
|---|---|
load_audio() | Load from file/URL via torchcodec (preferred) or librosa |
spectrogram() | Compute STFT-based (mel/log-mel) spectrogram |
mel_filter_bank() | Construct triangular mel filterbank matrix |
window_function() | Generate Hann, Hamming, Blackman, etc. windows |
power_to_db() | Convert power spectrogram to dB scale |
src/transformers/audio_utils.py60-231
WhisperFeatureExtractor uses mel_filter_bank() and spectrogram() to produce the 80-bin or 128-bin log-mel spectrogram expected by WhisperEncoder. The pipeline's preprocess() uses ffmpeg_read() from pipelines/audio_utils.py to decode audio bytes from any ffmpeg-supported format.
Refresh this wiki
This wiki was recently refreshed. Please wait 2 days to refresh again.