Whisper and Automatic Speech Recognition

Relevant source files

This page covers the Whisper encoder-decoder architecture, its extended generation interface (WhisperGenerationMixin), the AutomaticSpeechRecognitionPipeline, and related audio models: Wav2Vec2, SpeechT5, and Bark.

For the base encoder-decoder architectural pattern (cross-attention, shift_tokens_right, Seq2SeqLMOutput) shared with BART and T5, see Encoder-Decoder Models. For the generation infrastructure (logits processors, cache, decoding strategies) that Whisper builds on, see Generation System.

Whisper Model Architecture

Whisper is an encoder-decoder transformer for multitask speech processing: transcription, translation, language identification, and voice activity detection. The encoder consumes a fixed-length log-mel spectrogram; the decoder generates text conditioned on the encoded audio and a task/language prefix.

Audio Preprocessing and the Encoder

Raw audio is preprocessed outside the model by WhisperFeatureExtractor into a log-mel spectrogram of shape (batch, num_mel_bins, 3000) — representing 30 seconds of audio at 100 frames/s. The encoder reduces this to 1500 frames through two convolutional layers before passing it through the transformer stack.

Diagram: WhisperEncoder components (modeling_whisper.py)

Sources: src/transformers/models/whisper/modeling_whisper.py558-690

conv1 maps num_mel_bins → d_model; conv2 downsamples by stride 2. Both use GELU activation. The output length formula is (input_length - 1) // 2 + 1, implemented in _get_feat_extract_output_lengths().

src/transformers/models/whisper/modeling_whisper.py549-555

embed_positions holds frozen sinusoidal embeddings for up to max_source_positions positions. The sinusoids() function generates them at initialization:

src/transformers/models/whisper/modeling_whisper.py53-62

embed_positions.requires_grad_(False) is set directly, and _init_weights restores the sinusoidal values via init.copy_() to survive any accidental modifications.

src/transformers/models/whisper/modeling_whisper.py540-548

Each WhisperEncoderLayer applies pre-norm self-attention followed by an FFN (fc1 → activation → fc2). Float16 activations are clamped to finfo(float16).max - 1000 to prevent overflow.

src/transformers/models/whisper/modeling_whisper.py364-421

The encoder does not use an attention mask. Padding and silence are handled implicitly.

WhisperDecoder

Diagram: WhisperDecoder components (modeling_whisper.py)

Sources: src/transformers/models/whisper/modeling_whisper.py693-800

Unlike the encoder, the decoder uses learned positional embeddings (WhisperPositionalEmbedding) indexed by position. Each WhisperDecoderLayer contains:

self_attn: causal masked self-attention (is_causal=True)
encoder_attn: cross-attention attending to encoder output
self_attn_layer_norm, encoder_attn_layer_norm, final_layer_norm
FFN: fc1 (d_model → decoder_ffn_dim) → activation → fc2 (decoder_ffn_dim → d_model)

src/transformers/models/whisper/modeling_whisper.py423-523

Attention and EncoderDecoderCache

WhisperAttention handles both self-attention and cross-attention. Cross-attention is triggered when key_value_states is not None.

For autoregressive decoding, Whisper uses EncoderDecoderCache, which wraps two sub-caches:

self_attention_cache: growing KV cache for decoder self-attention
cross_attention_cache: fixed KV cache for encoder-decoder cross-attention

After the first decoder step, the encoder output is fixed. is_updated[layer_idx] tracks whether each layer has already written its cross-attention keys/values. On subsequent steps, the cached values are reused without re-computation.

src/transformers/models/whisper/modeling_whisper.py314-339

Model Variants

Class	Inherits From	Task
`WhisperModel`	`WhisperPreTrainedModel`	Base encoder-decoder; `Seq2SeqModelOutput`
`WhisperForConditionalGeneration`	`WhisperGenerationMixin`, `WhisperPreTrainedModel`	Transcription/translation; `Seq2SeqLMOutput`
`WhisperForAudioClassification`	`WhisperPreTrainedModel`	Encoder + `projector` + `classifier`
`WhisperForCausalLM`	`WhisperPreTrainedModel`, `GenerationMixin`	Decoder-only; draft model for assisted decoding

Sources: src/transformers/models/whisper/modeling_whisper.py527-556 tests/models/whisper/test_modeling_whisper.py351-360

WhisperPreTrainedModel declares main_input_name = "input_features" (not input_ids), input_modalities = ("audio", "text"), and sets _supports_flash_attn = True, _supports_sdpa = True, _supports_flex_attn = True, _can_compile_fullgraph = True.

WhisperForAudioClassification supports optional weighted averaging of all encoder hidden states via layer_weights, enabled by config.use_weighted_layer_sum. The weights are initialized to 1 / (num_hidden_layers + 1).

freeze_encoder() on WhisperModel or WhisperForConditionalGeneration sets requires_grad=False on all encoder parameters, useful for decoder-only fine-tuning.

src/transformers/models/whisper/modeling_whisper.py593-596

SpecAugment masking during training is controlled by config.mask_time_prob / config.mask_time_length (time axis) and config.mask_feature_prob / config.mask_feature_length (frequency axis). The _compute_mask_indices() function (shared with Wav2Vec2 and SpeechT5) generates the mask spans.

src/transformers/models/whisper/modeling_whisper.py83-199

WhisperGenerationMixin

WhisperGenerationMixin is defined in src/transformers/models/whisper/generation_whisper.py and extends GenerationMixin. Because WhisperForConditionalGeneration inherits it before WhisperPreTrainedModel, its generate() takes precedence over GenerationMixin.generate().

src/transformers/models/whisper/generation_whisper.py240

The generate() Interface

WhisperGenerationMixin.generate() adds Whisper-specific parameters to the standard generation interface:

src/transformers/models/whisper/generation_whisper.py383-412

Parameter	Type	Description
`return_timestamps`	`bool`	Enable timestamp token generation
`task`	`str`	`"transcribe"` or `"translate"`
`language`	`str \| list[str]`	ISO code (`"en"`), token (`"<\|en\|>"`), or full name
`condition_on_prev_tokens`	`bool`	Feed previous segment output as next segment prefix
`temperature`	`float \| tuple`	Single value or fallback schedule
`compression_ratio_threshold`	`float`	Zlib compression ratio cutoff (e.g., 1.35)
`logprob_threshold`	`float`	Average log-prob cutoff per segment (e.g., -1.0)
`no_speech_threshold`	`float`	Silence detection threshold (e.g., 0.6)
`num_segment_frames`	`int`	Frames per chunk (default 3000, i.e., 30 s)
`return_token_timestamps`	`bool`	Attach per-token timestamps via DTW
`prompt_ids`	`torch.Tensor`	Custom vocabulary context prepended to each segment
`prompt_condition_type`	`str`	`"first-segment"` or `"all-segments"`

Language can be specified as an ISO code, the token form, a full language name, or a per-batch list. The mapping tables TO_LANGUAGE_CODE and LANGUAGES (in tokenization_whisper.py) handle normalization.

Long-form Transcription Loop

When audio is longer than 30 seconds, the encoder features are split into num_segment_frames-length chunks and processed iteratively. A short-form path (force_unique_generate_call=True or single-chunk audio) bypasses the loop and calls super().generate() directly.

Diagram: Long-form transcription loop (generation_whisper.py)

Sources: src/transformers/models/whisper/generation_whisper.py383-900 src/transformers/models/whisper/generation_whisper.py126-237

Quality Filtering

Each generated segment is validated before being accepted. If a check fails, the segment is discarded and regenerated using the next temperature value in the fallback schedule.

Threshold	Failure Condition	Rationale
`compression_ratio_threshold`	`zlib.compress(text).size / text.size > threshold`	High compression = repetitive output
`logprob_threshold`	`mean(log P(token)) < threshold`	Low avg log-prob = uncertain output
`no_speech_threshold`	`P(no-speech token) > threshold` AND `logprob < logprob_threshold`	Combined silence detection

temperature accepts a tuple such as (0.0, 0.2, 0.4, 0.6, 0.8, 1.0). Generation starts at temperature[0] (greedy or beam) and escalates only on failure.

src/transformers/models/whisper/generation_whisper.py489-503

The logits processors installed per segment:

WhisperTimeStampLogitsProcessor: enforces valid timestamp token patterns (monotonically increasing, paired)
SuppressTokensAtBeginLogitsProcessor: suppresses tokens invalid at the start of generation
SuppressTokensLogitsProcessor: suppresses always-invalid tokens
WhisperNoSpeechDetection: monitors the no-speech token probability

src/transformers/models/whisper/generation_whisper.py26-33

Timestamp Extraction via DTW

_extract_token_timestamps() maps each output token to a position in the input audio using the cross-attention weights collected during generation.

Diagram: DTW timestamp pipeline (generation_whisper.py)

Sources: src/transformers/models/whisper/generation_whisper.py241-381

_dynamic_time_warping() implements standard DP DTW. It returns text_indices (output token positions) and time_indices (corresponding input frame positions), from which token-level timestamps are derived.

src/transformers/models/whisper/generation_whisper.py64-115

_median_filter() applies a sliding median of width config.median_filter_width (padding with reflection) to smooth cross-attention noise before DTW.

src/transformers/models/whisper/generation_whisper.py43-61

generation_config.alignment_heads is a list of [layer_index, head_index] pairs. These heads are empirically selected per model variant for best alignment quality.

WhisperTokenizer

WhisperTokenizer in src/transformers/models/whisper/tokenization_whisper.py is a BPE tokenizer backed by the HuggingFace tokenizers library. It inherits from TokenizersBackend and extends it with Whisper's special token vocabulary.

src/transformers/models/whisper/tokenization_whisper.py163-276

Language and Task Dictionaries

LANGUAGES maps 100 ISO language codes to full names. TO_LANGUAGE_CODE provides the reverse lookup plus aliases ("burmese" → "my", "mandarin" → "zh", etc.). TASK_IDS = ["translate", "transcribe"].

src/transformers/models/whisper/tokenization_whisper.py40-160

Decoder Prefix Token Structure

Whisper generation is controlled by a fixed prefix injected at the start of decoding:

Position	Token	Example
0	`<\|startoftranscript\|>`	Always present
1	Language token	`<\|en\|>`
2	Task token	`<\|transcribe\|>` or `<\|translate\|>`
3	Timestamp control	`<\|notimestamps\|>` or omitted

set_prefix_tokens() constructs this prefix based on self.language, self.task, and self.predict_timestamps.

Timestamp Token Decoding

Timestamp tokens have IDs above the special token range and are formatted as <|T.TT|> (e.g., <|1.08|>). Two methods decode these:

_decode_with_timestamps(): returns a string interleaving decoded text with timestamp annotations; handles segment boundaries and cumulative time offsets.

src/transformers/models/whisper/tokenization_whisper.py279-325

_compute_offsets(): identifies consecutive timestamp token pairs as segment boundaries and returns a list of (start, end) tuples in seconds.

src/transformers/models/whisper/tokenization_whisper.py328-420

The default time resolution is 0.02 s/token (50 tokens per second of audio coverage), configurable via time_precision.

AutomaticSpeechRecognitionPipeline

AutomaticSpeechRecognitionPipeline in src/transformers/pipelines/automatic_speech_recognition.py is a ChunkPipeline supporting multiple ASR backends behind a uniform interface. It sets _pipeline_calls_generate = True, _load_feature_extractor = True, and _load_tokenizer = True.

src/transformers/pipelines/automatic_speech_recognition.py112-185

Default generation config: max_new_tokens=256, num_beams=5.

Model Type Classification

self.type is assigned in __init__ based on the loaded model:

`self.type`	Condition	Typical Model
`"seq2seq_whisper"`	`config.model_type == "whisper"`	`WhisperForConditionalGeneration`
`"seq2seq"`	model in `MODEL_FOR_SPEECH_SEQ_2_SEQ_MAPPING_NAMES`	`Speech2TextForConditionalGeneration`
`"ctc_with_lm"`	`decoder` argument passed	`Wav2Vec2ForCTC` + pyctcdecode
`"ctc"`	default	`Wav2Vec2ForCTC`, `HubertForCTC`

src/transformers/pipelines/automatic_speech_recognition.py196-207

Pipeline Data Flow

Diagram: AutomaticSpeechRecognitionPipeline flow (automatic_speech_recognition.py)

Sources: src/transformers/pipelines/automatic_speech_recognition.py209-600

Input Handling

preprocess() normalizes inputs before feature extraction:

str URL: httpx.get() → ffmpeg_read() (requires ffmpeg)
str path: file open → ffmpeg_read()
bytes: ffmpeg_read()
np.ndarray or torch.Tensor: used directly as float32 waveform
dict: {"sampling_rate": int, "raw": array, "stride": (left, right)}

src/transformers/pipelines/automatic_speech_recognition.py363-420

Chunking and Striding

For long audio, chunk_iter() yields overlapping feature extractor outputs:

|<-------- chunk_len -------->|
|--stride_left--|--content--|--stride_right--|

Each yielded dict includes a stride tuple (chunk_len_samples, stride_left_samples, stride_right_samples). rescale_stride() converts this from audio sample space to token/logit space using the model's downsampling ratio.

src/transformers/pipelines/automatic_speech_recognition.py41-84

For CTC models, _find_longest_common_sequence() stitches overlapping chunk outputs by finding the longest token subsequence common to the overlap region.

src/transformers/pipelines/automatic_speech_recognition.py87-109

Note for Whisper: chunk_length_s is experimental with seq2seq_whisper. For audio longer than 30 s, use WhisperGenerationMixin.generate() directly. The pipeline warns about this.

src/transformers/pipelines/automatic_speech_recognition.py284-297

Timestamp Return Modes

`return_timestamps`	Supported Model Types	Mechanism
`"char"`	CTC only	CTC frame-level logit alignment
`"word"`	CTC, Whisper	CTC alignment or Whisper DTW
`True`	Whisper only	Timestamp tokens from decoder

src/transformers/pipelines/automatic_speech_recognition.py229-252

Wav2Vec2 — CTC-based ASR

Wav2Vec2 takes raw waveforms (not mel spectrograms) directly. The convolutional feature encoder replaces the mel spectrogram frontend.

Diagram: Wav2Vec2 architecture (modeling_wav2vec2.py)

Sources: src/transformers/models/wav2vec2/modeling_wav2vec2.py1-99

The feature encoder has no positional embeddings; position information is encoded implicitly through convolutions.
The transformer encoder uses learned relative position biases rather than absolute positional embeddings.
Wav2Vec2ForCTC applies a linear projection then CTC decode over the frame sequence.
For language-model-boosted decoding, the pipeline's ctc_with_lm mode wraps a pyctcdecode BeamSearchDecoderCTC.

SpecAugment masking (_compute_mask_indices()) is shared between Wav2Vec2, Whisper, and SpeechT5 for training augmentation.

src/transformers/models/wav2vec2/modeling_wav2vec2.py101-200

SpeechT5 — Unified Speech-Text Model

SpeechT5 uses a single transformer backbone for multiple speech-text tasks. Task specificity is provided by modality pre-nets (before the encoder) and post-nets (after the decoder).

Diagram: SpeechT5 task variants (modeling_speecht5.py)

Sources: src/transformers/models/speecht5/modeling_speecht5.py209-400

Class	Encoder Pre-net	Decoder Post-net
`SpeechT5ForSpeechToText`	`SpeechT5SpeechEncoderPrenet`	`SpeechT5TextDecoderPostnet`
`SpeechT5ForTextToSpeech`	`SpeechT5TextEncoderPrenet`	`SpeechT5SpeechDecoderPostnet` + `SpeechT5HifiGan`
`SpeechT5ForSpeechToSpeech`	`SpeechT5SpeechEncoderPrenet`	`SpeechT5SpeechDecoderPostnet` + `SpeechT5HifiGan`

SpeechT5HifiGan is a standalone neural vocoder (mel spectrogram → waveform) that can be used independently via the text-to-audio pipeline.

shift_spectrograms_right() is the spectrogram analog of shift_tokens_right(), handling the reduction factor (subsampling) during teacher-forced spectrogram decoding.

src/transformers/models/speecht5/modeling_speecht5.py68-86

Bark — Text-to-Audio Generation

Bark generates audio from text through three autoregressive stages. Each stage uses a separate PreTrainedModel and GenerationConfig subclass.

Diagram: Bark multi-stage pipeline (modeling_bark.py)

Sources: src/transformers/models/bark/modeling_bark.py361-650

BarkCausalModel is the shared base for the semantic and coarse stages. It is a GPT-2-style model with BarkBlock layers (causal BarkSelfAttention + BarkMLP), learned positional embeddings, and separate input/output vocabulary sizes.

BarkFineModel processes all 8 EnCodec codebook tokens simultaneously using bidirectional (is_causal=False) attention, then predicts the fine tokens for each codebook.

Each stage has its own generation config:

BarkSemanticGenerationConfig
BarkCoarseGenerationConfig
BarkFineGenerationConfig

Specialized logits processors:

Processor	Purpose
`AlternatingCodebooksLogitsProcessor`	Enforces alternating codebook token generation in coarse stage
`BarkEosPrioritizerLogitsProcessor`	Raises EOS probability near the max length limit
`SuppressTokensLogitsProcessor`	Blocks vocabulary entries invalid for the current stage

src/transformers/models/bark/modeling_bark.py26-30

BarkModel.generate() orchestrates all three stages, accepting a text string and returning a waveform tensor.

Audio Utility Functions

src/transformers/audio_utils.py provides pure-NumPy audio DSP utilities used by feature extractors. They are framework-agnostic.

Function	Purpose
`load_audio()`	Load from file/URL via torchcodec (preferred) or librosa
`spectrogram()`	Compute STFT-based (mel/log-mel) spectrogram
`mel_filter_bank()`	Construct triangular mel filterbank matrix
`window_function()`	Generate Hann, Hamming, Blackman, etc. windows
`power_to_db()`	Convert power spectrogram to dB scale

src/transformers/audio_utils.py60-231

WhisperFeatureExtractor uses mel_filter_bank() and spectrogram() to produce the 80-bin or 128-bin log-mel spectrogram expected by WhisperEncoder. The pipeline's preprocess() uses ffmpeg_read() from pipelines/audio_utils.py to decode audio bytes from any ffmpeg-supported format.

Whisper and Automatic Speech Recognition

Relevant source files

Whisper Model Architecture

Audio Preprocessing and the Encoder

Diagram: WhisperEncoder components (modeling_whisper.py)

Sources: src/transformers/models/whisper/modeling_whisper.py558-690

src/transformers/models/whisper/modeling_whisper.py549-555

embed_positions holds frozen sinusoidal embeddings for up to max_source_positions positions. The sinusoids() function generates them at initialization:

src/transformers/models/whisper/modeling_whisper.py53-62

embed_positions.requires_grad_(False) is set directly, and _init_weights restores the sinusoidal values via init.copy_() to survive any accidental modifications.

src/transformers/models/whisper/modeling_whisper.py540-548

Each WhisperEncoderLayer applies pre-norm self-attention followed by an FFN (fc1 → activation → fc2). Float16 activations are clamped to finfo(float16).max - 1000 to prevent overflow.

src/transformers/models/whisper/modeling_whisper.py364-421

The encoder does not use an attention mask. Padding and silence are handled implicitly.

WhisperDecoder

Diagram: WhisperDecoder components (modeling_whisper.py)

Sources: src/transformers/models/whisper/modeling_whisper.py693-800

Unlike the encoder, the decoder uses learned positional embeddings (WhisperPositionalEmbedding) indexed by position. Each WhisperDecoderLayer contains:

self_attn: causal masked self-attention (is_causal=True)
encoder_attn: cross-attention attending to encoder output
self_attn_layer_norm, encoder_attn_layer_norm, final_layer_norm
FFN: fc1 (d_model → decoder_ffn_dim) → activation → fc2 (decoder_ffn_dim → d_model)

src/transformers/models/whisper/modeling_whisper.py423-523

Attention and EncoderDecoderCache

WhisperAttention handles both self-attention and cross-attention. Cross-attention is triggered when key_value_states is not None.

For autoregressive decoding, Whisper uses EncoderDecoderCache, which wraps two sub-caches:

self_attention_cache: growing KV cache for decoder self-attention
cross_attention_cache: fixed KV cache for encoder-decoder cross-attention

src/transformers/models/whisper/modeling_whisper.py314-339

Model Variants

Class	Inherits From	Task
`WhisperModel`	`WhisperPreTrainedModel`	Base encoder-decoder; `Seq2SeqModelOutput`
`WhisperForConditionalGeneration`	`WhisperGenerationMixin`, `WhisperPreTrainedModel`	Transcription/translation; `Seq2SeqLMOutput`
`WhisperForAudioClassification`	`WhisperPreTrainedModel`	Encoder + `projector` + `classifier`
`WhisperForCausalLM`	`WhisperPreTrainedModel`, `GenerationMixin`	Decoder-only; draft model for assisted decoding

Sources: src/transformers/models/whisper/modeling_whisper.py527-556 tests/models/whisper/test_modeling_whisper.py351-360

freeze_encoder() on WhisperModel or WhisperForConditionalGeneration sets requires_grad=False on all encoder parameters, useful for decoder-only fine-tuning.

src/transformers/models/whisper/modeling_whisper.py593-596

src/transformers/models/whisper/modeling_whisper.py83-199

WhisperGenerationMixin

src/transformers/models/whisper/generation_whisper.py240

The generate() Interface

WhisperGenerationMixin.generate() adds Whisper-specific parameters to the standard generation interface:

src/transformers/models/whisper/generation_whisper.py383-412

Parameter	Type	Description
`return_timestamps`	`bool`	Enable timestamp token generation
`task`	`str`	`"transcribe"` or `"translate"`
`language`	`str \| list[str]`	ISO code (`"en"`), token (`"<\|en\|>"`), or full name
`condition_on_prev_tokens`	`bool`	Feed previous segment output as next segment prefix
`temperature`	`float \| tuple`	Single value or fallback schedule
`compression_ratio_threshold`	`float`	Zlib compression ratio cutoff (e.g., 1.35)
`logprob_threshold`	`float`	Average log-prob cutoff per segment (e.g., -1.0)
`no_speech_threshold`	`float`	Silence detection threshold (e.g., 0.6)
`num_segment_frames`	`int`	Frames per chunk (default 3000, i.e., 30 s)
`return_token_timestamps`	`bool`	Attach per-token timestamps via DTW
`prompt_ids`	`torch.Tensor`	Custom vocabulary context prepended to each segment
`prompt_condition_type`	`str`	`"first-segment"` or `"all-segments"`

Long-form Transcription Loop

Diagram: Long-form transcription loop (generation_whisper.py)

Sources: src/transformers/models/whisper/generation_whisper.py383-900 src/transformers/models/whisper/generation_whisper.py126-237

Quality Filtering

Each generated segment is validated before being accepted. If a check fails, the segment is discarded and regenerated using the next temperature value in the fallback schedule.

Threshold	Failure Condition	Rationale
`compression_ratio_threshold`	`zlib.compress(text).size / text.size > threshold`	High compression = repetitive output
`logprob_threshold`	`mean(log P(token)) < threshold`	Low avg log-prob = uncertain output
`no_speech_threshold`	`P(no-speech token) > threshold` AND `logprob < logprob_threshold`	Combined silence detection

temperature accepts a tuple such as (0.0, 0.2, 0.4, 0.6, 0.8, 1.0). Generation starts at temperature[0] (greedy or beam) and escalates only on failure.

src/transformers/models/whisper/generation_whisper.py489-503

The logits processors installed per segment:

WhisperTimeStampLogitsProcessor: enforces valid timestamp token patterns (monotonically increasing, paired)
SuppressTokensAtBeginLogitsProcessor: suppresses tokens invalid at the start of generation
SuppressTokensLogitsProcessor: suppresses always-invalid tokens
WhisperNoSpeechDetection: monitors the no-speech token probability

src/transformers/models/whisper/generation_whisper.py26-33

Timestamp Extraction via DTW

_extract_token_timestamps() maps each output token to a position in the input audio using the cross-attention weights collected during generation.

Diagram: DTW timestamp pipeline (generation_whisper.py)

Sources: src/transformers/models/whisper/generation_whisper.py241-381

src/transformers/models/whisper/generation_whisper.py64-115

_median_filter() applies a sliding median of width config.median_filter_width (padding with reflection) to smooth cross-attention noise before DTW.

src/transformers/models/whisper/generation_whisper.py43-61

generation_config.alignment_heads is a list of [layer_index, head_index] pairs. These heads are empirically selected per model variant for best alignment quality.

WhisperTokenizer

src/transformers/models/whisper/tokenization_whisper.py163-276

Language and Task Dictionaries

src/transformers/models/whisper/tokenization_whisper.py40-160

Decoder Prefix Token Structure

Whisper generation is controlled by a fixed prefix injected at the start of decoding:

Position	Token	Example
0	`<\|startoftranscript\|>`	Always present
1	Language token	`<\|en\|>`
2	Task token	`<\|transcribe\|>` or `<\|translate\|>`
3	Timestamp control	`<\|notimestamps\|>` or omitted

set_prefix_tokens() constructs this prefix based on self.language, self.task, and self.predict_timestamps.

Timestamp Token Decoding

Timestamp tokens have IDs above the special token range and are formatted as <|T.TT|> (e.g., <|1.08|>). Two methods decode these:

_decode_with_timestamps(): returns a string interleaving decoded text with timestamp annotations; handles segment boundaries and cumulative time offsets.

src/transformers/models/whisper/tokenization_whisper.py279-325

_compute_offsets(): identifies consecutive timestamp token pairs as segment boundaries and returns a list of (start, end) tuples in seconds.

src/transformers/models/whisper/tokenization_whisper.py328-420

The default time resolution is 0.02 s/token (50 tokens per second of audio coverage), configurable via time_precision.

AutomaticSpeechRecognitionPipeline

src/transformers/pipelines/automatic_speech_recognition.py112-185

Default generation config: max_new_tokens=256, num_beams=5.

Model Type Classification

self.type is assigned in __init__ based on the loaded model:

`self.type`	Condition	Typical Model
`"seq2seq_whisper"`	`config.model_type == "whisper"`	`WhisperForConditionalGeneration`
`"seq2seq"`	model in `MODEL_FOR_SPEECH_SEQ_2_SEQ_MAPPING_NAMES`	`Speech2TextForConditionalGeneration`
`"ctc_with_lm"`	`decoder` argument passed	`Wav2Vec2ForCTC` + pyctcdecode
`"ctc"`	default	`Wav2Vec2ForCTC`, `HubertForCTC`

src/transformers/pipelines/automatic_speech_recognition.py196-207

Pipeline Data Flow

Diagram: AutomaticSpeechRecognitionPipeline flow (automatic_speech_recognition.py)

Sources: src/transformers/pipelines/automatic_speech_recognition.py209-600

Input Handling

preprocess() normalizes inputs before feature extraction:

str URL: httpx.get() → ffmpeg_read() (requires ffmpeg)
str path: file open → ffmpeg_read()
bytes: ffmpeg_read()
np.ndarray or torch.Tensor: used directly as float32 waveform
dict: {"sampling_rate": int, "raw": array, "stride": (left, right)}

src/transformers/pipelines/automatic_speech_recognition.py363-420

Chunking and Striding

For long audio, chunk_iter() yields overlapping feature extractor outputs:

|<-------- chunk_len -------->|
|--stride_left--|--content--|--stride_right--|

src/transformers/pipelines/automatic_speech_recognition.py41-84

For CTC models, _find_longest_common_sequence() stitches overlapping chunk outputs by finding the longest token subsequence common to the overlap region.

src/transformers/pipelines/automatic_speech_recognition.py87-109

Note for Whisper: chunk_length_s is experimental with seq2seq_whisper. For audio longer than 30 s, use WhisperGenerationMixin.generate() directly. The pipeline warns about this.

src/transformers/pipelines/automatic_speech_recognition.py284-297

Timestamp Return Modes

`return_timestamps`	Supported Model Types	Mechanism
`"char"`	CTC only	CTC frame-level logit alignment
`"word"`	CTC, Whisper	CTC alignment or Whisper DTW
`True`	Whisper only	Timestamp tokens from decoder

src/transformers/pipelines/automatic_speech_recognition.py229-252

Wav2Vec2 — CTC-based ASR

Wav2Vec2 takes raw waveforms (not mel spectrograms) directly. The convolutional feature encoder replaces the mel spectrogram frontend.

Diagram: Wav2Vec2 architecture (modeling_wav2vec2.py)

Sources: src/transformers/models/wav2vec2/modeling_wav2vec2.py1-99

The feature encoder has no positional embeddings; position information is encoded implicitly through convolutions.
The transformer encoder uses learned relative position biases rather than absolute positional embeddings.
Wav2Vec2ForCTC applies a linear projection then CTC decode over the frame sequence.
For language-model-boosted decoding, the pipeline's ctc_with_lm mode wraps a pyctcdecode BeamSearchDecoderCTC.

SpecAugment masking (_compute_mask_indices()) is shared between Wav2Vec2, Whisper, and SpeechT5 for training augmentation.

src/transformers/models/wav2vec2/modeling_wav2vec2.py101-200

SpeechT5 — Unified Speech-Text Model

SpeechT5 uses a single transformer backbone for multiple speech-text tasks. Task specificity is provided by modality pre-nets (before the encoder) and post-nets (after the decoder).

Diagram: SpeechT5 task variants (modeling_speecht5.py)

Sources: src/transformers/models/speecht5/modeling_speecht5.py209-400

Class	Encoder Pre-net	Decoder Post-net
`SpeechT5ForSpeechToText`	`SpeechT5SpeechEncoderPrenet`	`SpeechT5TextDecoderPostnet`
`SpeechT5ForTextToSpeech`	`SpeechT5TextEncoderPrenet`	`SpeechT5SpeechDecoderPostnet` + `SpeechT5HifiGan`
`SpeechT5ForSpeechToSpeech`	`SpeechT5SpeechEncoderPrenet`	`SpeechT5SpeechDecoderPostnet` + `SpeechT5HifiGan`

SpeechT5HifiGan is a standalone neural vocoder (mel spectrogram → waveform) that can be used independently via the text-to-audio pipeline.

shift_spectrograms_right() is the spectrogram analog of shift_tokens_right(), handling the reduction factor (subsampling) during teacher-forced spectrogram decoding.

src/transformers/models/speecht5/modeling_speecht5.py68-86

Bark — Text-to-Audio Generation

Bark generates audio from text through three autoregressive stages. Each stage uses a separate PreTrainedModel and GenerationConfig subclass.

Diagram: Bark multi-stage pipeline (modeling_bark.py)

Sources: src/transformers/models/bark/modeling_bark.py361-650

BarkFineModel processes all 8 EnCodec codebook tokens simultaneously using bidirectional (is_causal=False) attention, then predicts the fine tokens for each codebook.

Each stage has its own generation config:

BarkSemanticGenerationConfig
BarkCoarseGenerationConfig
BarkFineGenerationConfig

Specialized logits processors:

Processor	Purpose
`AlternatingCodebooksLogitsProcessor`	Enforces alternating codebook token generation in coarse stage
`BarkEosPrioritizerLogitsProcessor`	Raises EOS probability near the max length limit
`SuppressTokensLogitsProcessor`	Blocks vocabulary entries invalid for the current stage

src/transformers/models/bark/modeling_bark.py26-30

BarkModel.generate() orchestrates all three stages, accepting a text string and returning a waveform tensor.

Audio Utility Functions

src/transformers/audio_utils.py provides pure-NumPy audio DSP utilities used by feature extractors. They are framework-agnostic.

Function	Purpose
`load_audio()`	Load from file/URL via torchcodec (preferred) or librosa
`spectrogram()`	Compute STFT-based (mel/log-mel) spectrogram
`mel_filter_bank()`	Construct triangular mel filterbank matrix
`window_function()`	Generate Hann, Hamming, Blackman, etc. windows
`power_to_db()`	Convert power spectrogram to dB scale

src/transformers/audio_utils.py60-231

Whisper and Automatic Speech Recognition

Whisper Model Architecture

Audio Preprocessing and the Encoder

WhisperDecoder

Attention and EncoderDecoderCache

Model Variants

WhisperGenerationMixin

The generate() Interface

Long-form Transcription Loop

Quality Filtering

Timestamp Extraction via DTW

WhisperTokenizer

Language and Task Dictionaries

Decoder Prefix Token Structure

Timestamp Token Decoding

AutomaticSpeechRecognitionPipeline

Model Type Classification

Pipeline Data Flow

Input Handling

Chunking and Striding

Timestamp Return Modes

Related Audio Models

Wav2Vec2 — CTC-based ASR

SpeechT5 — Unified Speech-Text Model

Bark — Text-to-Audio Generation

Audio Utility Functions

On this page

Whisper and Automatic Speech Recognition

Whisper Model Architecture

Audio Preprocessing and the Encoder

WhisperDecoder

Attention and EncoderDecoderCache

Model Variants

WhisperGenerationMixin

The generate() Interface

Long-form Transcription Loop

Quality Filtering

Timestamp Extraction via DTW

WhisperTokenizer

Language and Task Dictionaries

Decoder Prefix Token Structure

Timestamp Token Decoding

AutomaticSpeechRecognitionPipeline

Model Type Classification

Pipeline Data Flow

Input Handling

Chunking and Striding

Timestamp Return Modes

Related Audio Models

Wav2Vec2 — CTC-based ASR

SpeechT5 — Unified Speech-Text Model

Bark — Text-to-Audio Generation

Audio Utility Functions

On this page