This page provides an overview of the major model architecture families supported by the transformers library and the common architectural patterns shared across implementations. The library supports 200+ model architectures, organized under src/transformers/models/, each implementing specific architectural innovations while sharing fundamental building blocks provided by PreTrainedModel in src/transformers/modeling_utils.py.
| Family | Representative Models | Child Page |
|---|---|---|
| Decoder-Only LMs | LLaMA, Mistral, Gemma, Qwen2, Phi, Falcon, GPT-2, GPT-NeoX, Bloom | 5.1 |
| Attention Mechanisms | Eager, Flash Attention 2, SDPA, FlexAttention backends | 5.2 |
| Positional Embeddings | RoPE variants (linear, dynamic, yarn, llama3), sliding window | 5.3 |
| Mixture-of-Experts | Mixtral, Qwen2MoE, Jamba, GraniteHybrid | 5.4 |
| ASR / Speech | Whisper, Bark, SpeechT5, Wav2Vec2 | 5.5 |
| State Space Models | Mamba, Mamba2, Jamba, Bamba, FalconMamba, Zamba | 5.6 |
| Encoder-Decoder | BART, mBART, T5, Pegasus, Marian, M2M-100 | 5.7 |
| Multimodal VLMs | LLaVA, PaliGemma, Qwen2-VL, BLIP-2, Gemma3, MLLama | 5.8 |
All model families share common building blocks (embeddings, attention mechanisms, feed-forward networks, normalization layers) while varying in their specific implementations and combinations.
Related Documentation:
All model implementations in transformers follow a consistent four-level hierarchy that provides both flexibility and standardization across different architectures.
Sources: src/transformers/models/llama/modeling_llama.py315-331 src/transformers/models/mistral/modeling_mistral.py252-268 src/transformers/models/gemma/modeling_gemma.py308-324 src/transformers/modeling_layers.py31-34
Modern LLM architectures like Llama consist of hierarchical components that combine into complete models:
Component Hierarchy Diagram
Sources: src/transformers/models/llama/modeling_llama.py363-378 src/transformers/models/llama/modeling_llama.py298-340 src/transformers/models/llama/modeling_llama.py228-295 src/transformers/models/llama/modeling_llama.py173-186
The transformers library contains implementations across several model families, each following specific architectural patterns while sharing common infrastructure.
Most modern language models follow the decoder-only transformer architecture. The LlamaForCausalLM class in src/transformers/models/llama/modeling_llama.py is the canonical reference implementation for this family. All variants share the same embedding → stacked decoder layers → norm → lm_head structure.
| Model | ForCausalLM Class | Defining Feature |
|---|---|---|
| LLaMA | LlamaForCausalLM | RoPE, RMSNorm, gated SwiGLU MLP |
| Mistral | MistralForCausalLM | Sliding window attention, GQA |
| Gemma | GemmaForCausalLM | RMSNorm variant with weight + 1.0 |
| Qwen2 | Qwen2ForCausalLM | QKV bias, sliding window |
| Phi | PhiForCausalLM | Partial RoPE, LayerNorm |
| GPT-2 | GPT2LMHeadModel | Learned position embeddings, Conv1D |
| GPT-NeoX | GPTNeoXForCausalLM | Parallel attention+MLP, partial RoPE |
| Bloom | BloomForCausalLM | ALiBi positional encoding |
For detailed architecture, see Decoder-Only Language Models.
Sources: src/transformers/models/llama/modeling_llama.py359-560 src/transformers/models/mistral/modeling_mistral.py1-450 src/transformers/models/gemma/modeling_gemma.py1-460 src/transformers/models/qwen2/modeling_qwen2.py1-400 src/transformers/models/auto/modeling_auto.py577-650
MoE models use sparse expert routing to scale model capacity efficiently. Each token is processed by only a subset of expert networks. The MixtralSparseMoeBlock class in src/transformers/models/mixtral/modeling_mixtral.py is the canonical implementation.
MoE Routing Data Flow
| Model | MoE Class | Num Experts | Top-K |
|---|---|---|---|
| Mixtral | MixtralSparseMoeBlock | 8 | 2 |
| Qwen2MoE | Qwen2MoeSparseMoeBlock | 64 | 4 |
| Jamba | JambaSparseMoeBlock | 16 | 2 |
For detailed MoE internals, see Mixture-of-Experts Architecture.
Sources: src/transformers/models/mixtral/modeling_mixtral.py61-139 src/transformers/models/qwen2_moe/modeling_qwen2_moe.py62-146 src/transformers/models/mixtral/modeling_mixtral.py483-527
Encoder-decoder models (seq2seq) maintain separate encoder and decoder stacks connected through cross-attention. The key models are BART and T5, used for translation, summarization, and conditional generation.
| Model | Config Class | Output Class | Key Feature |
|---|---|---|---|
| BART | BartConfig | BartForConditionalGeneration | Denoising pretraining, learned positions |
| T5 | T5Config | T5ForConditionalGeneration | Relative position bias, no absolute positions |
| Pegasus | PegasusConfig | PegasusForConditionalGeneration | Gap sentence generation objective |
| Marian | MarianConfig | MarianMTModel | Neural machine translation |
| M2M-100 | M2M100Config | M2M100ForConditionalGeneration | Many-to-many multilingual translation |
For details, see Encoder-Decoder Models.
Sources: src/transformers/models/auto/modeling_auto.py482-575
SSMs are non-attention alternatives to the transformer decoder stack, using structured state-space recurrences. They maintain conv_state and ssm_state instead of a KV cache.
| Model | Key Class | Architecture |
|---|---|---|
| Mamba | MambaForCausalLM | Pure SSM |
| Mamba2 | Mamba2ForCausalLM | SSM with SSD algorithm |
| Jamba | JambaForCausalLM | Hybrid: alternating Mamba + Transformer + MoE |
| FalconMamba | FalconMambaForCausalLM | Pure SSM |
| Zamba | ZambaForCausalLM | Hybrid: Mamba + shared Transformer layer |
For SSM internals, see State Space Models.
Sources: src/transformers/models/auto/modeling_auto.py265-270
Audio models use specialized encoder architectures for processing mel spectrograms or raw waveforms. Whisper is the primary seq2seq ASR model with a CNN-based encoder and a transformer decoder.
| Model | Key Class | Architecture |
|---|---|---|
| Whisper | WhisperForConditionalGeneration | CNN encoder + Transformer decoder |
| Wav2Vec2 | Wav2Vec2ForCTC | Convolutional feature encoder + Transformer |
| SpeechT5 | SpeechT5ForSpeechToText | Shared encoder-decoder for multiple speech tasks |
| Bark | BarkForCausalLM | Autoregressive audio generation |
For details, see Whisper and Automatic Speech Recognition.
Sources: src/transformers/models/auto/modeling_auto.py460-465
Vision-language models (VLMs) combine a vision encoder (e.g., SiglipVisionModel, CLIPVisionModel) with a language decoder through a multimodal projector. Image or video tokens are inserted into the text token sequence before the LLM backbone processes them.
| Model | Config Class | Vision Backbone | LLM Backbone |
|---|---|---|---|
| LLaVA | LlavaConfig | CLIP | Llama/Vicuna |
| PaliGemma | PaliGemmaConfig | SigLIP | Gemma |
| Qwen2-VL | Qwen2VLConfig | Qwen2 ViT | Qwen2 |
| BLIP-2 | Blip2Config | ViT + Q-Former | OPT / T5 |
| Gemma3 | Gemma3Config | SigLIP | Gemma3 |
| MLLama | MllamaConfig | ViT (cross-attn) | Llama3 |
For details, see Multimodal Vision-Language Models.
Sources: src/transformers/models/auto/modeling_auto.py252-260 src/transformers/models/auto/image_processing_auto.py63-200
All transformer models in the library share fundamental building blocks, though their specific implementations vary across model families.
The following components appear across most model architectures:
| Component | Purpose | Common Implementations |
|---|---|---|
| Token Embeddings | Convert input tokens to vectors | nn.Embedding (universal) |
| Position Encodings | Encode positional information | LlamaRotaryEmbedding, Qwen2RotaryEmbedding, nn.Embedding (learned) |
| Attention Mechanism | Compute attention between tokens | LlamaAttention, MistralAttention, GemmaAttention |
| Feed-Forward Network | Non-linear transformations | LlamaMLP, GPT2MLP, MixtralSparseMoeBlock |
| Normalization Layers | Stabilize training | LlamaRMSNorm, GemmaRMSNorm, nn.LayerNorm |
| Output Head | Task-specific predictions | lm_head (nn.Linear, maps hidden → vocab) |
Model Component Data Flow (LlamaDecoderLayer)
Sources: src/transformers/models/llama/modeling_llama.py53-71 src/transformers/models/llama/modeling_llama.py73-136 src/transformers/models/llama/modeling_llama.py228-295 src/transformers/models/llama/modeling_llama.py173-186 src/transformers/models/llama/modeling_llama.py295-337
All models implement multi-head attention through a pluggable interface (ALL_ATTENTION_FUNCTIONS, defined in src/transformers/modeling_utils.py) that supports multiple optimization backends. The active backend is set via config._attn_implementation.
Attention Backend Selection
The eager_attention_forward function at src/transformers/models/llama/modeling_llama.py199-221 is the baseline used by all models and handles grouped-query attention (GQA) via repeat_kv.
For detailed attention backend mechanics, see Attention Mechanisms.
Sources: src/transformers/models/llama/modeling_llama.py199-295 src/transformers/models/mistral/modeling_mistral.py96-200
Transformer models use different strategies to encode positional information. Most modern LLMs use Rotary Position Embeddings (RoPE), while older models use learned or sinusoidal embeddings.
Positional Encoding Comparison Table
| Model Family | Implementation Class | Type | Key Characteristic |
|---|---|---|---|
| Llama, Mistral, Gemma | LlamaRotaryEmbedding | RoPE | Standard, via ROPE_INIT_FUNCTIONS["default"] |
| Qwen2, Qwen2MoE | Qwen2RotaryEmbedding | RoPE | Dynamic scaling support |
| Cohere | CohereRotaryEmbedding | RoPE | repeat_interleave instead of cat for interleaved RoPE |
| GPT-NeoX | GPTNeoXRotaryEmbedding | RoPE | partial_rotary_factor for partial rotation |
| Phi, Phi3 | PhiRotaryEmbedding | RoPE | Partial rotation with configurable factor |
| GPT-2, OPT | wpe (nn.Embedding) | Learned | Traditional learned position embeddings |
RoPE scaling strategies (linear, dynamic, yarn, llama3) are centralized in src/transformers/modeling_rope_utils.py via the ROPE_INIT_FUNCTIONS registry. The apply_rotary_pos_emb function and rotate_half helper are shared across all RoPE models.
For detailed RoPE internals and scaling strategies, see Positional Embeddings.
Sources: src/transformers/models/llama/modeling_llama.py73-136 src/transformers/models/cohere/modeling_cohere.py69-131 src/transformers/models/llama/modeling_llama.py138-168
| Normalization Type | Models | Class | Key Difference |
|---|---|---|---|
| RMSNorm (standard) | Llama, Mistral, Mixtral, Qwen2 | LlamaRMSNorm | No mean subtraction; weight * hidden * rsqrt(var + eps) |
| RMSNorm (Gemma) | Gemma, Gemma2 | GemmaRMSNorm | Weight initialized to zeros, then (1.0 + weight) * normalized |
| LayerNorm | GPT-2, OPT, GPT-J | nn.LayerNorm | Full mean/variance normalization with bias |
| LayerNorm (custom) | Cohere | CohereLayerNorm | Standard LayerNorm with model-specific initialization |
LlamaRMSNorm is defined at src/transformers/models/llama/modeling_llama.py53-70 and decorated with @use_kernel_forward_from_hub("RMSNorm") to allow hardware-optimized kernel replacement.
GemmaRMSNorm differs from the Llama version by initializing weight to zeros and adding 1.0 in the forward pass, as seen at src/transformers/models/gemma/modeling_gemma.py48-65
Models use three main MLP patterns:
| Pattern | Models | Projection Structure | Activation |
|---|---|---|---|
| Gated (SwiGLU) | Llama, Mistral, Gemma, Qwen2, Mixtral | gate_proj, up_proj, down_proj | ACT2FN[config.hidden_act] (typically silu) |
| Traditional | GPT-2, OPT, GPT-J | c_fc, c_proj (Conv1D) | gelu or gelu_new |
| Fused | Phi3, some Phi variants | gate_up_proj (single fused projection) | Chunked after projection |
The gated pattern (LlamaMLP) computes: down_proj(act_fn(gate_proj(x)) * up_proj(x)) — see src/transformers/models/llama/modeling_llama.py171-184
The fused pattern (Phi3MLP) reduces memory bandwidth by projecting to 2 * intermediate_size in one operation and then splitting: see src/transformers/models/phi3/modeling_phi3.py51-72
Sources: src/transformers/models/llama/modeling_llama.py53-70 src/transformers/models/gemma/modeling_gemma.py48-65 src/transformers/models/llama/modeling_llama.py171-184 src/transformers/models/phi3/modeling_phi3.py51-72
Model implementations integrate with the generation system through the GenerationMixin class:
GenerationMixinCausalLMOutputWithPast containing logits and cached statesDynamicCache for efficient generationModels support various training optimizations:
GradientCheckpointingLayer base class_tp_plan, _pp_plan)Each model family includes configuration classes that define architectural parameters:
hidden_size, num_attention_heads, intermediate_sizenum_hidden_layers, layer_types (for hybrid models)attention_dropout, hidden_dropout_probuse_cache, gradient_checkpointingSources: src/transformers/models/llama/modeling_llama.py413-425 src/transformers/models/llama/modeling_llama.py268-278 src/transformers/models/mixtral/modeling_mixtral.py566-581
Refresh this wiki
This wiki was recently refreshed. Please wait 2 days to refresh again.