Model Architectures

Relevant source files

This page provides an overview of the major model architecture families supported by the transformers library and the common architectural patterns shared across implementations. The library supports 200+ model architectures, organized under src/transformers/models/, each implementing specific architectural innovations while sharing fundamental building blocks provided by PreTrainedModel in src/transformers/modeling_utils.py.

Architecture Families

Family	Representative Models	Child Page
Decoder-Only LMs	LLaMA, Mistral, Gemma, Qwen2, Phi, Falcon, GPT-2, GPT-NeoX, Bloom	5.1
Attention Mechanisms	Eager, Flash Attention 2, SDPA, FlexAttention backends	5.2
Positional Embeddings	RoPE variants (linear, dynamic, yarn, llama3), sliding window	5.3
Mixture-of-Experts	Mixtral, Qwen2MoE, Jamba, GraniteHybrid	5.4
ASR / Speech	Whisper, Bark, SpeechT5, Wav2Vec2	5.5
State Space Models	Mamba, Mamba2, Jamba, Bamba, FalconMamba, Zamba	5.6
Encoder-Decoder	BART, mBART, T5, Pegasus, Marian, M2M-100	5.7
Multimodal VLMs	LLaVA, PaliGemma, Qwen2-VL, BLIP-2, Gemma3, MLLama	5.8

All model families share common building blocks (embeddings, attention mechanisms, feed-forward networks, normalization layers) while varying in their specific implementations and combinations.

Related Documentation:

For model loading mechanisms, see Model Loading and PreTrainedModel
For training models, see Training System
For text generation, see Generation System

Model Implementation Structure

All model implementations in transformers follow a consistent four-level hierarchy that provides both flexibility and standardization across different architectures.

Model Class Hierarchy

Sources: src/transformers/models/llama/modeling_llama.py315-331 src/transformers/models/mistral/modeling_mistral.py252-268 src/transformers/models/gemma/modeling_gemma.py308-324 src/transformers/modeling_layers.py31-34

Core Model Components

Modern LLM architectures like Llama consist of hierarchical components that combine into complete models:

Component Hierarchy Diagram

Sources: src/transformers/models/llama/modeling_llama.py363-378 src/transformers/models/llama/modeling_llama.py298-340 src/transformers/models/llama/modeling_llama.py228-295 src/transformers/models/llama/modeling_llama.py173-186

Model Categories and Implementation Patterns

The transformers library contains implementations across several model families, each following specific architectural patterns while sharing common infrastructure.

Decoder-Only Language Models

Most modern language models follow the decoder-only transformer architecture. The LlamaForCausalLM class in src/transformers/models/llama/modeling_llama.py is the canonical reference implementation for this family. All variants share the same embedding → stacked decoder layers → norm → lm_head structure.

Model	`ForCausalLM` Class	Defining Feature
LLaMA	`LlamaForCausalLM`	RoPE, RMSNorm, gated SwiGLU MLP
Mistral	`MistralForCausalLM`	Sliding window attention, GQA
Gemma	`GemmaForCausalLM`	RMSNorm variant with `weight + 1.0`
Qwen2	`Qwen2ForCausalLM`	QKV bias, sliding window
Phi	`PhiForCausalLM`	Partial RoPE, LayerNorm
GPT-2	`GPT2LMHeadModel`	Learned position embeddings, Conv1D
GPT-NeoX	`GPTNeoXForCausalLM`	Parallel attention+MLP, partial RoPE
Bloom	`BloomForCausalLM`	ALiBi positional encoding

For detailed architecture, see Decoder-Only Language Models.

Sources: src/transformers/models/llama/modeling_llama.py359-560 src/transformers/models/mistral/modeling_mistral.py1-450 src/transformers/models/gemma/modeling_gemma.py1-460 src/transformers/models/qwen2/modeling_qwen2.py1-400 src/transformers/models/auto/modeling_auto.py577-650

Mixture-of-Experts (MoE) Models

MoE models use sparse expert routing to scale model capacity efficiently. Each token is processed by only a subset of expert networks. The MixtralSparseMoeBlock class in src/transformers/models/mixtral/modeling_mixtral.py is the canonical implementation.

MoE Routing Data Flow

Model	MoE Class	Num Experts	Top-K
Mixtral	`MixtralSparseMoeBlock`	8	2
Qwen2MoE	`Qwen2MoeSparseMoeBlock`	64	4
Jamba	`JambaSparseMoeBlock`	16	2

For detailed MoE internals, see Mixture-of-Experts Architecture.

Sources: src/transformers/models/mixtral/modeling_mixtral.py61-139 src/transformers/models/qwen2_moe/modeling_qwen2_moe.py62-146 src/transformers/models/mixtral/modeling_mixtral.py483-527

Encoder-Decoder Models

Encoder-decoder models (seq2seq) maintain separate encoder and decoder stacks connected through cross-attention. The key models are BART and T5, used for translation, summarization, and conditional generation.

Model	Config Class	Output Class	Key Feature
BART	`BartConfig`	`BartForConditionalGeneration`	Denoising pretraining, learned positions
T5	`T5Config`	`T5ForConditionalGeneration`	Relative position bias, no absolute positions
Pegasus	`PegasusConfig`	`PegasusForConditionalGeneration`	Gap sentence generation objective
Marian	`MarianConfig`	`MarianMTModel`	Neural machine translation
M2M-100	`M2M100Config`	`M2M100ForConditionalGeneration`	Many-to-many multilingual translation

For details, see Encoder-Decoder Models.

Sources: src/transformers/models/auto/modeling_auto.py482-575

State Space Models (SSMs)

SSMs are non-attention alternatives to the transformer decoder stack, using structured state-space recurrences. They maintain conv_state and ssm_state instead of a KV cache.

Model	Key Class	Architecture
Mamba	`MambaForCausalLM`	Pure SSM
Mamba2	`Mamba2ForCausalLM`	SSM with SSD algorithm
Jamba	`JambaForCausalLM`	Hybrid: alternating Mamba + Transformer + MoE
FalconMamba	`FalconMambaForCausalLM`	Pure SSM
Zamba	`ZambaForCausalLM`	Hybrid: Mamba + shared Transformer layer

For SSM internals, see State Space Models.

Sources: src/transformers/models/auto/modeling_auto.py265-270

Automatic Speech Recognition and Audio Models

Audio models use specialized encoder architectures for processing mel spectrograms or raw waveforms. Whisper is the primary seq2seq ASR model with a CNN-based encoder and a transformer decoder.

Model	Key Class	Architecture
Whisper	`WhisperForConditionalGeneration`	CNN encoder + Transformer decoder
Wav2Vec2	`Wav2Vec2ForCTC`	Convolutional feature encoder + Transformer
SpeechT5	`SpeechT5ForSpeechToText`	Shared encoder-decoder for multiple speech tasks
Bark	`BarkForCausalLM`	Autoregressive audio generation

For details, see Whisper and Automatic Speech Recognition.

Sources: src/transformers/models/auto/modeling_auto.py460-465

Multimodal Vision-Language Models

Vision-language models (VLMs) combine a vision encoder (e.g., SiglipVisionModel, CLIPVisionModel) with a language decoder through a multimodal projector. Image or video tokens are inserted into the text token sequence before the LLM backbone processes them.

Model	Config Class	Vision Backbone	LLM Backbone
LLaVA	`LlavaConfig`	CLIP	Llama/Vicuna
PaliGemma	`PaliGemmaConfig`	SigLIP	Gemma
Qwen2-VL	`Qwen2VLConfig`	Qwen2 ViT	Qwen2
BLIP-2	`Blip2Config`	ViT + Q-Former	OPT / T5
Gemma3	`Gemma3Config`	SigLIP	Gemma3
MLLama	`MllamaConfig`	ViT (cross-attn)	Llama3

For details, see Multimodal Vision-Language Models.

Sources: src/transformers/models/auto/modeling_auto.py252-260 src/transformers/models/auto/image_processing_auto.py63-200

Common Architectural Components

All transformer models in the library share fundamental building blocks, though their specific implementations vary across model families.

Shared Building Blocks

The following components appear across most model architectures:

Component	Purpose	Common Implementations
Token Embeddings	Convert input tokens to vectors	`nn.Embedding` (universal)
Position Encodings	Encode positional information	`LlamaRotaryEmbedding`, `Qwen2RotaryEmbedding`, `nn.Embedding` (learned)
Attention Mechanism	Compute attention between tokens	`LlamaAttention`, `MistralAttention`, `GemmaAttention`
Feed-Forward Network	Non-linear transformations	`LlamaMLP`, `GPT2MLP`, `MixtralSparseMoeBlock`
Normalization Layers	Stabilize training	`LlamaRMSNorm`, `GemmaRMSNorm`, `nn.LayerNorm`
Output Head	Task-specific predictions	`lm_head` (`nn.Linear`, maps hidden → vocab)

Model Component Data Flow (LlamaDecoderLayer)

Sources: src/transformers/models/llama/modeling_llama.py53-71 src/transformers/models/llama/modeling_llama.py73-136 src/transformers/models/llama/modeling_llama.py228-295 src/transformers/models/llama/modeling_llama.py173-186 src/transformers/models/llama/modeling_llama.py295-337

Attention Implementation Strategy

All models implement multi-head attention through a pluggable interface (ALL_ATTENTION_FUNCTIONS, defined in src/transformers/modeling_utils.py) that supports multiple optimization backends. The active backend is set via config._attn_implementation.

Attention Backend Selection

The eager_attention_forward function at src/transformers/models/llama/modeling_llama.py199-221 is the baseline used by all models and handles grouped-query attention (GQA) via repeat_kv.

For detailed attention backend mechanics, see Attention Mechanisms.

Sources: src/transformers/models/llama/modeling_llama.py199-295 src/transformers/models/mistral/modeling_mistral.py96-200

Positional Encoding Patterns

Transformer models use different strategies to encode positional information. Most modern LLMs use Rotary Position Embeddings (RoPE), while older models use learned or sinusoidal embeddings.

Positional Encoding Comparison Table

Model Family	Implementation Class	Type	Key Characteristic
Llama, Mistral, Gemma	`LlamaRotaryEmbedding`	RoPE	Standard, via `ROPE_INIT_FUNCTIONS["default"]`
Qwen2, Qwen2MoE	`Qwen2RotaryEmbedding`	RoPE	Dynamic scaling support
Cohere	`CohereRotaryEmbedding`	RoPE	`repeat_interleave` instead of `cat` for interleaved RoPE
GPT-NeoX	`GPTNeoXRotaryEmbedding`	RoPE	`partial_rotary_factor` for partial rotation
Phi, Phi3	`PhiRotaryEmbedding`	RoPE	Partial rotation with configurable factor
GPT-2, OPT	`wpe` (`nn.Embedding`)	Learned	Traditional learned position embeddings

RoPE scaling strategies (linear, dynamic, yarn, llama3) are centralized in src/transformers/modeling_rope_utils.py via the ROPE_INIT_FUNCTIONS registry. The apply_rotary_pos_emb function and rotate_half helper are shared across all RoPE models.

For detailed RoPE internals and scaling strategies, see Positional Embeddings.

Sources: src/transformers/models/llama/modeling_llama.py73-136 src/transformers/models/cohere/modeling_cohere.py69-131 src/transformers/models/llama/modeling_llama.py138-168

Normalization Strategies

Normalization Type	Models	Class	Key Difference
RMSNorm (standard)	Llama, Mistral, Mixtral, Qwen2	`LlamaRMSNorm`	No mean subtraction; `weight * hidden * rsqrt(var + eps)`
RMSNorm (Gemma)	Gemma, Gemma2	`GemmaRMSNorm`	Weight initialized to `zeros`, then `(1.0 + weight) * normalized`
LayerNorm	GPT-2, OPT, GPT-J	`nn.LayerNorm`	Full mean/variance normalization with bias
LayerNorm (custom)	Cohere	`CohereLayerNorm`	Standard LayerNorm with model-specific initialization

LlamaRMSNorm is defined at src/transformers/models/llama/modeling_llama.py53-70 and decorated with @use_kernel_forward_from_hub("RMSNorm") to allow hardware-optimized kernel replacement.

GemmaRMSNorm differs from the Llama version by initializing weight to zeros and adding 1.0 in the forward pass, as seen at src/transformers/models/gemma/modeling_gemma.py48-65

Feed-Forward Network (MLP) Architectures

Models use three main MLP patterns:

Pattern	Models	Projection Structure	Activation
Gated (SwiGLU)	Llama, Mistral, Gemma, Qwen2, Mixtral	`gate_proj`, `up_proj`, `down_proj`	`ACT2FN[config.hidden_act]` (typically `silu`)
Traditional	GPT-2, OPT, GPT-J	`c_fc`, `c_proj` (Conv1D)	`gelu` or `gelu_new`
Fused	Phi3, some Phi variants	`gate_up_proj` (single fused projection)	Chunked after projection

The gated pattern (LlamaMLP) computes: down_proj(act_fn(gate_proj(x)) * up_proj(x)) — see src/transformers/models/llama/modeling_llama.py171-184

The fused pattern (Phi3MLP) reduces memory bandwidth by projecting to 2 * intermediate_size in one operation and then splitting: see src/transformers/models/phi3/modeling_phi3.py51-72

Sources: src/transformers/models/llama/modeling_llama.py53-70 src/transformers/models/gemma/modeling_gemma.py48-65 src/transformers/models/llama/modeling_llama.py171-184 src/transformers/models/phi3/modeling_phi3.py51-72

Integration with Core Systems

Generation Integration

Model implementations integrate with the generation system through the GenerationMixin class:

CausalLM Models: Inherit from both model base class and GenerationMixin
Output Structure: Return CausalLMOutputWithPast containing logits and cached states
Cache Integration: Support DynamicCache for efficient generation

Training Integration

Models support various training optimizations:

Gradient Checkpointing: Through GradientCheckpointingLayer base class
Mixed Precision: Automatic handling of different data types
Distributed Training: Support for model parallelism annotations (_tp_plan, _pp_plan)

Configuration System

Each model family includes configuration classes that define architectural parameters:

Model Dimensions: hidden_size, num_attention_heads, intermediate_size
Layer Configuration: num_hidden_layers, layer_types (for hybrid models)
Training Settings: attention_dropout, hidden_dropout_prob
Optimization Flags: use_cache, gradient_checkpointing

Sources: src/transformers/models/llama/modeling_llama.py413-425 src/transformers/models/llama/modeling_llama.py268-278 src/transformers/models/mixtral/modeling_mixtral.py566-581

Model Architectures

Relevant source files

Architecture Families

Family	Representative Models	Child Page
Decoder-Only LMs	LLaMA, Mistral, Gemma, Qwen2, Phi, Falcon, GPT-2, GPT-NeoX, Bloom	5.1
Attention Mechanisms	Eager, Flash Attention 2, SDPA, FlexAttention backends	5.2
Positional Embeddings	RoPE variants (linear, dynamic, yarn, llama3), sliding window	5.3
Mixture-of-Experts	Mixtral, Qwen2MoE, Jamba, GraniteHybrid	5.4
ASR / Speech	Whisper, Bark, SpeechT5, Wav2Vec2	5.5
State Space Models	Mamba, Mamba2, Jamba, Bamba, FalconMamba, Zamba	5.6
Encoder-Decoder	BART, mBART, T5, Pegasus, Marian, M2M-100	5.7
Multimodal VLMs	LLaVA, PaliGemma, Qwen2-VL, BLIP-2, Gemma3, MLLama	5.8

All model families share common building blocks (embeddings, attention mechanisms, feed-forward networks, normalization layers) while varying in their specific implementations and combinations.

Related Documentation:

For model loading mechanisms, see Model Loading and PreTrainedModel
For training models, see Training System
For text generation, see Generation System

Model Implementation Structure

All model implementations in transformers follow a consistent four-level hierarchy that provides both flexibility and standardization across different architectures.

Model Class Hierarchy

Core Model Components

Modern LLM architectures like Llama consist of hierarchical components that combine into complete models:

Component Hierarchy Diagram

Model Categories and Implementation Patterns

The transformers library contains implementations across several model families, each following specific architectural patterns while sharing common infrastructure.

Decoder-Only Language Models

Model	`ForCausalLM` Class	Defining Feature
LLaMA	`LlamaForCausalLM`	RoPE, RMSNorm, gated SwiGLU MLP
Mistral	`MistralForCausalLM`	Sliding window attention, GQA
Gemma	`GemmaForCausalLM`	RMSNorm variant with `weight + 1.0`
Qwen2	`Qwen2ForCausalLM`	QKV bias, sliding window
Phi	`PhiForCausalLM`	Partial RoPE, LayerNorm
GPT-2	`GPT2LMHeadModel`	Learned position embeddings, Conv1D
GPT-NeoX	`GPTNeoXForCausalLM`	Parallel attention+MLP, partial RoPE
Bloom	`BloomForCausalLM`	ALiBi positional encoding

For detailed architecture, see Decoder-Only Language Models.

Mixture-of-Experts (MoE) Models

MoE Routing Data Flow

Model	MoE Class	Num Experts	Top-K
Mixtral	`MixtralSparseMoeBlock`	8	2
Qwen2MoE	`Qwen2MoeSparseMoeBlock`	64	4
Jamba	`JambaSparseMoeBlock`	16	2

For detailed MoE internals, see Mixture-of-Experts Architecture.

Sources: src/transformers/models/mixtral/modeling_mixtral.py61-139 src/transformers/models/qwen2_moe/modeling_qwen2_moe.py62-146 src/transformers/models/mixtral/modeling_mixtral.py483-527

Encoder-Decoder Models

Model	Config Class	Output Class	Key Feature
BART	`BartConfig`	`BartForConditionalGeneration`	Denoising pretraining, learned positions
T5	`T5Config`	`T5ForConditionalGeneration`	Relative position bias, no absolute positions
Pegasus	`PegasusConfig`	`PegasusForConditionalGeneration`	Gap sentence generation objective
Marian	`MarianConfig`	`MarianMTModel`	Neural machine translation
M2M-100	`M2M100Config`	`M2M100ForConditionalGeneration`	Many-to-many multilingual translation

For details, see Encoder-Decoder Models.

Sources: src/transformers/models/auto/modeling_auto.py482-575

State Space Models (SSMs)

SSMs are non-attention alternatives to the transformer decoder stack, using structured state-space recurrences. They maintain conv_state and ssm_state instead of a KV cache.

Model	Key Class	Architecture
Mamba	`MambaForCausalLM`	Pure SSM
Mamba2	`Mamba2ForCausalLM`	SSM with SSD algorithm
Jamba	`JambaForCausalLM`	Hybrid: alternating Mamba + Transformer + MoE
FalconMamba	`FalconMambaForCausalLM`	Pure SSM
Zamba	`ZambaForCausalLM`	Hybrid: Mamba + shared Transformer layer

For SSM internals, see State Space Models.

Sources: src/transformers/models/auto/modeling_auto.py265-270

Automatic Speech Recognition and Audio Models

Audio models use specialized encoder architectures for processing mel spectrograms or raw waveforms. Whisper is the primary seq2seq ASR model with a CNN-based encoder and a transformer decoder.

Model	Key Class	Architecture
Whisper	`WhisperForConditionalGeneration`	CNN encoder + Transformer decoder
Wav2Vec2	`Wav2Vec2ForCTC`	Convolutional feature encoder + Transformer
SpeechT5	`SpeechT5ForSpeechToText`	Shared encoder-decoder for multiple speech tasks
Bark	`BarkForCausalLM`	Autoregressive audio generation

For details, see Whisper and Automatic Speech Recognition.

Sources: src/transformers/models/auto/modeling_auto.py460-465

Multimodal Vision-Language Models

Model	Config Class	Vision Backbone	LLM Backbone
LLaVA	`LlavaConfig`	CLIP	Llama/Vicuna
PaliGemma	`PaliGemmaConfig`	SigLIP	Gemma
Qwen2-VL	`Qwen2VLConfig`	Qwen2 ViT	Qwen2
BLIP-2	`Blip2Config`	ViT + Q-Former	OPT / T5
Gemma3	`Gemma3Config`	SigLIP	Gemma3
MLLama	`MllamaConfig`	ViT (cross-attn)	Llama3

For details, see Multimodal Vision-Language Models.

Sources: src/transformers/models/auto/modeling_auto.py252-260 src/transformers/models/auto/image_processing_auto.py63-200

Common Architectural Components

All transformer models in the library share fundamental building blocks, though their specific implementations vary across model families.

Shared Building Blocks

The following components appear across most model architectures:

Component	Purpose	Common Implementations
Token Embeddings	Convert input tokens to vectors	`nn.Embedding` (universal)
Position Encodings	Encode positional information	`LlamaRotaryEmbedding`, `Qwen2RotaryEmbedding`, `nn.Embedding` (learned)
Attention Mechanism	Compute attention between tokens	`LlamaAttention`, `MistralAttention`, `GemmaAttention`
Feed-Forward Network	Non-linear transformations	`LlamaMLP`, `GPT2MLP`, `MixtralSparseMoeBlock`
Normalization Layers	Stabilize training	`LlamaRMSNorm`, `GemmaRMSNorm`, `nn.LayerNorm`
Output Head	Task-specific predictions	`lm_head` (`nn.Linear`, maps hidden → vocab)

Model Component Data Flow (LlamaDecoderLayer)

Attention Implementation Strategy

Attention Backend Selection

The eager_attention_forward function at src/transformers/models/llama/modeling_llama.py199-221 is the baseline used by all models and handles grouped-query attention (GQA) via repeat_kv.

For detailed attention backend mechanics, see Attention Mechanisms.

Sources: src/transformers/models/llama/modeling_llama.py199-295 src/transformers/models/mistral/modeling_mistral.py96-200

Positional Encoding Patterns

Transformer models use different strategies to encode positional information. Most modern LLMs use Rotary Position Embeddings (RoPE), while older models use learned or sinusoidal embeddings.

Positional Encoding Comparison Table

Model Family	Implementation Class	Type	Key Characteristic
Llama, Mistral, Gemma	`LlamaRotaryEmbedding`	RoPE	Standard, via `ROPE_INIT_FUNCTIONS["default"]`
Qwen2, Qwen2MoE	`Qwen2RotaryEmbedding`	RoPE	Dynamic scaling support
Cohere	`CohereRotaryEmbedding`	RoPE	`repeat_interleave` instead of `cat` for interleaved RoPE
GPT-NeoX	`GPTNeoXRotaryEmbedding`	RoPE	`partial_rotary_factor` for partial rotation
Phi, Phi3	`PhiRotaryEmbedding`	RoPE	Partial rotation with configurable factor
GPT-2, OPT	`wpe` (`nn.Embedding`)	Learned	Traditional learned position embeddings

For detailed RoPE internals and scaling strategies, see Positional Embeddings.

Sources: src/transformers/models/llama/modeling_llama.py73-136 src/transformers/models/cohere/modeling_cohere.py69-131 src/transformers/models/llama/modeling_llama.py138-168

Normalization Strategies

Normalization Type	Models	Class	Key Difference
RMSNorm (standard)	Llama, Mistral, Mixtral, Qwen2	`LlamaRMSNorm`	No mean subtraction; `weight * hidden * rsqrt(var + eps)`
RMSNorm (Gemma)	Gemma, Gemma2	`GemmaRMSNorm`	Weight initialized to `zeros`, then `(1.0 + weight) * normalized`
LayerNorm	GPT-2, OPT, GPT-J	`nn.LayerNorm`	Full mean/variance normalization with bias
LayerNorm (custom)	Cohere	`CohereLayerNorm`	Standard LayerNorm with model-specific initialization

LlamaRMSNorm is defined at src/transformers/models/llama/modeling_llama.py53-70 and decorated with @use_kernel_forward_from_hub("RMSNorm") to allow hardware-optimized kernel replacement.

GemmaRMSNorm differs from the Llama version by initializing weight to zeros and adding 1.0 in the forward pass, as seen at src/transformers/models/gemma/modeling_gemma.py48-65

Feed-Forward Network (MLP) Architectures

Models use three main MLP patterns:

Pattern	Models	Projection Structure	Activation
Gated (SwiGLU)	Llama, Mistral, Gemma, Qwen2, Mixtral	`gate_proj`, `up_proj`, `down_proj`	`ACT2FN[config.hidden_act]` (typically `silu`)
Traditional	GPT-2, OPT, GPT-J	`c_fc`, `c_proj` (Conv1D)	`gelu` or `gelu_new`
Fused	Phi3, some Phi variants	`gate_up_proj` (single fused projection)	Chunked after projection

The gated pattern (LlamaMLP) computes: down_proj(act_fn(gate_proj(x)) * up_proj(x)) — see src/transformers/models/llama/modeling_llama.py171-184

The fused pattern (Phi3MLP) reduces memory bandwidth by projecting to 2 * intermediate_size in one operation and then splitting: see src/transformers/models/phi3/modeling_phi3.py51-72

Integration with Core Systems

Generation Integration

Model implementations integrate with the generation system through the GenerationMixin class:

CausalLM Models: Inherit from both model base class and GenerationMixin
Output Structure: Return CausalLMOutputWithPast containing logits and cached states
Cache Integration: Support DynamicCache for efficient generation

Training Integration

Models support various training optimizations:

Gradient Checkpointing: Through GradientCheckpointingLayer base class
Mixed Precision: Automatic handling of different data types
Distributed Training: Support for model parallelism annotations (_tp_plan, _pp_plan)

Configuration System

Each model family includes configuration classes that define architectural parameters:

Model Dimensions: hidden_size, num_attention_heads, intermediate_size
Layer Configuration: num_hidden_layers, layer_types (for hybrid models)
Training Settings: attention_dropout, hidden_dropout_prob
Optimization Flags: use_cache, gradient_checkpointing

Sources: src/transformers/models/llama/modeling_llama.py413-425 src/transformers/models/llama/modeling_llama.py268-278 src/transformers/models/mixtral/modeling_mixtral.py566-581

Model Architectures

Architecture Families

Model Implementation Structure

Model Class Hierarchy

Core Model Components

Model Categories and Implementation Patterns

Decoder-Only Language Models

Mixture-of-Experts (MoE) Models

Encoder-Decoder Models

State Space Models (SSMs)

Automatic Speech Recognition and Audio Models

Multimodal Vision-Language Models

Common Architectural Components

Shared Building Blocks

Attention Implementation Strategy

Positional Encoding Patterns

Normalization Strategies

Feed-Forward Network (MLP) Architectures

Integration with Core Systems

Generation Integration

Training Integration

Configuration System

On this page

Model Architectures

Architecture Families

Model Implementation Structure

Model Class Hierarchy

Core Model Components

Model Categories and Implementation Patterns

Decoder-Only Language Models

Mixture-of-Experts (MoE) Models

Encoder-Decoder Models

State Space Models (SSMs)

Automatic Speech Recognition and Audio Models

Multimodal Vision-Language Models

Common Architectural Components

Shared Building Blocks

Attention Implementation Strategy

Positional Encoding Patterns

Normalization Strategies

Feed-Forward Network (MLP) Architectures

Integration with Core Systems

Generation Integration

Training Integration

Configuration System

On this page