Positional Embeddings

Relevant source files

This document covers the positional embedding systems used throughout the transformers library to encode sequence position information into transformer models. The primary focus is on Rotary Position Embeddings (RoPE) and its variants, which are the dominant positional encoding mechanism used in modern decoder-only language models.

For information about attention mechanisms that consume these embeddings, see Attention Mechanisms. For details on model architectures that use these embeddings, see Decoder-Only Language Models and Multimodal Architectures.

Overview

Positional embeddings enable transformer models to capture the sequential nature of input data. Unlike recurrent architectures, transformers process all tokens in parallel and require explicit position information. This library implements several positional encoding schemes:

Rotary Position Embeddings (RoPE): The primary method used in Llama, Mistral, Qwen, Gemma, and most modern models
RoPE Scaling Strategies: Extensions to handle sequences longer than the pre-training context length
Sliding Window Attention Masking: Restricts attention to a local window of tokens, used in Mistral, Mixtral, Gemma2, Qwen2-MoE, and related models

The positional embedding system integrates with the model configuration, attention layers, and caching mechanisms to provide efficient position-aware representations.

Sources: src/transformers/models/llama/modeling_llama.py72-134 src/transformers/modeling_rope_utils.py1-900 src/transformers/models/mistral/modeling_mistral.py380-405

RoPE Architecture

RoPE Architecture Flow: Configuration determines RoPE type and parameters, initialization computes inverse frequencies, forward pass generates cos/sin embeddings, and application rotates query/key states.

The core RoPE implementation follows a consistent pattern across all decoder-only models. Each model has a RotaryEmbedding class (e.g., LlamaRotaryEmbedding, MistralRotaryEmbedding) that inherits common behavior but can be customized per architecture.

Sources: src/transformers/models/llama/modeling_llama.py72-134 src/transformers/modeling_rope_utils.py33-230

Core RoPE Implementation

RotaryEmbedding Class

The base pattern for all RoPE implementations:

RoPE Class Hierarchy: All model-specific RoPE classes follow the same interface, with customization through rope_init_fn functions.

The __init__ method determines which initialization function to use based on rope_type from the config, then registers inv_freq as a non-persistent buffer:

Key Components:

inv_freq: The inverse frequency tensor 1.0 / (base ** (torch.arange(0, dim, 2) / dim))
rope_type: Specifies the RoPE variant (default, linear, dynamic, etc.)
attention_scaling: Post-processing scaling factor applied to cos/sin (default 1.0 for most types)
@dynamic_rope_update: Decorator that enables dynamic frequency recomputation for certain RoPE types

Sources: src/transformers/models/llama/modeling_llama.py72-134 src/transformers/models/mistral/modeling_mistral.py269-331 src/transformers/modeling_rope_utils.py232-400

Forward Pass: Frequency Computation

The forward pass computes rotary embeddings for the current sequence positions:

Steps:

Expand inv_freq to match batch size: [batch, head_dim/2, 1]
Expand position_ids to: [batch, 1, seq_len]
Matrix multiply: freqs = inv_freq @ position_ids → [batch, seq_len, head_dim/2]
Concatenate frequencies: emb = cat([freqs, freqs], dim=-1) → [batch, seq_len, head_dim]
Compute trigonometric functions: cos = emb.cos() * attention_scaling, sin = emb.sin() * attention_scaling
Return cos/sin in the appropriate dtype

The computation is forced to float32 precision using maybe_autocast(enabled=False) to ensure numerical stability.

Sources: src/transformers/models/llama/modeling_llama.py121-134 src/transformers/models/gemma/modeling_gemma.py133-146

Application: rotate_half and apply_rotary_pos_emb

RoPE Application Flow: The rotate_half helper swaps and negates tensor halves, enabling the complex number rotation represented by q*cos + rotate_half(q)*sin.

The rotate_half function implements the core mathematical operation:

Splits the hidden dimension in half: x1 = x[..., :dim//2], x2 = x[..., dim//2:]
Returns: cat([-x2, x1], dim=-1)

This operation, combined with cos/sin multiplication, represents a rotation in the complex plane. The formula q_embed = (q * cos) + (rotate_half(q) * sin) applies RoPE to queries and keys.

Variant — Cohere Interleaved Rotation: CohereRotaryEmbedding uses an interleaved rotation instead of the split-halves approach. rotate_half takes every other element (x[..., ::2], x[..., 1::2]) and stacks them, and the embedding computation uses torch.repeat_interleave(freqs, 2, dim=-1) instead of cat([freqs, freqs]). The mathematical result is equivalent but the channel ordering differs.

Variant — Partial Rotary Factor: Some architectures (e.g., Phi) apply RoPE only to a fraction of the head dimension. The partial_rotary_factor in rope_parameters (default 1.0) controls how many channels receive rotary encoding: dim = int(head_dim * partial_rotary_factor).

The apply_rotary_pos_emb function is decorated with @use_kernel_func_from_hub("rotary_pos_emb") to enable optimized kernel implementations from external sources. See page 5.2 for more on kernel dispatch.

Sources: src/transformers/models/llama/modeling_llama.py137-167 src/transformers/models/mistral/modeling_mistral.py52-82 src/transformers/models/cohere/modeling_cohere.py187-192 src/transformers/models/phi/modeling_phi.py74-85

RoPE Scaling Strategies

Overview of Scaling Types

The library supports multiple RoPE scaling strategies to handle sequences longer than the original pre-training context length:

RoPE Type	Description	Key Parameters	Use Case
`default`	Standard RoPE without scaling	`rope_theta`	Original context length
`linear`	Linearly scales position indices	`factor`	Simple extension
`dynamic`	Dynamic NTK-aware interpolation	`factor`	Adaptive scaling
`yarn`	YaRN (Yet another RoPE extensioN)	`factor`, `attention_factor`, `beta_fast`, `beta_slow`	High-quality long context
`longrope`	Context-length adaptive scaling	`short_factor`, `long_factor`, `original_max_position_embeddings`	Efficient long sequences
`llama3`	Llama 3 scaling variant	`factor`, `low_freq_factor`, `high_freq_factor`, `original_max_position_embeddings`	Llama 3 models

Sources: src/transformers/modeling_rope_utils.py232-800

Configuration and Initialization

RoPE parameters are stored in config.rope_parameters, a dictionary containing:

rope_type: String key for the scaling strategy
rope_theta: Base frequency (default 10000.0)
Additional type-specific parameters (factors, thresholds, etc.)

RoPE Initialization Flow: The rope_type from config determines which initialization function is called from ROPE_INIT_FUNCTIONS to compute inv_freq.

Each initialization function follows the signature:

Sources: src/transformers/models/llama/modeling_llama.py75-86 src/transformers/modeling_rope_utils.py232-270

Dynamic RoPE Updates

The @dynamic_rope_update decorator enables RoPE types like "longrope" and "dynamic" to recompute frequencies during forward passes when the sequence length exceeds thresholds:

Mechanism:

Decorator wraps the forward method of RotaryEmbedding classes
Checks if current seq_len exceeds cached thresholds
If exceeded, calls the appropriate update function (e.g., longrope_frequency_update)
Recomputes inv_freq and updates the registered buffer
Original forward method proceeds with updated frequencies

Longrope Example:

Uses short_factor for sequences ≤ original_max_position_embeddings
Switches to long_factor for sequences > original_max_position_embeddings
Enables efficient handling of variable-length contexts

Sources: src/transformers/modeling_rope_utils.py33-230

Scaling Strategy Details

Linear Scaling

Multiplies position indices by 1/factor before computing frequencies:

YaRN (Yet another RoPE extensioN)

Advanced scaling with attention factor and frequency interpolation:

beta_fast and beta_slow control interpolation boundaries
Applies different scaling to low and high frequency components
Returns attention_scaling = sqrt(1 + log(factor) / log(original_max_pos_embeddings))

Llama 3 Scaling

Frequency-dependent scaling factors:

Low frequencies (< low_freq_wavelen) remain unchanged
High frequencies (> high_freq_wavelen) are scaled by factor
Intermediate frequencies are smoothly interpolated

Sources: src/transformers/modeling_rope_utils.py400-700

Sliding Window Attention Masking

Sliding window attention restricts each token to attending only to the most recent sliding_window tokens rather than all previous tokens. This reduces the KV cache pressure and computation for long sequences. It is used in Mistral, Mixtral, Gemma2, Gemma3, and Qwen2-MoE.

Mask Selection

Models with an optional sliding window select the mask function based on config:

# From MistralModel.forward / MixtralModel.forward
mask_function = create_causal_mask if config.sliding_window is None
              else create_sliding_window_causal_mask

Both functions are imported from masking_utils:

create_causal_mask: Standard lower-triangular causal mask
create_sliding_window_causal_mask: Causal mask restricted to a sliding_window-wide band

Sources: src/transformers/models/mistral/modeling_mistral.py380-390 src/transformers/models/mixtral/modeling_mixtral.py476-484 src/transformers/models/qwen2/modeling_qwen2.py293-301

Per-Layer Mask Types (Hybrid Models)

Models with interleaved full-attention and sliding-attention layers (Gemma2, Gemma3, Qwen2-MoE) precompute both mask types and index into them per decoder layer:

Per-layer hybrid masking flow

The attention_type per layer is determined by config.layer_types (a list like ["sliding_attention", "full_attention", "sliding_attention", ...]). The Gemma2DecoderLayer stores this in self.attention_type and self.sliding_window is set to None for full-attention layers.

Sources: src/transformers/models/gemma2/modeling_gemma2.py430-467 src/transformers/models/gemma2/modeling_gemma2.py305-353 src/transformers/models/qwen2_moe/modeling_qwen2_moe.py248-251

Sliding Window Parameter in the Attention Interface

In addition to the mask, Flash Attention and SDPA backends accept sliding_window directly on the attention call to avoid materializing the full mask:

# From MistralAttention.forward
attn_output, attn_weights = attention_interface(
    self, query_states, key_states, value_states, attention_mask,
    sliding_window=getattr(self.config, "sliding_window", None),
    ...
)

Position IDs always remain global (absolute token indices) regardless of sliding window. Only the attention mask (or backend parameter) limits which tokens can attend to each other.

Sources: src/transformers/models/mistral/modeling_mistral.py166-178 src/transformers/models/gemma2/modeling_gemma2.py255-299

Per-Layer-Type RoPE (Gemma3)

Gemma3 uses different RoPE base frequencies for full-attention and sliding-attention layers to account for their different effective context lengths. Gemma3RotaryEmbedding maintains a separate inv_freq buffer for each attention layer type, registered as {layer_type}_inv_freq:

# From Gemma3RotaryEmbedding.__init__
for layer_type in self.layer_types:
    rope_params = self.config.rope_parameters[layer_type]
    # compute inv_freq per layer_type via ROPE_INIT_FUNCTIONS
    self.register_buffer(f"{layer_type}_inv_freq", curr_inv_freq, persistent=False)
    setattr(self, f"{layer_type}_attention_scaling", curr_attention_scaling)

The Gemma3TextConfig default uses rope_theta=1_000_000.0 for "full_attention" and rope_theta=10_000.0 for "sliding_attention". Both are defined under config.rope_parameters as a dict mapping layer type strings to RopeParameters dicts.

The forward method accepts a layer_type argument to select the appropriate buffers:

# From Gemma3RotaryEmbedding.forward
inv_freq = getattr(self, f"{layer_type}_inv_freq")
attention_scaling = getattr(self, f"{layer_type}_attention_scaling")

Sources: src/transformers/models/gemma3/modeling_gemma3.py152-230 src/transformers/models/gemma3/modular_gemma3.py143-246

Integration with Attention

Attention Integration Flow: Position embeddings are computed once per forward pass and passed down to all decoder layers, where they're applied to queries and keys before attention computation.

Position Embedding Lifecycle

Model-Level Computation (e.g., LlamaModel.forward):
- Compute position_ids from cache_position (current token positions)
- Call self.rotary_emb(hidden_states, position_ids=position_ids) once
- Store result as position_embeddings = (cos, sin)
Layer-Level Passing:
- Pass position_embeddings to each LlamaDecoderLayer
- Layers forward to their self_attn module
Attention-Level Application:
- Project inputs to queries, keys, values
- Apply RoPE to queries and keys: query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin)
- Cache updated: key_states, value_states = past_key_values.update(key_states, value_states, layer_idx, cache_kwargs={"sin": sin, "cos": cos, "cache_position": cache_position})
- Compute attention with rotated q/k

Caching Considerations

Position embeddings interact with the KV cache system:

Cache kwargs: {"sin": sin, "cos": cos, "cache_position": cache_position} are passed to cache updates
Static cache: May pre-compute and store cos/sin for all positions up to max_position_embeddings
Dynamic cache: Recomputes as needed for each forward pass
Incremental generation: Only computes embeddings for new tokens, using cache_position to track absolute positions

Sources: src/transformers/models/llama/modeling_llama.py377-430 src/transformers/models/llama/modeling_llama.py251-292

Configuration Reference

RopeParameters TypedDict

RopeParameters (defined in modeling_rope_utils.py, imported in config classes via from ...modeling_rope_utils import RopeParameters) is a TypedDict that defines the allowed keys for the rope_parameters dict. It serves as both documentation and type-checking support for config classes.

# Type signature from LlamaConfig and others
rope_parameters: RopeParameters  # validated via standardize_rope_params / validate_rope

rope_parameters Dictionary Structure

# Example: standard default RoPE
rope_parameters = {
    "rope_type": "default",
    "rope_theta": 10000.0,
}

# Linear scaling
rope_parameters = {"rope_type": "linear", "rope_theta": 10000.0, "factor": 8.0}

# YaRN
rope_parameters = {
    "rope_type": "yarn", "rope_theta": 10000.0, "factor": 4.0,
    "attention_factor": 1.0, "beta_fast": 32, "beta_slow": 1,
}

# LongRoPE
rope_parameters = {
    "rope_type": "longrope",
    "short_factor": [1.0, ...], "long_factor": [1.0, ...],
    "original_max_position_embeddings": 4096,
}

# Llama 3 scaling
rope_parameters = {
    "rope_type": "llama3", "rope_theta": 500000.0, "factor": 8.0,
    "low_freq_factor": 1.0, "high_freq_factor": 4.0,
    "original_max_position_embeddings": 8192,
}

# Partial rotation (Phi)
rope_parameters = {"rope_type": "default", "rope_theta": 10000.0, "partial_rotary_factor": 0.4}

# Per-layer-type (Gemma3)
rope_parameters = {
    "full_attention":    {"rope_type": "default", "rope_theta": 1000000.0},
    "sliding_attention": {"rope_type": "default", "rope_theta": 10000.0},
}

Common Configuration Patterns

Model Family	Default rope_type	rope_theta	Notes
Llama 2	default	10000.0	Standard RoPE
Llama 3	llama3	500000.0	High theta + frequency-dependent scaling
Mistral	default	10000.0	May include sliding window attention
Qwen2	default	1000000.0	Very high theta for long context
Gemma	default	10000.0	Standard implementation
Qwen2VL	default	10000.0	3D multimodal RoPE

Setting Custom RoPE Parameters

LlamaConfig and related configs process rope_parameters through standardize_rope_params() and validate_rope() helpers (defined in PreTrainedConfig via modeling_rope_utils). These validate that required keys are present and fill in defaults. Fields are validated against the RopeParameters TypedDict schema.

For dynamic scaling during inference, @dynamic_rope_update on the RotaryEmbedding.forward method automatically adjusts inv_freq based on the observed sequence length—no additional configuration is needed beyond setting the rope_type to "dynamic" or "longrope".

Sources: src/transformers/models/llama/configuration_llama.py1-200 src/transformers/modeling_rope_utils.py33-130

Summary

The transformers library implements a flexible, efficient positional embedding system centered on Rotary Position Embeddings (RoPE):

Core Implementation: Model-specific RotaryEmbedding classes compute cos/sin embeddings from inverse frequencies
Scaling Strategies: Seven RoPE variants (default, linear, dynamic, yarn, longrope, llama3) handle different sequence length requirements
Multimodal Extensions: 3D RoPE variants encode temporal and spatial dimensions for vision-language models
Integration: Position embeddings are computed once per forward pass and applied to query/key states in attention layers
Configuration: The rope_parameters dictionary provides fine-grained control over RoPE behavior

Key classes and functions:

LlamaRotaryEmbedding, MistralRotaryEmbedding, etc. (model-specific implementations)
ROPE_INIT_FUNCTIONS (registry of initialization strategies)
apply_rotary_pos_emb (application function)
rotate_half (complex rotation helper)
apply_multimodal_rotary_pos_emb (3D variant for vision-language models)
@dynamic_rope_update (decorator for adaptive frequency updates)

Sources: src/transformers/models/llama/modeling_llama.py72-167 src/transformers/modeling_rope_utils.py1-900 src/transformers/models/qwen2_vl/modeling_qwen2_vl.py109-226

Positional Embeddings

Relevant source files

Overview

Rotary Position Embeddings (RoPE): The primary method used in Llama, Mistral, Qwen, Gemma, and most modern models
RoPE Scaling Strategies: Extensions to handle sequences longer than the pre-training context length
Sliding Window Attention Masking: Restricts attention to a local window of tokens, used in Mistral, Mixtral, Gemma2, Qwen2-MoE, and related models

The positional embedding system integrates with the model configuration, attention layers, and caching mechanisms to provide efficient position-aware representations.

Sources: src/transformers/models/llama/modeling_llama.py72-134 src/transformers/modeling_rope_utils.py1-900 src/transformers/models/mistral/modeling_mistral.py380-405

RoPE Architecture

Sources: src/transformers/models/llama/modeling_llama.py72-134 src/transformers/modeling_rope_utils.py33-230

Core RoPE Implementation

RotaryEmbedding Class

The base pattern for all RoPE implementations:

RoPE Class Hierarchy: All model-specific RoPE classes follow the same interface, with customization through rope_init_fn functions.

The __init__ method determines which initialization function to use based on rope_type from the config, then registers inv_freq as a non-persistent buffer:

Key Components:

inv_freq: The inverse frequency tensor 1.0 / (base ** (torch.arange(0, dim, 2) / dim))
rope_type: Specifies the RoPE variant (default, linear, dynamic, etc.)
attention_scaling: Post-processing scaling factor applied to cos/sin (default 1.0 for most types)
@dynamic_rope_update: Decorator that enables dynamic frequency recomputation for certain RoPE types

Sources: src/transformers/models/llama/modeling_llama.py72-134 src/transformers/models/mistral/modeling_mistral.py269-331 src/transformers/modeling_rope_utils.py232-400

Forward Pass: Frequency Computation

The forward pass computes rotary embeddings for the current sequence positions:

Steps:

Expand inv_freq to match batch size: [batch, head_dim/2, 1]
Expand position_ids to: [batch, 1, seq_len]
Matrix multiply: freqs = inv_freq @ position_ids → [batch, seq_len, head_dim/2]
Concatenate frequencies: emb = cat([freqs, freqs], dim=-1) → [batch, seq_len, head_dim]
Compute trigonometric functions: cos = emb.cos() * attention_scaling, sin = emb.sin() * attention_scaling
Return cos/sin in the appropriate dtype

The computation is forced to float32 precision using maybe_autocast(enabled=False) to ensure numerical stability.

Sources: src/transformers/models/llama/modeling_llama.py121-134 src/transformers/models/gemma/modeling_gemma.py133-146

Application: rotate_half and apply_rotary_pos_emb

RoPE Application Flow: The rotate_half helper swaps and negates tensor halves, enabling the complex number rotation represented by q*cos + rotate_half(q)*sin.

The rotate_half function implements the core mathematical operation:

Splits the hidden dimension in half: x1 = x[..., :dim//2], x2 = x[..., dim//2:]
Returns: cat([-x2, x1], dim=-1)

This operation, combined with cos/sin multiplication, represents a rotation in the complex plane. The formula q_embed = (q * cos) + (rotate_half(q) * sin) applies RoPE to queries and keys.

Variant — Cohere Interleaved Rotation: CohereRotaryEmbedding uses an interleaved rotation instead of the split-halves approach. rotate_half takes every other element (x[..., ::2], x[..., 1::2]) and stacks them, and the embedding computation uses torch.repeat_interleave(freqs, 2, dim=-1) instead of cat([freqs, freqs]). The mathematical result is equivalent but the channel ordering differs.

Variant — Partial Rotary Factor: Some architectures (e.g., Phi) apply RoPE only to a fraction of the head dimension. The partial_rotary_factor in rope_parameters (default 1.0) controls how many channels receive rotary encoding: dim = int(head_dim * partial_rotary_factor).

RoPE Scaling Strategies

Overview of Scaling Types

The library supports multiple RoPE scaling strategies to handle sequences longer than the original pre-training context length:

RoPE Type	Description	Key Parameters	Use Case
`default`	Standard RoPE without scaling	`rope_theta`	Original context length
`linear`	Linearly scales position indices	`factor`	Simple extension
`dynamic`	Dynamic NTK-aware interpolation	`factor`	Adaptive scaling
`yarn`	YaRN (Yet another RoPE extensioN)	`factor`, `attention_factor`, `beta_fast`, `beta_slow`	High-quality long context
`longrope`	Context-length adaptive scaling	`short_factor`, `long_factor`, `original_max_position_embeddings`	Efficient long sequences
`llama3`	Llama 3 scaling variant	`factor`, `low_freq_factor`, `high_freq_factor`, `original_max_position_embeddings`	Llama 3 models

Sources: src/transformers/modeling_rope_utils.py232-800

Configuration and Initialization

RoPE parameters are stored in config.rope_parameters, a dictionary containing:

rope_type: String key for the scaling strategy
rope_theta: Base frequency (default 10000.0)
Additional type-specific parameters (factors, thresholds, etc.)

RoPE Initialization Flow: The rope_type from config determines which initialization function is called from ROPE_INIT_FUNCTIONS to compute inv_freq.

Each initialization function follows the signature:

Sources: src/transformers/models/llama/modeling_llama.py75-86 src/transformers/modeling_rope_utils.py232-270

Dynamic RoPE Updates

The @dynamic_rope_update decorator enables RoPE types like "longrope" and "dynamic" to recompute frequencies during forward passes when the sequence length exceeds thresholds:

Mechanism:

Decorator wraps the forward method of RotaryEmbedding classes
Checks if current seq_len exceeds cached thresholds
If exceeded, calls the appropriate update function (e.g., longrope_frequency_update)
Recomputes inv_freq and updates the registered buffer
Original forward method proceeds with updated frequencies

Longrope Example:

Uses short_factor for sequences ≤ original_max_position_embeddings
Switches to long_factor for sequences > original_max_position_embeddings
Enables efficient handling of variable-length contexts

Sources: src/transformers/modeling_rope_utils.py33-230

Scaling Strategy Details

Linear Scaling

Multiplies position indices by 1/factor before computing frequencies:

YaRN (Yet another RoPE extensioN)

Advanced scaling with attention factor and frequency interpolation:

beta_fast and beta_slow control interpolation boundaries
Applies different scaling to low and high frequency components
Returns attention_scaling = sqrt(1 + log(factor) / log(original_max_pos_embeddings))

Llama 3 Scaling

Frequency-dependent scaling factors:

Low frequencies (< low_freq_wavelen) remain unchanged
High frequencies (> high_freq_wavelen) are scaled by factor
Intermediate frequencies are smoothly interpolated

Sources: src/transformers/modeling_rope_utils.py400-700

Sliding Window Attention Masking

Mask Selection

Models with an optional sliding window select the mask function based on config:

# From MistralModel.forward / MixtralModel.forward
mask_function = create_causal_mask if config.sliding_window is None
              else create_sliding_window_causal_mask

Both functions are imported from masking_utils:

create_causal_mask: Standard lower-triangular causal mask
create_sliding_window_causal_mask: Causal mask restricted to a sliding_window-wide band

Sources: src/transformers/models/mistral/modeling_mistral.py380-390 src/transformers/models/mixtral/modeling_mixtral.py476-484 src/transformers/models/qwen2/modeling_qwen2.py293-301

Per-Layer Mask Types (Hybrid Models)

Models with interleaved full-attention and sliding-attention layers (Gemma2, Gemma3, Qwen2-MoE) precompute both mask types and index into them per decoder layer:

Per-layer hybrid masking flow

Sources: src/transformers/models/gemma2/modeling_gemma2.py430-467 src/transformers/models/gemma2/modeling_gemma2.py305-353 src/transformers/models/qwen2_moe/modeling_qwen2_moe.py248-251

Sliding Window Parameter in the Attention Interface

In addition to the mask, Flash Attention and SDPA backends accept sliding_window directly on the attention call to avoid materializing the full mask:

# From MistralAttention.forward
attn_output, attn_weights = attention_interface(
    self, query_states, key_states, value_states, attention_mask,
    sliding_window=getattr(self.config, "sliding_window", None),
    ...
)

Position IDs always remain global (absolute token indices) regardless of sliding window. Only the attention mask (or backend parameter) limits which tokens can attend to each other.

Sources: src/transformers/models/mistral/modeling_mistral.py166-178 src/transformers/models/gemma2/modeling_gemma2.py255-299

Per-Layer-Type RoPE (Gemma3)

# From Gemma3RotaryEmbedding.__init__
for layer_type in self.layer_types:
    rope_params = self.config.rope_parameters[layer_type]
    # compute inv_freq per layer_type via ROPE_INIT_FUNCTIONS
    self.register_buffer(f"{layer_type}_inv_freq", curr_inv_freq, persistent=False)
    setattr(self, f"{layer_type}_attention_scaling", curr_attention_scaling)

The forward method accepts a layer_type argument to select the appropriate buffers:

# From Gemma3RotaryEmbedding.forward
inv_freq = getattr(self, f"{layer_type}_inv_freq")
attention_scaling = getattr(self, f"{layer_type}_attention_scaling")

Sources: src/transformers/models/gemma3/modeling_gemma3.py152-230 src/transformers/models/gemma3/modular_gemma3.py143-246

Integration with Attention

Attention Integration Flow: Position embeddings are computed once per forward pass and passed down to all decoder layers, where they're applied to queries and keys before attention computation.

Position Embedding Lifecycle

Model-Level Computation (e.g., LlamaModel.forward):
- Compute position_ids from cache_position (current token positions)
- Call self.rotary_emb(hidden_states, position_ids=position_ids) once
- Store result as position_embeddings = (cos, sin)
Layer-Level Passing:
- Pass position_embeddings to each LlamaDecoderLayer
- Layers forward to their self_attn module
Attention-Level Application:
- Project inputs to queries, keys, values
- Apply RoPE to queries and keys: query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin)
- Cache updated: key_states, value_states = past_key_values.update(key_states, value_states, layer_idx, cache_kwargs={"sin": sin, "cos": cos, "cache_position": cache_position})
- Compute attention with rotated q/k

Caching Considerations

Position embeddings interact with the KV cache system:

Cache kwargs: {"sin": sin, "cos": cos, "cache_position": cache_position} are passed to cache updates
Static cache: May pre-compute and store cos/sin for all positions up to max_position_embeddings
Dynamic cache: Recomputes as needed for each forward pass
Incremental generation: Only computes embeddings for new tokens, using cache_position to track absolute positions

Sources: src/transformers/models/llama/modeling_llama.py377-430 src/transformers/models/llama/modeling_llama.py251-292

Configuration Reference

RopeParameters TypedDict

# Type signature from LlamaConfig and others
rope_parameters: RopeParameters  # validated via standardize_rope_params / validate_rope

rope_parameters Dictionary Structure

# Example: standard default RoPE
rope_parameters = {
    "rope_type": "default",
    "rope_theta": 10000.0,
}

# Linear scaling
rope_parameters = {"rope_type": "linear", "rope_theta": 10000.0, "factor": 8.0}

# YaRN
rope_parameters = {
    "rope_type": "yarn", "rope_theta": 10000.0, "factor": 4.0,
    "attention_factor": 1.0, "beta_fast": 32, "beta_slow": 1,
}

# LongRoPE
rope_parameters = {
    "rope_type": "longrope",
    "short_factor": [1.0, ...], "long_factor": [1.0, ...],
    "original_max_position_embeddings": 4096,
}

# Llama 3 scaling
rope_parameters = {
    "rope_type": "llama3", "rope_theta": 500000.0, "factor": 8.0,
    "low_freq_factor": 1.0, "high_freq_factor": 4.0,
    "original_max_position_embeddings": 8192,
}

# Partial rotation (Phi)
rope_parameters = {"rope_type": "default", "rope_theta": 10000.0, "partial_rotary_factor": 0.4}

# Per-layer-type (Gemma3)
rope_parameters = {
    "full_attention":    {"rope_type": "default", "rope_theta": 1000000.0},
    "sliding_attention": {"rope_type": "default", "rope_theta": 10000.0},
}

Common Configuration Patterns

Model Family	Default rope_type	rope_theta	Notes
Llama 2	default	10000.0	Standard RoPE
Llama 3	llama3	500000.0	High theta + frequency-dependent scaling
Mistral	default	10000.0	May include sliding window attention
Qwen2	default	1000000.0	Very high theta for long context
Gemma	default	10000.0	Standard implementation
Qwen2VL	default	10000.0	3D multimodal RoPE

Setting Custom RoPE Parameters

Sources: src/transformers/models/llama/configuration_llama.py1-200 src/transformers/modeling_rope_utils.py33-130

Summary

The transformers library implements a flexible, efficient positional embedding system centered on Rotary Position Embeddings (RoPE):

Core Implementation: Model-specific RotaryEmbedding classes compute cos/sin embeddings from inverse frequencies
Scaling Strategies: Seven RoPE variants (default, linear, dynamic, yarn, longrope, llama3) handle different sequence length requirements
Multimodal Extensions: 3D RoPE variants encode temporal and spatial dimensions for vision-language models
Integration: Position embeddings are computed once per forward pass and applied to query/key states in attention layers
Configuration: The rope_parameters dictionary provides fine-grained control over RoPE behavior

Key classes and functions:

LlamaRotaryEmbedding, MistralRotaryEmbedding, etc. (model-specific implementations)
ROPE_INIT_FUNCTIONS (registry of initialization strategies)
apply_rotary_pos_emb (application function)
rotate_half (complex rotation helper)
apply_multimodal_rotary_pos_emb (3D variant for vision-language models)
@dynamic_rope_update (decorator for adaptive frequency updates)

Sources: src/transformers/models/llama/modeling_llama.py72-167 src/transformers/modeling_rope_utils.py1-900 src/transformers/models/qwen2_vl/modeling_qwen2_vl.py109-226

Positional Embeddings

Overview

RoPE Architecture

Core RoPE Implementation

RotaryEmbedding Class

Forward Pass: Frequency Computation

Application: rotate_half and apply_rotary_pos_emb

RoPE Scaling Strategies

Overview of Scaling Types

Configuration and Initialization

Dynamic RoPE Updates

Scaling Strategy Details

Linear Scaling

YaRN (Yet another RoPE extensioN)

Llama 3 Scaling

Sliding Window Attention Masking

Mask Selection

Per-Layer Mask Types (Hybrid Models)

Sliding Window Parameter in the Attention Interface

Per-Layer-Type RoPE (Gemma3)

Integration with Attention

Position Embedding Lifecycle

Caching Considerations

Configuration Reference

RopeParameters TypedDict

rope_parameters Dictionary Structure

Common Configuration Patterns

Setting Custom RoPE Parameters

Summary

On this page

Positional Embeddings

Overview

RoPE Architecture

Core RoPE Implementation

RotaryEmbedding Class

Forward Pass: Frequency Computation

Application: rotate_half and apply_rotary_pos_emb

RoPE Scaling Strategies

Overview of Scaling Types

Configuration and Initialization

Dynamic RoPE Updates

Scaling Strategy Details

Linear Scaling

YaRN (Yet another RoPE extensioN)

Llama 3 Scaling

Sliding Window Attention Masking

Mask Selection

Per-Layer Mask Types (Hybrid Models)

Sliding Window Parameter in the Attention Interface

Per-Layer-Type RoPE (Gemma3)

Integration with Attention

Position Embedding Lifecycle

Caching Considerations

Configuration Reference

RopeParameters TypedDict

rope_parameters Dictionary Structure

Common Configuration Patterns

Setting Custom RoPE Parameters

Summary

On this page