This document covers the positional embedding systems used throughout the transformers library to encode sequence position information into transformer models. The primary focus is on Rotary Position Embeddings (RoPE) and its variants, which are the dominant positional encoding mechanism used in modern decoder-only language models.
For information about attention mechanisms that consume these embeddings, see Attention Mechanisms. For details on model architectures that use these embeddings, see Decoder-Only Language Models and Multimodal Architectures.
Positional embeddings enable transformer models to capture the sequential nature of input data. Unlike recurrent architectures, transformers process all tokens in parallel and require explicit position information. This library implements several positional encoding schemes:
The positional embedding system integrates with the model configuration, attention layers, and caching mechanisms to provide efficient position-aware representations.
Sources: src/transformers/models/llama/modeling_llama.py72-134 src/transformers/modeling_rope_utils.py1-900 src/transformers/models/mistral/modeling_mistral.py380-405
RoPE Architecture Flow: Configuration determines RoPE type and parameters, initialization computes inverse frequencies, forward pass generates cos/sin embeddings, and application rotates query/key states.
The core RoPE implementation follows a consistent pattern across all decoder-only models. Each model has a RotaryEmbedding class (e.g., LlamaRotaryEmbedding, MistralRotaryEmbedding) that inherits common behavior but can be customized per architecture.
Sources: src/transformers/models/llama/modeling_llama.py72-134 src/transformers/modeling_rope_utils.py33-230
The base pattern for all RoPE implementations:
RoPE Class Hierarchy: All model-specific RoPE classes follow the same interface, with customization through rope_init_fn functions.
The __init__ method determines which initialization function to use based on rope_type from the config, then registers inv_freq as a non-persistent buffer:
Key Components:
inv_freq: The inverse frequency tensor 1.0 / (base ** (torch.arange(0, dim, 2) / dim))rope_type: Specifies the RoPE variant (default, linear, dynamic, etc.)attention_scaling: Post-processing scaling factor applied to cos/sin (default 1.0 for most types)@dynamic_rope_update: Decorator that enables dynamic frequency recomputation for certain RoPE typesSources: src/transformers/models/llama/modeling_llama.py72-134 src/transformers/models/mistral/modeling_mistral.py269-331 src/transformers/modeling_rope_utils.py232-400
The forward pass computes rotary embeddings for the current sequence positions:
Steps:
inv_freq to match batch size: [batch, head_dim/2, 1]position_ids to: [batch, 1, seq_len]freqs = inv_freq @ position_ids → [batch, seq_len, head_dim/2]emb = cat([freqs, freqs], dim=-1) → [batch, seq_len, head_dim]cos = emb.cos() * attention_scaling, sin = emb.sin() * attention_scalingThe computation is forced to float32 precision using maybe_autocast(enabled=False) to ensure numerical stability.
Sources: src/transformers/models/llama/modeling_llama.py121-134 src/transformers/models/gemma/modeling_gemma.py133-146
RoPE Application Flow: The rotate_half helper swaps and negates tensor halves, enabling the complex number rotation represented by q*cos + rotate_half(q)*sin.
The rotate_half function implements the core mathematical operation:
x1 = x[..., :dim//2], x2 = x[..., dim//2:]cat([-x2, x1], dim=-1)This operation, combined with cos/sin multiplication, represents a rotation in the complex plane. The formula q_embed = (q * cos) + (rotate_half(q) * sin) applies RoPE to queries and keys.
Variant — Cohere Interleaved Rotation:
CohereRotaryEmbeddinguses an interleaved rotation instead of the split-halves approach.rotate_halftakes every other element (x[..., ::2],x[..., 1::2]) and stacks them, and the embedding computation usestorch.repeat_interleave(freqs, 2, dim=-1)instead ofcat([freqs, freqs]). The mathematical result is equivalent but the channel ordering differs.
Variant — Partial Rotary Factor: Some architectures (e.g., Phi) apply RoPE only to a fraction of the head dimension. The
partial_rotary_factorinrope_parameters(default 1.0) controls how many channels receive rotary encoding:dim = int(head_dim * partial_rotary_factor).
The apply_rotary_pos_emb function is decorated with @use_kernel_func_from_hub("rotary_pos_emb") to enable optimized kernel implementations from external sources. See page 5.2 for more on kernel dispatch.
Sources: src/transformers/models/llama/modeling_llama.py137-167 src/transformers/models/mistral/modeling_mistral.py52-82 src/transformers/models/cohere/modeling_cohere.py187-192 src/transformers/models/phi/modeling_phi.py74-85
The library supports multiple RoPE scaling strategies to handle sequences longer than the original pre-training context length:
| RoPE Type | Description | Key Parameters | Use Case |
|---|---|---|---|
default | Standard RoPE without scaling | rope_theta | Original context length |
linear | Linearly scales position indices | factor | Simple extension |
dynamic | Dynamic NTK-aware interpolation | factor | Adaptive scaling |
yarn | YaRN (Yet another RoPE extensioN) | factor, attention_factor, beta_fast, beta_slow | High-quality long context |
longrope | Context-length adaptive scaling | short_factor, long_factor, original_max_position_embeddings | Efficient long sequences |
llama3 | Llama 3 scaling variant | factor, low_freq_factor, high_freq_factor, original_max_position_embeddings | Llama 3 models |
Sources: src/transformers/modeling_rope_utils.py232-800
RoPE parameters are stored in config.rope_parameters, a dictionary containing:
rope_type: String key for the scaling strategyrope_theta: Base frequency (default 10000.0)RoPE Initialization Flow: The rope_type from config determines which initialization function is called from ROPE_INIT_FUNCTIONS to compute inv_freq.
Each initialization function follows the signature:
Sources: src/transformers/models/llama/modeling_llama.py75-86 src/transformers/modeling_rope_utils.py232-270
The @dynamic_rope_update decorator enables RoPE types like "longrope" and "dynamic" to recompute frequencies during forward passes when the sequence length exceeds thresholds:
Mechanism:
forward method of RotaryEmbedding classesseq_len exceeds cached thresholdslongrope_frequency_update)inv_freq and updates the registered bufferLongrope Example:
short_factor for sequences ≤ original_max_position_embeddingslong_factor for sequences > original_max_position_embeddingsSources: src/transformers/modeling_rope_utils.py33-230
Multiplies position indices by 1/factor before computing frequencies:
Advanced scaling with attention factor and frequency interpolation:
beta_fast and beta_slow control interpolation boundariesattention_scaling = sqrt(1 + log(factor) / log(original_max_pos_embeddings))Frequency-dependent scaling factors:
low_freq_wavelen) remain unchangedhigh_freq_wavelen) are scaled by factorSources: src/transformers/modeling_rope_utils.py400-700
Sliding window attention restricts each token to attending only to the most recent sliding_window tokens rather than all previous tokens. This reduces the KV cache pressure and computation for long sequences. It is used in Mistral, Mixtral, Gemma2, Gemma3, and Qwen2-MoE.
Models with an optional sliding window select the mask function based on config:
# From MistralModel.forward / MixtralModel.forward
mask_function = create_causal_mask if config.sliding_window is None
else create_sliding_window_causal_mask
Both functions are imported from masking_utils:
create_causal_mask: Standard lower-triangular causal maskcreate_sliding_window_causal_mask: Causal mask restricted to a sliding_window-wide bandSources: src/transformers/models/mistral/modeling_mistral.py380-390 src/transformers/models/mixtral/modeling_mixtral.py476-484 src/transformers/models/qwen2/modeling_qwen2.py293-301
Models with interleaved full-attention and sliding-attention layers (Gemma2, Gemma3, Qwen2-MoE) precompute both mask types and index into them per decoder layer:
Per-layer hybrid masking flow
The attention_type per layer is determined by config.layer_types (a list like ["sliding_attention", "full_attention", "sliding_attention", ...]). The Gemma2DecoderLayer stores this in self.attention_type and self.sliding_window is set to None for full-attention layers.
Sources: src/transformers/models/gemma2/modeling_gemma2.py430-467 src/transformers/models/gemma2/modeling_gemma2.py305-353 src/transformers/models/qwen2_moe/modeling_qwen2_moe.py248-251
In addition to the mask, Flash Attention and SDPA backends accept sliding_window directly on the attention call to avoid materializing the full mask:
# From MistralAttention.forward
attn_output, attn_weights = attention_interface(
self, query_states, key_states, value_states, attention_mask,
sliding_window=getattr(self.config, "sliding_window", None),
...
)
Position IDs always remain global (absolute token indices) regardless of sliding window. Only the attention mask (or backend parameter) limits which tokens can attend to each other.
Sources: src/transformers/models/mistral/modeling_mistral.py166-178 src/transformers/models/gemma2/modeling_gemma2.py255-299
Gemma3 uses different RoPE base frequencies for full-attention and sliding-attention layers to account for their different effective context lengths. Gemma3RotaryEmbedding maintains a separate inv_freq buffer for each attention layer type, registered as {layer_type}_inv_freq:
# From Gemma3RotaryEmbedding.__init__
for layer_type in self.layer_types:
rope_params = self.config.rope_parameters[layer_type]
# compute inv_freq per layer_type via ROPE_INIT_FUNCTIONS
self.register_buffer(f"{layer_type}_inv_freq", curr_inv_freq, persistent=False)
setattr(self, f"{layer_type}_attention_scaling", curr_attention_scaling)
The Gemma3TextConfig default uses rope_theta=1_000_000.0 for "full_attention" and rope_theta=10_000.0 for "sliding_attention". Both are defined under config.rope_parameters as a dict mapping layer type strings to RopeParameters dicts.
The forward method accepts a layer_type argument to select the appropriate buffers:
# From Gemma3RotaryEmbedding.forward
inv_freq = getattr(self, f"{layer_type}_inv_freq")
attention_scaling = getattr(self, f"{layer_type}_attention_scaling")
Sources: src/transformers/models/gemma3/modeling_gemma3.py152-230 src/transformers/models/gemma3/modular_gemma3.py143-246
Attention Integration Flow: Position embeddings are computed once per forward pass and passed down to all decoder layers, where they're applied to queries and keys before attention computation.
Model-Level Computation (e.g., LlamaModel.forward):
position_ids from cache_position (current token positions)self.rotary_emb(hidden_states, position_ids=position_ids) onceposition_embeddings = (cos, sin)Layer-Level Passing:
position_embeddings to each LlamaDecoderLayerself_attn moduleAttention-Level Application:
query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin)key_states, value_states = past_key_values.update(key_states, value_states, layer_idx, cache_kwargs={"sin": sin, "cos": cos, "cache_position": cache_position})Position embeddings interact with the KV cache system:
{"sin": sin, "cos": cos, "cache_position": cache_position} are passed to cache updatesmax_position_embeddingscache_position to track absolute positionsSources: src/transformers/models/llama/modeling_llama.py377-430 src/transformers/models/llama/modeling_llama.py251-292
RopeParameters (defined in modeling_rope_utils.py, imported in config classes via from ...modeling_rope_utils import RopeParameters) is a TypedDict that defines the allowed keys for the rope_parameters dict. It serves as both documentation and type-checking support for config classes.
# Type signature from LlamaConfig and others
rope_parameters: RopeParameters # validated via standardize_rope_params / validate_rope
# Example: standard default RoPE
rope_parameters = {
"rope_type": "default",
"rope_theta": 10000.0,
}
# Linear scaling
rope_parameters = {"rope_type": "linear", "rope_theta": 10000.0, "factor": 8.0}
# YaRN
rope_parameters = {
"rope_type": "yarn", "rope_theta": 10000.0, "factor": 4.0,
"attention_factor": 1.0, "beta_fast": 32, "beta_slow": 1,
}
# LongRoPE
rope_parameters = {
"rope_type": "longrope",
"short_factor": [1.0, ...], "long_factor": [1.0, ...],
"original_max_position_embeddings": 4096,
}
# Llama 3 scaling
rope_parameters = {
"rope_type": "llama3", "rope_theta": 500000.0, "factor": 8.0,
"low_freq_factor": 1.0, "high_freq_factor": 4.0,
"original_max_position_embeddings": 8192,
}
# Partial rotation (Phi)
rope_parameters = {"rope_type": "default", "rope_theta": 10000.0, "partial_rotary_factor": 0.4}
# Per-layer-type (Gemma3)
rope_parameters = {
"full_attention": {"rope_type": "default", "rope_theta": 1000000.0},
"sliding_attention": {"rope_type": "default", "rope_theta": 10000.0},
}
| Model Family | Default rope_type | rope_theta | Notes |
|---|---|---|---|
| Llama 2 | default | 10000.0 | Standard RoPE |
| Llama 3 | llama3 | 500000.0 | High theta + frequency-dependent scaling |
| Mistral | default | 10000.0 | May include sliding window attention |
| Qwen2 | default | 1000000.0 | Very high theta for long context |
| Gemma | default | 10000.0 | Standard implementation |
| Qwen2VL | default | 10000.0 | 3D multimodal RoPE |
LlamaConfig and related configs process rope_parameters through standardize_rope_params() and validate_rope() helpers (defined in PreTrainedConfig via modeling_rope_utils). These validate that required keys are present and fill in defaults. Fields are validated against the RopeParameters TypedDict schema.
For dynamic scaling during inference, @dynamic_rope_update on the RotaryEmbedding.forward method automatically adjusts inv_freq based on the observed sequence length—no additional configuration is needed beyond setting the rope_type to "dynamic" or "longrope".
Sources: src/transformers/models/llama/configuration_llama.py1-200 src/transformers/modeling_rope_utils.py33-130
The transformers library implements a flexible, efficient positional embedding system centered on Rotary Position Embeddings (RoPE):
RotaryEmbedding classes compute cos/sin embeddings from inverse frequenciesrope_parameters dictionary provides fine-grained control over RoPE behaviorKey classes and functions:
LlamaRotaryEmbedding, MistralRotaryEmbedding, etc. (model-specific implementations)ROPE_INIT_FUNCTIONS (registry of initialization strategies)apply_rotary_pos_emb (application function)rotate_half (complex rotation helper)apply_multimodal_rotary_pos_emb (3D variant for vision-language models)@dynamic_rope_update (decorator for adaptive frequency updates)Sources: src/transformers/models/llama/modeling_llama.py72-167 src/transformers/modeling_rope_utils.py1-900 src/transformers/models/qwen2_vl/modeling_qwen2_vl.py109-226
Refresh this wiki
This wiki was recently refreshed. Please wait 2 days to refresh again.