Attention Mechanisms and Flash Attention

Relevant source files

Purpose and Scope

This document describes the attention mechanism implementation in nanochat's GPT model, focusing on the Flash Attention 3 integration, sliding window patterns, and architectural optimizations. For the broader transformer architecture (MLP blocks, residual connections, model configuration), see GPT Transformer Architecture. For inference-specific KV cache details, see Inference Engine and KV Cache.

Attention Architecture Overview

The nanochat model implements causal self-attention with several modern enhancements:

Flash Attention 3 on Hopper GPUs (H100+) with automatic fallback to PyTorch SDPA
Sliding window attention with configurable per-layer patterns
Group-Query Attention (GQA) for memory-efficient inference
Rotary Positional Embeddings (RoPE) for relative position encoding
QK Normalization for training stability
Value Embeddings (ResFormer-style) for increased capacity

CausalSelfAttention Module Structure

Sources: nanochat/gpt.py59-118

Flash Attention 3 Integration

Flash Attention 3 provides ~9% throughput improvement over Flash Attention 2 by optimizing memory access patterns and supporting Hopper GPU tensor cores.

Implementation Architecture

The nanochat codebase uses a custom abstraction layer (nanochat.flash_attention) that automatically routes to FA3 on Hopper GPUs or falls back to PyTorch's scaled_dot_product_attention (SDPA) on other hardware.

Sources: dev/LOG.md519-560 nanochat/gpt.py26

Training Path: flash_attn_func

During training, the model uses flash_attn.flash_attn_func for causal attention with optional sliding window:

Parameter	Value	Purpose
`q, k, v`	`(B, T, H, D)`	Native FA3 layout, no transpose
`causal`	`True`	Enforce autoregressive masking
`window_size`	`(left, 0)`	Sliding window (if enabled)

The function signature in nanochat/gpt.py100:

Inference Path: flash_attn_with_kvcache

During inference, the model uses flash_attn.flash_attn_with_kvcache which manages KV cache in-place:

Parameter	Value	Purpose
`q`	`(B, T, H, D)`	Query for current tokens
`k_cache, v_cache`	`(num_layers, B, T_cache, H, D)`	Persistent cache
`k, v`	`(B, T, H, D)`	New keys/values to append
`cache_seqlens`	`int32` tensor	Per-batch position tracker
`causal`	`True`	Autoregressive masking
`window_size`	`(left, 0)`	Sliding window

The function modifies k_cache and v_cache in-place, automatically appending new keys/values and returning the attention output. See nanochat/gpt.py104-110

Tensor Layout Optimization

FA3's native (B, T, H, D) layout eliminates the transpose operations required by FA2's (B, H, T, D) layout:

Sources: dev/LOG.md800-830 nanochat/gpt.py80-100

Sliding Window Attention

Sliding window attention allows each token to attend only to a limited context window, reducing computation for long sequences while preserving full context in selected layers.

Window Pattern Configuration

The model supports configurable per-layer window patterns via the window_pattern string in GPTConfig:

Character	Meaning	Window Size
`L`	Long	Full context (`sequence_len`)
`S`	Short	Half context (`sequence_len // 2`)

The pattern is tiled across layers, with the final layer always forced to L (full context). For example, with window_pattern="SSSL" and 20 layers:

Layers 0-3: S, S, S, L
Layers 4-7: S, S, S, L
Layers 8-11: S, S, S, L
Layers 12-15: S, S, S, L
Layers 16-18: S, S, S
Layer 19: L (forced)

Implementation

The _compute_window_sizes method nanochat/gpt.py260-287 converts the pattern string to per-layer (left, right) tuples:

Window Type	Tuple Value	Meaning
Full context	`(sequence_len, 0)`	Attend to all previous tokens
Half window	`(sequence_len // 2, 0)`	Attend to last N/2 tokens

These tuples are passed directly to Flash Attention 3's window_size parameter. The value (-1, 0) can also be used to indicate unlimited left context.

FLOP Calculation with Sliding Windows

The attention FLOPs per layer depend on the effective sequence length (capped by window size). The estimate_flops() method nanochat/gpt.py292-317 accounts for this:

Sources: nanochat/gpt.py36-39 nanochat/gpt.py260-287 dev/LOG.md784-798

Group-Query Attention (GQA)

GQA reduces memory bandwidth during inference by sharing key/value heads across multiple query heads.

Configuration Parameters

Parameter	Description	Default	Effect
`n_head`	Number of query heads	6	Full attention resolution
`n_kv_head`	Number of key/value heads	6	Must divide `n_head`
`head_dim`	Dimension per head	`n_embd // n_head`	Computed

Standard attention uses n_kv_head = n_head (all heads independent). GQA uses n_kv_head < n_head, where multiple query heads share the same key/value pair.

Implementation Details

The projection layers nanochat/gpt.py69-71 create different sizes:

Query: n_head × head_dim parameters
Key: n_kv_head × head_dim parameters
Value: n_kv_head × head_dim parameters

Flash Attention 3 handles GQA automatically when the K and V tensors have fewer heads than Q:

Memory Savings

For a model with n_head=12, head_dim=128, sequence_len=2048:

Configuration	KV Cache Size per Layer
`n_kv_head=12` (standard)	`2 × 12 × 2048 × 128 = 6.3 MB`
`n_kv_head=4` (GQA 3:1)	`2 × 4 × 2048 × 128 = 2.1 MB`
`n_kv_head=1` (MQA)	`2 × 1 × 2048 × 128 = 0.5 MB`

Sources: nanochat/gpt.py33-34 nanochat/gpt.py63-68

Rotary Positional Embeddings (RoPE)

RoPE provides relative position information by applying rotation matrices to query and key vectors.

Precomputation

The model precomputes rotation frequencies during initialization nanochat/gpt.py243-258:

Compute inverse frequencies: inv_freq = 1.0 / (base ** (channel_range / head_dim))
Compute outer product with time steps: freqs = outer(t, inv_freq)
Convert to cosine/sine pairs: cos, sin = freqs.cos(), freqs.sin()
Add batch and head dimensions: (1, seq_len, 1, head_dim/2)

The cache is allocated for sequence_len × 10 to allow dynamic sequence lengths without recomputation.

Application

The apply_rotary_emb function nanochat/gpt.py51-57 rotates query and key vectors:

This applies a 2D rotation to consecutive pairs of dimensions, encoding relative positions through interference patterns.

Position Offset for KV Cache

During inference with KV cache, the rotary embeddings must be offset to the current cache position nanochat/gpt.py396-397:

This ensures new tokens receive embeddings corresponding to their absolute position in the full sequence.

Sources: nanochat/gpt.py51-57 nanochat/gpt.py182-186 nanochat/gpt.py243-258

QK Normalization

Both queries and keys are normalized using RMSNorm after applying rotary embeddings nanochat/gpt.py94:

The norm function nanochat/gpt.py42-44 is a purely functional RMSNorm with no learnable parameters:

Benefits

Issue Without QK Norm	How QK Norm Helps
Attention logit magnitude grows with `√d`	Normalized vectors have unit norm
Unstable training with large `head_dim`	Stable dot products regardless of dimension
Need to carefully tune attention scale	Automatic scaling via normalization

The normalization is applied per-head (last dimension of shape (B, T, H, D)), ensuring each head's queries and keys are independently normalized.

Sources: nanochat/gpt.py42-44 nanochat/gpt.py94

Value Embeddings (ResFormer)

Value embeddings add extra capacity by mixing learned token-specific embeddings into the attention values.

Architecture

The model uses value embeddings at alternating layers, with the last layer always included nanochat/gpt.py47-49:

For a 12-layer model, value embeddings are present at layers: 1, 3, 5, 7, 9, 11.

Gated Injection

Each value embedding is gated by an input-dependent weight nanochat/gpt.py86-89:

The gate network:

Takes first 32 channels of input: x[..., :32]
Produces per-head scalar: (B, T, n_kv_head)
Range (0, 2) via 2 * sigmoid(...)
Initialized to zero → gate starts at 1.0 (neutral)

Parameter Cost

For vocab_size=32768, n_kv_head=6, head_dim=128, each value embedding table contains:

Parameters: 32768 × (6 × 128) = 25.2M
For 6 layers (d12 model): 6 × 25.2M = 151M

This is comparable to the token embedding table itself (25.2M). The large parameter count is justified because value embeddings add capacity at near-zero FLOP cost (just embedding lookup + gated addition).

Sources: nanochat/gpt.py47-49 nanochat/gpt.py73-74 nanochat/gpt.py86-89 dev/LOG.md487-495

Hardware Fallback: SDPA

For non-Hopper GPUs, the system automatically falls back to PyTorch's scaled_dot_product_attention:

Key Differences

Aspect	Flash Attention 3	SDPA Fallback
Layout	`(B, T, H, D)` native	Transpose to `(B, H, T, D)`
KV cache	In-place update	Manual concatenation
Sliding window	Native `window_size`	Explicit mask tensor
Performance	Optimized for H100	CPU/older GPU compatible
Memory	Recompute-optimized	Standard memory usage

The fallback ensures nanochat can run on any device (CPU, MPS, older CUDA GPUs) but with reduced performance, especially for sliding window attention.

Sources: dev/LOG.md519-560

Integration with Training Loop

The attention mechanism integrates into the transformer block's forward pass nanochat/gpt.py140-143:

Arguments passed from the main GPT forward nanochat/gpt.py403-406:

ve: Value embedding for this token (if layer has it)
cos_sin: Rotary embedding tables sliced to sequence length
window_size: Per-layer sliding window configuration
kv_cache: Inference cache (None during training)

This design allows the attention layer to remain stateless, with all positional and caching state managed by the calling context.

Sources: nanochat/gpt.py140-143 nanochat/gpt.py388-407

Attention Mechanisms and Flash Attention

Relevant source files

Purpose and Scope

Attention Architecture Overview

The nanochat model implements causal self-attention with several modern enhancements:

Flash Attention 3 on Hopper GPUs (H100+) with automatic fallback to PyTorch SDPA
Sliding window attention with configurable per-layer patterns
Group-Query Attention (GQA) for memory-efficient inference
Rotary Positional Embeddings (RoPE) for relative position encoding
QK Normalization for training stability
Value Embeddings (ResFormer-style) for increased capacity

CausalSelfAttention Module Structure

Sources: nanochat/gpt.py59-118

Flash Attention 3 Integration

Flash Attention 3 provides ~9% throughput improvement over Flash Attention 2 by optimizing memory access patterns and supporting Hopper GPU tensor cores.

Implementation Architecture

Sources: dev/LOG.md519-560 nanochat/gpt.py26

Training Path: flash_attn_func

During training, the model uses flash_attn.flash_attn_func for causal attention with optional sliding window:

Parameter	Value	Purpose
`q, k, v`	`(B, T, H, D)`	Native FA3 layout, no transpose
`causal`	`True`	Enforce autoregressive masking
`window_size`	`(left, 0)`	Sliding window (if enabled)

The function signature in nanochat/gpt.py100:

Inference Path: flash_attn_with_kvcache

During inference, the model uses flash_attn.flash_attn_with_kvcache which manages KV cache in-place:

Parameter	Value	Purpose
`q`	`(B, T, H, D)`	Query for current tokens
`k_cache, v_cache`	`(num_layers, B, T_cache, H, D)`	Persistent cache
`k, v`	`(B, T, H, D)`	New keys/values to append
`cache_seqlens`	`int32` tensor	Per-batch position tracker
`causal`	`True`	Autoregressive masking
`window_size`	`(left, 0)`	Sliding window

The function modifies k_cache and v_cache in-place, automatically appending new keys/values and returning the attention output. See nanochat/gpt.py104-110

Tensor Layout Optimization

FA3's native (B, T, H, D) layout eliminates the transpose operations required by FA2's (B, H, T, D) layout:

Sources: dev/LOG.md800-830 nanochat/gpt.py80-100

Sliding Window Attention

Sliding window attention allows each token to attend only to a limited context window, reducing computation for long sequences while preserving full context in selected layers.

Window Pattern Configuration

The model supports configurable per-layer window patterns via the window_pattern string in GPTConfig:

Character	Meaning	Window Size
`L`	Long	Full context (`sequence_len`)
`S`	Short	Half context (`sequence_len // 2`)

The pattern is tiled across layers, with the final layer always forced to L (full context). For example, with window_pattern="SSSL" and 20 layers:

Layers 0-3: S, S, S, L
Layers 4-7: S, S, S, L
Layers 8-11: S, S, S, L
Layers 12-15: S, S, S, L
Layers 16-18: S, S, S
Layer 19: L (forced)

Implementation

The _compute_window_sizes method nanochat/gpt.py260-287 converts the pattern string to per-layer (left, right) tuples:

Window Type	Tuple Value	Meaning
Full context	`(sequence_len, 0)`	Attend to all previous tokens
Half window	`(sequence_len // 2, 0)`	Attend to last N/2 tokens

These tuples are passed directly to Flash Attention 3's window_size parameter. The value (-1, 0) can also be used to indicate unlimited left context.

FLOP Calculation with Sliding Windows

The attention FLOPs per layer depend on the effective sequence length (capped by window size). The estimate_flops() method nanochat/gpt.py292-317 accounts for this:

Sources: nanochat/gpt.py36-39 nanochat/gpt.py260-287 dev/LOG.md784-798

Group-Query Attention (GQA)

GQA reduces memory bandwidth during inference by sharing key/value heads across multiple query heads.

Configuration Parameters

Parameter	Description	Default	Effect
`n_head`	Number of query heads	6	Full attention resolution
`n_kv_head`	Number of key/value heads	6	Must divide `n_head`
`head_dim`	Dimension per head	`n_embd // n_head`	Computed

Standard attention uses n_kv_head = n_head (all heads independent). GQA uses n_kv_head < n_head, where multiple query heads share the same key/value pair.

Implementation Details

The projection layers nanochat/gpt.py69-71 create different sizes:

Query: n_head × head_dim parameters
Key: n_kv_head × head_dim parameters
Value: n_kv_head × head_dim parameters

Flash Attention 3 handles GQA automatically when the K and V tensors have fewer heads than Q:

Memory Savings

For a model with n_head=12, head_dim=128, sequence_len=2048:

Configuration	KV Cache Size per Layer
`n_kv_head=12` (standard)	`2 × 12 × 2048 × 128 = 6.3 MB`
`n_kv_head=4` (GQA 3:1)	`2 × 4 × 2048 × 128 = 2.1 MB`
`n_kv_head=1` (MQA)	`2 × 1 × 2048 × 128 = 0.5 MB`

Sources: nanochat/gpt.py33-34 nanochat/gpt.py63-68

Rotary Positional Embeddings (RoPE)

RoPE provides relative position information by applying rotation matrices to query and key vectors.

Precomputation

The model precomputes rotation frequencies during initialization nanochat/gpt.py243-258:

Compute inverse frequencies: inv_freq = 1.0 / (base ** (channel_range / head_dim))
Compute outer product with time steps: freqs = outer(t, inv_freq)
Convert to cosine/sine pairs: cos, sin = freqs.cos(), freqs.sin()
Add batch and head dimensions: (1, seq_len, 1, head_dim/2)

The cache is allocated for sequence_len × 10 to allow dynamic sequence lengths without recomputation.

Application

The apply_rotary_emb function nanochat/gpt.py51-57 rotates query and key vectors:

This applies a 2D rotation to consecutive pairs of dimensions, encoding relative positions through interference patterns.

Position Offset for KV Cache

During inference with KV cache, the rotary embeddings must be offset to the current cache position nanochat/gpt.py396-397:

This ensures new tokens receive embeddings corresponding to their absolute position in the full sequence.

Sources: nanochat/gpt.py51-57 nanochat/gpt.py182-186 nanochat/gpt.py243-258

QK Normalization

Both queries and keys are normalized using RMSNorm after applying rotary embeddings nanochat/gpt.py94:

The norm function nanochat/gpt.py42-44 is a purely functional RMSNorm with no learnable parameters:

Benefits

Issue Without QK Norm	How QK Norm Helps
Attention logit magnitude grows with `√d`	Normalized vectors have unit norm
Unstable training with large `head_dim`	Stable dot products regardless of dimension
Need to carefully tune attention scale	Automatic scaling via normalization

The normalization is applied per-head (last dimension of shape (B, T, H, D)), ensuring each head's queries and keys are independently normalized.

Sources: nanochat/gpt.py42-44 nanochat/gpt.py94

Value Embeddings (ResFormer)

Value embeddings add extra capacity by mixing learned token-specific embeddings into the attention values.

Architecture

The model uses value embeddings at alternating layers, with the last layer always included nanochat/gpt.py47-49:

For a 12-layer model, value embeddings are present at layers: 1, 3, 5, 7, 9, 11.

Gated Injection

Each value embedding is gated by an input-dependent weight nanochat/gpt.py86-89:

The gate network:

Takes first 32 channels of input: x[..., :32]
Produces per-head scalar: (B, T, n_kv_head)
Range (0, 2) via 2 * sigmoid(...)
Initialized to zero → gate starts at 1.0 (neutral)

Parameter Cost

For vocab_size=32768, n_kv_head=6, head_dim=128, each value embedding table contains:

Parameters: 32768 × (6 × 128) = 25.2M
For 6 layers (d12 model): 6 × 25.2M = 151M

Sources: nanochat/gpt.py47-49 nanochat/gpt.py73-74 nanochat/gpt.py86-89 dev/LOG.md487-495

Hardware Fallback: SDPA

For non-Hopper GPUs, the system automatically falls back to PyTorch's scaled_dot_product_attention:

Key Differences

Aspect	Flash Attention 3	SDPA Fallback
Layout	`(B, T, H, D)` native	Transpose to `(B, H, T, D)`
KV cache	In-place update	Manual concatenation
Sliding window	Native `window_size`	Explicit mask tensor
Performance	Optimized for H100	CPU/older GPU compatible
Memory	Recompute-optimized	Standard memory usage

The fallback ensures nanochat can run on any device (CPU, MPS, older CUDA GPUs) but with reduced performance, especially for sliding window attention.

Sources: dev/LOG.md519-560

Integration with Training Loop

The attention mechanism integrates into the transformer block's forward pass nanochat/gpt.py140-143:

Arguments passed from the main GPT forward nanochat/gpt.py403-406:

ve: Value embedding for this token (if layer has it)
cos_sin: Rotary embedding tables sliced to sequence length
window_size: Per-layer sliding window configuration
kv_cache: Inference cache (None during training)

This design allows the attention layer to remain stateless, with all positional and caching state managed by the calling context.

Sources: nanochat/gpt.py140-143 nanochat/gpt.py388-407

Attention Mechanisms and Flash Attention

Purpose and Scope

Attention Architecture Overview

CausalSelfAttention Module Structure

Flash Attention 3 Integration

Implementation Architecture

Training Path: flash_attn_func

Inference Path: flash_attn_with_kvcache

Tensor Layout Optimization

Sliding Window Attention

Window Pattern Configuration

Implementation

FLOP Calculation with Sliding Windows

Group-Query Attention (GQA)

Configuration Parameters

Implementation Details

Memory Savings

Rotary Positional Embeddings (RoPE)

Precomputation

Application

Position Offset for KV Cache

QK Normalization

Benefits

Value Embeddings (ResFormer)

Architecture

Gated Injection

Parameter Cost

Hardware Fallback: SDPA

Key Differences

Integration with Training Loop

On this page

Attention Mechanisms and Flash Attention

Purpose and Scope

Attention Architecture Overview

CausalSelfAttention Module Structure

Flash Attention 3 Integration

Implementation Architecture

Training Path: flash_attn_func

Inference Path: flash_attn_with_kvcache

Tensor Layout Optimization

Sliding Window Attention

Window Pattern Configuration

Implementation

FLOP Calculation with Sliding Windows

Group-Query Attention (GQA)

Configuration Parameters

Implementation Details

Memory Savings

Rotary Positional Embeddings (RoPE)

Precomputation

Application

Position Offset for KV Cache

QK Normalization

Benefits

Value Embeddings (ResFormer)

Architecture

Gated Injection

Parameter Cost

Hardware Fallback: SDPA

Key Differences

Integration with Training Loop

On this page