GPU Model Runner

Relevant source files

Purpose and Scope

The GPU Model Runner (GPUModelRunner) is the core execution engine in vLLM's V1 architecture responsible for orchestrating model inference on GPU devices. It coordinates the forward pass, manages KV cache interactions, prepares input tensors, handles attention metadata, and interfaces with sampling mechanisms to generate tokens.

This page focuses on the model runner's architecture and execution flow. For information about:

Worker lifecycle and initialization: see page 4.2 (Worker and Executor Architecture)
Input batch state management: see page 4.3 (Input Batch Management)
Token sampling algorithms: see page 4.4 (Sampling and Token Generation)
Speculative decoding mechanisms: see page 4.5 (Speculative Decoding)

There are two GPUModelRunner implementations in the v1 codebase:

File	Role
`vllm/v1/worker/gpu_model_runner.py`	Primary runner: large, monolithic, handles all model types and features
`vllm/v1/worker/gpu/model_runner.py`	Modular refactored runner: separates concerns into sub-modules under `vllm/v1/worker/gpu/`

This page covers both, emphasizing the primary runner while noting where the modular runner differs.

Sources: vllm/v1/worker/gpu_model_runner.py1-200 vllm/v1/worker/gpu/model_runner.py1-100

Architecture Overview

The GPUModelRunner sits between the worker and the actual neural network model, serving as the orchestration layer that transforms scheduled requests into executable model inputs and processes the outputs.

Component Relationships (Primary Runner)

GPUModelRunner in gpu_model_runner.py integrates directly with multiple mixin classes and manages all state in-process.

Component Relationships (Primary Runner)

Sources: vllm/v1/worker/gpu_model_runner.py375-762

Modular Runner Structure

The newer vllm/v1/worker/gpu/model_runner.py delegates responsibilities to dedicated sub-modules:

Component Relationships (Modular Runner)

Sources: vllm/v1/worker/gpu/model_runner.py100-225 vllm/v1/worker/gpu/cudagraph_utils.py29-75 vllm/v1/worker/gpu/states.py9-79 vllm/v1/worker/gpu/input_batch.py12-33 vllm/v1/worker/gpu/block_table.py13-65

Core Responsibilities

The GPUModelRunner fulfills the following responsibilities:

Responsibility	Description	Key Methods
State Management	Maintains cached request states and persistent batch state	`_update_states()`, `requests: dict[str, CachedRequestState]`
Input Preparation	Converts scheduler output to model input tensors	`_prepare_inputs()`, `_prepare_input_ids()`
Attention Metadata	Builds attention metadata for various backends	`_prepare_attn_metadata()`, `initialize_attn_backend()`
Model Execution	Orchestrates forward pass with proper context	`execute_model()`, `_execute_model_cuda_graph()`
KV Cache Management	Initializes and binds KV cache to attention layers	`initialize_kv_cache()`, `bind_kv_cache()`
Token Sampling	Coordinates with sampling layers to generate tokens	`sample_tokens()`, `_run_sampler()`
CUDA Graphs	Captures and replays CUDA graphs for performance	`capture_model()`, `cudagraph_dispatcher`
Multi-modal Processing	Handles vision/audio encoder inputs	`_process_encoder_inputs()`, `encoder_cache`
Speculative Decoding	Supports draft model and verification	`drafter`, `rejection_sampler`

Sources: vllm/v1/worker/gpu_model_runner.py329-716

Initialization and Configuration

Constructor and Key Attributes

GPUModelRunner.__init__() in the primary runner initializes a large set of pre-allocated buffers and sub-components at startup:

__init__ Initialization Sequence

Key initialization steps:

Extract configurations from vllm_config vllm/v1/worker/gpu_model_runner.py383-398
Initialize sampler with logprobs mode vllm/v1/worker/gpu_model_runner.py461
Create InputBatch for persistent batch state vllm/v1/worker/gpu_model_runner.py557-581
Allocate persistent GPU/CPU buffers used for CUDA graph replay vllm/v1/worker/gpu_model_runner.py609-636
Initialize drafter for speculative decoding if configured vllm/v1/worker/gpu_model_runner.py491-526
Create CudagraphDispatcher for runtime graph dispatch vllm/v1/worker/gpu_model_runner.py697

The modular runner (gpu/model_runner.py) delegates these responsibilities:

Persistent buffers → InputBuffers vllm/v1/worker/gpu/input_batch.py12-33
Request state → RequestState vllm/v1/worker/gpu/states.py9-79
CUDA graph management → CudaGraphManager vllm/v1/worker/gpu/cudagraph_utils.py29-75
Block table management → BlockTables vllm/v1/worker/gpu/block_table.py13-65

Sources: vllm/v1/worker/gpu_model_runner.py375-762 vllm/v1/worker/gpu/model_runner.py100-225

Model Loading and KV Cache Initialization

Model Loading Flow

load_model and initialize_kv_cache Sequence

Key methods:

load_model(): Loads model weights using get_model_loader(), detects model architecture features (is_mixture_of_experts, supports_mrope, supports_multimodal_pruning), and initializes EPLB state if expert parallelism is configured.
initialize_kv_cache(kv_cache_config: KVCacheConfig):
- Creates KV cache tensors for each KVCacheGroupSpec
- Calls initialize_attn_backend() to set up per-group attention backends
- Calls bind_kv_cache() to attach cache tensors to attention layer modules
- Handles cross-layer KV sharing (shared_kv_cache_layers) and EncoderOnlyAttentionSpec

In the modular runner, initialize_kv_cache() additionally creates BlockTables and sets up the KVConnector vllm/v1/worker/gpu/model_runner.py287-326

Sources: vllm/v1/worker/gpu_model_runner.py1226-1369 vllm/v1/worker/gpu/model_runner.py236-326

Main Execution Flow

The execute_model() method is the primary entry point for processing a batch of requests:

Sources: vllm/v1/worker/gpu_model_runner.py1958-2194

Input Preparation Pipeline

The input preparation pipeline transforms SchedulerOutput into model-ready tensors:

State Update Process

The _update_states() method synchronizes the model runner's internal state with the scheduler's decisions:

Key data structure: CachedRequestState

Each request has a cached state containing:

req_id: Unique request identifier
prompt_token_ids: Original prompt tokens
block_ids: KV cache block assignments (tuple of lists for multi-group)
num_computed_tokens: Number of tokens already computed
output_token_ids: Generated tokens so far
sampling_params or pooling_params: Sampling/pooling configuration
mm_features: Multi-modal features if applicable
lora_request: LoRA adapter if applicable

Sources: vllm/v1/worker/gpu_model_runner.py878-1177 vllm/v1/worker/gpu_input_batch.py29-78

Tensor Preparation

The _prepare_inputs() method builds GPU tensors from the updated batch state:

Input tensor construction:

Input IDs and Positions (lines 2196-2336):
- Copies token IDs from input_batch.token_ids_cpu to GPU
- Handles prompt embeddings for requests using prompt_embeds
- Generates position IDs (supports M-RoPE and XD-RoPE)
- Marks token vs. embedding positions with is_token_ids
Sequence Lengths and Query Start Locations (lines 2337-2400):
- Computes seq_lens from num_computed_tokens
- Builds query_start_loc as cumulative sum of scheduled tokens
- For DCP (decode context parallel), computes local sequence lengths
Block Tables (lines 2401-2450):
- Updates input_batch.block_table from request states
- Handles multi-group KV caches for encoder-decoder models
Attention Metadata (lines 2451-2550):
- Calls _prepare_attn_metadata() for each attention group
- Builds backend-specific metadata (FlashInfer, FlashAttention, etc.)
- Handles cascade attention, prefix caching, and sliding windows

Sources: vllm/v1/worker/gpu_model_runner.py2196-2550

Attention Metadata Building

Attention metadata encapsulates all information needed by attention kernels:

Key parameters passed to metadata builder:

query_start_loc: Cumulative token counts per request
seq_lens: Total sequence length per request
slot_mapping: Mapping from token positions to KV cache slots
block_table: Physical KV cache block assignments
num_decode_tokens: Number of decode phase tokens
num_prefill_tokens: Number of prefill phase tokens

Backend-specific metadata:

FlashInfer: Includes paged KV info, decode/prefill split
FlashAttention: Includes sequence lengths, block tables
MLA (DeepSeek): Includes latent cache indices
CUTLASS: Includes tiling information

Sources: vllm/v1/worker/gpu_model_runner.py2560-2889

Forward Pass Execution

CUDA Graph vs Eager Execution

The model runner supports two execution modes:

CUDA Graph Execution (lines 3142-3261):

Uses CudagraphDispatcher to select appropriate graph wrapper
Pads inputs to match captured batch size
Replays recorded operations for maximum performance
Avoids Python overhead and kernel launch latency

Eager Execution (lines 2924-3140):

Direct model.forward() call with dynamic shapes
Used for unsupported batch sizes or during warmup
More flexible but slower due to kernel launch overhead

Sources: vllm/v1/worker/gpu_model_runner.py2890-3261

Model Forward Call

The actual model forward pass:

Return value depends on pipeline stage:

First/middle PP ranks: Return IntermediateTensors (hidden states)
Last PP rank: Return logits or hidden states for sampling/pooling

Sources: vllm/v1/worker/gpu_model_runner.py3080-3110

Token Sampling and Output Processing

Sampling Workflow

The sample_tokens() method coordinates token generation:

Key responsibilities:

Apply grammar constraints if structured output is enabled (lines 3273-3277)
Run sampler or rejection sampler based on spec decode mode (lines 3279-3393)
Initiate async copy of sampled tokens to CPU if async scheduling enabled (lines 3395-3445)
Build output structure with token IDs, logprobs, and metadata (lines 3447-3585)

Sources: vllm/v1/worker/gpu_model_runner.py3263-3585

Request State Management

Request Lifecycle in Model Runner

Key state transitions:

New Request (lines 901-933):
- Create CachedRequestState from NewRequestData
- Add to self.requests dictionary
- Add to InputBatch via add_request()
Update Active Request (lines 935-1085):
- Update block_ids from scheduler
- Increment num_computed_tokens
- Append newly generated tokens to output_token_ids
Remove Finished Request (lines 889-891):
- Pop from self.requests
- Remove from InputBatch via remove_request()
- Condense InputBatch to compact memory

Sources: vllm/v1/worker/gpu_model_runner.py878-1177

CUDA Graph Capture and Replay

Graph Capture: Primary Runner

The primary runner uses CudagraphDispatcher to manage graph capture and replay at runtime.

capture_model() Flow (Primary Runner)

Sources: vllm/v1/worker/gpu_model_runner.py1850-2194

Graph Capture: Modular Runner (`CudaGraphManager`)

The modular runner delegates graph management to CudaGraphManager in vllm/v1/worker/gpu/cudagraph_utils.py.

CudaGraphManager Structure

CudaGraphManager.capture() runs two phases vllm/v1/worker/gpu/cudagraph_utils.py248-292:

Mixed mode: Captures graphs for batches that may contain both prefill and decode requests.
Uniform decode mode: Captures separate FULL graphs for pure-decode batches where each request contributes exactly 1 + num_speculative_tokens tokens.

The two internal capture functions are:

_capture_full_graph(): Records a full torch.cuda.CUDAGraph, copying outputs into pre-allocated self.hidden_states buffers vllm/v1/worker/gpu/cudagraph_utils.py170-220
_capture_piecewise_graph(): Triggers piecewise capture via torch.compile's CUDAGraphWrapper using a BatchDescriptor key vllm/v1/worker/gpu/cudagraph_utils.py222-246

At runtime, get_cudagraph_runtime_mode() decides whether to use FULL, PIECEWISE, or NONE based on whether the batch is a uniform decode batch and whether a graph was captured for the token count vllm/v1/worker/gpu/cudagraph_utils.py294-318

Sources: vllm/v1/worker/gpu/cudagraph_utils.py29-391 vllm/v1/worker/gpu/model_runner.py447-491

Speculative Decoding Integration

The model runner integrates with various speculative decoding strategies:

Draft Token Proposal

Drafter initialization (lines 450-485):

Draft token usage in forward pass:

Drafter proposes K speculative tokens per request
Input preparation includes both verified and draft tokens
Target model computes logits for all tokens
Rejection sampler verifies draft tokens against target logits
Accepted tokens are returned; rejected tokens trigger re-sampling

Sources: vllm/v1/worker/gpu_model_runner.py446-490 vllm/v1/spec_decode/

For vision-language and audio models, the model runner manages encoder caching:

Encoder Output Caching

Encoder caching benefits:

Reuse computation: Same image/audio processed once across multiple requests
Memory efficient: Stores only final embeddings, not intermediate activations
Cache invalidation: Cleared on model weight reload via reset_encoder_cache()

Multi-modal kwargs construction (lines 2970-3060):

The model runner builds multi_modal_kwargs dict containing:

pixel_values: Image tensors
audio_values: Audio tensors
mm_hash: Hash for caching
mm_processor_kwargs: Processor-specific arguments

Sources: vllm/v1/worker/gpu_model_runner.py2733-2895 vllm/v1/worker/gpu_model_runner.py732-738

Key Data Structures

CachedRequestState

Defined in vllm/v1/worker/gpu_input_batch.py, CachedRequestState stores all per-request state needed for incremental token generation in the primary runner:

Field	Type	Description
`req_id`	`str`	Unique request identifier
`prompt_token_ids`	`list[int] \| None`	Original prompt tokens (None if using `prompt_embeds`)
`mm_features`	`list[MultiModalFeatureSpec]`	Multi-modal feature specifications
`sampling_params`	`SamplingParams \| None`	Sampling configuration
`block_ids`	`tuple[list[int], ...]`	KV cache block assignments per cache group
`num_computed_tokens`	`int`	Number of tokens already KV-cached
`output_token_ids`	`list[int]`	Tokens generated so far
`mrope_positions`	`torch.Tensor \| None`	M-RoPE position IDs (Qwen2-VL etc.)
`xdrope_positions`	`torch.Tensor \| None`	XD-RoPE position IDs (HunYuan-VL etc.)
`lora_request`	`LoRARequest \| None`	LoRA adapter if applicable
`prompt_embeds`	`torch.Tensor \| None`	Direct prompt embeddings
`pooling_params`	`PoolingParams \| None`	Pooling config for embedding models

Sources: vllm/v1/worker/gpu_input_batch.py29-78

ExecuteModelState

ExecuteModelState is a NamedTuple that carries ephemeral state between execute_model() (which may return None when using async scheduling) and the subsequent sample_tokens() call:

Field	Type	Description
`scheduler_output`	`SchedulerOutput`	The scheduler output for this step
`logits`	`torch.Tensor`	Output logits from the model
`spec_decode_metadata`	`SpecDecodeMetadata \| None`	Spec decode metadata if active
`hidden_states`	`torch.Tensor`	Full hidden states from last PP rank
`sample_hidden_states`	`torch.Tensor`	Hidden states at logit positions
`aux_hidden_states`	`list[torch.Tensor] \| None`	Auxiliary states for EAGLE3
`cudagraph_stats`	`CUDAGraphStat \| None`	CUDA graph execution statistics

Sources: vllm/v1/worker/gpu_model_runner.py359-373

The modular runner uses a plain tuple | None for execute_model_state vllm/v1/worker/gpu/model_runner.py222

InputBatch (Primary Runner)

The InputBatch class (detailed in page 4.3) maintains persistent CPU+GPU state across steps:

Request ID mappings (req_id_to_index, _req_ids)
Token ID buffers (token_ids_cpu, token_ids_cpu_tensor)
Sampling parameters (temperature, top_p, top_k, penalties)
MultiGroupBlockTable for KV cache block assignments
SamplingMetadata (rebuilt when batch composition changes)
LoRA mappings
LogitsProcessors

Sources: vllm/v1/worker/gpu_input_batch.py81-269

InputBuffers (Modular Runner)

The modular runner separates the pre-allocated GPU tensors into InputBuffers, a lightweight class holding only the fixed-size device buffers needed per step:

Buffer	Shape	dtype	Description
`input_ids`	`[max_num_tokens]`	`int32`	Token IDs for current step
`positions`	`[max_num_tokens]`	`int64`	Position IDs
`query_start_loc`	`[max_num_reqs + 1]`	`int32`	Cumulative token counts per request
`seq_lens`	`[max_num_reqs]`	`int32`	Total sequence length per request
`dcp_local_seq_lens`	`[max_num_reqs]`	`int32`	Per-request local seq lens for DCP

Sources: vllm/v1/worker/gpu/input_batch.py12-33

RequestState (Modular Runner)

RequestState in the modular runner replaces the per-request dict in the primary runner with UVA-backed (Unified Virtual Addressing) tensors for memory efficiency:

Key tensors (using UvaBackedTensor or StagedWriteTensor):

all_token_ids: [max_num_reqs, max_model_len] int32, stored in UVA to save GPU memory
num_computed_tokens: [max_num_reqs] int32
last_sampled_tokens: [max_num_reqs, 1] int64, GPU
draft_tokens: [max_num_reqs, num_speculative_steps] int64, GPU

Sources: vllm/v1/worker/gpu/states.py9-120

Performance Optimizations

Memory Efficiency

Optimization	Description	Location
Pinned memory	CPU tensors use pinned memory for fast GPU transfers	lines 361, 584
Buffer reuse	Pre-allocated persistent buffers avoid dynamic allocation	lines 569-651
CUDA graphs	Eliminate kernel launch overhead for common batch sizes	lines 1850-2194
Encoder caching	Reuse vision/audio embeddings across requests	lines 443, 732-738
In-place operations	Update tensors in-place when safe (e.g., logits)	lines 3080-3110

Computation Efficiency

Optimization	Description	Location
KV cache sharing	Share KV cache across identical attention layers	lines 644-651, 1629-1680
Cascade attention	Skip unnecessary attention computation for common prefixes	line 400
Microbatching (DBO)	Overlap computation and communication in distributed settings	Handled by executor
Expert parallelism	Distribute MoE experts across devices	lines 424-429
Compilation	Torch compile integration for kernel fusion	vllm_config.compilation_config

Sources: Various locations in vllm/v1/worker/gpu_model_runner.py

Error Handling and Validation

The model runner performs extensive validation:

Input Validation

Assertion Checks

Block table consistency: Ensures block_ids match block_table entries
Token count validation: Verifies num_scheduled_tokens matches actual tokens
Attention metadata validation: Checks metadata dimensions match batch size
KV cache bounds: Ensures slot mappings don't exceed cache capacity

Sources: vllm/v1/worker/gpu_model_runner.py3586-3615 assertions throughout the file

Integration with Other Components

Worker → Model Runner

The worker is responsible for:

Process/device initialization
KV cache memory allocation
Collective communication setup
Calling model runner for execution

Sources: vllm/v1/worker/gpu_worker.py636-698

Model Runner → Attention Backend

The attention backend is responsible for:

Metadata structure definition
Attention kernel selection
KV cache layout management

Sources: vllm/v1/worker/gpu_model_runner.py2560-2889

Model Runner → Sampler

The sampler is responsible for:

Temperature scaling
Top-k/top-p filtering
Token sampling from distribution
Logprobs computation

Sources: vllm/v1/worker/gpu_model_runner.py3279-3393

GPU Model Runner

Relevant source files

Purpose and Scope

This page focuses on the model runner's architecture and execution flow. For information about:

Worker lifecycle and initialization: see page 4.2 (Worker and Executor Architecture)
Input batch state management: see page 4.3 (Input Batch Management)
Token sampling algorithms: see page 4.4 (Sampling and Token Generation)
Speculative decoding mechanisms: see page 4.5 (Speculative Decoding)

There are two GPUModelRunner implementations in the v1 codebase:

File	Role
`vllm/v1/worker/gpu_model_runner.py`	Primary runner: large, monolithic, handles all model types and features
`vllm/v1/worker/gpu/model_runner.py`	Modular refactored runner: separates concerns into sub-modules under `vllm/v1/worker/gpu/`

This page covers both, emphasizing the primary runner while noting where the modular runner differs.

Sources: vllm/v1/worker/gpu_model_runner.py1-200 vllm/v1/worker/gpu/model_runner.py1-100

Architecture Overview

Component Relationships (Primary Runner)

GPUModelRunner in gpu_model_runner.py integrates directly with multiple mixin classes and manages all state in-process.

Component Relationships (Primary Runner)

Sources: vllm/v1/worker/gpu_model_runner.py375-762

Modular Runner Structure

The newer vllm/v1/worker/gpu/model_runner.py delegates responsibilities to dedicated sub-modules:

Component Relationships (Modular Runner)

Core Responsibilities

The GPUModelRunner fulfills the following responsibilities:

Responsibility	Description	Key Methods
State Management	Maintains cached request states and persistent batch state	`_update_states()`, `requests: dict[str, CachedRequestState]`
Input Preparation	Converts scheduler output to model input tensors	`_prepare_inputs()`, `_prepare_input_ids()`
Attention Metadata	Builds attention metadata for various backends	`_prepare_attn_metadata()`, `initialize_attn_backend()`
Model Execution	Orchestrates forward pass with proper context	`execute_model()`, `_execute_model_cuda_graph()`
KV Cache Management	Initializes and binds KV cache to attention layers	`initialize_kv_cache()`, `bind_kv_cache()`
Token Sampling	Coordinates with sampling layers to generate tokens	`sample_tokens()`, `_run_sampler()`
CUDA Graphs	Captures and replays CUDA graphs for performance	`capture_model()`, `cudagraph_dispatcher`
Multi-modal Processing	Handles vision/audio encoder inputs	`_process_encoder_inputs()`, `encoder_cache`
Speculative Decoding	Supports draft model and verification	`drafter`, `rejection_sampler`

Sources: vllm/v1/worker/gpu_model_runner.py329-716

Initialization and Configuration

Constructor and Key Attributes

GPUModelRunner.__init__() in the primary runner initializes a large set of pre-allocated buffers and sub-components at startup:

__init__ Initialization Sequence

Key initialization steps:

Extract configurations from vllm_config vllm/v1/worker/gpu_model_runner.py383-398
Initialize sampler with logprobs mode vllm/v1/worker/gpu_model_runner.py461
Create InputBatch for persistent batch state vllm/v1/worker/gpu_model_runner.py557-581
Allocate persistent GPU/CPU buffers used for CUDA graph replay vllm/v1/worker/gpu_model_runner.py609-636
Initialize drafter for speculative decoding if configured vllm/v1/worker/gpu_model_runner.py491-526
Create CudagraphDispatcher for runtime graph dispatch vllm/v1/worker/gpu_model_runner.py697

The modular runner (gpu/model_runner.py) delegates these responsibilities:

Persistent buffers → InputBuffers vllm/v1/worker/gpu/input_batch.py12-33
Request state → RequestState vllm/v1/worker/gpu/states.py9-79
CUDA graph management → CudaGraphManager vllm/v1/worker/gpu/cudagraph_utils.py29-75
Block table management → BlockTables vllm/v1/worker/gpu/block_table.py13-65

Sources: vllm/v1/worker/gpu_model_runner.py375-762 vllm/v1/worker/gpu/model_runner.py100-225

Model Loading and KV Cache Initialization

Model Loading Flow

load_model and initialize_kv_cache Sequence

Key methods:

load_model(): Loads model weights using get_model_loader(), detects model architecture features (is_mixture_of_experts, supports_mrope, supports_multimodal_pruning), and initializes EPLB state if expert parallelism is configured.
initialize_kv_cache(kv_cache_config: KVCacheConfig):
- Creates KV cache tensors for each KVCacheGroupSpec
- Calls initialize_attn_backend() to set up per-group attention backends
- Calls bind_kv_cache() to attach cache tensors to attention layer modules
- Handles cross-layer KV sharing (shared_kv_cache_layers) and EncoderOnlyAttentionSpec

In the modular runner, initialize_kv_cache() additionally creates BlockTables and sets up the KVConnector vllm/v1/worker/gpu/model_runner.py287-326

Sources: vllm/v1/worker/gpu_model_runner.py1226-1369 vllm/v1/worker/gpu/model_runner.py236-326

Main Execution Flow

The execute_model() method is the primary entry point for processing a batch of requests:

Sources: vllm/v1/worker/gpu_model_runner.py1958-2194

Input Preparation Pipeline

The input preparation pipeline transforms SchedulerOutput into model-ready tensors:

State Update Process

The _update_states() method synchronizes the model runner's internal state with the scheduler's decisions:

Key data structure: CachedRequestState

Each request has a cached state containing:

req_id: Unique request identifier
prompt_token_ids: Original prompt tokens
block_ids: KV cache block assignments (tuple of lists for multi-group)
num_computed_tokens: Number of tokens already computed
output_token_ids: Generated tokens so far
sampling_params or pooling_params: Sampling/pooling configuration
mm_features: Multi-modal features if applicable
lora_request: LoRA adapter if applicable

Sources: vllm/v1/worker/gpu_model_runner.py878-1177 vllm/v1/worker/gpu_input_batch.py29-78

Tensor Preparation

The _prepare_inputs() method builds GPU tensors from the updated batch state:

Input tensor construction:

Input IDs and Positions (lines 2196-2336):
- Copies token IDs from input_batch.token_ids_cpu to GPU
- Handles prompt embeddings for requests using prompt_embeds
- Generates position IDs (supports M-RoPE and XD-RoPE)
- Marks token vs. embedding positions with is_token_ids
Sequence Lengths and Query Start Locations (lines 2337-2400):
- Computes seq_lens from num_computed_tokens
- Builds query_start_loc as cumulative sum of scheduled tokens
- For DCP (decode context parallel), computes local sequence lengths
Block Tables (lines 2401-2450):
- Updates input_batch.block_table from request states
- Handles multi-group KV caches for encoder-decoder models
Attention Metadata (lines 2451-2550):
- Calls _prepare_attn_metadata() for each attention group
- Builds backend-specific metadata (FlashInfer, FlashAttention, etc.)
- Handles cascade attention, prefix caching, and sliding windows

Sources: vllm/v1/worker/gpu_model_runner.py2196-2550

Attention Metadata Building

Attention metadata encapsulates all information needed by attention kernels:

Key parameters passed to metadata builder:

query_start_loc: Cumulative token counts per request
seq_lens: Total sequence length per request
slot_mapping: Mapping from token positions to KV cache slots
block_table: Physical KV cache block assignments
num_decode_tokens: Number of decode phase tokens
num_prefill_tokens: Number of prefill phase tokens

Backend-specific metadata:

FlashInfer: Includes paged KV info, decode/prefill split
FlashAttention: Includes sequence lengths, block tables
MLA (DeepSeek): Includes latent cache indices
CUTLASS: Includes tiling information

Sources: vllm/v1/worker/gpu_model_runner.py2560-2889

Forward Pass Execution

CUDA Graph vs Eager Execution

The model runner supports two execution modes:

CUDA Graph Execution (lines 3142-3261):

Uses CudagraphDispatcher to select appropriate graph wrapper
Pads inputs to match captured batch size
Replays recorded operations for maximum performance
Avoids Python overhead and kernel launch latency

Eager Execution (lines 2924-3140):

Direct model.forward() call with dynamic shapes
Used for unsupported batch sizes or during warmup
More flexible but slower due to kernel launch overhead

Sources: vllm/v1/worker/gpu_model_runner.py2890-3261

Model Forward Call

The actual model forward pass:

Return value depends on pipeline stage:

First/middle PP ranks: Return IntermediateTensors (hidden states)
Last PP rank: Return logits or hidden states for sampling/pooling

Sources: vllm/v1/worker/gpu_model_runner.py3080-3110

Token Sampling and Output Processing

Sampling Workflow

The sample_tokens() method coordinates token generation:

Key responsibilities:

Apply grammar constraints if structured output is enabled (lines 3273-3277)
Run sampler or rejection sampler based on spec decode mode (lines 3279-3393)
Initiate async copy of sampled tokens to CPU if async scheduling enabled (lines 3395-3445)
Build output structure with token IDs, logprobs, and metadata (lines 3447-3585)

Sources: vllm/v1/worker/gpu_model_runner.py3263-3585

Request State Management

Request Lifecycle in Model Runner

Key state transitions:

New Request (lines 901-933):
- Create CachedRequestState from NewRequestData
- Add to self.requests dictionary
- Add to InputBatch via add_request()
Update Active Request (lines 935-1085):
- Update block_ids from scheduler
- Increment num_computed_tokens
- Append newly generated tokens to output_token_ids
Remove Finished Request (lines 889-891):
- Pop from self.requests
- Remove from InputBatch via remove_request()
- Condense InputBatch to compact memory

Sources: vllm/v1/worker/gpu_model_runner.py878-1177

CUDA Graph Capture and Replay

Graph Capture: Primary Runner

The primary runner uses CudagraphDispatcher to manage graph capture and replay at runtime.

capture_model() Flow (Primary Runner)

Sources: vllm/v1/worker/gpu_model_runner.py1850-2194

Graph Capture: Modular Runner (`CudaGraphManager`)

The modular runner delegates graph management to CudaGraphManager in vllm/v1/worker/gpu/cudagraph_utils.py.

CudaGraphManager Structure

CudaGraphManager.capture() runs two phases vllm/v1/worker/gpu/cudagraph_utils.py248-292:

Mixed mode: Captures graphs for batches that may contain both prefill and decode requests.
Uniform decode mode: Captures separate FULL graphs for pure-decode batches where each request contributes exactly 1 + num_speculative_tokens tokens.

The two internal capture functions are:

_capture_full_graph(): Records a full torch.cuda.CUDAGraph, copying outputs into pre-allocated self.hidden_states buffers vllm/v1/worker/gpu/cudagraph_utils.py170-220
_capture_piecewise_graph(): Triggers piecewise capture via torch.compile's CUDAGraphWrapper using a BatchDescriptor key vllm/v1/worker/gpu/cudagraph_utils.py222-246

Sources: vllm/v1/worker/gpu/cudagraph_utils.py29-391 vllm/v1/worker/gpu/model_runner.py447-491

Speculative Decoding Integration

The model runner integrates with various speculative decoding strategies:

Draft Token Proposal

Drafter initialization (lines 450-485):

Draft token usage in forward pass:

Drafter proposes K speculative tokens per request
Input preparation includes both verified and draft tokens
Target model computes logits for all tokens
Rejection sampler verifies draft tokens against target logits
Accepted tokens are returned; rejected tokens trigger re-sampling

Sources: vllm/v1/worker/gpu_model_runner.py446-490 vllm/v1/spec_decode/

For vision-language and audio models, the model runner manages encoder caching:

Encoder Output Caching

Encoder caching benefits:

Reuse computation: Same image/audio processed once across multiple requests
Memory efficient: Stores only final embeddings, not intermediate activations
Cache invalidation: Cleared on model weight reload via reset_encoder_cache()

Multi-modal kwargs construction (lines 2970-3060):

The model runner builds multi_modal_kwargs dict containing:

pixel_values: Image tensors
audio_values: Audio tensors
mm_hash: Hash for caching
mm_processor_kwargs: Processor-specific arguments

Sources: vllm/v1/worker/gpu_model_runner.py2733-2895 vllm/v1/worker/gpu_model_runner.py732-738

Key Data Structures

CachedRequestState

Defined in vllm/v1/worker/gpu_input_batch.py, CachedRequestState stores all per-request state needed for incremental token generation in the primary runner:

Field	Type	Description
`req_id`	`str`	Unique request identifier
`prompt_token_ids`	`list[int] \| None`	Original prompt tokens (None if using `prompt_embeds`)
`mm_features`	`list[MultiModalFeatureSpec]`	Multi-modal feature specifications
`sampling_params`	`SamplingParams \| None`	Sampling configuration
`block_ids`	`tuple[list[int], ...]`	KV cache block assignments per cache group
`num_computed_tokens`	`int`	Number of tokens already KV-cached
`output_token_ids`	`list[int]`	Tokens generated so far
`mrope_positions`	`torch.Tensor \| None`	M-RoPE position IDs (Qwen2-VL etc.)
`xdrope_positions`	`torch.Tensor \| None`	XD-RoPE position IDs (HunYuan-VL etc.)
`lora_request`	`LoRARequest \| None`	LoRA adapter if applicable
`prompt_embeds`	`torch.Tensor \| None`	Direct prompt embeddings
`pooling_params`	`PoolingParams \| None`	Pooling config for embedding models

Sources: vllm/v1/worker/gpu_input_batch.py29-78

ExecuteModelState

ExecuteModelState is a NamedTuple that carries ephemeral state between execute_model() (which may return None when using async scheduling) and the subsequent sample_tokens() call:

Field	Type	Description
`scheduler_output`	`SchedulerOutput`	The scheduler output for this step
`logits`	`torch.Tensor`	Output logits from the model
`spec_decode_metadata`	`SpecDecodeMetadata \| None`	Spec decode metadata if active
`hidden_states`	`torch.Tensor`	Full hidden states from last PP rank
`sample_hidden_states`	`torch.Tensor`	Hidden states at logit positions
`aux_hidden_states`	`list[torch.Tensor] \| None`	Auxiliary states for EAGLE3
`cudagraph_stats`	`CUDAGraphStat \| None`	CUDA graph execution statistics

Sources: vllm/v1/worker/gpu_model_runner.py359-373

The modular runner uses a plain tuple | None for execute_model_state vllm/v1/worker/gpu/model_runner.py222

InputBatch (Primary Runner)

The InputBatch class (detailed in page 4.3) maintains persistent CPU+GPU state across steps:

Request ID mappings (req_id_to_index, _req_ids)
Token ID buffers (token_ids_cpu, token_ids_cpu_tensor)
Sampling parameters (temperature, top_p, top_k, penalties)
MultiGroupBlockTable for KV cache block assignments
SamplingMetadata (rebuilt when batch composition changes)
LoRA mappings
LogitsProcessors

Sources: vllm/v1/worker/gpu_input_batch.py81-269

InputBuffers (Modular Runner)

The modular runner separates the pre-allocated GPU tensors into InputBuffers, a lightweight class holding only the fixed-size device buffers needed per step:

Buffer	Shape	dtype	Description
`input_ids`	`[max_num_tokens]`	`int32`	Token IDs for current step
`positions`	`[max_num_tokens]`	`int64`	Position IDs
`query_start_loc`	`[max_num_reqs + 1]`	`int32`	Cumulative token counts per request
`seq_lens`	`[max_num_reqs]`	`int32`	Total sequence length per request
`dcp_local_seq_lens`	`[max_num_reqs]`	`int32`	Per-request local seq lens for DCP

Sources: vllm/v1/worker/gpu/input_batch.py12-33

RequestState (Modular Runner)

RequestState in the modular runner replaces the per-request dict in the primary runner with UVA-backed (Unified Virtual Addressing) tensors for memory efficiency:

Key tensors (using UvaBackedTensor or StagedWriteTensor):

all_token_ids: [max_num_reqs, max_model_len] int32, stored in UVA to save GPU memory
num_computed_tokens: [max_num_reqs] int32
last_sampled_tokens: [max_num_reqs, 1] int64, GPU
draft_tokens: [max_num_reqs, num_speculative_steps] int64, GPU

Sources: vllm/v1/worker/gpu/states.py9-120

Performance Optimizations

Memory Efficiency

Optimization	Description	Location
Pinned memory	CPU tensors use pinned memory for fast GPU transfers	lines 361, 584
Buffer reuse	Pre-allocated persistent buffers avoid dynamic allocation	lines 569-651
CUDA graphs	Eliminate kernel launch overhead for common batch sizes	lines 1850-2194
Encoder caching	Reuse vision/audio embeddings across requests	lines 443, 732-738
In-place operations	Update tensors in-place when safe (e.g., logits)	lines 3080-3110

Computation Efficiency

Optimization	Description	Location
KV cache sharing	Share KV cache across identical attention layers	lines 644-651, 1629-1680
Cascade attention	Skip unnecessary attention computation for common prefixes	line 400
Microbatching (DBO)	Overlap computation and communication in distributed settings	Handled by executor
Expert parallelism	Distribute MoE experts across devices	lines 424-429
Compilation	Torch compile integration for kernel fusion	vllm_config.compilation_config

Sources: Various locations in vllm/v1/worker/gpu_model_runner.py

Error Handling and Validation

The model runner performs extensive validation:

Input Validation

Assertion Checks

Block table consistency: Ensures block_ids match block_table entries
Token count validation: Verifies num_scheduled_tokens matches actual tokens
Attention metadata validation: Checks metadata dimensions match batch size
KV cache bounds: Ensures slot mappings don't exceed cache capacity

Sources: vllm/v1/worker/gpu_model_runner.py3586-3615 assertions throughout the file

Integration with Other Components

Worker → Model Runner

The worker is responsible for:

Process/device initialization
KV cache memory allocation
Collective communication setup
Calling model runner for execution

Sources: vllm/v1/worker/gpu_worker.py636-698

Model Runner → Attention Backend

The attention backend is responsible for:

Metadata structure definition
Attention kernel selection
KV cache layout management

Sources: vllm/v1/worker/gpu_model_runner.py2560-2889

Model Runner → Sampler

The sampler is responsible for:

Temperature scaling
Top-k/top-p filtering
Token sampling from distribution
Logprobs computation

Sources: vllm/v1/worker/gpu_model_runner.py3279-3393

GPU Model Runner

Purpose and Scope

Architecture Overview

Component Relationships (Primary Runner)

Modular Runner Structure

Core Responsibilities

Initialization and Configuration

Constructor and Key Attributes

Model Loading and KV Cache Initialization

Model Loading Flow

Main Execution Flow

Input Preparation Pipeline

State Update Process

Tensor Preparation

Attention Metadata Building

Forward Pass Execution

CUDA Graph vs Eager Execution

Model Forward Call

Token Sampling and Output Processing

Sampling Workflow

Request State Management

Request Lifecycle in Model Runner

CUDA Graph Capture and Replay

Graph Capture: Primary Runner

Graph Capture: Modular Runner (CudaGraphManager)

Speculative Decoding Integration

Draft Token Proposal

Multi-Modal Processing

Encoder Output Caching

Key Data Structures

CachedRequestState

ExecuteModelState

InputBatch (Primary Runner)

InputBuffers (Modular Runner)

RequestState (Modular Runner)

Performance Optimizations

Memory Efficiency

Computation Efficiency

Error Handling and Validation

Input Validation

Assertion Checks

Integration with Other Components

Worker → Model Runner

Model Runner → Attention Backend

Model Runner → Sampler

On this page

GPU Model Runner

Purpose and Scope

Architecture Overview

Component Relationships (Primary Runner)

Modular Runner Structure

Core Responsibilities

Initialization and Configuration

Constructor and Key Attributes

Model Loading and KV Cache Initialization

Model Loading Flow

Main Execution Flow

Input Preparation Pipeline

State Update Process

Tensor Preparation

Attention Metadata Building

Forward Pass Execution

CUDA Graph vs Eager Execution

Model Forward Call

Token Sampling and Output Processing

Sampling Workflow

Request State Management

Request Lifecycle in Model Runner

CUDA Graph Capture and Replay

Graph Capture: Primary Runner

Graph Capture: Modular Runner (CudaGraphManager)

Speculative Decoding Integration

Draft Token Proposal

Multi-Modal Processing

Encoder Output Caching

Key Data Structures

CachedRequestState

ExecuteModelState

InputBatch (Primary Runner)

InputBuffers (Modular Runner)

Graph Capture: Modular Runner (`CudaGraphManager`)

Graph Capture: Modular Runner (`CudaGraphManager`)