The GPU Model Runner (GPUModelRunner) is the core execution engine in vLLM's V1 architecture responsible for orchestrating model inference on GPU devices. It coordinates the forward pass, manages KV cache interactions, prepares input tensors, handles attention metadata, and interfaces with sampling mechanisms to generate tokens.
This page focuses on the model runner's architecture and execution flow. For information about:
There are two GPUModelRunner implementations in the v1 codebase:
| File | Role |
|---|---|
vllm/v1/worker/gpu_model_runner.py | Primary runner: large, monolithic, handles all model types and features |
vllm/v1/worker/gpu/model_runner.py | Modular refactored runner: separates concerns into sub-modules under vllm/v1/worker/gpu/ |
This page covers both, emphasizing the primary runner while noting where the modular runner differs.
Sources: vllm/v1/worker/gpu_model_runner.py1-200 vllm/v1/worker/gpu/model_runner.py1-100
The GPUModelRunner sits between the worker and the actual neural network model, serving as the orchestration layer that transforms scheduled requests into executable model inputs and processes the outputs.
GPUModelRunner in gpu_model_runner.py integrates directly with multiple mixin classes and manages all state in-process.
Component Relationships (Primary Runner)
Sources: vllm/v1/worker/gpu_model_runner.py375-762
The newer vllm/v1/worker/gpu/model_runner.py delegates responsibilities to dedicated sub-modules:
Component Relationships (Modular Runner)
Sources: vllm/v1/worker/gpu/model_runner.py100-225 vllm/v1/worker/gpu/cudagraph_utils.py29-75 vllm/v1/worker/gpu/states.py9-79 vllm/v1/worker/gpu/input_batch.py12-33 vllm/v1/worker/gpu/block_table.py13-65
The GPUModelRunner fulfills the following responsibilities:
| Responsibility | Description | Key Methods |
|---|---|---|
| State Management | Maintains cached request states and persistent batch state | _update_states(), requests: dict[str, CachedRequestState] |
| Input Preparation | Converts scheduler output to model input tensors | _prepare_inputs(), _prepare_input_ids() |
| Attention Metadata | Builds attention metadata for various backends | _prepare_attn_metadata(), initialize_attn_backend() |
| Model Execution | Orchestrates forward pass with proper context | execute_model(), _execute_model_cuda_graph() |
| KV Cache Management | Initializes and binds KV cache to attention layers | initialize_kv_cache(), bind_kv_cache() |
| Token Sampling | Coordinates with sampling layers to generate tokens | sample_tokens(), _run_sampler() |
| CUDA Graphs | Captures and replays CUDA graphs for performance | capture_model(), cudagraph_dispatcher |
| Multi-modal Processing | Handles vision/audio encoder inputs | _process_encoder_inputs(), encoder_cache |
| Speculative Decoding | Supports draft model and verification | drafter, rejection_sampler |
Sources: vllm/v1/worker/gpu_model_runner.py329-716
GPUModelRunner.__init__() in the primary runner initializes a large set of pre-allocated buffers and sub-components at startup:
__init__ Initialization Sequence
Key initialization steps:
vllm_config vllm/v1/worker/gpu_model_runner.py383-398InputBatch for persistent batch state vllm/v1/worker/gpu_model_runner.py557-581CudagraphDispatcher for runtime graph dispatch vllm/v1/worker/gpu_model_runner.py697The modular runner (gpu/model_runner.py) delegates these responsibilities:
InputBuffers vllm/v1/worker/gpu/input_batch.py12-33RequestState vllm/v1/worker/gpu/states.py9-79CudaGraphManager vllm/v1/worker/gpu/cudagraph_utils.py29-75BlockTables vllm/v1/worker/gpu/block_table.py13-65Sources: vllm/v1/worker/gpu_model_runner.py375-762 vllm/v1/worker/gpu/model_runner.py100-225
load_model and initialize_kv_cache Sequence
Key methods:
load_model(): Loads model weights using get_model_loader(), detects model architecture features (is_mixture_of_experts, supports_mrope, supports_multimodal_pruning), and initializes EPLB state if expert parallelism is configured.
initialize_kv_cache(kv_cache_config: KVCacheConfig):
KVCacheGroupSpecinitialize_attn_backend() to set up per-group attention backendsbind_kv_cache() to attach cache tensors to attention layer modulesshared_kv_cache_layers) and EncoderOnlyAttentionSpecIn the modular runner, initialize_kv_cache() additionally creates BlockTables and sets up the KVConnector vllm/v1/worker/gpu/model_runner.py287-326
Sources: vllm/v1/worker/gpu_model_runner.py1226-1369 vllm/v1/worker/gpu/model_runner.py236-326
The execute_model() method is the primary entry point for processing a batch of requests:
Sources: vllm/v1/worker/gpu_model_runner.py1958-2194
The input preparation pipeline transforms SchedulerOutput into model-ready tensors:
The _update_states() method synchronizes the model runner's internal state with the scheduler's decisions:
Key data structure: CachedRequestState
Each request has a cached state containing:
req_id: Unique request identifierprompt_token_ids: Original prompt tokensblock_ids: KV cache block assignments (tuple of lists for multi-group)num_computed_tokens: Number of tokens already computedoutput_token_ids: Generated tokens so farsampling_params or pooling_params: Sampling/pooling configurationmm_features: Multi-modal features if applicablelora_request: LoRA adapter if applicableSources: vllm/v1/worker/gpu_model_runner.py878-1177 vllm/v1/worker/gpu_input_batch.py29-78
The _prepare_inputs() method builds GPU tensors from the updated batch state:
Input tensor construction:
Input IDs and Positions (lines 2196-2336):
input_batch.token_ids_cpu to GPUprompt_embedsis_token_idsSequence Lengths and Query Start Locations (lines 2337-2400):
seq_lens from num_computed_tokensquery_start_loc as cumulative sum of scheduled tokensBlock Tables (lines 2401-2450):
input_batch.block_table from request statesAttention Metadata (lines 2451-2550):
_prepare_attn_metadata() for each attention groupSources: vllm/v1/worker/gpu_model_runner.py2196-2550
Attention metadata encapsulates all information needed by attention kernels:
Key parameters passed to metadata builder:
query_start_loc: Cumulative token counts per requestseq_lens: Total sequence length per requestslot_mapping: Mapping from token positions to KV cache slotsblock_table: Physical KV cache block assignmentsnum_decode_tokens: Number of decode phase tokensnum_prefill_tokens: Number of prefill phase tokensBackend-specific metadata:
Sources: vllm/v1/worker/gpu_model_runner.py2560-2889
The model runner supports two execution modes:
CUDA Graph Execution (lines 3142-3261):
CudagraphDispatcher to select appropriate graph wrapperEager Execution (lines 2924-3140):
model.forward() call with dynamic shapesSources: vllm/v1/worker/gpu_model_runner.py2890-3261
The actual model forward pass:
Return value depends on pipeline stage:
IntermediateTensors (hidden states)Sources: vllm/v1/worker/gpu_model_runner.py3080-3110
The sample_tokens() method coordinates token generation:
Key responsibilities:
Sources: vllm/v1/worker/gpu_model_runner.py3263-3585
Key state transitions:
New Request (lines 901-933):
CachedRequestState from NewRequestDataself.requests dictionaryInputBatch via add_request()Update Active Request (lines 935-1085):
block_ids from schedulernum_computed_tokensoutput_token_idsRemove Finished Request (lines 889-891):
self.requestsInputBatch via remove_request()Sources: vllm/v1/worker/gpu_model_runner.py878-1177
The primary runner uses CudagraphDispatcher to manage graph capture and replay at runtime.
capture_model() Flow (Primary Runner)
Sources: vllm/v1/worker/gpu_model_runner.py1850-2194
CudaGraphManager)The modular runner delegates graph management to CudaGraphManager in vllm/v1/worker/gpu/cudagraph_utils.py.
CudaGraphManager Structure
CudaGraphManager.capture() runs two phases vllm/v1/worker/gpu/cudagraph_utils.py248-292:
FULL graphs for pure-decode batches where each request contributes exactly 1 + num_speculative_tokens tokens.The two internal capture functions are:
_capture_full_graph(): Records a full torch.cuda.CUDAGraph, copying outputs into pre-allocated self.hidden_states buffers vllm/v1/worker/gpu/cudagraph_utils.py170-220_capture_piecewise_graph(): Triggers piecewise capture via torch.compile's CUDAGraphWrapper using a BatchDescriptor key vllm/v1/worker/gpu/cudagraph_utils.py222-246At runtime, get_cudagraph_runtime_mode() decides whether to use FULL, PIECEWISE, or NONE based on whether the batch is a uniform decode batch and whether a graph was captured for the token count vllm/v1/worker/gpu/cudagraph_utils.py294-318
Sources: vllm/v1/worker/gpu/cudagraph_utils.py29-391 vllm/v1/worker/gpu/model_runner.py447-491
The model runner integrates with various speculative decoding strategies:
Drafter initialization (lines 450-485):
Draft token usage in forward pass:
Sources: vllm/v1/worker/gpu_model_runner.py446-490 vllm/v1/spec_decode/
For vision-language and audio models, the model runner manages encoder caching:
Encoder caching benefits:
reset_encoder_cache()Multi-modal kwargs construction (lines 2970-3060):
The model runner builds multi_modal_kwargs dict containing:
pixel_values: Image tensorsaudio_values: Audio tensorsmm_hash: Hash for cachingmm_processor_kwargs: Processor-specific argumentsSources: vllm/v1/worker/gpu_model_runner.py2733-2895 vllm/v1/worker/gpu_model_runner.py732-738
Defined in vllm/v1/worker/gpu_input_batch.py, CachedRequestState stores all per-request state needed for incremental token generation in the primary runner:
| Field | Type | Description |
|---|---|---|
req_id | str | Unique request identifier |
prompt_token_ids | list[int] | None | Original prompt tokens (None if using prompt_embeds) |
mm_features | list[MultiModalFeatureSpec] | Multi-modal feature specifications |
sampling_params | SamplingParams | None | Sampling configuration |
block_ids | tuple[list[int], ...] | KV cache block assignments per cache group |
num_computed_tokens | int | Number of tokens already KV-cached |
output_token_ids | list[int] | Tokens generated so far |
mrope_positions | torch.Tensor | None | M-RoPE position IDs (Qwen2-VL etc.) |
xdrope_positions | torch.Tensor | None | XD-RoPE position IDs (HunYuan-VL etc.) |
lora_request | LoRARequest | None | LoRA adapter if applicable |
prompt_embeds | torch.Tensor | None | Direct prompt embeddings |
pooling_params | PoolingParams | None | Pooling config for embedding models |
Sources: vllm/v1/worker/gpu_input_batch.py29-78
ExecuteModelState is a NamedTuple that carries ephemeral state between execute_model() (which may return None when using async scheduling) and the subsequent sample_tokens() call:
| Field | Type | Description |
|---|---|---|
scheduler_output | SchedulerOutput | The scheduler output for this step |
logits | torch.Tensor | Output logits from the model |
spec_decode_metadata | SpecDecodeMetadata | None | Spec decode metadata if active |
hidden_states | torch.Tensor | Full hidden states from last PP rank |
sample_hidden_states | torch.Tensor | Hidden states at logit positions |
aux_hidden_states | list[torch.Tensor] | None | Auxiliary states for EAGLE3 |
cudagraph_stats | CUDAGraphStat | None | CUDA graph execution statistics |
Sources: vllm/v1/worker/gpu_model_runner.py359-373
The modular runner uses a plain tuple | None for execute_model_state vllm/v1/worker/gpu/model_runner.py222
The InputBatch class (detailed in page 4.3) maintains persistent CPU+GPU state across steps:
req_id_to_index, _req_ids)token_ids_cpu, token_ids_cpu_tensor)temperature, top_p, top_k, penalties)MultiGroupBlockTable for KV cache block assignmentsSamplingMetadata (rebuilt when batch composition changes)LogitsProcessorsSources: vllm/v1/worker/gpu_input_batch.py81-269
The modular runner separates the pre-allocated GPU tensors into InputBuffers, a lightweight class holding only the fixed-size device buffers needed per step:
| Buffer | Shape | dtype | Description |
|---|---|---|---|
input_ids | [max_num_tokens] | int32 | Token IDs for current step |
positions | [max_num_tokens] | int64 | Position IDs |
query_start_loc | [max_num_reqs + 1] | int32 | Cumulative token counts per request |
seq_lens | [max_num_reqs] | int32 | Total sequence length per request |
dcp_local_seq_lens | [max_num_reqs] | int32 | Per-request local seq lens for DCP |
Sources: vllm/v1/worker/gpu/input_batch.py12-33
RequestState in the modular runner replaces the per-request dict in the primary runner with UVA-backed (Unified Virtual Addressing) tensors for memory efficiency:
Key tensors (using UvaBackedTensor or StagedWriteTensor):
all_token_ids: [max_num_reqs, max_model_len] int32, stored in UVA to save GPU memorynum_computed_tokens: [max_num_reqs] int32last_sampled_tokens: [max_num_reqs, 1] int64, GPUdraft_tokens: [max_num_reqs, num_speculative_steps] int64, GPUSources: vllm/v1/worker/gpu/states.py9-120
| Optimization | Description | Location |
|---|---|---|
| Pinned memory | CPU tensors use pinned memory for fast GPU transfers | lines 361, 584 |
| Buffer reuse | Pre-allocated persistent buffers avoid dynamic allocation | lines 569-651 |
| CUDA graphs | Eliminate kernel launch overhead for common batch sizes | lines 1850-2194 |
| Encoder caching | Reuse vision/audio embeddings across requests | lines 443, 732-738 |
| In-place operations | Update tensors in-place when safe (e.g., logits) | lines 3080-3110 |
| Optimization | Description | Location |
|---|---|---|
| KV cache sharing | Share KV cache across identical attention layers | lines 644-651, 1629-1680 |
| Cascade attention | Skip unnecessary attention computation for common prefixes | line 400 |
| Microbatching (DBO) | Overlap computation and communication in distributed settings | Handled by executor |
| Expert parallelism | Distribute MoE experts across devices | lines 424-429 |
| Compilation | Torch compile integration for kernel fusion | vllm_config.compilation_config |
Sources: Various locations in vllm/v1/worker/gpu_model_runner.py
The model runner performs extensive validation:
block_ids match block_table entriesnum_scheduled_tokens matches actual tokensSources: vllm/v1/worker/gpu_model_runner.py3586-3615 assertions throughout the file
The worker is responsible for:
Sources: vllm/v1/worker/gpu_worker.py636-698
The attention backend is responsible for:
Sources: vllm/v1/worker/gpu_model_runner.py2560-2889
The sampler is responsible for:
Refresh this wiki
This wiki was recently refreshed. Please wait 5 days to refresh again.