Model Execution

Relevant source files

Purpose and Scope

This page describes how vLLM loads models onto GPU devices and executes inference. It covers the executor → worker → model runner hierarchy, how requests are batched into GPU tensors, how the model forward pass runs, how tokens are sampled, and how speculative decoding accelerates throughput.

For in-depth coverage of each component, see the child pages:

GPU Model Runner — GPUModelRunner, CUDA graph capture, encoder caching, forward pass internals
Worker and Executor Architecture — MultiprocExecutor, RayDistributedExecutor, worker spawning, RPC communication
Input Batch Management — InputBatch, CachedRequestState, block table packing
Sampling and Token Generation — Sampler, penalties, top-k/top-p, logprob computation
Speculative Decoding — EAGLE, ngram, suffix, RejectionSampler

For the scheduler that produces the SchedulerOutput consumed here, see Scheduler and Resource Allocation. For KV cache block management, see KV Cache Management. For attention computation, see Attention Backends.

Execution Stack Overview

Model execution is structured in three nested layers:

Layer	Primary Class	File	Responsibility
Executor	`Executor` (ABC)	`vllm/v1/executor/abstract.py`	RPC dispatch; worker lifecycle orchestration
Worker	`Worker` (extends `WorkerBase`)	`vllm/v1/worker/gpu_worker.py`	Owns one GPU; manages weights, KV cache, and execution
Model Runner	`GPUModelRunner`	`vllm/v1/worker/gpu_model_runner.py`	All PyTorch operations: input prep, forward pass, sampling

Each inference step, EngineCore calls Executor.execute_model(scheduler_output). The executor fans this out via RPC to all workers. Each worker delegates to its GPUModelRunner, which returns a ModelRunnerOutput.

Diagram: Component Hierarchy with Code Entities

Sources: vllm/v1/executor/abstract.py36-86 vllm/v1/executor/multiproc_executor.py95-225 vllm/v1/worker/gpu_worker.py102-150 vllm/v1/worker/gpu_model_runner.py375-530

Executor Layer

Executor.get_class() selects the executor implementation from parallel_config.distributed_executor_backend:

Backend Value	Executor Class	When Used
`"mp"`	`MultiprocExecutor`	Multi-GPU on one or more nodes via Python `multiprocessing`
`"ray"`	`RayDistributedExecutor`	Multi-node via Ray actors
`"uni"`	`UniProcExecutor`	Single GPU, in-process (no subprocess)
`"external_launcher"`	`ExecutorWithExternalLauncher`	`torchrun`-style external process management

All executors expose the same interface:

Method	Description
`initialize_from_config(kv_cache_configs)`	Allocates KV cache and compiles/warms up the model
`execute_model(scheduler_output)`	Runs one forward pass step
`sample_tokens(grammar_output)`	Separate sampling step used with async scheduling
`collective_rpc(method, args)`	Broadcasts a method call to all workers and collects results

MultiprocExecutor creates one WorkerProc per GPU, each in its own OS process. SchedulerOutput is serialized once and broadcast via MessageQueue (a shared-memory ring buffer). Each worker sends its ModelRunnerOutput back via a dedicated response MessageQueue. A background thread monitors worker process liveness and triggers a FailureCallback if any worker dies.

Sources: vllm/v1/executor/abstract.py36-225 vllm/v1/executor/multiproc_executor.py95-466

Worker Layer

Worker (extends WorkerBase) manages one GPU device identified by local_rank. It is constructed inside WorkerProc.__init__ after the process is spawned.

Initialization Sequence

Diagram: Worker Initialization Sequence

Key Worker Methods

Method	File Location	Purpose
`init_device()`	gpu_worker.py219-315	Set CUDA device, init distributed groups, create `GPUModelRunner`
`load_model()`	gpu_worker.py319-342	Load weights (supports sleep/wake memory pool)
`determine_available_memory()`	gpu_worker.py350-429	Profile-run and compute KV cache budget
`initialize_from_config()`	gpu_worker.py462-481	Allocate KV cache, init KV transfer connector
`compile_or_warm_up_model()`	gpu_worker.py482-608	Warm up sizes, capture CUDA graphs
`execute_model()`	gpu_worker.py658-759	Forward pass; handles PP tensor sends/receives
`sample_tokens()`	gpu_worker.py652-656	Async-mode separate sampling call

The Worker.execute_model() also handles pipeline-parallel communication. Non-first-rank workers receive IntermediateTensors from the previous pipeline stage via get_pp_group().irecv_tensor_dict(). Non-last-rank workers send their output forward. Only the last pipeline stage returns a ModelRunnerOutput.

Sources: vllm/v1/worker/gpu_worker.py102-759 vllm/v1/worker/worker_base.py34-100

GPUModelRunner

GPUModelRunner in vllm/v1/worker/gpu_model_runner.py375 performs all computation within a single GPU. It is instantiated once per worker and owns:

The loaded nn.Module (set after load_model)
kv_caches: list[torch.Tensor] — KV cache tensors allocated after initialize_kv_cache
requests: dict[str, CachedRequestState] — per-request cached state
input_batch: InputBatch — persistent GPU tensor buffers for the active batch
Pre-allocated GPU buffers: input_ids, positions, seq_lens, query_start_loc, etc.
sampler: Sampler — token sampling module
drafter — speculative decoding proposer (if configured)
rejection_sampler: RejectionSampler — draft token verifier (if speculative decoding)

Per-Step Execution Flow

Diagram: Per-Step Execution — Method Call Chain

When async scheduling is enabled, execute_model stores its output in execute_model_state: ExecuteModelState and returns None. The subsequent sample_tokens call retrieves the cached logits, hidden_states, and spec_decode_metadata from that field to complete sampling. This allows the engine to overlap input preparation for the next step with GPU→CPU output copying.

CUDA Graph Capture

compile_or_warm_up_model() calls GPUModelRunner.capture_model(). This records CUDA graphs for each batch size listed in compilation_config.cudagraph_capture_sizes. During inference, decode-phase batches (all requests generating tokens) execute via graph replay. Prefill-phase batches always run eagerly to handle variable sequence lengths.

Sources: vllm/v1/worker/gpu_model_runner.py375-760 vllm/v1/worker/gpu_model_runner.py359-373

Input Batch Management

InputBatch (vllm/v1/worker/gpu_input_batch.py81) maintains persistent CPU and GPU tensors for all active requests, updated incrementally each step rather than rebuilt from scratch. This avoids re-allocation overhead.

CachedRequestState (vllm/v1/worker/gpu_input_batch.py30) is the per-request state object stored in GPUModelRunner.requests. It carries prompt token IDs, generated token IDs, allocated block IDs, sampling parameters, MRoPE position state, and LoRA request info.

InputBatch Key State

Field	Shape / Type	Description
`token_ids_cpu`	`np.ndarray (max_num_reqs, max_model_len)`	All token IDs per request (prompt + generated + spec decode)
`is_token_ids`	`np.ndarray (max_num_reqs, max_model_len)`	Mask: `True` where token IDs exist vs. prompt embeddings
`num_computed_tokens_cpu`	`np.ndarray (max_num_reqs,)`	KV-cached token count per request
`block_table`	`MultiGroupBlockTable`	KV cache physical block index mappings
`temperature`, `top_p`, `top_k`	`torch.Tensor (max_num_reqs,)`	Sampling parameter tensors
`frequency_penalties`, `presence_penalties`, `repetition_penalties`	`torch.Tensor (max_num_reqs,)`	Token penalty tensors
`generators`	`dict[int, torch.Generator]`	Per-request RNG states
`allowed_token_ids_mask`	`torch.Tensor (max_num_reqs, vocab_size)`	Token allowlist bitmask
`sampling_metadata`	`SamplingMetadata`	Compiled view of all sampling params for `Sampler`
`logitsprocs`	`LogitsProcessors`	Active logits processor instances

Key methods on InputBatch:

add_request(CachedRequestState) — registers a new request and fills its row in all tensors
remove_request(req_id) — marks a slot as removed
condense() — compacts the batch by sliding active entries into freed slots
swap_states(i1, i2) — reorders two request slots (used for attention backend requirements)

Sources: vllm/v1/worker/gpu_input_batch.py29-441 vllm/v1/worker/block_table.py16-200

Sampling Pipeline

After the model forward pass, Sampler (vllm/v1/sample/sampler.py20) converts raw logits into sampled token IDs. It receives a SamplingMetadata object aggregated from InputBatch.

SamplingMetadata Fields

SamplingMetadata (vllm/v1/sample/metadata.py12) packages the per-request sampling state:

Field	Description
`temperature`	Per-request temperature tensor (`None` if all greedy)
`all_greedy` / `all_random`	Batch-level fast-path flags
`top_p`, `top_k`	Nucleus sampling and top-k tensors
`output_token_ids`	Previously generated tokens (for penalty application)
`prompt_token_ids`	Prompt token IDs (for repetition penalty)
`generators`	Per-request `torch.Generator` states
`logitsprocs`	`LogitsProcessors` split into argmax-invariant and non-argmax-invariant
`allowed_token_ids_mask`	Allowlist mask (non-`None` only when some requests use it)
`bad_words_token_ids`	Per-request banned token sequences
`max_num_logprobs`	Max logprob count across the batch (`None` if none requested)

Sampler Forward Pass

Diagram: Sampler.forward() Processing Steps

Sources: vllm/v1/sample/sampler.py20-320 vllm/v1/sample/metadata.py12-50

Speculative Decoding

When speculative_config is set, GPUModelRunner.__init__ initializes a drafter that proposes num_speculative_tokens candidate tokens per request. The target model then verifies all draft positions in one batched forward pass.

Drafter Selection

Selection logic is in vllm/v1/worker/gpu_model_runner.py491-526:

`speculative_config.method`	Drafter Class	Location	Approach
`"ngram"`	`NgramProposer`	`vllm/v1/spec_decode/ngram_proposer.py`	Prompt lookup (n-gram matching), no NN
(uses draft model)	`DraftModelProposer`	`vllm/v1/spec_decode/draft_model.py`	Separate smaller model
`"suffix"`	`SuffixDecodingProposer`	`vllm/v1/spec_decode/suffix_decoding.py`	Suffix-tree based proposals
`"eagle"` / `"eagle3"`	`EagleProposer`	`vllm/v1/spec_decode/eagle.py`	EAGLE auto-regression head
`"medusa"`	`MedusaProposer`	`vllm/v1/spec_decode/medusa.py`	Multi-head speculation

Rejection Sampling

RejectionSampler.forward() (vllm/v1/sample/rejection_sampler.py60) verifies draft tokens:

Computes bonus token via Sampler on target model logits at position k+1
Extracts target logits for each draft position
Applies logits processors and sampling constraints to target logits
Calls the Triton rejection_sample kernel which accepts draft tokens probabilistically and samples recovered tokens where drafts are rejected
Returns accepted tokens + recovered token + optional bonus token per request

The output tensor has shape (batch_size, max_spec_len + 1) with PLACEHOLDER_TOKEN_ID (-1) for rejected positions. RejectionSampler.parse_output() strips placeholders to produce variable-length per-request token lists.

Sources: vllm/v1/sample/rejection_sampler.py30-252 vllm/v1/worker/gpu_model_runner.py491-527 vllm/v1/spec_decode/ngram_proposer.py12-100

Output Structures

ModelRunnerOutput (vllm/v1/outputs.py160) is the data structure returned from the worker to the engine after each step.

Field	Type	Description
`req_ids`	`list[str]`	Ordered request IDs in this batch
`req_id_to_index`	`dict[str, int]`	Maps request ID to batch index
`sampled_token_ids`	`list[list[int]]`	Per-request generated tokens (1+ for spec decode)
`logprobs`	`LogprobsLists \| None`	Top-k logprobs if any request requested them
`prompt_logprobs_dict`	`dict[str, LogprobsTensors \| None]`	Prompt logprobs per request
`pooler_output`	`list[torch.Tensor \| None] \| None`	Hidden state embeddings for pooling models
`kv_connector_output`	`KVConnectorOutput \| None`	State for disaggregated prefill connectors
`num_nans_in_logits`	`dict[str, int] \| None`	Per-request NaN diagnostic counts

Logprob Structures

Class	Location	Description
`LogprobsTensors`	outputs.py47	GPU tensors: `(token_ids, logprobs, selected_token_ranks)`
`LogprobsLists`	outputs.py22	CPU numpy arrays version for serialization
`SamplerOutput`	outputs.py114	Intermediate: `sampled_token_ids` + `logprobs_tensors` (GPU)

Async Output Copying

When async scheduling is active, GPUModelRunner wraps its output in AsyncGPUModelRunnerOutput (vllm/v1/worker/gpu_model_runner.py208). The constructor immediately initiates a non-blocking GPU→CPU copy of sampled_token_ids on a dedicated async_output_copy_stream. The caller later calls get_output(), which synchronizes the copy event and returns the finalized ModelRunnerOutput. This overlaps GPU→CPU transfer with the next step's scheduling and input preparation.

Sources: vllm/v1/outputs.py22-251 vllm/v1/worker/gpu_model_runner.py208-274

Model Execution

Relevant source files

Purpose and Scope

For in-depth coverage of each component, see the child pages:

GPU Model Runner — GPUModelRunner, CUDA graph capture, encoder caching, forward pass internals
Worker and Executor Architecture — MultiprocExecutor, RayDistributedExecutor, worker spawning, RPC communication
Input Batch Management — InputBatch, CachedRequestState, block table packing
Sampling and Token Generation — Sampler, penalties, top-k/top-p, logprob computation
Speculative Decoding — EAGLE, ngram, suffix, RejectionSampler

Execution Stack Overview

Model execution is structured in three nested layers:

Layer	Primary Class	File	Responsibility
Executor	`Executor` (ABC)	`vllm/v1/executor/abstract.py`	RPC dispatch; worker lifecycle orchestration
Worker	`Worker` (extends `WorkerBase`)	`vllm/v1/worker/gpu_worker.py`	Owns one GPU; manages weights, KV cache, and execution
Model Runner	`GPUModelRunner`	`vllm/v1/worker/gpu_model_runner.py`	All PyTorch operations: input prep, forward pass, sampling

Diagram: Component Hierarchy with Code Entities

Sources: vllm/v1/executor/abstract.py36-86 vllm/v1/executor/multiproc_executor.py95-225 vllm/v1/worker/gpu_worker.py102-150 vllm/v1/worker/gpu_model_runner.py375-530

Executor Layer

Executor.get_class() selects the executor implementation from parallel_config.distributed_executor_backend:

Backend Value	Executor Class	When Used
`"mp"`	`MultiprocExecutor`	Multi-GPU on one or more nodes via Python `multiprocessing`
`"ray"`	`RayDistributedExecutor`	Multi-node via Ray actors
`"uni"`	`UniProcExecutor`	Single GPU, in-process (no subprocess)
`"external_launcher"`	`ExecutorWithExternalLauncher`	`torchrun`-style external process management

All executors expose the same interface:

Method	Description
`initialize_from_config(kv_cache_configs)`	Allocates KV cache and compiles/warms up the model
`execute_model(scheduler_output)`	Runs one forward pass step
`sample_tokens(grammar_output)`	Separate sampling step used with async scheduling
`collective_rpc(method, args)`	Broadcasts a method call to all workers and collects results

Sources: vllm/v1/executor/abstract.py36-225 vllm/v1/executor/multiproc_executor.py95-466

Worker Layer

Worker (extends WorkerBase) manages one GPU device identified by local_rank. It is constructed inside WorkerProc.__init__ after the process is spawned.

Initialization Sequence

Diagram: Worker Initialization Sequence

Key Worker Methods

Method	File Location	Purpose
`init_device()`	gpu_worker.py219-315	Set CUDA device, init distributed groups, create `GPUModelRunner`
`load_model()`	gpu_worker.py319-342	Load weights (supports sleep/wake memory pool)
`determine_available_memory()`	gpu_worker.py350-429	Profile-run and compute KV cache budget
`initialize_from_config()`	gpu_worker.py462-481	Allocate KV cache, init KV transfer connector
`compile_or_warm_up_model()`	gpu_worker.py482-608	Warm up sizes, capture CUDA graphs
`execute_model()`	gpu_worker.py658-759	Forward pass; handles PP tensor sends/receives
`sample_tokens()`	gpu_worker.py652-656	Async-mode separate sampling call

Sources: vllm/v1/worker/gpu_worker.py102-759 vllm/v1/worker/worker_base.py34-100

GPUModelRunner

GPUModelRunner in vllm/v1/worker/gpu_model_runner.py375 performs all computation within a single GPU. It is instantiated once per worker and owns:

The loaded nn.Module (set after load_model)
kv_caches: list[torch.Tensor] — KV cache tensors allocated after initialize_kv_cache
requests: dict[str, CachedRequestState] — per-request cached state
input_batch: InputBatch — persistent GPU tensor buffers for the active batch
Pre-allocated GPU buffers: input_ids, positions, seq_lens, query_start_loc, etc.
sampler: Sampler — token sampling module
drafter — speculative decoding proposer (if configured)
rejection_sampler: RejectionSampler — draft token verifier (if speculative decoding)

Per-Step Execution Flow

Diagram: Per-Step Execution — Method Call Chain

CUDA Graph Capture

Sources: vllm/v1/worker/gpu_model_runner.py375-760 vllm/v1/worker/gpu_model_runner.py359-373

Input Batch Management

InputBatch Key State

Field	Shape / Type	Description
`token_ids_cpu`	`np.ndarray (max_num_reqs, max_model_len)`	All token IDs per request (prompt + generated + spec decode)
`is_token_ids`	`np.ndarray (max_num_reqs, max_model_len)`	Mask: `True` where token IDs exist vs. prompt embeddings
`num_computed_tokens_cpu`	`np.ndarray (max_num_reqs,)`	KV-cached token count per request
`block_table`	`MultiGroupBlockTable`	KV cache physical block index mappings
`temperature`, `top_p`, `top_k`	`torch.Tensor (max_num_reqs,)`	Sampling parameter tensors
`frequency_penalties`, `presence_penalties`, `repetition_penalties`	`torch.Tensor (max_num_reqs,)`	Token penalty tensors
`generators`	`dict[int, torch.Generator]`	Per-request RNG states
`allowed_token_ids_mask`	`torch.Tensor (max_num_reqs, vocab_size)`	Token allowlist bitmask
`sampling_metadata`	`SamplingMetadata`	Compiled view of all sampling params for `Sampler`
`logitsprocs`	`LogitsProcessors`	Active logits processor instances

Key methods on InputBatch:

add_request(CachedRequestState) — registers a new request and fills its row in all tensors
remove_request(req_id) — marks a slot as removed
condense() — compacts the batch by sliding active entries into freed slots
swap_states(i1, i2) — reorders two request slots (used for attention backend requirements)

Sources: vllm/v1/worker/gpu_input_batch.py29-441 vllm/v1/worker/block_table.py16-200

Sampling Pipeline

After the model forward pass, Sampler (vllm/v1/sample/sampler.py20) converts raw logits into sampled token IDs. It receives a SamplingMetadata object aggregated from InputBatch.

SamplingMetadata Fields

SamplingMetadata (vllm/v1/sample/metadata.py12) packages the per-request sampling state:

Field	Description
`temperature`	Per-request temperature tensor (`None` if all greedy)
`all_greedy` / `all_random`	Batch-level fast-path flags
`top_p`, `top_k`	Nucleus sampling and top-k tensors
`output_token_ids`	Previously generated tokens (for penalty application)
`prompt_token_ids`	Prompt token IDs (for repetition penalty)
`generators`	Per-request `torch.Generator` states
`logitsprocs`	`LogitsProcessors` split into argmax-invariant and non-argmax-invariant
`allowed_token_ids_mask`	Allowlist mask (non-`None` only when some requests use it)
`bad_words_token_ids`	Per-request banned token sequences
`max_num_logprobs`	Max logprob count across the batch (`None` if none requested)

Sampler Forward Pass

Diagram: Sampler.forward() Processing Steps

Sources: vllm/v1/sample/sampler.py20-320 vllm/v1/sample/metadata.py12-50

Speculative Decoding

Drafter Selection

Selection logic is in vllm/v1/worker/gpu_model_runner.py491-526:

`speculative_config.method`	Drafter Class	Location	Approach
`"ngram"`	`NgramProposer`	`vllm/v1/spec_decode/ngram_proposer.py`	Prompt lookup (n-gram matching), no NN
(uses draft model)	`DraftModelProposer`	`vllm/v1/spec_decode/draft_model.py`	Separate smaller model
`"suffix"`	`SuffixDecodingProposer`	`vllm/v1/spec_decode/suffix_decoding.py`	Suffix-tree based proposals
`"eagle"` / `"eagle3"`	`EagleProposer`	`vllm/v1/spec_decode/eagle.py`	EAGLE auto-regression head
`"medusa"`	`MedusaProposer`	`vllm/v1/spec_decode/medusa.py`	Multi-head speculation

Rejection Sampling

RejectionSampler.forward() (vllm/v1/sample/rejection_sampler.py60) verifies draft tokens:

Computes bonus token via Sampler on target model logits at position k+1
Extracts target logits for each draft position
Applies logits processors and sampling constraints to target logits
Calls the Triton rejection_sample kernel which accepts draft tokens probabilistically and samples recovered tokens where drafts are rejected
Returns accepted tokens + recovered token + optional bonus token per request

Sources: vllm/v1/sample/rejection_sampler.py30-252 vllm/v1/worker/gpu_model_runner.py491-527 vllm/v1/spec_decode/ngram_proposer.py12-100

Output Structures

ModelRunnerOutput (vllm/v1/outputs.py160) is the data structure returned from the worker to the engine after each step.

Field	Type	Description
`req_ids`	`list[str]`	Ordered request IDs in this batch
`req_id_to_index`	`dict[str, int]`	Maps request ID to batch index
`sampled_token_ids`	`list[list[int]]`	Per-request generated tokens (1+ for spec decode)
`logprobs`	`LogprobsLists \| None`	Top-k logprobs if any request requested them
`prompt_logprobs_dict`	`dict[str, LogprobsTensors \| None]`	Prompt logprobs per request
`pooler_output`	`list[torch.Tensor \| None] \| None`	Hidden state embeddings for pooling models
`kv_connector_output`	`KVConnectorOutput \| None`	State for disaggregated prefill connectors
`num_nans_in_logits`	`dict[str, int] \| None`	Per-request NaN diagnostic counts

Logprob Structures

Class	Location	Description
`LogprobsTensors`	outputs.py47	GPU tensors: `(token_ids, logprobs, selected_token_ranks)`
`LogprobsLists`	outputs.py22	CPU numpy arrays version for serialization
`SamplerOutput`	outputs.py114	Intermediate: `sampled_token_ids` + `logprobs_tensors` (GPU)

Async Output Copying

Sources: vllm/v1/outputs.py22-251 vllm/v1/worker/gpu_model_runner.py208-274

Model Execution

Purpose and Scope

Execution Stack Overview

Executor Layer

Worker Layer

Initialization Sequence

Key Worker Methods

GPUModelRunner

Per-Step Execution Flow

CUDA Graph Capture

Input Batch Management

InputBatch Key State

Sampling Pipeline

SamplingMetadata Fields

Sampler Forward Pass

Speculative Decoding

Drafter Selection

Rejection Sampling

Output Structures

Logprob Structures

Async Output Copying

On this page

Model Execution

Purpose and Scope

Execution Stack Overview

Executor Layer

Worker Layer

Initialization Sequence

Key Worker Methods

GPUModelRunner

Per-Step Execution Flow

CUDA Graph Capture

Input Batch Management

InputBatch Key State

Sampling Pipeline

SamplingMetadata Fields

Sampler Forward Pass

Speculative Decoding

Drafter Selection

Rejection Sampling

Output Structures

Logprob Structures

Async Output Copying

On this page