This page describes how vLLM loads models onto GPU devices and executes inference. It covers the executor → worker → model runner hierarchy, how requests are batched into GPU tensors, how the model forward pass runs, how tokens are sampled, and how speculative decoding accelerates throughput.
For in-depth coverage of each component, see the child pages:
GPUModelRunner, CUDA graph capture, encoder caching, forward pass internalsMultiprocExecutor, RayDistributedExecutor, worker spawning, RPC communicationInputBatch, CachedRequestState, block table packingSampler, penalties, top-k/top-p, logprob computationRejectionSamplerFor the scheduler that produces the SchedulerOutput consumed here, see Scheduler and Resource Allocation. For KV cache block management, see KV Cache Management. For attention computation, see Attention Backends.
Model execution is structured in three nested layers:
| Layer | Primary Class | File | Responsibility |
|---|---|---|---|
| Executor | Executor (ABC) | vllm/v1/executor/abstract.py | RPC dispatch; worker lifecycle orchestration |
| Worker | Worker (extends WorkerBase) | vllm/v1/worker/gpu_worker.py | Owns one GPU; manages weights, KV cache, and execution |
| Model Runner | GPUModelRunner | vllm/v1/worker/gpu_model_runner.py | All PyTorch operations: input prep, forward pass, sampling |
Each inference step, EngineCore calls Executor.execute_model(scheduler_output). The executor fans this out via RPC to all workers. Each worker delegates to its GPUModelRunner, which returns a ModelRunnerOutput.
Diagram: Component Hierarchy with Code Entities
Sources: vllm/v1/executor/abstract.py36-86 vllm/v1/executor/multiproc_executor.py95-225 vllm/v1/worker/gpu_worker.py102-150 vllm/v1/worker/gpu_model_runner.py375-530
Executor.get_class() selects the executor implementation from parallel_config.distributed_executor_backend:
| Backend Value | Executor Class | When Used |
|---|---|---|
"mp" | MultiprocExecutor | Multi-GPU on one or more nodes via Python multiprocessing |
"ray" | RayDistributedExecutor | Multi-node via Ray actors |
"uni" | UniProcExecutor | Single GPU, in-process (no subprocess) |
"external_launcher" | ExecutorWithExternalLauncher | torchrun-style external process management |
All executors expose the same interface:
| Method | Description |
|---|---|
initialize_from_config(kv_cache_configs) | Allocates KV cache and compiles/warms up the model |
execute_model(scheduler_output) | Runs one forward pass step |
sample_tokens(grammar_output) | Separate sampling step used with async scheduling |
collective_rpc(method, args) | Broadcasts a method call to all workers and collects results |
MultiprocExecutor creates one WorkerProc per GPU, each in its own OS process. SchedulerOutput is serialized once and broadcast via MessageQueue (a shared-memory ring buffer). Each worker sends its ModelRunnerOutput back via a dedicated response MessageQueue. A background thread monitors worker process liveness and triggers a FailureCallback if any worker dies.
Sources: vllm/v1/executor/abstract.py36-225 vllm/v1/executor/multiproc_executor.py95-466
Worker (extends WorkerBase) manages one GPU device identified by local_rank. It is constructed inside WorkerProc.__init__ after the process is spawned.
Diagram: Worker Initialization Sequence
| Method | File Location | Purpose |
|---|---|---|
init_device() | gpu_worker.py219-315 | Set CUDA device, init distributed groups, create GPUModelRunner |
load_model() | gpu_worker.py319-342 | Load weights (supports sleep/wake memory pool) |
determine_available_memory() | gpu_worker.py350-429 | Profile-run and compute KV cache budget |
initialize_from_config() | gpu_worker.py462-481 | Allocate KV cache, init KV transfer connector |
compile_or_warm_up_model() | gpu_worker.py482-608 | Warm up sizes, capture CUDA graphs |
execute_model() | gpu_worker.py658-759 | Forward pass; handles PP tensor sends/receives |
sample_tokens() | gpu_worker.py652-656 | Async-mode separate sampling call |
The Worker.execute_model() also handles pipeline-parallel communication. Non-first-rank workers receive IntermediateTensors from the previous pipeline stage via get_pp_group().irecv_tensor_dict(). Non-last-rank workers send their output forward. Only the last pipeline stage returns a ModelRunnerOutput.
Sources: vllm/v1/worker/gpu_worker.py102-759 vllm/v1/worker/worker_base.py34-100
GPUModelRunner in vllm/v1/worker/gpu_model_runner.py375 performs all computation within a single GPU. It is instantiated once per worker and owns:
nn.Module (set after load_model)kv_caches: list[torch.Tensor] — KV cache tensors allocated after initialize_kv_cacherequests: dict[str, CachedRequestState] — per-request cached stateinput_batch: InputBatch — persistent GPU tensor buffers for the active batchinput_ids, positions, seq_lens, query_start_loc, etc.sampler: Sampler — token sampling moduledrafter — speculative decoding proposer (if configured)rejection_sampler: RejectionSampler — draft token verifier (if speculative decoding)Diagram: Per-Step Execution — Method Call Chain
When async scheduling is enabled, execute_model stores its output in execute_model_state: ExecuteModelState and returns None. The subsequent sample_tokens call retrieves the cached logits, hidden_states, and spec_decode_metadata from that field to complete sampling. This allows the engine to overlap input preparation for the next step with GPU→CPU output copying.
compile_or_warm_up_model() calls GPUModelRunner.capture_model(). This records CUDA graphs for each batch size listed in compilation_config.cudagraph_capture_sizes. During inference, decode-phase batches (all requests generating tokens) execute via graph replay. Prefill-phase batches always run eagerly to handle variable sequence lengths.
Sources: vllm/v1/worker/gpu_model_runner.py375-760 vllm/v1/worker/gpu_model_runner.py359-373
InputBatch (vllm/v1/worker/gpu_input_batch.py81) maintains persistent CPU and GPU tensors for all active requests, updated incrementally each step rather than rebuilt from scratch. This avoids re-allocation overhead.
CachedRequestState (vllm/v1/worker/gpu_input_batch.py30) is the per-request state object stored in GPUModelRunner.requests. It carries prompt token IDs, generated token IDs, allocated block IDs, sampling parameters, MRoPE position state, and LoRA request info.
| Field | Shape / Type | Description |
|---|---|---|
token_ids_cpu | np.ndarray (max_num_reqs, max_model_len) | All token IDs per request (prompt + generated + spec decode) |
is_token_ids | np.ndarray (max_num_reqs, max_model_len) | Mask: True where token IDs exist vs. prompt embeddings |
num_computed_tokens_cpu | np.ndarray (max_num_reqs,) | KV-cached token count per request |
block_table | MultiGroupBlockTable | KV cache physical block index mappings |
temperature, top_p, top_k | torch.Tensor (max_num_reqs,) | Sampling parameter tensors |
frequency_penalties, presence_penalties, repetition_penalties | torch.Tensor (max_num_reqs,) | Token penalty tensors |
generators | dict[int, torch.Generator] | Per-request RNG states |
allowed_token_ids_mask | torch.Tensor (max_num_reqs, vocab_size) | Token allowlist bitmask |
sampling_metadata | SamplingMetadata | Compiled view of all sampling params for Sampler |
logitsprocs | LogitsProcessors | Active logits processor instances |
Key methods on InputBatch:
add_request(CachedRequestState) — registers a new request and fills its row in all tensorsremove_request(req_id) — marks a slot as removedcondense() — compacts the batch by sliding active entries into freed slotsswap_states(i1, i2) — reorders two request slots (used for attention backend requirements)Sources: vllm/v1/worker/gpu_input_batch.py29-441 vllm/v1/worker/block_table.py16-200
After the model forward pass, Sampler (vllm/v1/sample/sampler.py20) converts raw logits into sampled token IDs. It receives a SamplingMetadata object aggregated from InputBatch.
SamplingMetadata (vllm/v1/sample/metadata.py12) packages the per-request sampling state:
| Field | Description |
|---|---|
temperature | Per-request temperature tensor (None if all greedy) |
all_greedy / all_random | Batch-level fast-path flags |
top_p, top_k | Nucleus sampling and top-k tensors |
output_token_ids | Previously generated tokens (for penalty application) |
prompt_token_ids | Prompt token IDs (for repetition penalty) |
generators | Per-request torch.Generator states |
logitsprocs | LogitsProcessors split into argmax-invariant and non-argmax-invariant |
allowed_token_ids_mask | Allowlist mask (non-None only when some requests use it) |
bad_words_token_ids | Per-request banned token sequences |
max_num_logprobs | Max logprob count across the batch (None if none requested) |
Diagram: Sampler.forward() Processing Steps
Sources: vllm/v1/sample/sampler.py20-320 vllm/v1/sample/metadata.py12-50
When speculative_config is set, GPUModelRunner.__init__ initializes a drafter that proposes num_speculative_tokens candidate tokens per request. The target model then verifies all draft positions in one batched forward pass.
Selection logic is in vllm/v1/worker/gpu_model_runner.py491-526:
speculative_config.method | Drafter Class | Location | Approach |
|---|---|---|---|
"ngram" | NgramProposer | vllm/v1/spec_decode/ngram_proposer.py | Prompt lookup (n-gram matching), no NN |
| (uses draft model) | DraftModelProposer | vllm/v1/spec_decode/draft_model.py | Separate smaller model |
"suffix" | SuffixDecodingProposer | vllm/v1/spec_decode/suffix_decoding.py | Suffix-tree based proposals |
"eagle" / "eagle3" | EagleProposer | vllm/v1/spec_decode/eagle.py | EAGLE auto-regression head |
"medusa" | MedusaProposer | vllm/v1/spec_decode/medusa.py | Multi-head speculation |
RejectionSampler.forward() (vllm/v1/sample/rejection_sampler.py60) verifies draft tokens:
Sampler on target model logits at position k+1rejection_sample kernel which accepts draft tokens probabilistically and samples recovered tokens where drafts are rejectedThe output tensor has shape (batch_size, max_spec_len + 1) with PLACEHOLDER_TOKEN_ID (-1) for rejected positions. RejectionSampler.parse_output() strips placeholders to produce variable-length per-request token lists.
Sources: vllm/v1/sample/rejection_sampler.py30-252 vllm/v1/worker/gpu_model_runner.py491-527 vllm/v1/spec_decode/ngram_proposer.py12-100
ModelRunnerOutput (vllm/v1/outputs.py160) is the data structure returned from the worker to the engine after each step.
| Field | Type | Description |
|---|---|---|
req_ids | list[str] | Ordered request IDs in this batch |
req_id_to_index | dict[str, int] | Maps request ID to batch index |
sampled_token_ids | list[list[int]] | Per-request generated tokens (1+ for spec decode) |
logprobs | LogprobsLists | None | Top-k logprobs if any request requested them |
prompt_logprobs_dict | dict[str, LogprobsTensors | None] | Prompt logprobs per request |
pooler_output | list[torch.Tensor | None] | None | Hidden state embeddings for pooling models |
kv_connector_output | KVConnectorOutput | None | State for disaggregated prefill connectors |
num_nans_in_logits | dict[str, int] | None | Per-request NaN diagnostic counts |
| Class | Location | Description |
|---|---|---|
LogprobsTensors | outputs.py47 | GPU tensors: (token_ids, logprobs, selected_token_ranks) |
LogprobsLists | outputs.py22 | CPU numpy arrays version for serialization |
SamplerOutput | outputs.py114 | Intermediate: sampled_token_ids + logprobs_tensors (GPU) |
When async scheduling is active, GPUModelRunner wraps its output in AsyncGPUModelRunnerOutput (vllm/v1/worker/gpu_model_runner.py208). The constructor immediately initiates a non-blocking GPU→CPU copy of sampled_token_ids on a dedicated async_output_copy_stream. The caller later calls get_output(), which synchronizes the copy event and returns the finalized ModelRunnerOutput. This overlaps GPU→CPU transfer with the next step's scheduling and input preparation.
Sources: vllm/v1/outputs.py22-251 vllm/v1/worker/gpu_model_runner.py208-274
Refresh this wiki
This wiki was recently refreshed. Please wait 5 days to refresh again.