Worker and Executor Architecture

Relevant source files

Purpose and Scope

This page documents the two-layer abstraction vLLM uses to manage model execution across hardware: the Executor layer and the Worker layer.

The Executor is owned by the engine core and acts as a proxy to one or more worker processes or actors. It handles process lifecycle, RPC dispatch, and result collection.
The Worker runs inside each process (or Ray actor) and is responsible for device initialization, model loading, KV cache allocation, and executing individual forward passes.

This page focuses on the v1 implementations under vllm/v1/executor/ and vllm/v1/worker/. For how the EngineCore holds and calls the executor, see 3.1. For what the GPUModelRunner does inside a worker, see 4.1. For distributed communication primitives (MessageQueue, ShmRingBuffer), see 9.2.

Architecture Overview

Two-layer model execution architecture

Sources: vllm/v1/executor/abstract.py36-86 vllm/v1/executor/uniproc_executor.py25-130 vllm/v1/executor/multiproc_executor.py95-225 vllm/v1/executor/ray_executor.py62-98

The Executor Abstraction

Executor (vllm/v1/executor/abstract.py36-355) is the abstract base class for all executor implementations. It is instantiated by the engine core and provides a uniform interface regardless of how many GPUs or processes are in use.

Backend Selection

Executor.get_class(vllm_config) reads parallel_config.distributed_executor_backend to pick the concrete class:

`distributed_executor_backend`	Executor Class	File
`"mp"`	`MultiprocExecutor`	`vllm/v1/executor/multiproc_executor.py`
`"ray"`	`RayDistributedExecutor`	`vllm/v1/executor/ray_executor.py`
`"uni"`	`UniProcExecutor`	`vllm/v1/executor/uniproc_executor.py`
`"external_launcher"`	`ExecutorWithExternalLauncher`	`vllm/v1/executor/uniproc_executor.py`
A `type` subclassing `Executor`	That class directly	—
A fully-qualified string	Resolved via `resolve_obj_by_qualname`	—

Sources: vllm/v1/executor/abstract.py46-86

The `collective_rpc` Interface

All executor implementations expose collective_rpc(), the core RPC primitive. It sends a method call (by name or serialized callable) with positional and keyword arguments to all workers, then collects a list of results.

collective_rpc(method, timeout, args, kwargs, non_block) -> list[result] | Future[list[result]]

Higher-level methods (execute_model, sample_tokens, initialize_from_config, etc.) are implemented in the base class by delegating to collective_rpc. Subclasses override them only when they need extra behavior (e.g., handling pipeline parallelism).

Sources: vllm/v1/executor/abstract.py141-191 vllm/v1/executor/abstract.py112-126

Key Executor Methods

Method	Description
`_init_executor()`	Abstract. Creates and initializes worker processes/actors.
`initialize_from_config(kv_cache_configs)`	Sends KV cache config to workers, then triggers `compile_or_warm_up_model`.
`determine_available_memory()`	Calls `determine_available_memory` on all workers; returns list of bytes.
`get_kv_cache_specs()`	Returns KV cache spec dicts from all workers.
`execute_model(scheduler_output)`	Dispatches a forward pass to all workers.
`sample_tokens(grammar_output)`	Dispatches sampling after a deferred `execute_model`.
`sleep(level)` / `wake_up(tags)`	Offload/restore GPU memory (sleep mode).
`check_health()`	Abstract. Raises if executor is unhealthy.
`shutdown()`	Terminates all workers.

Sources: vllm/v1/executor/abstract.py108-354

UniProcExecutor

vllm/v1/executor/uniproc_executor.py25-129

UniProcExecutor runs a single worker in the same process as the engine core. It is the default for single-GPU deployments (i.e., TP=1, PP=1, no data parallelism requiring separate processes).

_init_executor() creates one WorkerWrapperBase(rpc_rank=0), calls init_worker(), init_device(), and load_model() in-process.
collective_rpc() calls run_method(self.driver_worker, method, args, kwargs) directly — no IPC, no serialization.
When async_scheduling=True, a ThreadPoolExecutor (one thread) is created to handle AsyncModelRunnerOutput.get_output() off the main thread, allowing the scheduler to overlap with output processing.

ExecutorWithExternalLauncher

vllm/v1/executor/uniproc_executor.py131-177

A subclass of UniProcExecutor designed for torchrun-compatible launchers. Instead of one executor managing multiple workers, the user launches one engine per GPU. distributed_init_method is set to "env://" and rank/local_rank are read from RANK and LOCAL_RANK. determine_available_memory() performs an all_reduce across all ranks to pick the minimum available memory.

MultiprocExecutor

vllm/v1/executor/multiproc_executor.py95-466

MultiprocExecutor spawns one worker subprocess per local GPU. It is selected when distributed_executor_backend="mp" and is the standard backend for multi-GPU inference on a single node.

Worker Process Spawning

MultiprocExecutor initialization sequence

Sources: vllm/v1/executor/multiproc_executor.py102-225 vllm/v1/executor/multiproc_executor.py603-643 vllm/v1/executor/multiproc_executor.py712-810

Message Queue Communication

Each worker subprocess has two MessageQueue channels (backed by shared memory):

Queue	Direction	Purpose
`rpc_broadcast_mq`	Executor → all workers	Broadcast `(method, args, kwargs, output_rank)`
`worker_response_mq`	Worker → Executor	Return `(status, result)` for each call

The executor's collective_rpc() vllm/v1/executor/multiproc_executor.py317-389 enqueues a call tuple on rpc_broadcast_mq, then dequeues from the appropriate response_mqs. When non_block=True, it returns a FutureWrapper and defers reading until .result() is called.

Workers run worker_busy_loop() (not fully shown in the file but invoked at vllm/v1/executor/multiproc_executor.py791) which continuously reads from rpc_broadcast_mq and dispatches execute_method().

Worker Process Handles

Class	Purpose
`UnreadyWorkerProcHandle`	Holds `proc`, `rank`, `ready_pipe`, `death_writer` before READY signal
`WorkerProcHandle`	Holds `proc`, `rank`, `worker_response_mq`, `peer_worker_response_mqs`, `death_writer` after ready

The death pipe (death_reader/death_writer) is a one-way pipe. The parent keeps death_writer open. When the parent process exits, the child gets EOFError on death_reader and shuts down cleanly.

Health Monitoring

A daemon thread (MultiprocWorkerMonitor) watches process sentinels. If any worker exits unexpectedly, it logs an error, calls shutdown(), and fires the registered FailureCallback to notify the engine core.

Sources: vllm/v1/executor/multiproc_executor.py246-283

Output Rank

Only one worker produces the ModelRunnerOutput — the first TP worker of the last PP stage. The executor calculates this as:

output_rank = world_size - tensor_parallel_size * prefill_context_parallel_size

Sources: vllm/v1/executor/multiproc_executor.py451-465

RayDistributedExecutor

vllm/v1/executor/ray_executor.py62-643

RayDistributedExecutor distributes workers as Ray actors and uses Ray's Compiled DAG for the execution path.

Worker Creation

_init_workers_ray() vllm/v1/executor/ray_executor.py160-398:

Resolves bundle_indices from the Ray placement group (or from VLLM_RAY_BUNDLE_INDICES).
Creates RayWorkerWrapper remote actors, one per rank.
Collects IP addresses from all actors and re-sorts workers so the driver's node comes first.
Adjusts ranks via collective_rpc("adjust_rank", ...) to account for re-sorting.
Sets CUDA_VISIBLE_DEVICES environment variables per node via collective_rpc("update_environment_variables", ...).
Calls collective_rpc("init_worker", ...), "init_device", "load_model" on all actors.
Organizes workers into pp_tp_workers[pp_rank][tp_rank] for DAG construction.

Execution via Compiled DAG

On the first execute_model() call, _compiled_ray_dag() builds a Ray CompiledDAG vllm/v1/executor/ray_executor.py542-635 It chains PP stages so intermediate tensors flow from PP rank 0 to PP rank N−1. The DAG is then called with forward_dag.execute((scheduler_output, grammar_output)).

Ray DAG structure for PP=2, TP=4:

Sources: vllm/v1/executor/ray_executor.py542-635

`collective_rpc` in Ray

collective_rpc() vllm/v1/executor/ray_executor.py485-510 uses worker.execute_method.remote(method, *args, **kwargs) on every actor and blocks with ray.get(). Method callables are serialized with cloudpickle.

The Worker Abstraction

`WorkerBase`

vllm/v1/worker/worker_base.py34-176

WorkerBase defines the interface every worker implementation must fulfill. It stores the decomposed VllmConfig fields and rank information.

WorkerBase interface summary

Method	Description
`init_device()`	Set up the device (CUDA context, distributed process group, model runner).
`load_model()`	Load model weights onto the device.
`determine_available_memory()`	Profile peak memory use; return bytes available for KV cache.
`get_kv_cache_spec()`	Return `dict[str, KVCacheSpec]` for KV cache planning.
`initialize_from_config(kv_cache_config)`	Allocate KV cache and initialize KV transfer connectors.
`compile_or_warm_up_model()`	Capture CUDA graphs, run warmup iterations. Returns compilation time in seconds.
`execute_model(scheduler_output)`	Run the model forward pass. Returns `ModelRunnerOutput` or `None`.
`sample_tokens(grammar_output)`	Complete sampling if `execute_model` returned `None`.
`check_health()`	Liveness check.
`shutdown()`	Release resources.
`add_lora / remove_lora / pin_lora / list_loras`	LoRA adapter management.
`sleep(level) / wake_up(tags)`	GPU memory sleep/wake (only in `Worker`).

Sources: vllm/v1/worker/worker_base.py34-176

`WorkerWrapperBase`

vllm/v1/worker/worker_base.py179-372

WorkerWrapperBase sits between an executor and a WorkerBase instance. Its roles are:

Lazy initialization — creates the WorkerBase subclass only when init_worker() is called.
Worker class resolution — reads parallel_config.worker_cls (a fully-qualified class name string) and instantiates it via resolve_obj_by_qualname.
Worker extension injection — if parallel_config.worker_extension_cls is set, dynamically inserts that class into worker_class.__bases__ to add methods without subclassing.
Multimodal shared memory cache — if configured, sets up a mm_receiver_cache for reading multimodal features from shared memory before forwarding to the worker.
Method dispatch — execute_method(method, *args, **kwargs) calls run_method(self, ...). __getattr__ transparently delegates any unhandled attribute to self.worker.

WorkerWrapperBase
  .rpc_rank         — index in executor's worker list
  .global_rank      — global rank in distributed group
  .worker           — the actual WorkerBase instance (set after init_worker())
  .mm_receiver_cache — optional multimodal feature cache

Sources: vllm/v1/worker/worker_base.py179-372

GPU Worker (`Worker`)

vllm/v1/worker/gpu_worker.py102-900

Worker is the concrete WorkerBase implementation for NVIDIA/AMD GPU execution. It coordinates between the distributed environment, the GPUModelRunner, and optional features like sleep mode and weight transfer.

Class Relationships

Code entities in the GPU worker layer

Sources: vllm/v1/worker/gpu_worker.py102-150 vllm/v1/worker/worker_base.py34-80 vllm/v1/worker/worker_base.py179-215

Initialization Lifecycle

Worker initialization sequence

Sources: vllm/v1/worker/gpu_worker.py219-315 vllm/v1/worker/gpu_worker.py319-343 vllm/v1/worker/gpu_worker.py350-429 vllm/v1/worker/gpu_worker.py462-481 vllm/v1/worker/gpu_worker.py482-608

`init_device()`

vllm/v1/worker/gpu_worker.py219-315

Key actions:

Removes NCCL_ASYNC_ERROR_HANDLING from the environment (conflicts with CUDA graph building).
Computes local_rank accounting for data-parallel layout: local_rank += dp_local_rank * tp_pp_world_size.
Calls init_worker_distributed_environment() to initialize the NCCL process group.
Takes a MemorySnapshot after NCCL init (NCCL buffers affect available memory).
Calls request_memory() to compute the GPU memory to reserve based on gpu_memory_utilization.
Creates the GPUModelRunner (V1 or V2 variant based on VLLM_USE_V2_MODEL_RUNNER).

`determine_available_memory()`

vllm/v1/worker/gpu_worker.py350-429

Runs model_runner.profile_run() under memory_profiling() context to measure peak activation memory, non-torch memory increases, and post-profile free memory. If kv_cache_memory_bytes is set directly in CacheConfig, the profiling step is skipped and that value is returned directly.

`execute_model()` and Pipeline Parallelism

vllm/v1/worker/gpu_worker.py658-747

For PP > 1:

Non-first-rank workers call get_pp_group().irecv_tensor_dict() to receive IntermediateTensors from the previous stage as AsyncIntermediateTensors (lazy synchronization via wait_for_comm() on first tensor access).
Non-last-rank workers call get_pp_group().isend_tensor_dict() and store handles in _pp_send_work. These are waited on at the start of the next execute_model() call.
Only the last PP rank returns a ModelRunnerOutput; others return None.

AsyncIntermediateTensors vllm/v1/worker/gpu_worker.py70-99 wraps IntermediateTensors and defers handle.wait() calls until .tensors is first accessed, overlapping PP communication with other work.

Sleep / Wake

vllm/v1/worker/gpu_worker.py154-213

When enable_sleep_mode=True, Worker supports two sleep levels:

Level	Behavior
1	Offload model weights to CPU via `CuMemAllocator`, keep KV cache on GPU.
2	Offload everything; save non-offloadable buffers to CPU before sleep, restore on wake.

Weights are allocated inside _maybe_get_memory_pool_context(tag="weights") and KV cache inside tag="kv_cache" so the allocator can track and offload them independently.

Executor Selection and Initialization Flow

Executor backend selection and worker class resolution

Sources: vllm/v1/executor/abstract.py46-86 vllm/v1/worker/worker_base.py251-314

Summary Reference Table

Class	File	Role
`Executor`	`vllm/v1/executor/abstract.py`	Abstract base; owns `collective_rpc` contract
`UniProcExecutor`	`vllm/v1/executor/uniproc_executor.py`	In-process single-worker execution
`ExecutorWithExternalLauncher`	`vllm/v1/executor/uniproc_executor.py`	One worker per torchrun-launched engine
`MultiprocExecutor`	`vllm/v1/executor/multiproc_executor.py`	Multi-GPU via forked subprocesses + `MessageQueue`
`RayDistributedExecutor`	`vllm/v1/executor/ray_executor.py`	Multi-GPU/multi-node via Ray actors + compiled DAG
`WorkerProc`	`vllm/v1/executor/multiproc_executor.py`	Runs one worker in a subprocess; owns busy loop
`WorkerProcHandle`	`vllm/v1/executor/multiproc_executor.py`	Executor's handle to a ready `WorkerProc`
`WorkerBase`	`vllm/v1/worker/worker_base.py`	Abstract worker interface
`WorkerWrapperBase`	`vllm/v1/worker/worker_base.py`	Lazy init, method dispatch, extension injection
`Worker`	`vllm/v1/worker/gpu_worker.py`	Concrete GPU worker; drives `GPUModelRunner`
`AsyncIntermediateTensors`	`vllm/v1/worker/gpu_worker.py`	Lazy-sync PP intermediate tensors

Worker and Executor Architecture

Relevant source files

Purpose and Scope

This page documents the two-layer abstraction vLLM uses to manage model execution across hardware: the Executor layer and the Worker layer.

The Executor is owned by the engine core and acts as a proxy to one or more worker processes or actors. It handles process lifecycle, RPC dispatch, and result collection.
The Worker runs inside each process (or Ray actor) and is responsible for device initialization, model loading, KV cache allocation, and executing individual forward passes.

Architecture Overview

Two-layer model execution architecture

Sources: vllm/v1/executor/abstract.py36-86 vllm/v1/executor/uniproc_executor.py25-130 vllm/v1/executor/multiproc_executor.py95-225 vllm/v1/executor/ray_executor.py62-98

The Executor Abstraction

Backend Selection

Executor.get_class(vllm_config) reads parallel_config.distributed_executor_backend to pick the concrete class:

`distributed_executor_backend`	Executor Class	File
`"mp"`	`MultiprocExecutor`	`vllm/v1/executor/multiproc_executor.py`
`"ray"`	`RayDistributedExecutor`	`vllm/v1/executor/ray_executor.py`
`"uni"`	`UniProcExecutor`	`vllm/v1/executor/uniproc_executor.py`
`"external_launcher"`	`ExecutorWithExternalLauncher`	`vllm/v1/executor/uniproc_executor.py`
A `type` subclassing `Executor`	That class directly	—
A fully-qualified string	Resolved via `resolve_obj_by_qualname`	—

Sources: vllm/v1/executor/abstract.py46-86

The `collective_rpc` Interface

collective_rpc(method, timeout, args, kwargs, non_block) -> list[result] | Future[list[result]]

Sources: vllm/v1/executor/abstract.py141-191 vllm/v1/executor/abstract.py112-126

Key Executor Methods

Method	Description
`_init_executor()`	Abstract. Creates and initializes worker processes/actors.
`initialize_from_config(kv_cache_configs)`	Sends KV cache config to workers, then triggers `compile_or_warm_up_model`.
`determine_available_memory()`	Calls `determine_available_memory` on all workers; returns list of bytes.
`get_kv_cache_specs()`	Returns KV cache spec dicts from all workers.
`execute_model(scheduler_output)`	Dispatches a forward pass to all workers.
`sample_tokens(grammar_output)`	Dispatches sampling after a deferred `execute_model`.
`sleep(level)` / `wake_up(tags)`	Offload/restore GPU memory (sleep mode).
`check_health()`	Abstract. Raises if executor is unhealthy.
`shutdown()`	Terminates all workers.

Sources: vllm/v1/executor/abstract.py108-354

UniProcExecutor

vllm/v1/executor/uniproc_executor.py25-129

UniProcExecutor runs a single worker in the same process as the engine core. It is the default for single-GPU deployments (i.e., TP=1, PP=1, no data parallelism requiring separate processes).

_init_executor() creates one WorkerWrapperBase(rpc_rank=0), calls init_worker(), init_device(), and load_model() in-process.
collective_rpc() calls run_method(self.driver_worker, method, args, kwargs) directly — no IPC, no serialization.
When async_scheduling=True, a ThreadPoolExecutor (one thread) is created to handle AsyncModelRunnerOutput.get_output() off the main thread, allowing the scheduler to overlap with output processing.

ExecutorWithExternalLauncher

vllm/v1/executor/uniproc_executor.py131-177

MultiprocExecutor

vllm/v1/executor/multiproc_executor.py95-466

MultiprocExecutor spawns one worker subprocess per local GPU. It is selected when distributed_executor_backend="mp" and is the standard backend for multi-GPU inference on a single node.

Worker Process Spawning

MultiprocExecutor initialization sequence

Sources: vllm/v1/executor/multiproc_executor.py102-225 vllm/v1/executor/multiproc_executor.py603-643 vllm/v1/executor/multiproc_executor.py712-810

Message Queue Communication

Each worker subprocess has two MessageQueue channels (backed by shared memory):

Queue	Direction	Purpose
`rpc_broadcast_mq`	Executor → all workers	Broadcast `(method, args, kwargs, output_rank)`
`worker_response_mq`	Worker → Executor	Return `(status, result)` for each call

Worker Process Handles

Class	Purpose
`UnreadyWorkerProcHandle`	Holds `proc`, `rank`, `ready_pipe`, `death_writer` before READY signal
`WorkerProcHandle`	Holds `proc`, `rank`, `worker_response_mq`, `peer_worker_response_mqs`, `death_writer` after ready

Health Monitoring

Sources: vllm/v1/executor/multiproc_executor.py246-283

Output Rank

Only one worker produces the ModelRunnerOutput — the first TP worker of the last PP stage. The executor calculates this as:

output_rank = world_size - tensor_parallel_size * prefill_context_parallel_size

Sources: vllm/v1/executor/multiproc_executor.py451-465

RayDistributedExecutor

vllm/v1/executor/ray_executor.py62-643

RayDistributedExecutor distributes workers as Ray actors and uses Ray's Compiled DAG for the execution path.

Worker Creation

_init_workers_ray() vllm/v1/executor/ray_executor.py160-398:

Resolves bundle_indices from the Ray placement group (or from VLLM_RAY_BUNDLE_INDICES).
Creates RayWorkerWrapper remote actors, one per rank.
Collects IP addresses from all actors and re-sorts workers so the driver's node comes first.
Adjusts ranks via collective_rpc("adjust_rank", ...) to account for re-sorting.
Sets CUDA_VISIBLE_DEVICES environment variables per node via collective_rpc("update_environment_variables", ...).
Calls collective_rpc("init_worker", ...), "init_device", "load_model" on all actors.
Organizes workers into pp_tp_workers[pp_rank][tp_rank] for DAG construction.

Execution via Compiled DAG

Ray DAG structure for PP=2, TP=4:

Sources: vllm/v1/executor/ray_executor.py542-635

`collective_rpc` in Ray

The Worker Abstraction

`WorkerBase`

vllm/v1/worker/worker_base.py34-176

WorkerBase defines the interface every worker implementation must fulfill. It stores the decomposed VllmConfig fields and rank information.

WorkerBase interface summary

Method	Description
`init_device()`	Set up the device (CUDA context, distributed process group, model runner).
`load_model()`	Load model weights onto the device.
`determine_available_memory()`	Profile peak memory use; return bytes available for KV cache.
`get_kv_cache_spec()`	Return `dict[str, KVCacheSpec]` for KV cache planning.
`initialize_from_config(kv_cache_config)`	Allocate KV cache and initialize KV transfer connectors.
`compile_or_warm_up_model()`	Capture CUDA graphs, run warmup iterations. Returns compilation time in seconds.
`execute_model(scheduler_output)`	Run the model forward pass. Returns `ModelRunnerOutput` or `None`.
`sample_tokens(grammar_output)`	Complete sampling if `execute_model` returned `None`.
`check_health()`	Liveness check.
`shutdown()`	Release resources.
`add_lora / remove_lora / pin_lora / list_loras`	LoRA adapter management.
`sleep(level) / wake_up(tags)`	GPU memory sleep/wake (only in `Worker`).

Sources: vllm/v1/worker/worker_base.py34-176

`WorkerWrapperBase`

vllm/v1/worker/worker_base.py179-372

WorkerWrapperBase sits between an executor and a WorkerBase instance. Its roles are:

Lazy initialization — creates the WorkerBase subclass only when init_worker() is called.
Worker class resolution — reads parallel_config.worker_cls (a fully-qualified class name string) and instantiates it via resolve_obj_by_qualname.
Worker extension injection — if parallel_config.worker_extension_cls is set, dynamically inserts that class into worker_class.__bases__ to add methods without subclassing.
Multimodal shared memory cache — if configured, sets up a mm_receiver_cache for reading multimodal features from shared memory before forwarding to the worker.
Method dispatch — execute_method(method, *args, **kwargs) calls run_method(self, ...). __getattr__ transparently delegates any unhandled attribute to self.worker.

WorkerWrapperBase
  .rpc_rank         — index in executor's worker list
  .global_rank      — global rank in distributed group
  .worker           — the actual WorkerBase instance (set after init_worker())
  .mm_receiver_cache — optional multimodal feature cache

Sources: vllm/v1/worker/worker_base.py179-372

GPU Worker (`Worker`)

vllm/v1/worker/gpu_worker.py102-900

Class Relationships

Code entities in the GPU worker layer

Sources: vllm/v1/worker/gpu_worker.py102-150 vllm/v1/worker/worker_base.py34-80 vllm/v1/worker/worker_base.py179-215

Initialization Lifecycle

Worker initialization sequence

Sources: vllm/v1/worker/gpu_worker.py219-315 vllm/v1/worker/gpu_worker.py319-343 vllm/v1/worker/gpu_worker.py350-429 vllm/v1/worker/gpu_worker.py462-481 vllm/v1/worker/gpu_worker.py482-608

`init_device()`

vllm/v1/worker/gpu_worker.py219-315

Key actions:

Removes NCCL_ASYNC_ERROR_HANDLING from the environment (conflicts with CUDA graph building).
Computes local_rank accounting for data-parallel layout: local_rank += dp_local_rank * tp_pp_world_size.
Calls init_worker_distributed_environment() to initialize the NCCL process group.
Takes a MemorySnapshot after NCCL init (NCCL buffers affect available memory).
Calls request_memory() to compute the GPU memory to reserve based on gpu_memory_utilization.
Creates the GPUModelRunner (V1 or V2 variant based on VLLM_USE_V2_MODEL_RUNNER).

`determine_available_memory()`

vllm/v1/worker/gpu_worker.py350-429

`execute_model()` and Pipeline Parallelism

vllm/v1/worker/gpu_worker.py658-747

For PP > 1:

Non-first-rank workers call get_pp_group().irecv_tensor_dict() to receive IntermediateTensors from the previous stage as AsyncIntermediateTensors (lazy synchronization via wait_for_comm() on first tensor access).
Non-last-rank workers call get_pp_group().isend_tensor_dict() and store handles in _pp_send_work. These are waited on at the start of the next execute_model() call.
Only the last PP rank returns a ModelRunnerOutput; others return None.

Sleep / Wake

vllm/v1/worker/gpu_worker.py154-213

When enable_sleep_mode=True, Worker supports two sleep levels:

Level	Behavior
1	Offload model weights to CPU via `CuMemAllocator`, keep KV cache on GPU.
2	Offload everything; save non-offloadable buffers to CPU before sleep, restore on wake.

Weights are allocated inside _maybe_get_memory_pool_context(tag="weights") and KV cache inside tag="kv_cache" so the allocator can track and offload them independently.

Executor Selection and Initialization Flow

Executor backend selection and worker class resolution

Sources: vllm/v1/executor/abstract.py46-86 vllm/v1/worker/worker_base.py251-314

Summary Reference Table

Class	File	Role
`Executor`	`vllm/v1/executor/abstract.py`	Abstract base; owns `collective_rpc` contract
`UniProcExecutor`	`vllm/v1/executor/uniproc_executor.py`	In-process single-worker execution
`ExecutorWithExternalLauncher`	`vllm/v1/executor/uniproc_executor.py`	One worker per torchrun-launched engine
`MultiprocExecutor`	`vllm/v1/executor/multiproc_executor.py`	Multi-GPU via forked subprocesses + `MessageQueue`
`RayDistributedExecutor`	`vllm/v1/executor/ray_executor.py`	Multi-GPU/multi-node via Ray actors + compiled DAG
`WorkerProc`	`vllm/v1/executor/multiproc_executor.py`	Runs one worker in a subprocess; owns busy loop
`WorkerProcHandle`	`vllm/v1/executor/multiproc_executor.py`	Executor's handle to a ready `WorkerProc`
`WorkerBase`	`vllm/v1/worker/worker_base.py`	Abstract worker interface
`WorkerWrapperBase`	`vllm/v1/worker/worker_base.py`	Lazy init, method dispatch, extension injection
`Worker`	`vllm/v1/worker/gpu_worker.py`	Concrete GPU worker; drives `GPUModelRunner`
`AsyncIntermediateTensors`	`vllm/v1/worker/gpu_worker.py`	Lazy-sync PP intermediate tensors

Worker and Executor Architecture

Purpose and Scope

Architecture Overview

The Executor Abstraction

Backend Selection

The collective_rpc Interface

Key Executor Methods

UniProcExecutor

ExecutorWithExternalLauncher

MultiprocExecutor

Worker Process Spawning

Message Queue Communication

Worker Process Handles

Health Monitoring

Output Rank

RayDistributedExecutor

Worker Creation

Execution via Compiled DAG

collective_rpc in Ray

The Worker Abstraction

WorkerBase

WorkerWrapperBase

GPU Worker (Worker)

Class Relationships

Initialization Lifecycle

init_device()

determine_available_memory()

execute_model() and Pipeline Parallelism

Sleep / Wake

Executor Selection and Initialization Flow

Summary Reference Table

On this page

Worker and Executor Architecture

Purpose and Scope

Architecture Overview

The Executor Abstraction

Backend Selection

The collective_rpc Interface

Key Executor Methods

UniProcExecutor

ExecutorWithExternalLauncher

MultiprocExecutor

Worker Process Spawning

Message Queue Communication

Worker Process Handles

Health Monitoring

Output Rank

RayDistributedExecutor

Worker Creation

Execution via Compiled DAG

collective_rpc in Ray

The Worker Abstraction

WorkerBase

WorkerWrapperBase

GPU Worker (Worker)

Class Relationships

Initialization Lifecycle

init_device()

determine_available_memory()

execute_model() and Pipeline Parallelism

Sleep / Wake

Executor Selection and Initialization Flow

Summary Reference Table

On this page

The `collective_rpc` Interface

`collective_rpc` in Ray

`WorkerBase`

`WorkerWrapperBase`

GPU Worker (`Worker`)

`init_device()`

`determine_available_memory()`

`execute_model()` and Pipeline Parallelism

The `collective_rpc` Interface

`collective_rpc` in Ray

`WorkerBase`

`WorkerWrapperBase`

GPU Worker (`Worker`)

`init_device()`

`determine_available_memory()`

`execute_model()` and Pipeline Parallelism