This page documents the two-layer abstraction vLLM uses to manage model execution across hardware: the Executor layer and the Worker layer.
This page focuses on the v1 implementations under vllm/v1/executor/ and vllm/v1/worker/. For how the EngineCore holds and calls the executor, see 3.1. For what the GPUModelRunner does inside a worker, see 4.1. For distributed communication primitives (MessageQueue, ShmRingBuffer), see 9.2.
Two-layer model execution architecture
Sources: vllm/v1/executor/abstract.py36-86 vllm/v1/executor/uniproc_executor.py25-130 vllm/v1/executor/multiproc_executor.py95-225 vllm/v1/executor/ray_executor.py62-98
Executor (vllm/v1/executor/abstract.py36-355) is the abstract base class for all executor implementations. It is instantiated by the engine core and provides a uniform interface regardless of how many GPUs or processes are in use.
Executor.get_class(vllm_config) reads parallel_config.distributed_executor_backend to pick the concrete class:
distributed_executor_backend | Executor Class | File |
|---|---|---|
"mp" | MultiprocExecutor | vllm/v1/executor/multiproc_executor.py |
"ray" | RayDistributedExecutor | vllm/v1/executor/ray_executor.py |
"uni" | UniProcExecutor | vllm/v1/executor/uniproc_executor.py |
"external_launcher" | ExecutorWithExternalLauncher | vllm/v1/executor/uniproc_executor.py |
A type subclassing Executor | That class directly | — |
| A fully-qualified string | Resolved via resolve_obj_by_qualname | — |
Sources: vllm/v1/executor/abstract.py46-86
collective_rpc InterfaceAll executor implementations expose collective_rpc(), the core RPC primitive. It sends a method call (by name or serialized callable) with positional and keyword arguments to all workers, then collects a list of results.
collective_rpc(method, timeout, args, kwargs, non_block) -> list[result] | Future[list[result]]
Higher-level methods (execute_model, sample_tokens, initialize_from_config, etc.) are implemented in the base class by delegating to collective_rpc. Subclasses override them only when they need extra behavior (e.g., handling pipeline parallelism).
Sources: vllm/v1/executor/abstract.py141-191 vllm/v1/executor/abstract.py112-126
| Method | Description |
|---|---|
_init_executor() | Abstract. Creates and initializes worker processes/actors. |
initialize_from_config(kv_cache_configs) | Sends KV cache config to workers, then triggers compile_or_warm_up_model. |
determine_available_memory() | Calls determine_available_memory on all workers; returns list of bytes. |
get_kv_cache_specs() | Returns KV cache spec dicts from all workers. |
execute_model(scheduler_output) | Dispatches a forward pass to all workers. |
sample_tokens(grammar_output) | Dispatches sampling after a deferred execute_model. |
sleep(level) / wake_up(tags) | Offload/restore GPU memory (sleep mode). |
check_health() | Abstract. Raises if executor is unhealthy. |
shutdown() | Terminates all workers. |
Sources: vllm/v1/executor/abstract.py108-354
vllm/v1/executor/uniproc_executor.py25-129
UniProcExecutor runs a single worker in the same process as the engine core. It is the default for single-GPU deployments (i.e., TP=1, PP=1, no data parallelism requiring separate processes).
_init_executor() creates one WorkerWrapperBase(rpc_rank=0), calls init_worker(), init_device(), and load_model() in-process.collective_rpc() calls run_method(self.driver_worker, method, args, kwargs) directly — no IPC, no serialization.async_scheduling=True, a ThreadPoolExecutor (one thread) is created to handle AsyncModelRunnerOutput.get_output() off the main thread, allowing the scheduler to overlap with output processing.vllm/v1/executor/uniproc_executor.py131-177
A subclass of UniProcExecutor designed for torchrun-compatible launchers. Instead of one executor managing multiple workers, the user launches one engine per GPU. distributed_init_method is set to "env://" and rank/local_rank are read from RANK and LOCAL_RANK. determine_available_memory() performs an all_reduce across all ranks to pick the minimum available memory.
vllm/v1/executor/multiproc_executor.py95-466
MultiprocExecutor spawns one worker subprocess per local GPU. It is selected when distributed_executor_backend="mp" and is the standard backend for multi-GPU inference on a single node.
MultiprocExecutor initialization sequence
Sources: vllm/v1/executor/multiproc_executor.py102-225 vllm/v1/executor/multiproc_executor.py603-643 vllm/v1/executor/multiproc_executor.py712-810
Each worker subprocess has two MessageQueue channels (backed by shared memory):
| Queue | Direction | Purpose |
|---|---|---|
rpc_broadcast_mq | Executor → all workers | Broadcast (method, args, kwargs, output_rank) |
worker_response_mq | Worker → Executor | Return (status, result) for each call |
The executor's collective_rpc() vllm/v1/executor/multiproc_executor.py317-389 enqueues a call tuple on rpc_broadcast_mq, then dequeues from the appropriate response_mqs. When non_block=True, it returns a FutureWrapper and defers reading until .result() is called.
Workers run worker_busy_loop() (not fully shown in the file but invoked at vllm/v1/executor/multiproc_executor.py791) which continuously reads from rpc_broadcast_mq and dispatches execute_method().
| Class | Purpose |
|---|---|
UnreadyWorkerProcHandle | Holds proc, rank, ready_pipe, death_writer before READY signal |
WorkerProcHandle | Holds proc, rank, worker_response_mq, peer_worker_response_mqs, death_writer after ready |
The death pipe (death_reader/death_writer) is a one-way pipe. The parent keeps death_writer open. When the parent process exits, the child gets EOFError on death_reader and shuts down cleanly.
A daemon thread (MultiprocWorkerMonitor) watches process sentinels. If any worker exits unexpectedly, it logs an error, calls shutdown(), and fires the registered FailureCallback to notify the engine core.
Sources: vllm/v1/executor/multiproc_executor.py246-283
Only one worker produces the ModelRunnerOutput — the first TP worker of the last PP stage. The executor calculates this as:
output_rank = world_size - tensor_parallel_size * prefill_context_parallel_size
Sources: vllm/v1/executor/multiproc_executor.py451-465
vllm/v1/executor/ray_executor.py62-643
RayDistributedExecutor distributes workers as Ray actors and uses Ray's Compiled DAG for the execution path.
_init_workers_ray() vllm/v1/executor/ray_executor.py160-398:
bundle_indices from the Ray placement group (or from VLLM_RAY_BUNDLE_INDICES).RayWorkerWrapper remote actors, one per rank.collective_rpc("adjust_rank", ...) to account for re-sorting.CUDA_VISIBLE_DEVICES environment variables per node via collective_rpc("update_environment_variables", ...).collective_rpc("init_worker", ...), "init_device", "load_model" on all actors.pp_tp_workers[pp_rank][tp_rank] for DAG construction.On the first execute_model() call, _compiled_ray_dag() builds a Ray CompiledDAG vllm/v1/executor/ray_executor.py542-635 It chains PP stages so intermediate tensors flow from PP rank 0 to PP rank N−1. The DAG is then called with forward_dag.execute((scheduler_output, grammar_output)).
Ray DAG structure for PP=2, TP=4:
Sources: vllm/v1/executor/ray_executor.py542-635
collective_rpc in Raycollective_rpc() vllm/v1/executor/ray_executor.py485-510 uses worker.execute_method.remote(method, *args, **kwargs) on every actor and blocks with ray.get(). Method callables are serialized with cloudpickle.
WorkerBasevllm/v1/worker/worker_base.py34-176
WorkerBase defines the interface every worker implementation must fulfill. It stores the decomposed VllmConfig fields and rank information.
WorkerBase interface summary
| Method | Description |
|---|---|
init_device() | Set up the device (CUDA context, distributed process group, model runner). |
load_model() | Load model weights onto the device. |
determine_available_memory() | Profile peak memory use; return bytes available for KV cache. |
get_kv_cache_spec() | Return dict[str, KVCacheSpec] for KV cache planning. |
initialize_from_config(kv_cache_config) | Allocate KV cache and initialize KV transfer connectors. |
compile_or_warm_up_model() | Capture CUDA graphs, run warmup iterations. Returns compilation time in seconds. |
execute_model(scheduler_output) | Run the model forward pass. Returns ModelRunnerOutput or None. |
sample_tokens(grammar_output) | Complete sampling if execute_model returned None. |
check_health() | Liveness check. |
shutdown() | Release resources. |
add_lora / remove_lora / pin_lora / list_loras | LoRA adapter management. |
sleep(level) / wake_up(tags) | GPU memory sleep/wake (only in Worker). |
Sources: vllm/v1/worker/worker_base.py34-176
WorkerWrapperBasevllm/v1/worker/worker_base.py179-372
WorkerWrapperBase sits between an executor and a WorkerBase instance. Its roles are:
WorkerBase subclass only when init_worker() is called.parallel_config.worker_cls (a fully-qualified class name string) and instantiates it via resolve_obj_by_qualname.parallel_config.worker_extension_cls is set, dynamically inserts that class into worker_class.__bases__ to add methods without subclassing.mm_receiver_cache for reading multimodal features from shared memory before forwarding to the worker.execute_method(method, *args, **kwargs) calls run_method(self, ...). __getattr__ transparently delegates any unhandled attribute to self.worker.WorkerWrapperBase
.rpc_rank — index in executor's worker list
.global_rank — global rank in distributed group
.worker — the actual WorkerBase instance (set after init_worker())
.mm_receiver_cache — optional multimodal feature cache
Sources: vllm/v1/worker/worker_base.py179-372
Worker)vllm/v1/worker/gpu_worker.py102-900
Worker is the concrete WorkerBase implementation for NVIDIA/AMD GPU execution. It coordinates between the distributed environment, the GPUModelRunner, and optional features like sleep mode and weight transfer.
Code entities in the GPU worker layer
Sources: vllm/v1/worker/gpu_worker.py102-150 vllm/v1/worker/worker_base.py34-80 vllm/v1/worker/worker_base.py179-215
Worker initialization sequence
Sources: vllm/v1/worker/gpu_worker.py219-315 vllm/v1/worker/gpu_worker.py319-343 vllm/v1/worker/gpu_worker.py350-429 vllm/v1/worker/gpu_worker.py462-481 vllm/v1/worker/gpu_worker.py482-608
init_device()vllm/v1/worker/gpu_worker.py219-315
Key actions:
NCCL_ASYNC_ERROR_HANDLING from the environment (conflicts with CUDA graph building).local_rank accounting for data-parallel layout: local_rank += dp_local_rank * tp_pp_world_size.init_worker_distributed_environment() to initialize the NCCL process group.MemorySnapshot after NCCL init (NCCL buffers affect available memory).request_memory() to compute the GPU memory to reserve based on gpu_memory_utilization.GPUModelRunner (V1 or V2 variant based on VLLM_USE_V2_MODEL_RUNNER).determine_available_memory()vllm/v1/worker/gpu_worker.py350-429
Runs model_runner.profile_run() under memory_profiling() context to measure peak activation memory, non-torch memory increases, and post-profile free memory. If kv_cache_memory_bytes is set directly in CacheConfig, the profiling step is skipped and that value is returned directly.
execute_model() and Pipeline Parallelismvllm/v1/worker/gpu_worker.py658-747
For PP > 1:
get_pp_group().irecv_tensor_dict() to receive IntermediateTensors from the previous stage as AsyncIntermediateTensors (lazy synchronization via wait_for_comm() on first tensor access).get_pp_group().isend_tensor_dict() and store handles in _pp_send_work. These are waited on at the start of the next execute_model() call.ModelRunnerOutput; others return None.AsyncIntermediateTensors vllm/v1/worker/gpu_worker.py70-99 wraps IntermediateTensors and defers handle.wait() calls until .tensors is first accessed, overlapping PP communication with other work.
vllm/v1/worker/gpu_worker.py154-213
When enable_sleep_mode=True, Worker supports two sleep levels:
| Level | Behavior |
|---|---|
| 1 | Offload model weights to CPU via CuMemAllocator, keep KV cache on GPU. |
| 2 | Offload everything; save non-offloadable buffers to CPU before sleep, restore on wake. |
Weights are allocated inside _maybe_get_memory_pool_context(tag="weights") and KV cache inside tag="kv_cache" so the allocator can track and offload them independently.
Executor backend selection and worker class resolution
Sources: vllm/v1/executor/abstract.py46-86 vllm/v1/worker/worker_base.py251-314
| Class | File | Role |
|---|---|---|
Executor | vllm/v1/executor/abstract.py | Abstract base; owns collective_rpc contract |
UniProcExecutor | vllm/v1/executor/uniproc_executor.py | In-process single-worker execution |
ExecutorWithExternalLauncher | vllm/v1/executor/uniproc_executor.py | One worker per torchrun-launched engine |
MultiprocExecutor | vllm/v1/executor/multiproc_executor.py | Multi-GPU via forked subprocesses + MessageQueue |
RayDistributedExecutor | vllm/v1/executor/ray_executor.py | Multi-GPU/multi-node via Ray actors + compiled DAG |
WorkerProc | vllm/v1/executor/multiproc_executor.py | Runs one worker in a subprocess; owns busy loop |
WorkerProcHandle | vllm/v1/executor/multiproc_executor.py | Executor's handle to a ready WorkerProc |
WorkerBase | vllm/v1/worker/worker_base.py | Abstract worker interface |
WorkerWrapperBase | vllm/v1/worker/worker_base.py | Lazy init, method dispatch, extension injection |
Worker | vllm/v1/worker/gpu_worker.py | Concrete GPU worker; drives GPUModelRunner |
AsyncIntermediateTensors | vllm/v1/worker/gpu_worker.py | Lazy-sync PP intermediate tensors |
Refresh this wiki
This wiki was recently refreshed. Please wait 5 days to refresh again.