This document describes the process lifecycle management infrastructure in vLLM's distributed execution system. It covers the creation, initialization, monitoring, and shutdown of engine processes and API server processes. For information about the communication mechanisms used between these processes (ZMQ, ShmRingBuffer, MsgpackEncoder), see page 9.2. For information about KV cache transfer across disaggregated serving instances, see page 9.4.
vLLM uses multiprocessing to manage distributed execution across multiple GPUs and nodes. The engine process management system is responsible for:
The system is built around a client/server split: the front-end (AsyncLLM or LLMEngine) communicates with one or more background EngineCoreProc instances over ZMQ. The EngineCoreClient abstraction unifies both in-process and multiprocessing modes.
Five main classes handle these responsibilities:
| Class | File | Role |
|---|---|---|
EngineCoreClient | vllm/v1/engine/core_client.py | Abstract client interface |
InprocClient | vllm/v1/engine/core_client.py | In-process EngineCore wrapper (LLMEngine/offline) |
SyncMPClient / AsyncMPClient | vllm/v1/engine/core_client.py | ZMQ-based multiprocess client (online serving) |
CoreEngineProcManager | vllm/v1/engine/utils.py | Spawns local engine processes |
CoreEngineActorManager | vllm/v1/engine/utils.py | Manages Ray actors across nodes |
Sources: vllm/v1/engine/core_client.py66-128 vllm/v1/engine/utils.py80-230
EngineCoreClient is an abstract base class defining the full API surface for interacting with an engine core. The concrete subclasses differ in how they communicate with EngineCore.
Class hierarchy diagram:
Sources: vllm/v1/engine/core_client.py66-128 vllm/v1/engine/core.py83-230
InprocClient runs EngineCore directly in the calling process. It is used by LLMEngine in non-multiprocessing mode and offline LLM usage.
EngineCore.step_fn() synchronously on get_output()EngineCore.preprocess_add_request() and EngineCore.add_request()Sources: vllm/v1/engine/core_client.py271-362
SyncMPClient (for LLMEngine) and AsyncMPClient (for AsyncLLM) communicate with one or more EngineCoreProc instances over ZMQ. The factory method EngineCoreClient.make_async_mp_client() selects the correct subclass:
| Condition | Subclass selected |
|---|---|
data_parallel_size == 1 | AsyncMPClient |
data_parallel_size > 1 and internal LB | DPLBAsyncMPClient |
data_parallel_size > 1 and external LB | DPAsyncMPClient |
Sources: vllm/v1/engine/core_client.py102-127
EngineCoreClient selection flow:
Sources: vllm/v1/engine/core_client.py77-127
BackgroundResources is a dataclass used as a finalizer to ensure clean shutdown of ZMQ sockets, background tasks, and engine managers when the client is garbage-collected. It avoids circular references back to the client object.
Sources: vllm/v1/engine/core_client.py364-444
EngineCore is the inner loop of vLLM's engine. It owns the executor, scheduler, structured output manager, and KV cache. Its step() method schedules and executes one model step.
Key initialization sequence in EngineCore.__init__():
executor_class(vllm_config))_initialize_kv_caches()StructuredOutputManagerSchedulerfreeze_gc_heap()) to reduce GC pause timesenable_envs_cache())Sources: vllm/v1/engine/core.py83-229
EngineCoreProc is the entry point for a background engine process. It wraps EngineCore with server-side ZMQ communication, allowing the front-end to send requests and receive outputs over sockets. It is the target_fn passed to CoreEngineProcManager.
The EngineCoreProc handles:
EngineCoreRequest messages from the front-end over ZMQEngineCore step loopEngineCoreOutputs back to the front-endSources: vllm/v1/engine/core.py vllm/v1/engine/core_client.py46-47
The CoreEngineProcManager class manages the lifecycle of local engine core processes using Python's multiprocessing library.
Sources: vllm/v1/engine/utils.py80-174
Key responsibilities:
Process per local data parallel rank with unique process names like EngineCore_DP0, EngineCore_DP1, etc. vllm/v1/engine/utils.py119-129shutdown() on garbage collection vllm/v1/engine/utils.py131The constructor accepts:
| Parameter | Description |
|---|---|
target_fn | Function to execute in each process (typically EngineCoreProc.run_engine_core) |
local_engine_count | Number of engine processes to create locally |
start_index | Global starting index for data parallel ranks |
local_start_index | Local starting index (for multi-node setups) |
vllm_config | Configuration object containing all settings |
local_client | Whether the client is colocated with engines |
handshake_address | ZMQ address for handshaking |
executor_class | Executor class to use for model execution |
log_stats | Whether to enable statistics logging |
Sources: vllm/v1/engine/utils.py86-98
The CoreEngineActorManager class manages Ray actors for distributed engine cores across multiple nodes.
Sources: vllm/v1/engine/utils.py227-358
Key differences from CoreEngineProcManager:
EngineCoreActor and DPMoEEngineCoreActor based on whether the model is MoE and DP > 1 vllm/v1/engine/utils.py254-258The create_dp_placement_groups() method implements three packing strategies:
| Strategy | Description | Use Case |
|---|---|---|
strict | Each DP group on separate nodes (STRICT_PACK) | Default, ensures isolation |
fill | Fill nodes before moving to next (STRICT_PACK) | Maximize node utilization |
span | Spread DP groups across nodes (PACK) | Multi-node DP groups |
Sources: vllm/v1/engine/utils.py359-449
The APIServerProcessManager class manages multiple API server worker processes that handle client requests.
Sources: vllm/v1/utils.py159-225
Key characteristics:
num_servers worker processes to handle concurrent requests vllm/v1/utils.py197-215multiprocessing.get_context("spawn") for clean process isolation vllm/v1/utils.py194ApiServer_0, ApiServer_1, etc. vllm/v1/utils.py211The launch_core_engines() function in vllm/v1/engine/utils.py is the primary entry point that orchestrates engine process creation. It is called by the MP clients (SyncMPClient, AsyncMPClient) and:
get_engine_zmq_addresses()CoreEngineProcManager (local multiprocessing) and CoreEngineActorManager (Ray)Sources: vllm/v1/engine/utils.py vllm/v1/engine/core_client.py46-53
The process creation flow varies depending on the execution backend:
Engine Process Initialization Sequence
Sources: vllm/v1/engine/utils.py100-154 vllm/v1/engine/core_client.py102-127
The initialization sequence:
start() on each process vllm/v1/engine/utils.py150The handshaking protocol establishes communication between engine processes and client processes:
CoreEngine Handshake State Machine
Sources: vllm/v1/engine/utils.py37-51 vllm/v1/engine/utils.py69-78
The CoreEngine class (distinct from EngineCore) tracks handshake state per data parallel rank:
| State | Description |
|---|---|
NEW | Process created but not yet connected |
CONNECTED | Connected to handshake socket |
READY | Received configuration, ready to process requests |
EngineHandshakeMetadata is the data sent from the client to each engine process during the handshake, specifying ZMQ socket addresses and parallel config:
| Field | Type | Content |
|---|---|---|
addresses | EngineZmqAddresses | Input/output ZMQ socket addresses per client, coordinator addresses |
parallel_config | dict[str, int | str | list[int]] | DP/TP/PP rank configuration |
Sources: vllm/v1/engine/utils.py53-78
EngineZmqAddresses carries the full set of ZMQ socket paths:
| Field | Description |
|---|---|
inputs | Per-client ZMQ input socket addresses (requests) |
outputs | Per-client ZMQ output socket addresses (responses) |
coordinator_input | DP coordinator input address (if applicable) |
coordinator_output | DP coordinator output address (if applicable) |
frontend_stats_publish_address | Stats publish address for external LB case |
Sources: vllm/v1/engine/utils.py53-67
The wait_for_completion_or_failure() function in vllm/v1/utils.py monitors all processes and detects failures:
Process Monitoring Flow
Sources: vllm/v1/utils.py227-298
Key monitoring features:
connection.wait() with 5-second timeout vllm/v1/utils.py268ray.wait() vllm/v1/utils.py281-284RuntimeError with process name and exit code on failure vllm/v1/utils.py275-279The shutdown process follows a graceful termination strategy with forceful fallback:
Shutdown sequence (shutdown() in vllm/v1/utils.py)
Sources: vllm/v1/utils.py302-320
The shutdown sequence:
terminate() on each process vllm/v1/utils.py304-306kill_process_tree() for any remaining processes vllm/v1/utils.py317-319This shutdown function is registered as a weakref.finalize on manager objects to ensure cleanup on garbage collection vllm/v1/engine/utils.py131
For data parallel deployments, each engine process must have exclusive access to its assigned GPUs. The device control system manages this assignment.
The get_device_indices() function computes which physical devices each process should use:
get_device_indices() computation
Sources: vllm/v1/engine/utils.py193-224
Example for world_size=2, local_dp_rank=1, 4 total devices:
[2, 3]current_platform.device_id_to_physical_device_id()"2,3"The set_device_control_env_var() context manager temporarily sets device visibility before Process.start():
| Platform | Environment Variable | Purpose |
|---|---|---|
| CUDA | CUDA_VISIBLE_DEVICES | Restrict visible GPUs |
| ROCm | HIP_VISIBLE_DEVICES | Restrict visible AMD GPUs |
| XPU | ZE_AFFINITY_MASK | Restrict visible Intel GPUs |
Sources: vllm/v1/engine/utils.py176-190
The context manager:
get_device_indices()It is only applied when:
dp_size > 1)torch.cuda.set_device() instead vllm/v1/engine/utils.py136-149The complete architecture showing how all components interact:
Sources: vllm/v1/utils.py227-298 vllm/v1/engine/utils.py80-622
Key interactions:
AsyncLLM or LLMEngine creates appropriate process managers based on configurationwait_for_completion_or_failure() monitors all process sentinelsDPCoordinator manages synchronization between DP ranks (see Data Parallel Coordination)shutdown() on cleanup, either explicitly or via finalizersThis architecture enables vLLM to scale from single-process inference to distributed multi-node deployments while maintaining clean process lifecycle management.
Refresh this wiki
This wiki was recently refreshed. Please wait 6 days to refresh again.