Engine Process Management

Relevant source files

This document describes the process lifecycle management infrastructure in vLLM's distributed execution system. It covers the creation, initialization, monitoring, and shutdown of engine processes and API server processes. For information about the communication mechanisms used between these processes (ZMQ, ShmRingBuffer, MsgpackEncoder), see page 9.2. For information about KV cache transfer across disaggregated serving instances, see page 9.4.

Overview

vLLM uses multiprocessing to manage distributed execution across multiple GPUs and nodes. The engine process management system is responsible for:

Spawning and initializing engine core processes (one per data parallel rank)
Creating API server worker processes to handle client requests
Managing Ray actors for multi-node distributed inference
Setting up device assignments and environment variables for each process
Monitoring process health and detecting failures
Coordinating graceful shutdown and cleanup

The system is built around a client/server split: the front-end (AsyncLLM or LLMEngine) communicates with one or more background EngineCoreProc instances over ZMQ. The EngineCoreClient abstraction unifies both in-process and multiprocessing modes.

Five main classes handle these responsibilities:

Class	File	Role
`EngineCoreClient`	`vllm/v1/engine/core_client.py`	Abstract client interface
`InprocClient`	`vllm/v1/engine/core_client.py`	In-process `EngineCore` wrapper (LLMEngine/offline)
`SyncMPClient` / `AsyncMPClient`	`vllm/v1/engine/core_client.py`	ZMQ-based multiprocess client (online serving)
`CoreEngineProcManager`	`vllm/v1/engine/utils.py`	Spawns local engine processes
`CoreEngineActorManager`	`vllm/v1/engine/utils.py`	Manages Ray actors across nodes

Sources: vllm/v1/engine/core_client.py66-128 vllm/v1/engine/utils.py80-230

EngineCoreClient Hierarchy

EngineCoreClient is an abstract base class defining the full API surface for interacting with an engine core. The concrete subclasses differ in how they communicate with EngineCore.

Class hierarchy diagram:

Sources: vllm/v1/engine/core_client.py66-128 vllm/v1/engine/core.py83-230

InprocClient

InprocClient runs EngineCore directly in the calling process. It is used by LLMEngine in non-multiprocessing mode and offline LLM usage.

Calls EngineCore.step_fn() synchronously on get_output()
Passes requests directly via EngineCore.preprocess_add_request() and EngineCore.add_request()
No background threads or ZMQ sockets

Sources: vllm/v1/engine/core_client.py271-362

SyncMPClient and AsyncMPClient

SyncMPClient (for LLMEngine) and AsyncMPClient (for AsyncLLM) communicate with one or more EngineCoreProc instances over ZMQ. The factory method EngineCoreClient.make_async_mp_client() selects the correct subclass:

Condition	Subclass selected
`data_parallel_size == 1`	`AsyncMPClient`
`data_parallel_size > 1` and internal LB	`DPLBAsyncMPClient`
`data_parallel_size > 1` and external LB	`DPAsyncMPClient`

Sources: vllm/v1/engine/core_client.py102-127

EngineCoreClient selection flow:

Sources: vllm/v1/engine/core_client.py77-127

BackgroundResources

BackgroundResources is a dataclass used as a finalizer to ensure clean shutdown of ZMQ sockets, background tasks, and engine managers when the client is garbage-collected. It avoids circular references back to the client object.

Sources: vllm/v1/engine/core_client.py364-444

EngineCore and EngineCoreProc

EngineCore

EngineCore is the inner loop of vLLM's engine. It owns the executor, scheduler, structured output manager, and KV cache. Its step() method schedules and executes one model step.

Key initialization sequence in EngineCore.__init__():

Load plugins
Create model executor (executor_class(vllm_config))
Initialize KV caches via _initialize_kv_caches()
Instantiate StructuredOutputManager
Instantiate the Scheduler
Set up multimodal registry cache
Register KV connector handshake metadata if applicable
Freeze GC heap (freeze_gc_heap()) to reduce GC pause times
Enable environment variable caching (enable_envs_cache())

Sources: vllm/v1/engine/core.py83-229

EngineCoreProc

EngineCoreProc is the entry point for a background engine process. It wraps EngineCore with server-side ZMQ communication, allowing the front-end to send requests and receive outputs over sockets. It is the target_fn passed to CoreEngineProcManager.

The EngineCoreProc handles:

Receiving EngineCoreRequest messages from the front-end over ZMQ
Running the EngineCore step loop
Sending EngineCoreOutputs back to the front-end
Processing utility calls (profile, LoRA management, cache resets, etc.)

Sources: vllm/v1/engine/core.py vllm/v1/engine/core_client.py46-47

Process Manager Classes

CoreEngineProcManager

The CoreEngineProcManager class manages the lifecycle of local engine core processes using Python's multiprocessing library.

Sources: vllm/v1/engine/utils.py80-174

Key responsibilities:

Process Creation: Creates one Process per local data parallel rank with unique process names like EngineCore_DP0, EngineCore_DP1, etc. vllm/v1/engine/utils.py119-129
Device Assignment: For data parallel deployments on non-CUDA platforms or with Ray/external launchers, sets device control environment variables before starting each process vllm/v1/engine/utils.py136-150
Monitoring: Provides methods to check process sentinels and detect finished processes vllm/v1/engine/utils.py164-173
Cleanup: Registers a finalizer that calls shutdown() on garbage collection vllm/v1/engine/utils.py131

The constructor accepts:

Parameter	Description
`target_fn`	Function to execute in each process (typically `EngineCoreProc.run_engine_core`)
`local_engine_count`	Number of engine processes to create locally
`start_index`	Global starting index for data parallel ranks
`local_start_index`	Local starting index (for multi-node setups)
`vllm_config`	Configuration object containing all settings
`local_client`	Whether the client is colocated with engines
`handshake_address`	ZMQ address for handshaking
`executor_class`	Executor class to use for model execution
`log_stats`	Whether to enable statistics logging

Sources: vllm/v1/engine/utils.py86-98

CoreEngineActorManager

The CoreEngineActorManager class manages Ray actors for distributed engine cores across multiple nodes.

Sources: vllm/v1/engine/utils.py227-358

Key differences from CoreEngineProcManager:

Distributed: Manages actors across local and remote nodes vllm/v1/engine/utils.py260-262
Placement Groups: Uses Ray placement groups to control actor placement on nodes vllm/v1/engine/utils.py280-297
Actor Types: Selects between EngineCoreActor and DPMoEEngineCoreActor based on whether the model is MoE and DP > 1 vllm/v1/engine/utils.py254-258
Environment Variables: Copies environment variables to actors via Ray's RuntimeEnv vllm/v1/engine/utils.py263-267

The create_dp_placement_groups() method implements three packing strategies:

Strategy	Description	Use Case
`strict`	Each DP group on separate nodes (STRICT_PACK)	Default, ensures isolation
`fill`	Fill nodes before moving to next (STRICT_PACK)	Maximize node utilization
`span`	Spread DP groups across nodes (PACK)	Multi-node DP groups

Sources: vllm/v1/engine/utils.py359-449

APIServerProcessManager

The APIServerProcessManager class manages multiple API server worker processes that handle client requests.

Sources: vllm/v1/utils.py159-225

Key characteristics:

Multiple Workers: Creates num_servers worker processes to handle concurrent requests vllm/v1/utils.py197-215
Spawn Context: Uses multiprocessing.get_context("spawn") for clean process isolation vllm/v1/utils.py194
Process Names: Names processes as ApiServer_0, ApiServer_1, etc. vllm/v1/utils.py211
Client Config: Passes configuration dict with ZMQ addresses and indices to each worker vllm/v1/utils.py200-207
Shared Socket: All workers share the same listening socket for load balancing vllm/v1/utils.py212

launch_core_engines

The launch_core_engines() function in vllm/v1/engine/utils.py is the primary entry point that orchestrates engine process creation. It is called by the MP clients (SyncMPClient, AsyncMPClient) and:

Allocates ZMQ addresses via get_engine_zmq_addresses()
Selects between CoreEngineProcManager (local multiprocessing) and CoreEngineActorManager (Ray)
Starts the handshake protocol to wait for each engine to report ready

Sources: vllm/v1/engine/utils.py vllm/v1/engine/core_client.py46-53

Process Lifecycle Management

Creation and Initialization

The process creation flow varies depending on the execution backend:

Engine Process Initialization Sequence

Sources: vllm/v1/engine/utils.py100-154 vllm/v1/engine/core_client.py102-127

The initialization sequence:

Create Process Objects: One process per local data parallel rank vllm/v1/engine/utils.py119-129
Set Device Environment: For non-CUDA platforms or distributed backends, set device control variables vllm/v1/engine/utils.py136-150
Start Processes: Call start() on each process vllm/v1/engine/utils.py150
Check for Early Failures: If any process exits during startup, kill all others vllm/v1/engine/utils.py152-154

Handshaking Protocol

The handshaking protocol establishes communication between engine processes and client processes:

CoreEngine Handshake State Machine

Sources: vllm/v1/engine/utils.py37-51 vllm/v1/engine/utils.py69-78

The CoreEngine class (distinct from EngineCore) tracks handshake state per data parallel rank:

State	Description
`NEW`	Process created but not yet connected
`CONNECTED`	Connected to handshake socket
`READY`	Received configuration, ready to process requests

EngineHandshakeMetadata is the data sent from the client to each engine process during the handshake, specifying ZMQ socket addresses and parallel config:

Field	Type	Content
`addresses`	`EngineZmqAddresses`	Input/output ZMQ socket addresses per client, coordinator addresses
`parallel_config`	`dict[str, int \| str \| list[int]]`	DP/TP/PP rank configuration

Sources: vllm/v1/engine/utils.py53-78

EngineZmqAddresses carries the full set of ZMQ socket paths:

Field	Description
`inputs`	Per-client ZMQ input socket addresses (requests)
`outputs`	Per-client ZMQ output socket addresses (responses)
`coordinator_input`	DP coordinator input address (if applicable)
`coordinator_output`	DP coordinator output address (if applicable)
`frontend_stats_publish_address`	Stats publish address for external LB case

Sources: vllm/v1/engine/utils.py53-67

Monitoring and Failure Detection

The wait_for_completion_or_failure() function in vllm/v1/utils.py monitors all processes and detects failures:

Process Monitoring Flow

Sources: vllm/v1/utils.py227-298

Key monitoring features:

Sentinel Mapping: Creates a dict mapping process sentinels to processes for efficient lookup vllm/v1/utils.py251-263
Non-blocking Wait: Uses connection.wait() with 5-second timeout vllm/v1/utils.py268
Ray Actor Support: Also monitors Ray actor completion via ray.wait() vllm/v1/utils.py281-284
Error Propagation: Raises RuntimeError with process name and exit code on failure vllm/v1/utils.py275-279

Shutdown and Cleanup

The shutdown process follows a graceful termination strategy with forceful fallback:

Shutdown sequence (shutdown() in vllm/v1/utils.py)

Sources: vllm/v1/utils.py302-320

The shutdown sequence:

Terminate All: Call terminate() on each process vllm/v1/utils.py304-306
Grace Period: Allow 5 seconds for processes to exit cleanly vllm/v1/utils.py309-315
Force Kill: Use kill_process_tree() for any remaining processes vllm/v1/utils.py317-319

This shutdown function is registered as a weakref.finalize on manager objects to ensure cleanup on garbage collection vllm/v1/engine/utils.py131

Device Control and Environment Setup

For data parallel deployments, each engine process must have exclusive access to its assigned GPUs. The device control system manages this assignment.

Device Index Calculation

The get_device_indices() function computes which physical devices each process should use:

get_device_indices() computation

Sources: vllm/v1/engine/utils.py193-224

Example for world_size=2, local_dp_rank=1, 4 total devices:

Logical devices for rank 1: [2, 3]
Physical devices: Mapped via current_platform.device_id_to_physical_device_id()
Result: "2,3"

Environment Variable Setting

The set_device_control_env_var() context manager temporarily sets device visibility before Process.start():

Platform	Environment Variable	Purpose
CUDA	`CUDA_VISIBLE_DEVICES`	Restrict visible GPUs
ROCm	`HIP_VISIBLE_DEVICES`	Restrict visible AMD GPUs
XPU	`ZE_AFFINITY_MASK`	Restrict visible Intel GPUs

Sources: vllm/v1/engine/utils.py176-190

The context manager:

Calculates device indices for the local DP rank via get_device_indices()
Sets the platform-specific environment variable
Yields control to start the process
Restores the original value on exit

It is only applied when:

Data parallel deployments (dp_size > 1)
Non-CUDA platforms OR Ray/external launchers
CUDA platforms use torch.cuda.set_device() instead vllm/v1/engine/utils.py136-149

Process Management Architecture

The complete architecture showing how all components interact:

Sources: vllm/v1/utils.py227-298 vllm/v1/engine/utils.py80-622

Key interactions:

Client Creation: AsyncLLM or LLMEngine creates appropriate process managers based on configuration
Process Spawning: Managers create and start their respective processes/actors
Handshaking: Processes connect to clients via ZMQ sockets (see Communication Infrastructure)
Monitoring: wait_for_completion_or_failure() monitors all process sentinels
Coordination: DPCoordinator manages synchronization between DP ranks (see Data Parallel Coordination)
Shutdown: Managers call shutdown() on cleanup, either explicitly or via finalizers

This architecture enables vLLM to scale from single-process inference to distributed multi-node deployments while maintaining clean process lifecycle management.

Engine Process Management

Relevant source files

Overview

vLLM uses multiprocessing to manage distributed execution across multiple GPUs and nodes. The engine process management system is responsible for:

Spawning and initializing engine core processes (one per data parallel rank)
Creating API server worker processes to handle client requests
Managing Ray actors for multi-node distributed inference
Setting up device assignments and environment variables for each process
Monitoring process health and detecting failures
Coordinating graceful shutdown and cleanup

Five main classes handle these responsibilities:

Class	File	Role
`EngineCoreClient`	`vllm/v1/engine/core_client.py`	Abstract client interface
`InprocClient`	`vllm/v1/engine/core_client.py`	In-process `EngineCore` wrapper (LLMEngine/offline)
`SyncMPClient` / `AsyncMPClient`	`vllm/v1/engine/core_client.py`	ZMQ-based multiprocess client (online serving)
`CoreEngineProcManager`	`vllm/v1/engine/utils.py`	Spawns local engine processes
`CoreEngineActorManager`	`vllm/v1/engine/utils.py`	Manages Ray actors across nodes

Sources: vllm/v1/engine/core_client.py66-128 vllm/v1/engine/utils.py80-230

EngineCoreClient Hierarchy

EngineCoreClient is an abstract base class defining the full API surface for interacting with an engine core. The concrete subclasses differ in how they communicate with EngineCore.

Class hierarchy diagram:

Sources: vllm/v1/engine/core_client.py66-128 vllm/v1/engine/core.py83-230

InprocClient

InprocClient runs EngineCore directly in the calling process. It is used by LLMEngine in non-multiprocessing mode and offline LLM usage.

Calls EngineCore.step_fn() synchronously on get_output()
Passes requests directly via EngineCore.preprocess_add_request() and EngineCore.add_request()
No background threads or ZMQ sockets

Sources: vllm/v1/engine/core_client.py271-362

SyncMPClient and AsyncMPClient

Condition	Subclass selected
`data_parallel_size == 1`	`AsyncMPClient`
`data_parallel_size > 1` and internal LB	`DPLBAsyncMPClient`
`data_parallel_size > 1` and external LB	`DPAsyncMPClient`

Sources: vllm/v1/engine/core_client.py102-127

EngineCoreClient selection flow:

Sources: vllm/v1/engine/core_client.py77-127

BackgroundResources

Sources: vllm/v1/engine/core_client.py364-444

EngineCore and EngineCoreProc

EngineCore

EngineCore is the inner loop of vLLM's engine. It owns the executor, scheduler, structured output manager, and KV cache. Its step() method schedules and executes one model step.

Key initialization sequence in EngineCore.__init__():

Load plugins
Create model executor (executor_class(vllm_config))
Initialize KV caches via _initialize_kv_caches()
Instantiate StructuredOutputManager
Instantiate the Scheduler
Set up multimodal registry cache
Register KV connector handshake metadata if applicable
Freeze GC heap (freeze_gc_heap()) to reduce GC pause times
Enable environment variable caching (enable_envs_cache())

Sources: vllm/v1/engine/core.py83-229

EngineCoreProc

The EngineCoreProc handles:

Receiving EngineCoreRequest messages from the front-end over ZMQ
Running the EngineCore step loop
Sending EngineCoreOutputs back to the front-end
Processing utility calls (profile, LoRA management, cache resets, etc.)

Sources: vllm/v1/engine/core.py vllm/v1/engine/core_client.py46-47

Process Manager Classes

CoreEngineProcManager

The CoreEngineProcManager class manages the lifecycle of local engine core processes using Python's multiprocessing library.

Sources: vllm/v1/engine/utils.py80-174

Key responsibilities:

Process Creation: Creates one Process per local data parallel rank with unique process names like EngineCore_DP0, EngineCore_DP1, etc. vllm/v1/engine/utils.py119-129
Device Assignment: For data parallel deployments on non-CUDA platforms or with Ray/external launchers, sets device control environment variables before starting each process vllm/v1/engine/utils.py136-150
Monitoring: Provides methods to check process sentinels and detect finished processes vllm/v1/engine/utils.py164-173
Cleanup: Registers a finalizer that calls shutdown() on garbage collection vllm/v1/engine/utils.py131

The constructor accepts:

Parameter	Description
`target_fn`	Function to execute in each process (typically `EngineCoreProc.run_engine_core`)
`local_engine_count`	Number of engine processes to create locally
`start_index`	Global starting index for data parallel ranks
`local_start_index`	Local starting index (for multi-node setups)
`vllm_config`	Configuration object containing all settings
`local_client`	Whether the client is colocated with engines
`handshake_address`	ZMQ address for handshaking
`executor_class`	Executor class to use for model execution
`log_stats`	Whether to enable statistics logging

Sources: vllm/v1/engine/utils.py86-98

CoreEngineActorManager

The CoreEngineActorManager class manages Ray actors for distributed engine cores across multiple nodes.

Sources: vllm/v1/engine/utils.py227-358

Key differences from CoreEngineProcManager:

Distributed: Manages actors across local and remote nodes vllm/v1/engine/utils.py260-262
Placement Groups: Uses Ray placement groups to control actor placement on nodes vllm/v1/engine/utils.py280-297
Actor Types: Selects between EngineCoreActor and DPMoEEngineCoreActor based on whether the model is MoE and DP > 1 vllm/v1/engine/utils.py254-258
Environment Variables: Copies environment variables to actors via Ray's RuntimeEnv vllm/v1/engine/utils.py263-267

The create_dp_placement_groups() method implements three packing strategies:

Strategy	Description	Use Case
`strict`	Each DP group on separate nodes (STRICT_PACK)	Default, ensures isolation
`fill`	Fill nodes before moving to next (STRICT_PACK)	Maximize node utilization
`span`	Spread DP groups across nodes (PACK)	Multi-node DP groups

Sources: vllm/v1/engine/utils.py359-449

APIServerProcessManager

The APIServerProcessManager class manages multiple API server worker processes that handle client requests.

Sources: vllm/v1/utils.py159-225

Key characteristics:

Multiple Workers: Creates num_servers worker processes to handle concurrent requests vllm/v1/utils.py197-215
Spawn Context: Uses multiprocessing.get_context("spawn") for clean process isolation vllm/v1/utils.py194
Process Names: Names processes as ApiServer_0, ApiServer_1, etc. vllm/v1/utils.py211
Client Config: Passes configuration dict with ZMQ addresses and indices to each worker vllm/v1/utils.py200-207
Shared Socket: All workers share the same listening socket for load balancing vllm/v1/utils.py212

launch_core_engines

Allocates ZMQ addresses via get_engine_zmq_addresses()
Selects between CoreEngineProcManager (local multiprocessing) and CoreEngineActorManager (Ray)
Starts the handshake protocol to wait for each engine to report ready

Sources: vllm/v1/engine/utils.py vllm/v1/engine/core_client.py46-53

Process Lifecycle Management

Creation and Initialization

The process creation flow varies depending on the execution backend:

Engine Process Initialization Sequence

Sources: vllm/v1/engine/utils.py100-154 vllm/v1/engine/core_client.py102-127

The initialization sequence:

Create Process Objects: One process per local data parallel rank vllm/v1/engine/utils.py119-129
Set Device Environment: For non-CUDA platforms or distributed backends, set device control variables vllm/v1/engine/utils.py136-150
Start Processes: Call start() on each process vllm/v1/engine/utils.py150
Check for Early Failures: If any process exits during startup, kill all others vllm/v1/engine/utils.py152-154

Handshaking Protocol

The handshaking protocol establishes communication between engine processes and client processes:

CoreEngine Handshake State Machine

Sources: vllm/v1/engine/utils.py37-51 vllm/v1/engine/utils.py69-78

The CoreEngine class (distinct from EngineCore) tracks handshake state per data parallel rank:

State	Description
`NEW`	Process created but not yet connected
`CONNECTED`	Connected to handshake socket
`READY`	Received configuration, ready to process requests

EngineHandshakeMetadata is the data sent from the client to each engine process during the handshake, specifying ZMQ socket addresses and parallel config:

Field	Type	Content
`addresses`	`EngineZmqAddresses`	Input/output ZMQ socket addresses per client, coordinator addresses
`parallel_config`	`dict[str, int \| str \| list[int]]`	DP/TP/PP rank configuration

Sources: vllm/v1/engine/utils.py53-78

EngineZmqAddresses carries the full set of ZMQ socket paths:

Field	Description
`inputs`	Per-client ZMQ input socket addresses (requests)
`outputs`	Per-client ZMQ output socket addresses (responses)
`coordinator_input`	DP coordinator input address (if applicable)
`coordinator_output`	DP coordinator output address (if applicable)
`frontend_stats_publish_address`	Stats publish address for external LB case

Sources: vllm/v1/engine/utils.py53-67

Monitoring and Failure Detection

The wait_for_completion_or_failure() function in vllm/v1/utils.py monitors all processes and detects failures:

Process Monitoring Flow

Sources: vllm/v1/utils.py227-298

Key monitoring features:

Sentinel Mapping: Creates a dict mapping process sentinels to processes for efficient lookup vllm/v1/utils.py251-263
Non-blocking Wait: Uses connection.wait() with 5-second timeout vllm/v1/utils.py268
Ray Actor Support: Also monitors Ray actor completion via ray.wait() vllm/v1/utils.py281-284
Error Propagation: Raises RuntimeError with process name and exit code on failure vllm/v1/utils.py275-279

Shutdown and Cleanup

The shutdown process follows a graceful termination strategy with forceful fallback:

Shutdown sequence (shutdown() in vllm/v1/utils.py)

Sources: vllm/v1/utils.py302-320

The shutdown sequence:

Terminate All: Call terminate() on each process vllm/v1/utils.py304-306
Grace Period: Allow 5 seconds for processes to exit cleanly vllm/v1/utils.py309-315
Force Kill: Use kill_process_tree() for any remaining processes vllm/v1/utils.py317-319

This shutdown function is registered as a weakref.finalize on manager objects to ensure cleanup on garbage collection vllm/v1/engine/utils.py131

Device Control and Environment Setup

For data parallel deployments, each engine process must have exclusive access to its assigned GPUs. The device control system manages this assignment.

Device Index Calculation

The get_device_indices() function computes which physical devices each process should use:

get_device_indices() computation

Sources: vllm/v1/engine/utils.py193-224

Example for world_size=2, local_dp_rank=1, 4 total devices:

Logical devices for rank 1: [2, 3]
Physical devices: Mapped via current_platform.device_id_to_physical_device_id()
Result: "2,3"

Environment Variable Setting

The set_device_control_env_var() context manager temporarily sets device visibility before Process.start():

Platform	Environment Variable	Purpose
CUDA	`CUDA_VISIBLE_DEVICES`	Restrict visible GPUs
ROCm	`HIP_VISIBLE_DEVICES`	Restrict visible AMD GPUs
XPU	`ZE_AFFINITY_MASK`	Restrict visible Intel GPUs

Sources: vllm/v1/engine/utils.py176-190

The context manager:

Calculates device indices for the local DP rank via get_device_indices()
Sets the platform-specific environment variable
Yields control to start the process
Restores the original value on exit

It is only applied when:

Data parallel deployments (dp_size > 1)
Non-CUDA platforms OR Ray/external launchers
CUDA platforms use torch.cuda.set_device() instead vllm/v1/engine/utils.py136-149

Process Management Architecture

The complete architecture showing how all components interact:

Sources: vllm/v1/utils.py227-298 vllm/v1/engine/utils.py80-622

Key interactions:

Client Creation: AsyncLLM or LLMEngine creates appropriate process managers based on configuration
Process Spawning: Managers create and start their respective processes/actors
Handshaking: Processes connect to clients via ZMQ sockets (see Communication Infrastructure)
Monitoring: wait_for_completion_or_failure() monitors all process sentinels
Coordination: DPCoordinator manages synchronization between DP ranks (see Data Parallel Coordination)
Shutdown: Managers call shutdown() on cleanup, either explicitly or via finalizers

This architecture enables vLLM to scale from single-process inference to distributed multi-node deployments while maintaining clean process lifecycle management.

Engine Process Management

Overview

EngineCoreClient Hierarchy

InprocClient

SyncMPClient and AsyncMPClient

BackgroundResources

EngineCore and EngineCoreProc

EngineCore

EngineCoreProc

Process Manager Classes

CoreEngineProcManager

CoreEngineActorManager

APIServerProcessManager

launch_core_engines

Process Lifecycle Management

Creation and Initialization

Handshaking Protocol

Monitoring and Failure Detection

Shutdown and Cleanup

Device Control and Environment Setup

Device Index Calculation

Environment Variable Setting

Process Management Architecture

On this page

Engine Process Management

Overview

EngineCoreClient Hierarchy

InprocClient

SyncMPClient and AsyncMPClient

BackgroundResources

EngineCore and EngineCoreProc

EngineCore

EngineCoreProc

Process Manager Classes

CoreEngineProcManager

CoreEngineActorManager

APIServerProcessManager

launch_core_engines

Process Lifecycle Management

Creation and Initialization

Handshaking Protocol

Monitoring and Failure Detection

Shutdown and Cleanup

Device Control and Environment Setup

Device Index Calculation

Environment Variable Setting

Process Management Architecture

On this page