Distributed Execution

Relevant source files

This page provides an overview of vLLM's distributed execution system, which enables scaling inference across multiple GPUs and nodes. This includes process/actor management, parallelism strategies, inter-process communication, and coordination mechanisms.

For specific topics within distributed execution:

Parallelism strategies and configuration: see Parallelism Strategies
Data parallel coordination and load balancing: see Data Parallel Coordination
Process lifecycle and management: see Engine Process Management
Communication protocols and infrastructure: see Communication Infrastructure
Microbatching and overlap techniques: see Microbatching and Dual Batch Overlap

System Overview

vLLM's distributed execution system orchestrates multiple inference workers across GPUs and nodes to serve large language models efficiently. The system supports multiple deployment modes and parallelism strategies, coordinating execution through a combination of process managers, communication channels, and synchronization primitives.

Distributed Execution Architecture

Sources: vllm/v1/engine/utils.py80-175 vllm/v1/engine/coordinator.py22-106 vllm/v1/utils.py159-225

Execution Backend Types

vLLM supports multiple distributed executor backends, selected via distributed_executor_backend in ParallelConfig:

Backend	Mode	Use Case
`mp`	Multiprocessing	Single-node or multi-node with explicit node configuration
`ray`	Ray actors	Multi-node deployments with dynamic resource management
`uni`	Single process	Debugging or XPU SPMD mode
`external_launcher`	External orchestration	Integration with external launch systems (e.g., torchrun)

The backend selection logic resides in vllm/config/parallel.py600-660 For single-node deployments where world_size fits on available GPUs, mp is preferred. For multi-node or when Ray is already initialized, ray is used. TPU platforms default to uni when using SPMD mode.

Sources: vllm/config/parallel.py201-211 vllm/config/parallel.py600-660

Process and Actor Management

CoreEngineProcManager

The CoreEngineProcManager class manages background engine processes for local data parallel ranks. It spawns processes using Python's multiprocessing, handles device control via environment variables, and monitors process health.

Key responsibilities:

Spawn processes for each local DP rank via get_mp_context().Process()
Set device visibility using CUDA_VISIBLE_DEVICES or platform-specific variables
Register finalizer for cleanup via weakref.finalize()
Monitor process sentinels to detect failures

Sources: vllm/v1/engine/utils.py80-175

CoreEngineActorManager

For Ray-based deployments, CoreEngineActorManager manages EngineCoreActor instances across nodes. It creates placement groups, handles actor lifecycle, and coordinates initialization.

Placement group creation:

Supports strict, fill, and span packing strategies via VLLM_RAY_DP_PACK_STRATEGY
Allocates world_size + 1 bundles per DP rank (workers + engine core)
Validates homogeneity for multi-node DP groups

Sources: vllm/v1/engine/utils.py227-498

APIServerProcessManager

The APIServerProcessManager spawns multiple API server processes to handle client connections, each with its own ZMQ client connecting to engine cores.

Configuration:

num_servers processes created via spawn context
Each receives input_address and output_address for ZMQ communication
Optional stats_update_address for subscribing to coordinator stats

Sources: vllm/v1/utils.py159-225

Parallelism Configuration

ParallelConfig Fields

The ParallelConfig dataclass vllm/config/parallel.py94-358 contains all distributed execution parameters:

Field	Description	Default
`pipeline_parallel_size`	Number of pipeline stages	1
`tensor_parallel_size`	Number of tensor parallel groups	1
`data_parallel_size`	Number of data parallel replicas	1
`enable_expert_parallel`	Use expert parallelism for MoE	False
`world_size`	TP × PP × PCP (computed)	-
`distributed_executor_backend`	Backend type (mp/ray/uni/external_launcher)	None (auto)

Derived properties:

world_size_across_dp = world_size * data_parallel_size
use_ubatching = enable_dbo or ubatch_size > 1
num_ubatches = 2 if enable_dbo else ubatch_size

Sources: vllm/config/parallel.py94-358

EngineArgs to ParallelConfig Flow

The EngineArgs class vllm/engine/arg_utils.py358-603 defines CLI arguments that are transformed into ParallelConfig during VllmConfig initialization. Arguments include:

--tensor-parallel-size / -tp
--pipeline-parallel-size / -pp
--data-parallel-size / -dp
--enable-expert-parallel / -ep
--enable-dbo (Dual Batch Overlap)
--ubatch-size

Sources: vllm/engine/arg_utils.py358-603 vllm/config/parallel.py94-358

Data Parallel Coordination

Request Wave Synchronization

When data_parallel_size > 1, a DPCoordinator process orchestrates request waves across DP ranks. A request wave represents a synchronized execution phase where all ranks process their local requests.

Wave lifecycle:

Engines start in paused state
When any rank receives a request, it notifies coordinator
Coordinator broadcasts START_DP_WAVE to all ranks
Ranks move to running state and process requests
When all ranks finish, they synchronize via NCCL AllReduce
Wave number increments, engines return to paused state

Sources: vllm/v1/engine/coordinator.py22-106 vllm/v1/engine/coordinator.py113-311

DP Rank Synchronization

The coordinate_batch_across_dp() function vllm/v1/worker/dp_utils.py173-241 synchronizes batch execution decisions across DP ranks using NCCL or Gloo AllReduce:

Synchronized information:

orig_num_tokens_per_ubatch: Original token count (unpadded)
padded_num_tokens_per_ubatch: Token count after non-DP padding (CUDA graph, TP)
should_ubatch: Whether to use microbatching
should_dp_pad: Whether to pad all ranks to same size
cudagraph_mode: CUDA graph mode (NONE/PIECEWISE/FULL)

Decision logic:

Each rank computes local decisions
AllReduce broadcasts all ranks' values
Take minimum cudagraph_mode across ranks
Enable ubatching only if all ranks agree AND no empty second ubatch
Pad to max token count if should_dp_pad=True or ubatching enabled

Sources: vllm/v1/worker/dp_utils.py103-171 vllm/v1/worker/dp_utils.py173-241

Communication Infrastructure

ZMQ Socket Topology

vLLM uses ZeroMQ for request/response communication between API servers, coordinator, and engine cores:

Socket types:

REQ/REP: Synchronous request-response for inference requests
XPUB/SUB: Publish-subscribe for stats and wave notifications
PUSH/PULL: Asynchronous stats reporting to coordinator

Address generation: The get_engine_client_zmq_addr() function vllm/v1/utils.py143-157 creates socket addresses:

Local-only: IPC sockets via get_open_zmq_ipc_path()
Remote: TCP sockets via get_tcp_uri(host, port)

Sources: vllm/v1/utils.py143-157 vllm/v1/engine/coordinator.py151-311

NCCL Collective Operations

For GPU-to-GPU communication, vLLM uses NCCL collectives within process groups:

Key operations:

AllReduce: Synchronize DP rank decisions vllm/v1/worker/dp_utils.py38-56
AllGather/ReduceScatter: All2All for expert parallelism
Broadcast: Weight distribution during initialization

Process group initialization: The stateless_init_dp_group() method vllm/config/parallel.py399-435 creates a DP process group using:

Gloo backend for CPU devices or when disable_nccl_for_dp_synchronization=True
NCCL for GPU devices
Retries on EADDRINUSE errors

Sources: vllm/config/parallel.py399-435 vllm/v1/worker/dp_utils.py20-56

Microbatching and Dual Batch Overlap

Microbatching Overview

When batch sizes exceed thresholds, vLLM can split execution into microbatches (ubatches) to overlap communication with computation. This is configured via:

enable_dbo: Enables 2-way microbatching (Dual Batch Overlap)
ubatch_size: Number of microbatches when DBO disabled
dbo_decode_token_threshold: Threshold for decode-only batches (default: 32)
dbo_prefill_token_threshold: Threshold for mixed batches (default: 512)

UBatchContext and Threading

The UBatchContext class vllm/v1/worker/ubatching.py20-148 manages synchronization between microbatch threads using:

Synchronization primitives:

threading.Barrier: Ensures all threads initialize CUDA contexts
threading.Event: CPU-level signaling between threads
torch.Event: GPU stream synchronization

Stream management:

compute_stream: Main computation stream
comm_stream: Communication stream for collectives

Sources: vllm/v1/worker/ubatching.py20-148 vllm/v1/worker/ubatching.py202-242

UBatchSlice and Request Splitting

The UBatchSlice dataclass vllm/v1/worker/ubatch_utils.py13-28 defines a slice of requests and tokens:

Creation logic vllm/v1/worker/ubatch_utils.py63-115:

Compute split points: [split_point * i for i in range(1, num_ubatches)]
Use np.searchsorted() on cumulative token counts to find request boundaries
Handle requests that span ubatch boundaries
Pad final ubatch to total token count

Metadata splitting: The split_attn_metadata() function vllm/v1/worker/ubatch_utils.py229-243 creates separate CommonAttentionMetadata for each ubatch by slicing:

query_start_loc: Adjusted for token offset
seq_lens: Per-request sequence lengths
block_table_tensor: Block tables for requests
slot_mapping: Token-to-slot mapping

Sources: vllm/v1/worker/ubatch_utils.py13-115 vllm/v1/worker/ubatch_utils.py134-243

UBatchWrapper Execution

The UBatchWrapper class vllm/v1/worker/gpu_ubatch_wrapper.py94-368 wraps model execution to support microbatching with CUDA graphs:

Components:

SMControlContextManager: Controls SM allocation between compute and communication
ready_barrier: Synchronizes thread startup
cudagraphs: Cache of captured graphs per batch size
comm_stream: Stream for communication operations

Execution flow:

Check if should microbatch via coordinate_batch_across_dp()
Create UBatchSlice objects for each microbatch
Split attention metadata and input tensors
Launch threads, each executing one microbatch with UBatchContext
Threads yield between layers to overlap compute/comm

Sources: vllm/v1/worker/gpu_ubatch_wrapper.py94-368

Configuration Examples

Single-Node Tensor Parallel

Result:

world_size = 4
4 GPU workers, no data parallelism
Multiprocessing backend for local execution

Multi-Node Pipeline + Tensor Parallel

Result:

world_size = 4 (2 TP × 2 PP)
8 total GPU workers across 2 nodes
Multiprocessing with distributed init

Data Parallel with Load Balancing

Result:

world_size = 2, world_size_across_dp = 8
4 DP ranks, each with 2 TP workers
DPCoordinator created for wave synchronization
Internal load balancing via request stats

Expert Parallel for MoE

Result:

Experts sharded across TP × DP = 8 workers
DeepEP kernels for All2All communication
Sequence parallelism enabled automatically

Sources: vllm/engine/arg_utils.py785-933 vllm/config/parallel.py446-460

Device Control and Environment Variables

Device Visibility

For multiprocessing deployments with data_parallel_size > 1, vLLM manages device visibility using environment variables:

Function: get_device_indices() vllm/v1/engine/utils.py193-225

Computes device range: [local_dp_rank * world_size, (local_dp_rank + 1) * world_size)
Maps logical IDs to physical IDs via device_id_to_physical_device_id()
Sets platform-specific variable (CUDA_VISIBLE_DEVICES, ZE_AFFINITY_MASK, etc.)

Example:

world_size = 2, local_dp_rank = 1
Device range: [2, 3]
CUDA_VISIBLE_DEVICES="2,3"

Sources: vllm/v1/engine/utils.py176-225

Key Environment Variables

Variable	Purpose	Set By
`CUDA_VISIBLE_DEVICES`	Device visibility for CUDA	vLLM during process spawn
`VLLM_DP_SIZE`	Data parallel size	User (for offline SPMD)
`VLLM_DP_RANK`	Data parallel rank	User (for offline SPMD)
`VLLM_ENABLE_V1_MULTIPROCESSING`	Enable V1 multiprocessing	vLLM (disabled for external launcher)
`VLLM_RAY_DP_PACK_STRATEGY`	Ray placement strategy	User (strict/fill/span)
`VLLM_DBO_COMM_SMS`	SMs for communication in DBO	User (default: value from envs)

Sources: vllm/config/parallel.py547-599 vllm/v1/worker/gpu_ubatch_wrapper.py126-157

Distributed Execution

Relevant source files

For specific topics within distributed execution:

Parallelism strategies and configuration: see Parallelism Strategies
Data parallel coordination and load balancing: see Data Parallel Coordination
Process lifecycle and management: see Engine Process Management
Communication protocols and infrastructure: see Communication Infrastructure
Microbatching and overlap techniques: see Microbatching and Dual Batch Overlap

System Overview

Distributed Execution Architecture

Sources: vllm/v1/engine/utils.py80-175 vllm/v1/engine/coordinator.py22-106 vllm/v1/utils.py159-225

Execution Backend Types

vLLM supports multiple distributed executor backends, selected via distributed_executor_backend in ParallelConfig:

Backend	Mode	Use Case
`mp`	Multiprocessing	Single-node or multi-node with explicit node configuration
`ray`	Ray actors	Multi-node deployments with dynamic resource management
`uni`	Single process	Debugging or XPU SPMD mode
`external_launcher`	External orchestration	Integration with external launch systems (e.g., torchrun)

Sources: vllm/config/parallel.py201-211 vllm/config/parallel.py600-660

Process and Actor Management

CoreEngineProcManager

Key responsibilities:

Spawn processes for each local DP rank via get_mp_context().Process()
Set device visibility using CUDA_VISIBLE_DEVICES or platform-specific variables
Register finalizer for cleanup via weakref.finalize()
Monitor process sentinels to detect failures

Sources: vllm/v1/engine/utils.py80-175

CoreEngineActorManager

For Ray-based deployments, CoreEngineActorManager manages EngineCoreActor instances across nodes. It creates placement groups, handles actor lifecycle, and coordinates initialization.

Placement group creation:

Supports strict, fill, and span packing strategies via VLLM_RAY_DP_PACK_STRATEGY
Allocates world_size + 1 bundles per DP rank (workers + engine core)
Validates homogeneity for multi-node DP groups

Sources: vllm/v1/engine/utils.py227-498

APIServerProcessManager

The APIServerProcessManager spawns multiple API server processes to handle client connections, each with its own ZMQ client connecting to engine cores.

Configuration:

num_servers processes created via spawn context
Each receives input_address and output_address for ZMQ communication
Optional stats_update_address for subscribing to coordinator stats

Sources: vllm/v1/utils.py159-225

Parallelism Configuration

ParallelConfig Fields

The ParallelConfig dataclass vllm/config/parallel.py94-358 contains all distributed execution parameters:

Field	Description	Default
`pipeline_parallel_size`	Number of pipeline stages	1
`tensor_parallel_size`	Number of tensor parallel groups	1
`data_parallel_size`	Number of data parallel replicas	1
`enable_expert_parallel`	Use expert parallelism for MoE	False
`world_size`	TP × PP × PCP (computed)	-
`distributed_executor_backend`	Backend type (mp/ray/uni/external_launcher)	None (auto)

Derived properties:

world_size_across_dp = world_size * data_parallel_size
use_ubatching = enable_dbo or ubatch_size > 1
num_ubatches = 2 if enable_dbo else ubatch_size

Sources: vllm/config/parallel.py94-358

EngineArgs to ParallelConfig Flow

The EngineArgs class vllm/engine/arg_utils.py358-603 defines CLI arguments that are transformed into ParallelConfig during VllmConfig initialization. Arguments include:

--tensor-parallel-size / -tp
--pipeline-parallel-size / -pp
--data-parallel-size / -dp
--enable-expert-parallel / -ep
--enable-dbo (Dual Batch Overlap)
--ubatch-size

Sources: vllm/engine/arg_utils.py358-603 vllm/config/parallel.py94-358

Data Parallel Coordination

Request Wave Synchronization

Wave lifecycle:

Engines start in paused state
When any rank receives a request, it notifies coordinator
Coordinator broadcasts START_DP_WAVE to all ranks
Ranks move to running state and process requests
When all ranks finish, they synchronize via NCCL AllReduce
Wave number increments, engines return to paused state

Sources: vllm/v1/engine/coordinator.py22-106 vllm/v1/engine/coordinator.py113-311

DP Rank Synchronization

The coordinate_batch_across_dp() function vllm/v1/worker/dp_utils.py173-241 synchronizes batch execution decisions across DP ranks using NCCL or Gloo AllReduce:

Synchronized information:

orig_num_tokens_per_ubatch: Original token count (unpadded)
padded_num_tokens_per_ubatch: Token count after non-DP padding (CUDA graph, TP)
should_ubatch: Whether to use microbatching
should_dp_pad: Whether to pad all ranks to same size
cudagraph_mode: CUDA graph mode (NONE/PIECEWISE/FULL)

Decision logic:

Each rank computes local decisions
AllReduce broadcasts all ranks' values
Take minimum cudagraph_mode across ranks
Enable ubatching only if all ranks agree AND no empty second ubatch
Pad to max token count if should_dp_pad=True or ubatching enabled

Sources: vllm/v1/worker/dp_utils.py103-171 vllm/v1/worker/dp_utils.py173-241

Communication Infrastructure

ZMQ Socket Topology

vLLM uses ZeroMQ for request/response communication between API servers, coordinator, and engine cores:

Socket types:

REQ/REP: Synchronous request-response for inference requests
XPUB/SUB: Publish-subscribe for stats and wave notifications
PUSH/PULL: Asynchronous stats reporting to coordinator

Address generation: The get_engine_client_zmq_addr() function vllm/v1/utils.py143-157 creates socket addresses:

Local-only: IPC sockets via get_open_zmq_ipc_path()
Remote: TCP sockets via get_tcp_uri(host, port)

Sources: vllm/v1/utils.py143-157 vllm/v1/engine/coordinator.py151-311

NCCL Collective Operations

For GPU-to-GPU communication, vLLM uses NCCL collectives within process groups:

Key operations:

AllReduce: Synchronize DP rank decisions vllm/v1/worker/dp_utils.py38-56
AllGather/ReduceScatter: All2All for expert parallelism
Broadcast: Weight distribution during initialization

Process group initialization: The stateless_init_dp_group() method vllm/config/parallel.py399-435 creates a DP process group using:

Gloo backend for CPU devices or when disable_nccl_for_dp_synchronization=True
NCCL for GPU devices
Retries on EADDRINUSE errors

Sources: vllm/config/parallel.py399-435 vllm/v1/worker/dp_utils.py20-56

Microbatching and Dual Batch Overlap

Microbatching Overview

When batch sizes exceed thresholds, vLLM can split execution into microbatches (ubatches) to overlap communication with computation. This is configured via:

enable_dbo: Enables 2-way microbatching (Dual Batch Overlap)
ubatch_size: Number of microbatches when DBO disabled
dbo_decode_token_threshold: Threshold for decode-only batches (default: 32)
dbo_prefill_token_threshold: Threshold for mixed batches (default: 512)

UBatchContext and Threading

The UBatchContext class vllm/v1/worker/ubatching.py20-148 manages synchronization between microbatch threads using:

Synchronization primitives:

threading.Barrier: Ensures all threads initialize CUDA contexts
threading.Event: CPU-level signaling between threads
torch.Event: GPU stream synchronization

Stream management:

compute_stream: Main computation stream
comm_stream: Communication stream for collectives

Sources: vllm/v1/worker/ubatching.py20-148 vllm/v1/worker/ubatching.py202-242

UBatchSlice and Request Splitting

The UBatchSlice dataclass vllm/v1/worker/ubatch_utils.py13-28 defines a slice of requests and tokens:

Creation logic vllm/v1/worker/ubatch_utils.py63-115:

Compute split points: [split_point * i for i in range(1, num_ubatches)]
Use np.searchsorted() on cumulative token counts to find request boundaries
Handle requests that span ubatch boundaries
Pad final ubatch to total token count

Metadata splitting: The split_attn_metadata() function vllm/v1/worker/ubatch_utils.py229-243 creates separate CommonAttentionMetadata for each ubatch by slicing:

query_start_loc: Adjusted for token offset
seq_lens: Per-request sequence lengths
block_table_tensor: Block tables for requests
slot_mapping: Token-to-slot mapping

Sources: vllm/v1/worker/ubatch_utils.py13-115 vllm/v1/worker/ubatch_utils.py134-243

UBatchWrapper Execution

The UBatchWrapper class vllm/v1/worker/gpu_ubatch_wrapper.py94-368 wraps model execution to support microbatching with CUDA graphs:

Components:

SMControlContextManager: Controls SM allocation between compute and communication
ready_barrier: Synchronizes thread startup
cudagraphs: Cache of captured graphs per batch size
comm_stream: Stream for communication operations

Execution flow:

Check if should microbatch via coordinate_batch_across_dp()
Create UBatchSlice objects for each microbatch
Split attention metadata and input tensors
Launch threads, each executing one microbatch with UBatchContext
Threads yield between layers to overlap compute/comm

Sources: vllm/v1/worker/gpu_ubatch_wrapper.py94-368

Configuration Examples

Single-Node Tensor Parallel

Result:

world_size = 4
4 GPU workers, no data parallelism
Multiprocessing backend for local execution

Multi-Node Pipeline + Tensor Parallel

Result:

world_size = 4 (2 TP × 2 PP)
8 total GPU workers across 2 nodes
Multiprocessing with distributed init

Data Parallel with Load Balancing

Result:

world_size = 2, world_size_across_dp = 8
4 DP ranks, each with 2 TP workers
DPCoordinator created for wave synchronization
Internal load balancing via request stats

Expert Parallel for MoE

Result:

Experts sharded across TP × DP = 8 workers
DeepEP kernels for All2All communication
Sequence parallelism enabled automatically

Sources: vllm/engine/arg_utils.py785-933 vllm/config/parallel.py446-460

Device Control and Environment Variables

Device Visibility

For multiprocessing deployments with data_parallel_size > 1, vLLM manages device visibility using environment variables:

Function: get_device_indices() vllm/v1/engine/utils.py193-225

Computes device range: [local_dp_rank * world_size, (local_dp_rank + 1) * world_size)
Maps logical IDs to physical IDs via device_id_to_physical_device_id()
Sets platform-specific variable (CUDA_VISIBLE_DEVICES, ZE_AFFINITY_MASK, etc.)

Example:

world_size = 2, local_dp_rank = 1
Device range: [2, 3]
CUDA_VISIBLE_DEVICES="2,3"

Sources: vllm/v1/engine/utils.py176-225

Key Environment Variables

Variable	Purpose	Set By
`CUDA_VISIBLE_DEVICES`	Device visibility for CUDA	vLLM during process spawn
`VLLM_DP_SIZE`	Data parallel size	User (for offline SPMD)
`VLLM_DP_RANK`	Data parallel rank	User (for offline SPMD)
`VLLM_ENABLE_V1_MULTIPROCESSING`	Enable V1 multiprocessing	vLLM (disabled for external launcher)
`VLLM_RAY_DP_PACK_STRATEGY`	Ray placement strategy	User (strict/fill/span)
`VLLM_DBO_COMM_SMS`	SMs for communication in DBO	User (default: value from envs)

Sources: vllm/config/parallel.py547-599 vllm/v1/worker/gpu_ubatch_wrapper.py126-157

Distributed Execution

System Overview

Distributed Execution Architecture

Execution Backend Types

Process and Actor Management

CoreEngineProcManager

CoreEngineActorManager

APIServerProcessManager

Parallelism Configuration

ParallelConfig Fields

EngineArgs to ParallelConfig Flow

Data Parallel Coordination

Request Wave Synchronization

DP Rank Synchronization

Communication Infrastructure

ZMQ Socket Topology

NCCL Collective Operations

Microbatching and Dual Batch Overlap

Microbatching Overview

UBatchContext and Threading

UBatchSlice and Request Splitting

UBatchWrapper Execution

Configuration Examples

Single-Node Tensor Parallel

Multi-Node Pipeline + Tensor Parallel

Data Parallel with Load Balancing

Expert Parallel for MoE

Device Control and Environment Variables

Device Visibility

Key Environment Variables

On this page

Distributed Execution

System Overview

Distributed Execution Architecture

Execution Backend Types

Process and Actor Management

CoreEngineProcManager

CoreEngineActorManager

APIServerProcessManager

Parallelism Configuration

ParallelConfig Fields

EngineArgs to ParallelConfig Flow

Data Parallel Coordination

Request Wave Synchronization

DP Rank Synchronization

Communication Infrastructure

ZMQ Socket Topology

NCCL Collective Operations

Microbatching and Dual Batch Overlap

Microbatching Overview

UBatchContext and Threading

UBatchSlice and Request Splitting

UBatchWrapper Execution

Configuration Examples

Single-Node Tensor Parallel

Multi-Node Pipeline + Tensor Parallel

Data Parallel with Load Balancing

Expert Parallel for MoE

Device Control and Environment Variables

Device Visibility

Key Environment Variables

On this page