This page provides an overview of vLLM's distributed execution system, which enables scaling inference across multiple GPUs and nodes. This includes process/actor management, parallelism strategies, inter-process communication, and coordination mechanisms.
For specific topics within distributed execution:
vLLM's distributed execution system orchestrates multiple inference workers across GPUs and nodes to serve large language models efficiently. The system supports multiple deployment modes and parallelism strategies, coordinating execution through a combination of process managers, communication channels, and synchronization primitives.
Sources: vllm/v1/engine/utils.py80-175 vllm/v1/engine/coordinator.py22-106 vllm/v1/utils.py159-225
vLLM supports multiple distributed executor backends, selected via distributed_executor_backend in ParallelConfig:
| Backend | Mode | Use Case |
|---|---|---|
mp | Multiprocessing | Single-node or multi-node with explicit node configuration |
ray | Ray actors | Multi-node deployments with dynamic resource management |
uni | Single process | Debugging or XPU SPMD mode |
external_launcher | External orchestration | Integration with external launch systems (e.g., torchrun) |
The backend selection logic resides in vllm/config/parallel.py600-660 For single-node deployments where world_size fits on available GPUs, mp is preferred. For multi-node or when Ray is already initialized, ray is used. TPU platforms default to uni when using SPMD mode.
Sources: vllm/config/parallel.py201-211 vllm/config/parallel.py600-660
The CoreEngineProcManager class manages background engine processes for local data parallel ranks. It spawns processes using Python's multiprocessing, handles device control via environment variables, and monitors process health.
Key responsibilities:
get_mp_context().Process()CUDA_VISIBLE_DEVICES or platform-specific variablesweakref.finalize()Sources: vllm/v1/engine/utils.py80-175
For Ray-based deployments, CoreEngineActorManager manages EngineCoreActor instances across nodes. It creates placement groups, handles actor lifecycle, and coordinates initialization.
Placement group creation:
strict, fill, and span packing strategies via VLLM_RAY_DP_PACK_STRATEGYworld_size + 1 bundles per DP rank (workers + engine core)Sources: vllm/v1/engine/utils.py227-498
The APIServerProcessManager spawns multiple API server processes to handle client connections, each with its own ZMQ client connecting to engine cores.
Configuration:
num_servers processes created via spawn contextinput_address and output_address for ZMQ communicationstats_update_address for subscribing to coordinator statsSources: vllm/v1/utils.py159-225
The ParallelConfig dataclass vllm/config/parallel.py94-358 contains all distributed execution parameters:
| Field | Description | Default |
|---|---|---|
pipeline_parallel_size | Number of pipeline stages | 1 |
tensor_parallel_size | Number of tensor parallel groups | 1 |
data_parallel_size | Number of data parallel replicas | 1 |
enable_expert_parallel | Use expert parallelism for MoE | False |
world_size | TP × PP × PCP (computed) | - |
distributed_executor_backend | Backend type (mp/ray/uni/external_launcher) | None (auto) |
Derived properties:
world_size_across_dp = world_size * data_parallel_sizeuse_ubatching = enable_dbo or ubatch_size > 1num_ubatches = 2 if enable_dbo else ubatch_sizeSources: vllm/config/parallel.py94-358
The EngineArgs class vllm/engine/arg_utils.py358-603 defines CLI arguments that are transformed into ParallelConfig during VllmConfig initialization. Arguments include:
--tensor-parallel-size / -tp--pipeline-parallel-size / -pp--data-parallel-size / -dp--enable-expert-parallel / -ep--enable-dbo (Dual Batch Overlap)--ubatch-sizeSources: vllm/engine/arg_utils.py358-603 vllm/config/parallel.py94-358
When data_parallel_size > 1, a DPCoordinator process orchestrates request waves across DP ranks. A request wave represents a synchronized execution phase where all ranks process their local requests.
Wave lifecycle:
START_DP_WAVE to all ranksSources: vllm/v1/engine/coordinator.py22-106 vllm/v1/engine/coordinator.py113-311
The coordinate_batch_across_dp() function vllm/v1/worker/dp_utils.py173-241 synchronizes batch execution decisions across DP ranks using NCCL or Gloo AllReduce:
Synchronized information:
orig_num_tokens_per_ubatch: Original token count (unpadded)padded_num_tokens_per_ubatch: Token count after non-DP padding (CUDA graph, TP)should_ubatch: Whether to use microbatchingshould_dp_pad: Whether to pad all ranks to same sizecudagraph_mode: CUDA graph mode (NONE/PIECEWISE/FULL)Decision logic:
cudagraph_mode across ranksshould_dp_pad=True or ubatching enabledSources: vllm/v1/worker/dp_utils.py103-171 vllm/v1/worker/dp_utils.py173-241
vLLM uses ZeroMQ for request/response communication between API servers, coordinator, and engine cores:
Socket types:
Address generation:
The get_engine_client_zmq_addr() function vllm/v1/utils.py143-157 creates socket addresses:
get_open_zmq_ipc_path()get_tcp_uri(host, port)Sources: vllm/v1/utils.py143-157 vllm/v1/engine/coordinator.py151-311
For GPU-to-GPU communication, vLLM uses NCCL collectives within process groups:
Key operations:
AllReduce: Synchronize DP rank decisions vllm/v1/worker/dp_utils.py38-56AllGather/ReduceScatter: All2All for expert parallelismBroadcast: Weight distribution during initializationProcess group initialization:
The stateless_init_dp_group() method vllm/config/parallel.py399-435 creates a DP process group using:
disable_nccl_for_dp_synchronization=TrueEADDRINUSE errorsSources: vllm/config/parallel.py399-435 vllm/v1/worker/dp_utils.py20-56
When batch sizes exceed thresholds, vLLM can split execution into microbatches (ubatches) to overlap communication with computation. This is configured via:
enable_dbo: Enables 2-way microbatching (Dual Batch Overlap)ubatch_size: Number of microbatches when DBO disableddbo_decode_token_threshold: Threshold for decode-only batches (default: 32)dbo_prefill_token_threshold: Threshold for mixed batches (default: 512)The UBatchContext class vllm/v1/worker/ubatching.py20-148 manages synchronization between microbatch threads using:
Synchronization primitives:
threading.Barrier: Ensures all threads initialize CUDA contextsthreading.Event: CPU-level signaling between threadstorch.Event: GPU stream synchronizationStream management:
compute_stream: Main computation streamcomm_stream: Communication stream for collectivesSources: vllm/v1/worker/ubatching.py20-148 vllm/v1/worker/ubatching.py202-242
The UBatchSlice dataclass vllm/v1/worker/ubatch_utils.py13-28 defines a slice of requests and tokens:
Creation logic vllm/v1/worker/ubatch_utils.py63-115:
[split_point * i for i in range(1, num_ubatches)]np.searchsorted() on cumulative token counts to find request boundariesMetadata splitting:
The split_attn_metadata() function vllm/v1/worker/ubatch_utils.py229-243 creates separate CommonAttentionMetadata for each ubatch by slicing:
query_start_loc: Adjusted for token offsetseq_lens: Per-request sequence lengthsblock_table_tensor: Block tables for requestsslot_mapping: Token-to-slot mappingSources: vllm/v1/worker/ubatch_utils.py13-115 vllm/v1/worker/ubatch_utils.py134-243
The UBatchWrapper class vllm/v1/worker/gpu_ubatch_wrapper.py94-368 wraps model execution to support microbatching with CUDA graphs:
Components:
SMControlContextManager: Controls SM allocation between compute and communicationready_barrier: Synchronizes thread startupcudagraphs: Cache of captured graphs per batch sizecomm_stream: Stream for communication operationsExecution flow:
coordinate_batch_across_dp()UBatchSlice objects for each microbatchUBatchContextSources: vllm/v1/worker/gpu_ubatch_wrapper.py94-368
Result:
world_size = 4Result:
world_size = 4 (2 TP × 2 PP)Result:
world_size = 2, world_size_across_dp = 8DPCoordinator created for wave synchronizationResult:
Sources: vllm/engine/arg_utils.py785-933 vllm/config/parallel.py446-460
For multiprocessing deployments with data_parallel_size > 1, vLLM manages device visibility using environment variables:
Function: get_device_indices() vllm/v1/engine/utils.py193-225
[local_dp_rank * world_size, (local_dp_rank + 1) * world_size)device_id_to_physical_device_id()CUDA_VISIBLE_DEVICES, ZE_AFFINITY_MASK, etc.)Example:
world_size = 2, local_dp_rank = 1
Device range: [2, 3]
CUDA_VISIBLE_DEVICES="2,3"
Sources: vllm/v1/engine/utils.py176-225
| Variable | Purpose | Set By |
|---|---|---|
CUDA_VISIBLE_DEVICES | Device visibility for CUDA | vLLM during process spawn |
VLLM_DP_SIZE | Data parallel size | User (for offline SPMD) |
VLLM_DP_RANK | Data parallel rank | User (for offline SPMD) |
VLLM_ENABLE_V1_MULTIPROCESSING | Enable V1 multiprocessing | vLLM (disabled for external launcher) |
VLLM_RAY_DP_PACK_STRATEGY | Ray placement strategy | User (strict/fill/span) |
VLLM_DBO_COMM_SMS | SMs for communication in DBO | User (default: value from envs) |
Sources: vllm/config/parallel.py547-599 vllm/v1/worker/gpu_ubatch_wrapper.py126-157
Refresh this wiki
This wiki was recently refreshed. Please wait 6 days to refresh again.