This document describes vLLM's data parallel (DP) coordination system, which manages synchronization and load balancing across multiple data parallel engine ranks. When data_parallel_size > 1, the system uses a centralized coordinator process to orchestrate request distribution and execution synchronization.
For general information about parallelism strategies including tensor, pipeline, and expert parallelism, see Parallelism Strategies. For details on engine process lifecycle management, see Engine Process Management. For microbatching implementation details, see Microbatching and Dual Batch Overlap.
Data parallel coordination in vLLM enables horizontal scaling by running multiple independent engine replicas that process different subsets of requests. The coordination layer ensures:
The system operates through three primary components:
DPCoordinator: Manages the coordinator process lifecycleDPCoordinatorProc: The coordinator process implementationcoordinate_batch_across_dp(): Per-rank batch synchronizationSources: vllm/v1/engine/coordinator.py1-400 vllm/v1/worker/dp_utils.py1-241
DPCoordinator ZMQ Socket Topology
The coordinator uses three ZMQ sockets:
| Socket | Type | Purpose | Participants |
|---|---|---|---|
front_publish_address | XPUB | Publish stats to API servers | API servers subscribe |
back_output_address | PULL | Receive engine outputs | All DP ranks send |
back_publish_address | XPUB | Broadcast control messages | All DP ranks subscribe |
The DPCoordinator class manages process lifecycle and exposes addresses via get_stats_publish_address() and get_engine_socket_addresses().
Sources: vllm/v1/engine/coordinator.py22-106 vllm/v1/engine/coordinator.py151-198
Request waves are the fundamental synchronization primitive for data parallel execution. Each wave represents a cycle from engines entering running state, processing requests, then returning to paused state.
Engine State Transitions
The coordinator tracks current_wave (integer counter) and engines_running (boolean). When all engines complete a wave:
EngineCoreOutputs with finished_wave=True to coordinatorcurrent_wave and sets engines_running=FalseSources: vllm/v1/engine/coordinator.py36-53 vllm/v1/engine/coordinator.py159-167
A new wave begins when:
Case 1: New Request While Paused
request_wave=current_wave+1 to engineSTART_DP_WAVE messageSTART_DP_WAVE to all enginesCase 2: Stale Request to Paused Engine
request_wave < current_wave while pausedEngineCoreRequestType.START_DP_WAVESTART_DP_WAVE to all enginesThe coordinator processes these in _handle_start_wave():
if not engines_running and enable_wave_coordination:
current_wave += 1
engines_running = True
publish_back.send(msgpack.encode(START_DP_WAVE))
Sources: vllm/v1/engine/coordinator.py42-52 vllm/v1/engine/coordinator.py246-262
Each DP rank independently schedules requests but must coordinate certain decisions to maintain consistency. The coordinate_batch_across_dp() function synchronizes three aspects:
Batch Synchronization All-Reduce Protocol
The synchronization tensor has shape [5, dp_size]:
| Index | Content | Purpose |
|---|---|---|
[0][rank] | orig_num_tokens_per_ubatch | Original token count |
[1][rank] | padded_num_tokens_per_ubatch | Token count with padding |
[2][rank] | 1 if should_ubatch else 0 | Microbatching vote |
[3][rank] | 1 if should_dp_pad else 0 | DP padding enabled |
[4][rank] | cudagraph_mode enum value | CUDA graph mode |
Post-processing logic in _synchronize_dp_ranks():
Sources: vllm/v1/worker/dp_utils.py103-240 vllm/v1/worker/dp_utils.py38-101
By default, the all-reduce uses NCCL on GPU. However, ParallelConfig.disable_nccl_for_dp_synchronization can force Gloo on CPU to avoid GPU synchronization points that may hurt performance with async scheduling.
Configuration logic:
The flag defaults to True when async_scheduling=True, otherwise False.
Sources: vllm/v1/worker/dp_utils.py20-36 vllm/config/parallel.py185-190
vLLM supports three data parallel load balancing modes, configured via ParallelConfig:
Mode: data_parallel_external_lb=False, data_parallel_hybrid_lb=False
The front-end AsyncLLM manages all DP ranks (both local and remote) and performs load balancing:
Internal LB Architecture
The coordinator publishes stats updates containing request_counts for each rank:
Sources: vllm/v1/engine/coordinator.py218-234
Mode: data_parallel_external_lb=True
Each DP rank runs in a separate pod/instance with its own API server. An external load balancer (e.g., Kubernetes Service) distributes requests:
External LB Architecture
Set via --data-parallel-rank CLI flag. No stats are published, only wave synchronization. Each API server sends requests to its local engine only.
Sources: vllm/config/parallel.py121-125 vllm/v1/engine/coordinator.py53-56
Mode: data_parallel_hybrid_lb=True
Multiple nodes run API servers, each with multiple local DP ranks. External LB distributes between nodes, internal LB distributes within nodes:
Hybrid LB Architecture
Set via --data-parallel-start-rank CLI flag. The coordinator publishes stats for local ranks within each node's API server.
Sources: vllm/config/parallel.py126-133 vllm/engine/arg_utils.py834-850
The ParallelConfig.local_engines_only property determines which engines a client manages:
Sources: vllm/config/parallel.py375-381
| Parameter | Type | Default | Description |
|---|---|---|---|
data_parallel_size | int | 1 | Number of DP replicas |
data_parallel_rank | int | None | None | Explicit rank for external LB |
data_parallel_start_rank | int | None | None | Starting rank for hybrid LB |
data_parallel_size_local | int | None | None | Number of local DP replicas |
data_parallel_address | str | None | None | Coordinator address |
data_parallel_rpc_port | int | None | None | RPC port for DP messaging |
data_parallel_hybrid_lb | bool | False | Enable hybrid LB mode |
data_parallel_external_lb | bool | False | Enable external LB mode |
data_parallel_backend | DataParallelBackend | "mp" | Backend: "mp" or "ray" |
disable_nccl_for_dp_synchronization | bool | None | None | Use Gloo instead of NCCL |
Sources: vllm/engine/arg_utils.py406-423 vllm/engine/arg_utils.py827-874
After initialization, ParallelConfig contains resolved values:
Sources: vllm/config/parallel.py103-191
The coordinator is created in CoreEngineProcManager or CoreEngineActorManager when data_parallel_size > 1:
The coordinator process runs DPCoordinatorProc.run_coordinator() which:
EngineCoreOutputs messagesSources: vllm/v1/engine/coordinator.py58-106 vllm/v1/engine/coordinator.py128-150
| Variable | Purpose | Default |
|---|---|---|
VLLM_DP_SIZE | DP size in offline SPMD mode | From config |
VLLM_DP_RANK | DP rank in offline SPMD mode | From config |
VLLM_DP_RANK_LOCAL | Local DP rank in SPMD mode | From config |
VLLM_DP_MASTER_IP | Coordinator IP address | From config |
VLLM_DP_MASTER_PORT | Coordinator port | From config |
VLLM_RAY_DP_PACK_STRATEGY | Ray placement: "strict", "fill", "span" | "strict" |
Sources: vllm/config/parallel.py580-592 vllm/v1/engine/utils.py397-448
Refresh this wiki
This wiki was recently refreshed. Please wait 6 days to refresh again.