Data Parallel Coordination

Relevant source files

Purpose and Scope

This document describes vLLM's data parallel (DP) coordination system, which manages synchronization and load balancing across multiple data parallel engine ranks. When data_parallel_size > 1, the system uses a centralized coordinator process to orchestrate request distribution and execution synchronization.

For general information about parallelism strategies including tensor, pipeline, and expert parallelism, see Parallelism Strategies. For details on engine process lifecycle management, see Engine Process Management. For microbatching implementation details, see Microbatching and Dual Batch Overlap.

Overview

Data parallel coordination in vLLM enables horizontal scaling by running multiple independent engine replicas that process different subsets of requests. The coordination layer ensures:

Wave-based synchronization: Engines execute in synchronized "waves" to maintain consistent state
Load balancing: Stats collection and distribution for routing decisions
Batch coordination: Synchronization of microbatching and padding decisions across ranks
Unified control flow: Coordination of engine state transitions (running ↔ paused)

The system operates through three primary components:

DPCoordinator: Manages the coordinator process lifecycle
DPCoordinatorProc: The coordinator process implementation
coordinate_batch_across_dp(): Per-rank batch synchronization

Sources: vllm/v1/engine/coordinator.py1-400 vllm/v1/worker/dp_utils.py1-241

DPCoordinator Architecture

DPCoordinator ZMQ Socket Topology

The coordinator uses three ZMQ sockets:

Socket	Type	Purpose	Participants
`front_publish_address`	XPUB	Publish stats to API servers	API servers subscribe
`back_output_address`	PULL	Receive engine outputs	All DP ranks send
`back_publish_address`	XPUB	Broadcast control messages	All DP ranks subscribe

The DPCoordinator class manages process lifecycle and exposes addresses via get_stats_publish_address() and get_engine_socket_addresses().

Sources: vllm/v1/engine/coordinator.py22-106 vllm/v1/engine/coordinator.py151-198

Wave-Based Synchronization

Request waves are the fundamental synchronization primitive for data parallel execution. Each wave represents a cycle from engines entering running state, processing requests, then returning to paused state.

Engine State Transitions

Wave Number Tracking

The coordinator tracks current_wave (integer counter) and engines_running (boolean). When all engines complete a wave:

Each engine performs an all-reduce to determine if any rank has unfinished requests
If all ranks report no unfinished requests, they transition to paused state
Rank 0 sends EngineCoreOutputs with finished_wave=True to coordinator
Coordinator increments current_wave and sets engines_running=False

Sources: vllm/v1/engine/coordinator.py36-53 vllm/v1/engine/coordinator.py159-167

Starting a New Wave

A new wave begins when:

Case 1: New Request While Paused

Front-end sends request with request_wave=current_wave+1 to engine
Front-end concurrently notifies coordinator via START_DP_WAVE message
Coordinator broadcasts START_DP_WAVE to all engines

Case 2: Stale Request to Paused Engine

Engine receives request with request_wave < current_wave while paused
Engine notifies coordinator via EngineCoreRequestType.START_DP_WAVE
Coordinator broadcasts START_DP_WAVE to all engines

The coordinator processes these in _handle_start_wave():

if not engines_running and enable_wave_coordination:
    current_wave += 1
    engines_running = True
    publish_back.send(msgpack.encode(START_DP_WAVE))

Sources: vllm/v1/engine/coordinator.py42-52 vllm/v1/engine/coordinator.py246-262

Batch Coordination Across DP Ranks

Each DP rank independently schedules requests but must coordinate certain decisions to maintain consistency. The coordinate_batch_across_dp() function synchronizes three aspects:

Microbatching decision: All ranks must agree to microbatch or not
DP padding: Pad all ranks to same token count for synchronized execution
CUDA graph mode: Use minimum mode across ranks for consistent tensor shapes

Batch Synchronization All-Reduce Protocol

Implementation Details

The synchronization tensor has shape [5, dp_size]:

Index	Content	Purpose
`[0][rank]`	`orig_num_tokens_per_ubatch`	Original token count
`[1][rank]`	`padded_num_tokens_per_ubatch`	Token count with padding
`[2][rank]`	`1` if `should_ubatch` else `0`	Microbatching vote
`[3][rank]`	`1` if `should_dp_pad` else `0`	DP padding enabled
`[4][rank]`	`cudagraph_mode` enum value	CUDA graph mode

Post-processing logic in _synchronize_dp_ranks():

Sources: vllm/v1/worker/dp_utils.py103-240 vllm/v1/worker/dp_utils.py38-101

NCCL vs Gloo for DP Synchronization

By default, the all-reduce uses NCCL on GPU. However, ParallelConfig.disable_nccl_for_dp_synchronization can force Gloo on CPU to avoid GPU synchronization points that may hurt performance with async scheduling.

Configuration logic:

The flag defaults to True when async_scheduling=True, otherwise False.

Sources: vllm/v1/worker/dp_utils.py20-36 vllm/config/parallel.py185-190

Load Balancing Modes

vLLM supports three data parallel load balancing modes, configured via ParallelConfig:

Internal Load Balancing (Default)

Mode: data_parallel_external_lb=False, data_parallel_hybrid_lb=False

The front-end AsyncLLM manages all DP ranks (both local and remote) and performs load balancing:

Internal LB Architecture

The coordinator publishes stats updates containing request_counts for each rank:

Sources: vllm/v1/engine/coordinator.py218-234

External Load Balancing

Mode: data_parallel_external_lb=True

Each DP rank runs in a separate pod/instance with its own API server. An external load balancer (e.g., Kubernetes Service) distributes requests:

External LB Architecture

Set via --data-parallel-rank CLI flag. No stats are published, only wave synchronization. Each API server sends requests to its local engine only.

Sources: vllm/config/parallel.py121-125 vllm/v1/engine/coordinator.py53-56

Hybrid Load Balancing

Mode: data_parallel_hybrid_lb=True

Multiple nodes run API servers, each with multiple local DP ranks. External LB distributes between nodes, internal LB distributes within nodes:

Hybrid LB Architecture

Set via --data-parallel-start-rank CLI flag. The coordinator publishes stats for local ranks within each node's API server.

Sources: vllm/config/parallel.py126-133 vllm/engine/arg_utils.py834-850

Load Balancing Mode Selection

The ParallelConfig.local_engines_only property determines which engines a client manages:

Sources: vllm/config/parallel.py375-381

Configuration

EngineArgs Data Parallel Parameters

Parameter	Type	Default	Description
`data_parallel_size`	`int`	`1`	Number of DP replicas
`data_parallel_rank`	`int \| None`	`None`	Explicit rank for external LB
`data_parallel_start_rank`	`int \| None`	`None`	Starting rank for hybrid LB
`data_parallel_size_local`	`int \| None`	`None`	Number of local DP replicas
`data_parallel_address`	`str \| None`	`None`	Coordinator address
`data_parallel_rpc_port`	`int \| None`	`None`	RPC port for DP messaging
`data_parallel_hybrid_lb`	`bool`	`False`	Enable hybrid LB mode
`data_parallel_external_lb`	`bool`	`False`	Enable external LB mode
`data_parallel_backend`	`DataParallelBackend`	`"mp"`	Backend: `"mp"` or `"ray"`
`disable_nccl_for_dp_synchronization`	`bool \| None`	`None`	Use Gloo instead of NCCL

Sources: vllm/engine/arg_utils.py406-423 vllm/engine/arg_utils.py827-874

ParallelConfig DP Fields

After initialization, ParallelConfig contains resolved values:

Sources: vllm/config/parallel.py103-191

Coordinator Initialization

The coordinator is created in CoreEngineProcManager or CoreEngineActorManager when data_parallel_size > 1:

The coordinator process runs DPCoordinatorProc.run_coordinator() which:

Creates ZMQ sockets (XPUB, PULL, XPUB)
Waits for all engines to subscribe
Enters main loop processing EngineCoreOutputs messages
Publishes stats and wave updates

Sources: vllm/v1/engine/coordinator.py58-106 vllm/v1/engine/coordinator.py128-150

Environment Variables

Variable	Purpose	Default
`VLLM_DP_SIZE`	DP size in offline SPMD mode	From config
`VLLM_DP_RANK`	DP rank in offline SPMD mode	From config
`VLLM_DP_RANK_LOCAL`	Local DP rank in SPMD mode	From config
`VLLM_DP_MASTER_IP`	Coordinator IP address	From config
`VLLM_DP_MASTER_PORT`	Coordinator port	From config
`VLLM_RAY_DP_PACK_STRATEGY`	Ray placement: `"strict"`, `"fill"`, `"span"`	`"strict"`

Sources: vllm/config/parallel.py580-592 vllm/v1/engine/utils.py397-448

Data Parallel Coordination

Relevant source files

Purpose and Scope

Overview

Data parallel coordination in vLLM enables horizontal scaling by running multiple independent engine replicas that process different subsets of requests. The coordination layer ensures:

Wave-based synchronization: Engines execute in synchronized "waves" to maintain consistent state
Load balancing: Stats collection and distribution for routing decisions
Batch coordination: Synchronization of microbatching and padding decisions across ranks
Unified control flow: Coordination of engine state transitions (running ↔ paused)

The system operates through three primary components:

DPCoordinator: Manages the coordinator process lifecycle
DPCoordinatorProc: The coordinator process implementation
coordinate_batch_across_dp(): Per-rank batch synchronization

Sources: vllm/v1/engine/coordinator.py1-400 vllm/v1/worker/dp_utils.py1-241

DPCoordinator Architecture

DPCoordinator ZMQ Socket Topology

The coordinator uses three ZMQ sockets:

Socket	Type	Purpose	Participants
`front_publish_address`	XPUB	Publish stats to API servers	API servers subscribe
`back_output_address`	PULL	Receive engine outputs	All DP ranks send
`back_publish_address`	XPUB	Broadcast control messages	All DP ranks subscribe

The DPCoordinator class manages process lifecycle and exposes addresses via get_stats_publish_address() and get_engine_socket_addresses().

Sources: vllm/v1/engine/coordinator.py22-106 vllm/v1/engine/coordinator.py151-198

Wave-Based Synchronization

Engine State Transitions

Wave Number Tracking

The coordinator tracks current_wave (integer counter) and engines_running (boolean). When all engines complete a wave:

Each engine performs an all-reduce to determine if any rank has unfinished requests
If all ranks report no unfinished requests, they transition to paused state
Rank 0 sends EngineCoreOutputs with finished_wave=True to coordinator
Coordinator increments current_wave and sets engines_running=False

Sources: vllm/v1/engine/coordinator.py36-53 vllm/v1/engine/coordinator.py159-167

Starting a New Wave

A new wave begins when:

Case 1: New Request While Paused

Front-end sends request with request_wave=current_wave+1 to engine
Front-end concurrently notifies coordinator via START_DP_WAVE message
Coordinator broadcasts START_DP_WAVE to all engines

Case 2: Stale Request to Paused Engine

Engine receives request with request_wave < current_wave while paused
Engine notifies coordinator via EngineCoreRequestType.START_DP_WAVE
Coordinator broadcasts START_DP_WAVE to all engines

The coordinator processes these in _handle_start_wave():

if not engines_running and enable_wave_coordination:
    current_wave += 1
    engines_running = True
    publish_back.send(msgpack.encode(START_DP_WAVE))

Sources: vllm/v1/engine/coordinator.py42-52 vllm/v1/engine/coordinator.py246-262

Batch Coordination Across DP Ranks

Each DP rank independently schedules requests but must coordinate certain decisions to maintain consistency. The coordinate_batch_across_dp() function synchronizes three aspects:

Microbatching decision: All ranks must agree to microbatch or not
DP padding: Pad all ranks to same token count for synchronized execution
CUDA graph mode: Use minimum mode across ranks for consistent tensor shapes

Batch Synchronization All-Reduce Protocol

Implementation Details

The synchronization tensor has shape [5, dp_size]:

Index	Content	Purpose
`[0][rank]`	`orig_num_tokens_per_ubatch`	Original token count
`[1][rank]`	`padded_num_tokens_per_ubatch`	Token count with padding
`[2][rank]`	`1` if `should_ubatch` else `0`	Microbatching vote
`[3][rank]`	`1` if `should_dp_pad` else `0`	DP padding enabled
`[4][rank]`	`cudagraph_mode` enum value	CUDA graph mode

Post-processing logic in _synchronize_dp_ranks():

Sources: vllm/v1/worker/dp_utils.py103-240 vllm/v1/worker/dp_utils.py38-101

NCCL vs Gloo for DP Synchronization

Configuration logic:

The flag defaults to True when async_scheduling=True, otherwise False.

Sources: vllm/v1/worker/dp_utils.py20-36 vllm/config/parallel.py185-190

Load Balancing Modes

vLLM supports three data parallel load balancing modes, configured via ParallelConfig:

Internal Load Balancing (Default)

Mode: data_parallel_external_lb=False, data_parallel_hybrid_lb=False

The front-end AsyncLLM manages all DP ranks (both local and remote) and performs load balancing:

Internal LB Architecture

The coordinator publishes stats updates containing request_counts for each rank:

Sources: vllm/v1/engine/coordinator.py218-234

External Load Balancing

Mode: data_parallel_external_lb=True

Each DP rank runs in a separate pod/instance with its own API server. An external load balancer (e.g., Kubernetes Service) distributes requests:

External LB Architecture

Set via --data-parallel-rank CLI flag. No stats are published, only wave synchronization. Each API server sends requests to its local engine only.

Sources: vllm/config/parallel.py121-125 vllm/v1/engine/coordinator.py53-56

Hybrid Load Balancing

Mode: data_parallel_hybrid_lb=True

Multiple nodes run API servers, each with multiple local DP ranks. External LB distributes between nodes, internal LB distributes within nodes:

Hybrid LB Architecture

Set via --data-parallel-start-rank CLI flag. The coordinator publishes stats for local ranks within each node's API server.

Sources: vllm/config/parallel.py126-133 vllm/engine/arg_utils.py834-850

Load Balancing Mode Selection

The ParallelConfig.local_engines_only property determines which engines a client manages:

Sources: vllm/config/parallel.py375-381

Configuration

EngineArgs Data Parallel Parameters

Parameter	Type	Default	Description
`data_parallel_size`	`int`	`1`	Number of DP replicas
`data_parallel_rank`	`int \| None`	`None`	Explicit rank for external LB
`data_parallel_start_rank`	`int \| None`	`None`	Starting rank for hybrid LB
`data_parallel_size_local`	`int \| None`	`None`	Number of local DP replicas
`data_parallel_address`	`str \| None`	`None`	Coordinator address
`data_parallel_rpc_port`	`int \| None`	`None`	RPC port for DP messaging
`data_parallel_hybrid_lb`	`bool`	`False`	Enable hybrid LB mode
`data_parallel_external_lb`	`bool`	`False`	Enable external LB mode
`data_parallel_backend`	`DataParallelBackend`	`"mp"`	Backend: `"mp"` or `"ray"`
`disable_nccl_for_dp_synchronization`	`bool \| None`	`None`	Use Gloo instead of NCCL

Sources: vllm/engine/arg_utils.py406-423 vllm/engine/arg_utils.py827-874

ParallelConfig DP Fields

After initialization, ParallelConfig contains resolved values:

Sources: vllm/config/parallel.py103-191

Coordinator Initialization

The coordinator is created in CoreEngineProcManager or CoreEngineActorManager when data_parallel_size > 1:

The coordinator process runs DPCoordinatorProc.run_coordinator() which:

Creates ZMQ sockets (XPUB, PULL, XPUB)
Waits for all engines to subscribe
Enters main loop processing EngineCoreOutputs messages
Publishes stats and wave updates

Sources: vllm/v1/engine/coordinator.py58-106 vllm/v1/engine/coordinator.py128-150

Environment Variables

Variable	Purpose	Default
`VLLM_DP_SIZE`	DP size in offline SPMD mode	From config
`VLLM_DP_RANK`	DP rank in offline SPMD mode	From config
`VLLM_DP_RANK_LOCAL`	Local DP rank in SPMD mode	From config
`VLLM_DP_MASTER_IP`	Coordinator IP address	From config
`VLLM_DP_MASTER_PORT`	Coordinator port	From config
`VLLM_RAY_DP_PACK_STRATEGY`	Ray placement: `"strict"`, `"fill"`, `"span"`	`"strict"`

Sources: vllm/config/parallel.py580-592 vllm/v1/engine/utils.py397-448

Data Parallel Coordination

Purpose and Scope

Overview

DPCoordinator Architecture

Wave-Based Synchronization

Wave Number Tracking

Starting a New Wave

Batch Coordination Across DP Ranks

Implementation Details

NCCL vs Gloo for DP Synchronization

Load Balancing Modes

Internal Load Balancing (Default)

External Load Balancing

Hybrid Load Balancing

Load Balancing Mode Selection

Configuration

EngineArgs Data Parallel Parameters

ParallelConfig DP Fields

Coordinator Initialization

Environment Variables

On this page

Data Parallel Coordination

Purpose and Scope

Overview

DPCoordinator Architecture

Wave-Based Synchronization

Wave Number Tracking

Starting a New Wave

Batch Coordination Across DP Ranks

Implementation Details

NCCL vs Gloo for DP Synchronization

Load Balancing Modes

Internal Load Balancing (Default)

External Load Balancing

Hybrid Load Balancing

Load Balancing Mode Selection

Configuration

EngineArgs Data Parallel Parameters

ParallelConfig DP Fields

Coordinator Initialization

Environment Variables

On this page