This page documents the KV cache transfer subsystem used for disaggregated prefilling in vLLM v1. It covers the KVConnectorBase_V1 abstract interface, the NixlConnector implementation (and its NixlConnectorScheduler / NixlConnectorWorker sub-components), the MultiConnector composition wrapper, and KVConnectorFactory. It also covers the OffloadingConnector for single-instance CPU KV offloading, which shares the same interface.
For KV block management within a single instance, see KV Cache Management. For the scheduler that produces SchedulerOutput consumed here, see Scheduler and Resource Allocation. For ZMQ/shared memory communication primitives used by the side channel, see Communication Infrastructure.
Disaggregated prefilling separates the prefill and decode phases of LLM inference onto different vLLM instances. A prefiller (producer) runs the prompt through the model to compute KV cache blocks, then transfers those blocks over the network to a decoder (consumer) that proceeds directly to token generation.
A proxy server orchestrates requests: it first sends the prompt to a prefiller, then forwards the completed request (with KV transfer metadata attached) to a decoder.
Architecture:
Sources: vllm/distributed/kv_transfer/kv_connector/v1/nixl_connector.py515-851 docs/features/nixl_connector_usage.md1-115
Key benefits:
KVConnectorBase_V1All connectors implement the abstract class KVConnectorBase_V1 defined in vllm/distributed/kv_transfer/kv_connector/v1/base.py The interface enforces a strict split between methods that run in the scheduler process and methods that run in worker processes.
Class hierarchy:
Sources: vllm/distributed/kv_transfer/kv_connector/v1/base.py121-608 vllm/distributed/kv_transfer/kv_connector/factory.py142-204
KVConnectorRole is an enum with two values. Each connector is instantiated separately for each role:
| Role | Where It Runs | Responsibilities |
|---|---|---|
KVConnectorRole.SCHEDULER | Scheduler process | Request tracking, metadata assembly, block hold/release decisions |
KVConnectorRole.WORKER | Each GPU worker process | Memory registration, async transfers, transfer status polling |
| Method | Signature | Purpose |
|---|---|---|
get_num_new_matched_tokens | (request, num_computed_tokens) → (int|None, bool) | Returns count of tokens loadable from external KV cache and whether load is async |
update_state_after_alloc | (request, blocks, num_external_tokens) | Called after block allocation; connector records which blocks need loading/saving |
build_connector_meta | (scheduler_output) → KVConnectorMetadata | Assembles per-step metadata for workers; resets internal pending state |
request_finished | (request, block_ids) → (bool, dict|None) | Called once when request completes; returns (delay_free, kv_transfer_params) |
set_xfer_handshake_metadata | (metadata: dict[int, KVConnectorHandshakeMetadata]) | Installs worker handshake payloads indexed by TP rank |
| Method | Purpose |
|---|---|
register_kv_caches(kv_caches) | Called at startup with per-layer KV tensors; registers GPU memory with NIXL |
register_cross_layers_kv_cache(kv_cache, attn_backend) | Alternative to above for cross-layer contiguous tensor |
start_load_kv(forward_context) | Initiates async KV transfers (non-blocking) |
wait_for_layer_load(layer_name) | Blocks until a given layer's KV is loaded |
save_kv_layer(layer_name, kv_layer, attn_metadata) | Saves a layer's KV to the connector |
wait_for_save() | Blocks until all in-progress saves complete |
get_finished(finished_req_ids) → (set, set) | Returns (finished_sending, finished_recving) request ID sets |
get_block_ids_with_load_errors() → set[int] | Returns block IDs that failed to load |
| Class | Purpose |
|---|---|
KVConnectorMetadata | Abstract; per-step data from scheduler → workers |
KVConnectorHandshakeMetadata | Abstract; out-of-band worker↔worker initialization data |
Sources: vllm/distributed/kv_transfer/kv_connector/v1/base.py60-608
KVConnectorFactoryKVConnectorFactory in vllm/distributed/kv_transfer/kv_connector/factory.py maintains a string-keyed registry of connector classes, loaded lazily via module path.
Registration:
Pre-registered connectors (registered at the bottom of factory.py):
| Name | Module | Primary Use |
|---|---|---|
NixlConnector | vllm.distributed.kv_transfer.kv_connector.v1.nixl_connector | Disaggregated prefilling (RDMA) |
MultiConnector | vllm.distributed.kv_transfer.kv_connector.v1.multi_connector | Compose multiple connectors |
OffloadingConnector | vllm.distributed.kv_transfer.kv_connector.v1.offloading_connector | CPU KV offloading |
P2pNcclConnector | ...v1.p2p.p2p_nccl_connector | NCCL-based disagg transfer |
LMCacheConnectorV1 | ...v1.lmcache_connector | LMCache integration |
LMCacheMPConnector | ...v1.lmcache_mp_connector | LMCache multiprocessing |
MooncakeConnector | ...v1.mooncake.mooncake_connector | Mooncake integration |
MoRIIOConnector | ...v1.moriio.moriio_connector | MoRIIO integration |
ExampleConnector | ...v1.example_connector | Reference implementation |
DecodeBenchConnector | ...v1.decode_bench_connector | Benchmarking |
Creation via KVConnectorFactory.create_connector(vllm_config, role, kv_cache_config). The factory also checks whether the connector supports the Hybrid Memory Allocator (SupportsHMA); if HMA is enabled and the connector does not implement SupportsHMA, creation fails.
Custom connectors can be loaded from external modules by setting kv_connector_module_path in KVTransferConfig.
Sources: vllm/distributed/kv_transfer/kv_connector/factory.py27-204
NixlConnectorNixlConnector in vllm/distributed/kv_transfer/kv_connector/v1/nixl_connector.py is the primary connector for production disaggregated prefilling. It uses the NIXL library (nixl._api.nixl_agent on CUDA, rixl._api.nixl_agent on ROCm) for RDMA-style GPU memory transfer.
Sources: vllm/distributed/kv_transfer/kv_connector/v1/nixl_connector.py515-560 vllm/distributed/kv_transfer/kv_connector/v1/nixl_connector.py854-980
NixlConnectorSchedulerRuns in the scheduler process. Tracks request state across scheduling steps.
State fields:
| Field | Type | Description |
|---|---|---|
_reqs_need_recv | dict[ReqId, tuple[Request, list[int]]] | Requests waiting for decoder to initiate NIXL GET |
_reqs_need_send | dict[ReqId, float] | Prefill requests holding blocks pending decoder read; value is expiry time |
_reqs_need_save | dict[ReqId, Request] | Prefill requests that need KV blocks staged to host buffer (when use_host_buffer=True) |
_reqs_in_batch | set[ReqId] | Requests in current scheduling batch (for do_remote_decode) |
build_connector_meta() snapshots these dicts into a NixlConnectorMetadata, clears the dicts, and returns the metadata for workers to consume.
request_finished(request, block_ids) — key logic:
do_remote_decode=True and status is FINISHED_LENGTH_CAPPED: adds to _reqs_need_send; returns (True, kv_transfer_params) where kv_transfer_params contains remote_block_ids, remote_engine_id, remote_request_id, remote_host, remote_port, tp_size_reqs_not_processed, clears from _reqs_need_saveThe ZMQ listener (_nixl_handshake_listener) is a daemon thread that starts once set_xfer_handshake_metadata() is called. It serves NixlHandshakePayload to connecting decoder workers, responding to GET_META_MSG queries with the encoded payload for the requested TP rank.
Sources: vllm/distributed/kv_transfer/kv_connector/v1/nixl_connector.py515-852
NixlConnectorWorkerRuns in each GPU worker process. Manages the NIXL agent and executes async block transfers.
Initialization sequence:
NixlWrapper (NIXL agent) with a UUID-based namenixl_agent_config with num_threads (default 4, to avoid UAR exhaustion on Mellanox NICs) and telemetry capturenixl_memory_type to "VRAM" (GPU) or "DRAM" (CPU) based on kv_buffer_deviceregister_kv_caches(kv_caches):
nixl_wrapper.get_reg_descs() on each KV cache tensornixl_wrapper.register_memory() to pin the memory regionsnixl_wrapper.prep_xfer_dlist() to build source-side transfer handles (src_xfer_handles_by_block_size)xfer_handshake_metadata (NixlHandshakePayload) for the scheduler to serve to remote workersstart_load_kv(connector_metadata) — called before every forward pass. For each request in reqs_to_recv:
_read_blocks() immediately_nixl_handshake() in a ThreadPoolExecutor background thread, queues the transfer for after completionThis makes start_load_kv() non-blocking.
Sources: vllm/distributed/kv_transfer/kv_connector/v1/nixl_connector.py854-1200
Before blocks can be transferred, the decoder worker must register the prefiller's NIXL agent. This handshake is initiated lazily on first transfer to a given engine.
Sources: vllm/distributed/kv_transfer/kv_connector/v1/nixl_connector.py600-636 vllm/distributed/kv_transfer/kv_connector/v1/nixl_connector.py1060-1160
NixlAgentMetadata and NixlHandshakePayloadNixlAgentMetadata is the data exchanged during handshake:
| Field | Type | Description |
|---|---|---|
engine_id | str | Unique identifier of the remote engine |
agent_metadata | bytes | Raw NIXL agent descriptor bytes from nixl_wrapper.get_agent_metadata() |
kv_caches_base_addr | list[int] | Base memory addresses of registered KV cache regions |
device_id | int | GPU device index (== TP rank) |
num_blocks | int | Number of KV blocks in the remote cache |
block_lens | list[int] | Per-layer block sizes in bytes |
kv_cache_layout | str | "HND" or "NHD" |
block_size | int | Token block size |
NixlHandshakePayload wraps the above with a compatibility_hash. The hash covers vLLM version, NIXL_CONNECTOR_VERSION, model architecture, attention backend, and cache dtype. Mismatches are detected before attempting to decode NixlAgentMetadata, providing a clean error message.
Sources: vllm/distributed/kv_transfer/kv_connector/v1/nixl_connector.py145-233
Sources: vllm/distributed/kv_transfer/kv_connector/v1/nixl_connector.py783-851 vllm/distributed/kv_transfer/kv_connector/v1/nixl_connector.py1060-1400
NixlConnectorMetadata FieldsThe NixlConnectorMetadata object assembled by build_connector_meta() carries:
| Field | Type | Used By |
|---|---|---|
reqs_to_recv | dict[ReqId, ReqMeta] | Worker: initiate NIXL GET for each request |
reqs_to_save | dict[ReqId, ReqMeta] | Worker: stage KV to host buffer (prefill with use_host_buffer=True) |
reqs_to_send | dict[ReqId, float] | Worker: track expiry of held prefill blocks |
reqs_in_batch | set[ReqId] | Worker: know which decode requests are in current batch |
reqs_not_processed | set[ReqId] | Worker: clean up aborted/non-KV requests |
ReqMeta carries:
| Field | Type | Description |
|---|---|---|
local_block_ids | list[int] | Destination block IDs on the decoder |
local_physical_block_ids | list[int] | Physical block IDs (differs when logical block size ≠ kernel block size) |
tp_size | int | TP size from the producing side |
remote | RemoteMeta | None | Remote connection info (block_ids, host, port, engine_id, request_id) |
Sources: vllm/distributed/kv_transfer/kv_connector/v1/nixl_connector.py254-299
NixlConnector.get_required_kvcache_layout() returns "HND" for non-MLA models. The default vLLM layout is "NHD" (tokens × heads × dims). In "HND" (heads × tokens × dims), all data for a given head is contiguous, improving NIXL scatter/gather efficiency during block-level transfers.
Layout is enforced via VLLM_KV_CACHE_LAYOUT at startup and determined globally via get_kv_connector_cache_layout() in vllm/distributed/kv_transfer/kv_connector/utils.py29-42
When enable_cross_layers_blocks = "True" in kv_connector_extra_config and the backend is FLASH_ATTN or FLASHINFER with HND layout, NixlConnector.prefer_cross_layer_blocks returns True.
This causes KVConnectorModelRunnerMixin.use_uniform_kv_cache() to return True, and allocate_uniform_kv_caches() allocates a single tensor for all layers. The tensor shape includes a num_layers dimension, so each block slot holds KV data for all layers contiguously. This reduces the NIXL descriptor count from 2 * num_layers to 1 per block, significantly reducing transfer overhead for models with many layers.
register_cross_layers_kv_cache() is called instead of register_kv_caches() in this mode.
Sources: vllm/distributed/kv_transfer/kv_connector/v1/nixl_connector.py302-363 vllm/v1/worker/kv_connector_model_runner_mixin.py127-240
TpKVTopology in vllm/distributed/kv_transfer/kv_connector/utils.py304-472 handles mismatched TP sizes between prefiller and decoder.
TP ratio computation (TpKVTopology.tp_ratio(remote_tp_size)):
local_tp ≥ remote_tp → positive ratio local_tp / remote_tp; each local rank reads from one remote rankremote_tp > local_tp → negative ratio -(remote_tp / local_tp); each local rank reads from |ratio| remote ranksAt handshake time with remote_tp > local_tp, the worker performs |ratio| handshakes and stores src_xfer_handles_by_tp_ratio[-ratio] as a list of handles, one per remote rank.
TpKVTopology.get_target_remote_ranks(remote_tp_size) returns the list of remote TP ranks a given local rank reads from.
Sources: vllm/distributed/kv_transfer/kv_connector/utils.py304-480 vllm/distributed/kv_transfer/kv_connector/v1/nixl_connector.py980-1060
MultiConnectorMultiConnector in vllm/distributed/kv_transfer/kv_connector/v1/multi_connector.py composes an ordered list of KVConnectorBase_V1 instances. Sub-connectors are configured via the connectors key in kv_connector_extra_config.
Load routing:
get_num_new_matched_tokens() queries all connectors in order; the first with nonzero tokens is recorded in _requests_to_connector[request_id]update_state_after_alloc() forwards to the chosen connector with actual blocks; all others receive empty blocksSave routing:
save_kv_layer(), wait_for_save() are forwarded to all sub-connectorsget_finished() aggregates results; _extra_async_saves tracks requests where multiple connectors are saving asynchronously (all must finish before blocks are freed)Metadata:
build_connector_meta() produces MultiKVConnectorMetadata containing a tuple of each sub-connector's metadatabind_connector_metadata() distributes each element to the corresponding sub-connectorSources: vllm/distributed/kv_transfer/kv_connector/v1/multi_connector.py105-434
OffloadingConnectorOffloadingConnector in vllm/distributed/kv_transfer/kv_connector/v1/offloading_connector.py offloads KV blocks from GPU to CPU within a single instance. It is not used for cross-instance transfer.
Key components:
| Component | Role |
|---|---|
OffloadingConnectorScheduler | Tracks which block hashes are offloaded; responds to get_num_new_matched_tokens() with cache hits |
OffloadingConnectorWorker | Executes GPU↔CPU copies via CUDA streams (CpuGpuOffloadingHandlers) |
CPUOffloadingSpec | Configuration: cpu_bytes_to_use, block_size |
OffloadingManager | Abstract LRU/ARC manager; determines which blocks to evict |
prefer_cross_layer_blocks = True is set, enabling cross-layer contiguous allocation for efficient block copies.
The spec is configured via:
Sources: vllm/distributed/kv_transfer/kv_connector/v1/offloading_connector.py116-242 vllm/v1/kv_offload/cpu.py21-100
KVConnectorModelRunnerMixin in vllm/v1/worker/kv_connector_model_runner_mixin.py integrates the connector into GPUModelRunner.execute_model().
Per-step lifecycle:
The context manager _get_kv_connector_output() encapsulates this lifecycle. kv_connector_no_forward() is called when there are no requests to run but async KV operations still need polling.
The global connector singleton is managed by kv_transfer_state.py via get_kv_transfer_group() / has_kv_transfer_group() / ensure_kv_transfer_shutdown().
KVOutputAggregator in vllm/distributed/kv_transfer/kv_connector/utils.py45-155 is used by executors to merge KVConnectorOutput objects from all TP workers before returning to the scheduler. It counts down _expected_finished_count per request ID before declaring a transfer complete.
Sources: vllm/v1/worker/kv_connector_model_runner_mixin.py40-125 vllm/distributed/kv_transfer/kv_connector/utils.py45-155 vllm/distributed/kv_transfer/kv_transfer_state.py1-60
| Class | Location | Purpose |
|---|---|---|
KVConnectorStats | metrics.py | Base dataclass for transfer stats; has aggregate(), reduce(), is_empty() |
NixlKVConnectorStats | nixl_connector.py | NIXL telemetry: duration, bytes, descriptor count per transfer |
MultiKVConnectorStats | multi_connector.py | Dict of per-sub-connector KVConnectorStats |
OffloadingConnectorStats | offloading_connector.py | Per-direction transfer size and time |
KVConnectorPromMetrics | metrics.py | Base class for Prometheus metric registration and observation |
Stats are collected by get_kv_connector_stats() on the worker side and aggregated by KVOutputAggregator before being returned to the scheduler process.
NixlKVConnectorStats records telemetry from nixlXferTelemetry objects returned by nixl_wrapper.get_xfer_telemetry(handle):
| Field | Unit | Description |
|---|---|---|
xferDuration | µs | Time for the NIXL transfer itself |
postDuration | µs | Post-transfer processing time |
totalBytes | bytes | Data transferred |
descCount | count | NIXL descriptor count |
Sources: vllm/distributed/kv_transfer/kv_connector/v1/metrics.py1-200 vllm/distributed/kv_transfer/kv_connector/v1/nixl_connector.py1400-1500
KVTransferConfig Fields| Field | Type | Default | Description |
|---|---|---|---|
kv_connector | str | — | Registered connector name (e.g., "NixlConnector") |
kv_role | str | — | "kv_producer", "kv_consumer", or "kv_both" |
kv_buffer_device | str | "cuda" | Transport buffer: "cuda" (VRAM) or "cpu" (DRAM) |
kv_connector_extra_config | dict | {} | Connector-specific settings |
kv_connector_module_path | str | None | Python module path for custom connectors |
engine_id | str | UUID | Unique identifier for this engine |
kv_load_failure_policy | str | "fail" | "fail" or "recompute" on KV load error |
NixlConnector kv_connector_extra_config Keys| Key | Type | Default | Description |
|---|---|---|---|
backends | list[str] | ["UCX"] | NIXL transport backends (e.g., ["LIBFABRIC"]) |
num_threads | int | 4 | NIXL worker thread count (limits UAR exhaustion) |
enable_cross_layers_blocks | str | "False" | Enable cross-layer contiguous KV tensor |
enable_permute_local_kv | str | "False" | Enable HND↔NHD post-receive permutation (heterogeneous layout) |
| Variable | Default | Description |
|---|---|---|
VLLM_NIXL_SIDE_CHANNEL_PORT | 5600 | ZMQ handshake listener port; port + dp_rank per DP worker |
VLLM_NIXL_SIDE_CHANNEL_HOST | "localhost" | Hostname advertised to decoders for handshake connection |
VLLM_NIXL_ABORT_REQUEST_TIMEOUT | 480 | Seconds before held prefill blocks are force-released |
VLLM_KV_CACHE_LAYOUT | "NHD" | Override KV tensor layout; "HND" recommended for NIXL |
UCX_RCACHE_MAX_UNRELEASED | "1024" (auto-set) | Prevents UCX memory leak on Mellanox NICs |
UCX_NET_DEVICES | — | UCX network device selection (e.g., "all" or "mlx5_0:1") |
UCX_TLS | — | UCX transport layer selection |
Sources: vllm/distributed/kv_transfer/kv_connector/v1/nixl_connector.py84-143 docs/features/nixl_connector_usage.md113-132
Refresh this wiki
This wiki was recently refreshed. Please wait 6 days to refresh again.