KV Cache Transfer and Disaggregated Serving

Relevant source files

This page documents the KV cache transfer subsystem used for disaggregated prefilling in vLLM v1. It covers the KVConnectorBase_V1 abstract interface, the NixlConnector implementation (and its NixlConnectorScheduler / NixlConnectorWorker sub-components), the MultiConnector composition wrapper, and KVConnectorFactory. It also covers the OffloadingConnector for single-instance CPU KV offloading, which shares the same interface.

For KV block management within a single instance, see KV Cache Management. For the scheduler that produces SchedulerOutput consumed here, see Scheduler and Resource Allocation. For ZMQ/shared memory communication primitives used by the side channel, see Communication Infrastructure.

Disaggregated Prefilling Concept

Disaggregated prefilling separates the prefill and decode phases of LLM inference onto different vLLM instances. A prefiller (producer) runs the prompt through the model to compute KV cache blocks, then transfers those blocks over the network to a decoder (consumer) that proceeds directly to token generation.

A proxy server orchestrates requests: it first sends the prompt to a prefiller, then forwards the completed request (with KV transfer metadata attached) to a decoder.

Architecture:

Sources: vllm/distributed/kv_transfer/kv_connector/v1/nixl_connector.py515-851 docs/features/nixl_connector_usage.md1-115

Key benefits:

TTFT and ITL can be tuned independently by assigning different parallelism configs to each instance type
Decode instances avoid prefill-decode interference, reducing tail ITL
KV blocks move directly between GPU VRAM regions via RDMA (NIXL/UCX), bypassing CPU when possible

Connector Interface: `KVConnectorBase_V1`

All connectors implement the abstract class KVConnectorBase_V1 defined in vllm/distributed/kv_transfer/kv_connector/v1/base.py The interface enforces a strict split between methods that run in the scheduler process and methods that run in worker processes.

Class hierarchy:

Sources: vllm/distributed/kv_transfer/kv_connector/v1/base.py121-608 vllm/distributed/kv_transfer/kv_connector/factory.py142-204

Connector Roles

KVConnectorRole is an enum with two values. Each connector is instantiated separately for each role:

Role	Where It Runs	Responsibilities
`KVConnectorRole.SCHEDULER`	Scheduler process	Request tracking, metadata assembly, block hold/release decisions
`KVConnectorRole.WORKER`	Each GPU worker process	Memory registration, async transfers, transfer status polling

Scheduler-Side API

Method	Signature	Purpose
`get_num_new_matched_tokens`	`(request, num_computed_tokens) → (int\|None, bool)`	Returns count of tokens loadable from external KV cache and whether load is async
`update_state_after_alloc`	`(request, blocks, num_external_tokens)`	Called after block allocation; connector records which blocks need loading/saving
`build_connector_meta`	`(scheduler_output) → KVConnectorMetadata`	Assembles per-step metadata for workers; resets internal pending state
`request_finished`	`(request, block_ids) → (bool, dict\|None)`	Called once when request completes; returns `(delay_free, kv_transfer_params)`
`set_xfer_handshake_metadata`	`(metadata: dict[int, KVConnectorHandshakeMetadata])`	Installs worker handshake payloads indexed by TP rank

Worker-Side API

Method	Purpose
`register_kv_caches(kv_caches)`	Called at startup with per-layer KV tensors; registers GPU memory with NIXL
`register_cross_layers_kv_cache(kv_cache, attn_backend)`	Alternative to above for cross-layer contiguous tensor
`start_load_kv(forward_context)`	Initiates async KV transfers (non-blocking)
`wait_for_layer_load(layer_name)`	Blocks until a given layer's KV is loaded
`save_kv_layer(layer_name, kv_layer, attn_metadata)`	Saves a layer's KV to the connector
`wait_for_save()`	Blocks until all in-progress saves complete
`get_finished(finished_req_ids) → (set, set)`	Returns `(finished_sending, finished_recving)` request ID sets
`get_block_ids_with_load_errors() → set[int]`	Returns block IDs that failed to load

Metadata Base Classes

Class	Purpose
`KVConnectorMetadata`	Abstract; per-step data from scheduler → workers
`KVConnectorHandshakeMetadata`	Abstract; out-of-band worker↔worker initialization data

Sources: vllm/distributed/kv_transfer/kv_connector/v1/base.py60-608

`KVConnectorFactory`

KVConnectorFactory in vllm/distributed/kv_transfer/kv_connector/factory.py maintains a string-keyed registry of connector classes, loaded lazily via module path.

Registration:

Pre-registered connectors (registered at the bottom of factory.py):

Name	Module	Primary Use
`NixlConnector`	`vllm.distributed.kv_transfer.kv_connector.v1.nixl_connector`	Disaggregated prefilling (RDMA)
`MultiConnector`	`vllm.distributed.kv_transfer.kv_connector.v1.multi_connector`	Compose multiple connectors
`OffloadingConnector`	`vllm.distributed.kv_transfer.kv_connector.v1.offloading_connector`	CPU KV offloading
`P2pNcclConnector`	`...v1.p2p.p2p_nccl_connector`	NCCL-based disagg transfer
`LMCacheConnectorV1`	`...v1.lmcache_connector`	LMCache integration
`LMCacheMPConnector`	`...v1.lmcache_mp_connector`	LMCache multiprocessing
`MooncakeConnector`	`...v1.mooncake.mooncake_connector`	Mooncake integration
`MoRIIOConnector`	`...v1.moriio.moriio_connector`	MoRIIO integration
`ExampleConnector`	`...v1.example_connector`	Reference implementation
`DecodeBenchConnector`	`...v1.decode_bench_connector`	Benchmarking

Creation via KVConnectorFactory.create_connector(vllm_config, role, kv_cache_config). The factory also checks whether the connector supports the Hybrid Memory Allocator (SupportsHMA); if HMA is enabled and the connector does not implement SupportsHMA, creation fails.

Custom connectors can be loaded from external modules by setting kv_connector_module_path in KVTransferConfig.

Sources: vllm/distributed/kv_transfer/kv_connector/factory.py27-204

`NixlConnector`

NixlConnector in vllm/distributed/kv_transfer/kv_connector/v1/nixl_connector.py is the primary connector for production disaggregated prefilling. It uses the NIXL library (nixl._api.nixl_agent on CUDA, rixl._api.nixl_agent on ROCm) for RDMA-style GPU memory transfer.

Internal Structure

Sources: vllm/distributed/kv_transfer/kv_connector/v1/nixl_connector.py515-560 vllm/distributed/kv_transfer/kv_connector/v1/nixl_connector.py854-980

`NixlConnectorScheduler`

Runs in the scheduler process. Tracks request state across scheduling steps.

State fields:

Field	Type	Description
`_reqs_need_recv`	`dict[ReqId, tuple[Request, list[int]]]`	Requests waiting for decoder to initiate NIXL GET
`_reqs_need_send`	`dict[ReqId, float]`	Prefill requests holding blocks pending decoder read; value is expiry time
`_reqs_need_save`	`dict[ReqId, Request]`	Prefill requests that need KV blocks staged to host buffer (when `use_host_buffer=True`)
`_reqs_in_batch`	`set[ReqId]`	Requests in current scheduling batch (for `do_remote_decode`)

build_connector_meta() snapshots these dicts into a NixlConnectorMetadata, clears the dicts, and returns the metadata for workers to consume.

request_finished(request, block_ids) — key logic:

If do_remote_decode=True and status is FINISHED_LENGTH_CAPPED: adds to _reqs_need_send; returns (True, kv_transfer_params) where kv_transfer_params contains remote_block_ids, remote_engine_id, remote_request_id, remote_host, remote_port, tp_size
If request is aborted: adds to _reqs_not_processed, clears from _reqs_need_save

The ZMQ listener (_nixl_handshake_listener) is a daemon thread that starts once set_xfer_handshake_metadata() is called. It serves NixlHandshakePayload to connecting decoder workers, responding to GET_META_MSG queries with the encoded payload for the requested TP rank.

Sources: vllm/distributed/kv_transfer/kv_connector/v1/nixl_connector.py515-852

`NixlConnectorWorker`

Runs in each GPU worker process. Manages the NIXL agent and executes async block transfers.

Initialization sequence:

Creates NixlWrapper (NIXL agent) with a UUID-based name
Configures nixl_agent_config with num_threads (default 4, to avoid UAR exhaustion on Mellanox NICs) and telemetry capture
Sets nixl_memory_type to "VRAM" (GPU) or "DRAM" (CPU) based on kv_buffer_device

register_kv_caches(kv_caches):

Calls nixl_wrapper.get_reg_descs() on each KV cache tensor
Calls nixl_wrapper.register_memory() to pin the memory regions
Calls nixl_wrapper.prep_xfer_dlist() to build source-side transfer handles (src_xfer_handles_by_block_size)
Computes xfer_handshake_metadata (NixlHandshakePayload) for the scheduler to serve to remote workers

start_load_kv(connector_metadata) — called before every forward pass. For each request in reqs_to_recv:

If the remote engine is already known (handshake done): launches _read_blocks() immediately
If the handshake is pending: starts _nixl_handshake() in a ThreadPoolExecutor background thread, queues the transfer for after completion

This makes start_load_kv() non-blocking.

Sources: vllm/distributed/kv_transfer/kv_connector/v1/nixl_connector.py854-1200

NIXL Handshake Protocol

Before blocks can be transferred, the decoder worker must register the prefiller's NIXL agent. This handshake is initiated lazily on first transfer to a given engine.

Sources: vllm/distributed/kv_transfer/kv_connector/v1/nixl_connector.py600-636 vllm/distributed/kv_transfer/kv_connector/v1/nixl_connector.py1060-1160

`NixlAgentMetadata` and `NixlHandshakePayload`

NixlAgentMetadata is the data exchanged during handshake:

Field	Type	Description
`engine_id`	`str`	Unique identifier of the remote engine
`agent_metadata`	`bytes`	Raw NIXL agent descriptor bytes from `nixl_wrapper.get_agent_metadata()`
`kv_caches_base_addr`	`list[int]`	Base memory addresses of registered KV cache regions
`device_id`	`int`	GPU device index (== TP rank)
`num_blocks`	`int`	Number of KV blocks in the remote cache
`block_lens`	`list[int]`	Per-layer block sizes in bytes
`kv_cache_layout`	`str`	`"HND"` or `"NHD"`
`block_size`	`int`	Token block size

NixlHandshakePayload wraps the above with a compatibility_hash. The hash covers vLLM version, NIXL_CONNECTOR_VERSION, model architecture, attention backend, and cache dtype. Mismatches are detected before attempting to decode NixlAgentMetadata, providing a clean error message.

Sources: vllm/distributed/kv_transfer/kv_connector/v1/nixl_connector.py145-233

Full Transfer Data Flow

Sources: vllm/distributed/kv_transfer/kv_connector/v1/nixl_connector.py783-851 vllm/distributed/kv_transfer/kv_connector/v1/nixl_connector.py1060-1400

`NixlConnectorMetadata` Fields

The NixlConnectorMetadata object assembled by build_connector_meta() carries:

Field	Type	Used By
`reqs_to_recv`	`dict[ReqId, ReqMeta]`	Worker: initiate NIXL GET for each request
`reqs_to_save`	`dict[ReqId, ReqMeta]`	Worker: stage KV to host buffer (prefill with `use_host_buffer=True`)
`reqs_to_send`	`dict[ReqId, float]`	Worker: track expiry of held prefill blocks
`reqs_in_batch`	`set[ReqId]`	Worker: know which decode requests are in current batch
`reqs_not_processed`	`set[ReqId]`	Worker: clean up aborted/non-KV requests

ReqMeta carries:

Field	Type	Description
`local_block_ids`	`list[int]`	Destination block IDs on the decoder
`local_physical_block_ids`	`list[int]`	Physical block IDs (differs when logical block size ≠ kernel block size)
`tp_size`	`int`	TP size from the producing side
`remote`	`RemoteMeta \| None`	Remote connection info (`block_ids`, `host`, `port`, `engine_id`, `request_id`)

Sources: vllm/distributed/kv_transfer/kv_connector/v1/nixl_connector.py254-299

KV Layout and Cross-Layer Optimization

HND Layout

NixlConnector.get_required_kvcache_layout() returns "HND" for non-MLA models. The default vLLM layout is "NHD" (tokens × heads × dims). In "HND" (heads × tokens × dims), all data for a given head is contiguous, improving NIXL scatter/gather efficiency during block-level transfers.

Layout is enforced via VLLM_KV_CACHE_LAYOUT at startup and determined globally via get_kv_connector_cache_layout() in vllm/distributed/kv_transfer/kv_connector/utils.py29-42

Cross-Layer Contiguous Blocks

When enable_cross_layers_blocks = "True" in kv_connector_extra_config and the backend is FLASH_ATTN or FLASHINFER with HND layout, NixlConnector.prefer_cross_layer_blocks returns True.

This causes KVConnectorModelRunnerMixin.use_uniform_kv_cache() to return True, and allocate_uniform_kv_caches() allocates a single tensor for all layers. The tensor shape includes a num_layers dimension, so each block slot holds KV data for all layers contiguously. This reduces the NIXL descriptor count from 2 * num_layers to 1 per block, significantly reducing transfer overhead for models with many layers.

register_cross_layers_kv_cache() is called instead of register_kv_caches() in this mode.

Sources: vllm/distributed/kv_transfer/kv_connector/v1/nixl_connector.py302-363 vllm/v1/worker/kv_connector_model_runner_mixin.py127-240

Heterogeneous Tensor Parallelism

TpKVTopology in vllm/distributed/kv_transfer/kv_connector/utils.py304-472 handles mismatched TP sizes between prefiller and decoder.

TP ratio computation (TpKVTopology.tp_ratio(remote_tp_size)):

local_tp ≥ remote_tp → positive ratio local_tp / remote_tp; each local rank reads from one remote rank
remote_tp > local_tp → negative ratio -(remote_tp / local_tp); each local rank reads from |ratio| remote ranks

At handshake time with remote_tp > local_tp, the worker performs |ratio| handshakes and stores src_xfer_handles_by_tp_ratio[-ratio] as a list of handles, one per remote rank.

TpKVTopology.get_target_remote_ranks(remote_tp_size) returns the list of remote TP ranks a given local rank reads from.

Sources: vllm/distributed/kv_transfer/kv_connector/utils.py304-480 vllm/distributed/kv_transfer/kv_connector/v1/nixl_connector.py980-1060

`MultiConnector`

MultiConnector in vllm/distributed/kv_transfer/kv_connector/v1/multi_connector.py composes an ordered list of KVConnectorBase_V1 instances. Sub-connectors are configured via the connectors key in kv_connector_extra_config.

Load routing:

get_num_new_matched_tokens() queries all connectors in order; the first with nonzero tokens is recorded in _requests_to_connector[request_id]
update_state_after_alloc() forwards to the chosen connector with actual blocks; all others receive empty blocks
On reload failures, the next connector in the list can serve as fallback (at the next request attempt)

Save routing:

save_kv_layer(), wait_for_save() are forwarded to all sub-connectors
get_finished() aggregates results; _extra_async_saves tracks requests where multiple connectors are saving asynchronously (all must finish before blocks are freed)

Metadata:

build_connector_meta() produces MultiKVConnectorMetadata containing a tuple of each sub-connector's metadata
bind_connector_metadata() distributes each element to the corresponding sub-connector

Sources: vllm/distributed/kv_transfer/kv_connector/v1/multi_connector.py105-434

`OffloadingConnector`

OffloadingConnector in vllm/distributed/kv_transfer/kv_connector/v1/offloading_connector.py offloads KV blocks from GPU to CPU within a single instance. It is not used for cross-instance transfer.

Key components:

Component	Role
`OffloadingConnectorScheduler`	Tracks which block hashes are offloaded; responds to `get_num_new_matched_tokens()` with cache hits
`OffloadingConnectorWorker`	Executes GPU↔CPU copies via CUDA streams (`CpuGpuOffloadingHandlers`)
`CPUOffloadingSpec`	Configuration: `cpu_bytes_to_use`, `block_size`
`OffloadingManager`	Abstract LRU/ARC manager; determines which blocks to evict

prefer_cross_layer_blocks = True is set, enabling cross-layer contiguous allocation for efficient block copies.

The spec is configured via:

Sources: vllm/distributed/kv_transfer/kv_connector/v1/offloading_connector.py116-242 vllm/v1/kv_offload/cpu.py21-100

Integration with Model Runner

KVConnectorModelRunnerMixin in vllm/v1/worker/kv_connector_model_runner_mixin.py integrates the connector into GPUModelRunner.execute_model().

Per-step lifecycle:

The context manager _get_kv_connector_output() encapsulates this lifecycle. kv_connector_no_forward() is called when there are no requests to run but async KV operations still need polling.

The global connector singleton is managed by kv_transfer_state.py via get_kv_transfer_group() / has_kv_transfer_group() / ensure_kv_transfer_shutdown().

KVOutputAggregator in vllm/distributed/kv_transfer/kv_connector/utils.py45-155 is used by executors to merge KVConnectorOutput objects from all TP workers before returning to the scheduler. It counts down _expected_finished_count per request ID before declaring a transfer complete.

Sources: vllm/v1/worker/kv_connector_model_runner_mixin.py40-125 vllm/distributed/kv_transfer/kv_connector/utils.py45-155 vllm/distributed/kv_transfer/kv_transfer_state.py1-60

Metrics and Observability

Class	Location	Purpose
`KVConnectorStats`	`metrics.py`	Base dataclass for transfer stats; has `aggregate()`, `reduce()`, `is_empty()`
`NixlKVConnectorStats`	`nixl_connector.py`	NIXL telemetry: duration, bytes, descriptor count per transfer
`MultiKVConnectorStats`	`multi_connector.py`	Dict of per-sub-connector `KVConnectorStats`
`OffloadingConnectorStats`	`offloading_connector.py`	Per-direction transfer size and time
`KVConnectorPromMetrics`	`metrics.py`	Base class for Prometheus metric registration and observation

Stats are collected by get_kv_connector_stats() on the worker side and aggregated by KVOutputAggregator before being returned to the scheduler process.

NixlKVConnectorStats records telemetry from nixlXferTelemetry objects returned by nixl_wrapper.get_xfer_telemetry(handle):

Field	Unit	Description
`xferDuration`	µs	Time for the NIXL transfer itself
`postDuration`	µs	Post-transfer processing time
`totalBytes`	bytes	Data transferred
`descCount`	count	NIXL descriptor count

Sources: vllm/distributed/kv_transfer/kv_connector/v1/metrics.py1-200 vllm/distributed/kv_transfer/kv_connector/v1/nixl_connector.py1400-1500

Configuration Reference

`KVTransferConfig` Fields

Field	Type	Default	Description
`kv_connector`	`str`	—	Registered connector name (e.g., `"NixlConnector"`)
`kv_role`	`str`	—	`"kv_producer"`, `"kv_consumer"`, or `"kv_both"`
`kv_buffer_device`	`str`	`"cuda"`	Transport buffer: `"cuda"` (VRAM) or `"cpu"` (DRAM)
`kv_connector_extra_config`	`dict`	`{}`	Connector-specific settings
`kv_connector_module_path`	`str`	`None`	Python module path for custom connectors
`engine_id`	`str`	UUID	Unique identifier for this engine
`kv_load_failure_policy`	`str`	`"fail"`	`"fail"` or `"recompute"` on KV load error

`NixlConnector` `kv_connector_extra_config` Keys

Key	Type	Default	Description
`backends`	`list[str]`	`["UCX"]`	NIXL transport backends (e.g., `["LIBFABRIC"]`)
`num_threads`	`int`	`4`	NIXL worker thread count (limits UAR exhaustion)
`enable_cross_layers_blocks`	`str`	`"False"`	Enable cross-layer contiguous KV tensor
`enable_permute_local_kv`	`str`	`"False"`	Enable HND↔NHD post-receive permutation (heterogeneous layout)

Environment Variables

Variable	Default	Description
`VLLM_NIXL_SIDE_CHANNEL_PORT`	`5600`	ZMQ handshake listener port; `port + dp_rank` per DP worker
`VLLM_NIXL_SIDE_CHANNEL_HOST`	`"localhost"`	Hostname advertised to decoders for handshake connection
`VLLM_NIXL_ABORT_REQUEST_TIMEOUT`	`480`	Seconds before held prefill blocks are force-released
`VLLM_KV_CACHE_LAYOUT`	`"NHD"`	Override KV tensor layout; `"HND"` recommended for NIXL
`UCX_RCACHE_MAX_UNRELEASED`	`"1024"` (auto-set)	Prevents UCX memory leak on Mellanox NICs
`UCX_NET_DEVICES`	—	UCX network device selection (e.g., `"all"` or `"mlx5_0:1"`)
`UCX_TLS`	—	UCX transport layer selection

Sources: vllm/distributed/kv_transfer/kv_connector/v1/nixl_connector.py84-143 docs/features/nixl_connector_usage.md113-132

KV Cache Transfer and Disaggregated Serving

Relevant source files

Disaggregated Prefilling Concept

A proxy server orchestrates requests: it first sends the prompt to a prefiller, then forwards the completed request (with KV transfer metadata attached) to a decoder.

Architecture:

Sources: vllm/distributed/kv_transfer/kv_connector/v1/nixl_connector.py515-851 docs/features/nixl_connector_usage.md1-115

Key benefits:

TTFT and ITL can be tuned independently by assigning different parallelism configs to each instance type
Decode instances avoid prefill-decode interference, reducing tail ITL
KV blocks move directly between GPU VRAM regions via RDMA (NIXL/UCX), bypassing CPU when possible

Connector Interface: `KVConnectorBase_V1`

Class hierarchy:

Sources: vllm/distributed/kv_transfer/kv_connector/v1/base.py121-608 vllm/distributed/kv_transfer/kv_connector/factory.py142-204

Connector Roles

KVConnectorRole is an enum with two values. Each connector is instantiated separately for each role:

Role	Where It Runs	Responsibilities
`KVConnectorRole.SCHEDULER`	Scheduler process	Request tracking, metadata assembly, block hold/release decisions
`KVConnectorRole.WORKER`	Each GPU worker process	Memory registration, async transfers, transfer status polling

Scheduler-Side API

Method	Signature	Purpose
`get_num_new_matched_tokens`	`(request, num_computed_tokens) → (int\|None, bool)`	Returns count of tokens loadable from external KV cache and whether load is async
`update_state_after_alloc`	`(request, blocks, num_external_tokens)`	Called after block allocation; connector records which blocks need loading/saving
`build_connector_meta`	`(scheduler_output) → KVConnectorMetadata`	Assembles per-step metadata for workers; resets internal pending state
`request_finished`	`(request, block_ids) → (bool, dict\|None)`	Called once when request completes; returns `(delay_free, kv_transfer_params)`
`set_xfer_handshake_metadata`	`(metadata: dict[int, KVConnectorHandshakeMetadata])`	Installs worker handshake payloads indexed by TP rank

Worker-Side API

Method	Purpose
`register_kv_caches(kv_caches)`	Called at startup with per-layer KV tensors; registers GPU memory with NIXL
`register_cross_layers_kv_cache(kv_cache, attn_backend)`	Alternative to above for cross-layer contiguous tensor
`start_load_kv(forward_context)`	Initiates async KV transfers (non-blocking)
`wait_for_layer_load(layer_name)`	Blocks until a given layer's KV is loaded
`save_kv_layer(layer_name, kv_layer, attn_metadata)`	Saves a layer's KV to the connector
`wait_for_save()`	Blocks until all in-progress saves complete
`get_finished(finished_req_ids) → (set, set)`	Returns `(finished_sending, finished_recving)` request ID sets
`get_block_ids_with_load_errors() → set[int]`	Returns block IDs that failed to load

Metadata Base Classes

Class	Purpose
`KVConnectorMetadata`	Abstract; per-step data from scheduler → workers
`KVConnectorHandshakeMetadata`	Abstract; out-of-band worker↔worker initialization data

Sources: vllm/distributed/kv_transfer/kv_connector/v1/base.py60-608

`KVConnectorFactory`

KVConnectorFactory in vllm/distributed/kv_transfer/kv_connector/factory.py maintains a string-keyed registry of connector classes, loaded lazily via module path.

Registration:

Pre-registered connectors (registered at the bottom of factory.py):

Name	Module	Primary Use
`NixlConnector`	`vllm.distributed.kv_transfer.kv_connector.v1.nixl_connector`	Disaggregated prefilling (RDMA)
`MultiConnector`	`vllm.distributed.kv_transfer.kv_connector.v1.multi_connector`	Compose multiple connectors
`OffloadingConnector`	`vllm.distributed.kv_transfer.kv_connector.v1.offloading_connector`	CPU KV offloading
`P2pNcclConnector`	`...v1.p2p.p2p_nccl_connector`	NCCL-based disagg transfer
`LMCacheConnectorV1`	`...v1.lmcache_connector`	LMCache integration
`LMCacheMPConnector`	`...v1.lmcache_mp_connector`	LMCache multiprocessing
`MooncakeConnector`	`...v1.mooncake.mooncake_connector`	Mooncake integration
`MoRIIOConnector`	`...v1.moriio.moriio_connector`	MoRIIO integration
`ExampleConnector`	`...v1.example_connector`	Reference implementation
`DecodeBenchConnector`	`...v1.decode_bench_connector`	Benchmarking

Custom connectors can be loaded from external modules by setting kv_connector_module_path in KVTransferConfig.

Sources: vllm/distributed/kv_transfer/kv_connector/factory.py27-204

`NixlConnector`

Internal Structure

Sources: vllm/distributed/kv_transfer/kv_connector/v1/nixl_connector.py515-560 vllm/distributed/kv_transfer/kv_connector/v1/nixl_connector.py854-980

`NixlConnectorScheduler`

Runs in the scheduler process. Tracks request state across scheduling steps.

State fields:

Field	Type	Description
`_reqs_need_recv`	`dict[ReqId, tuple[Request, list[int]]]`	Requests waiting for decoder to initiate NIXL GET
`_reqs_need_send`	`dict[ReqId, float]`	Prefill requests holding blocks pending decoder read; value is expiry time
`_reqs_need_save`	`dict[ReqId, Request]`	Prefill requests that need KV blocks staged to host buffer (when `use_host_buffer=True`)
`_reqs_in_batch`	`set[ReqId]`	Requests in current scheduling batch (for `do_remote_decode`)

build_connector_meta() snapshots these dicts into a NixlConnectorMetadata, clears the dicts, and returns the metadata for workers to consume.

request_finished(request, block_ids) — key logic:

If do_remote_decode=True and status is FINISHED_LENGTH_CAPPED: adds to _reqs_need_send; returns (True, kv_transfer_params) where kv_transfer_params contains remote_block_ids, remote_engine_id, remote_request_id, remote_host, remote_port, tp_size
If request is aborted: adds to _reqs_not_processed, clears from _reqs_need_save

Sources: vllm/distributed/kv_transfer/kv_connector/v1/nixl_connector.py515-852

`NixlConnectorWorker`

Runs in each GPU worker process. Manages the NIXL agent and executes async block transfers.

Initialization sequence:

Creates NixlWrapper (NIXL agent) with a UUID-based name
Configures nixl_agent_config with num_threads (default 4, to avoid UAR exhaustion on Mellanox NICs) and telemetry capture
Sets nixl_memory_type to "VRAM" (GPU) or "DRAM" (CPU) based on kv_buffer_device

register_kv_caches(kv_caches):

Calls nixl_wrapper.get_reg_descs() on each KV cache tensor
Calls nixl_wrapper.register_memory() to pin the memory regions
Calls nixl_wrapper.prep_xfer_dlist() to build source-side transfer handles (src_xfer_handles_by_block_size)
Computes xfer_handshake_metadata (NixlHandshakePayload) for the scheduler to serve to remote workers

start_load_kv(connector_metadata) — called before every forward pass. For each request in reqs_to_recv:

If the remote engine is already known (handshake done): launches _read_blocks() immediately
If the handshake is pending: starts _nixl_handshake() in a ThreadPoolExecutor background thread, queues the transfer for after completion

This makes start_load_kv() non-blocking.

Sources: vllm/distributed/kv_transfer/kv_connector/v1/nixl_connector.py854-1200

NIXL Handshake Protocol

Before blocks can be transferred, the decoder worker must register the prefiller's NIXL agent. This handshake is initiated lazily on first transfer to a given engine.

Sources: vllm/distributed/kv_transfer/kv_connector/v1/nixl_connector.py600-636 vllm/distributed/kv_transfer/kv_connector/v1/nixl_connector.py1060-1160

`NixlAgentMetadata` and `NixlHandshakePayload`

NixlAgentMetadata is the data exchanged during handshake:

Field	Type	Description
`engine_id`	`str`	Unique identifier of the remote engine
`agent_metadata`	`bytes`	Raw NIXL agent descriptor bytes from `nixl_wrapper.get_agent_metadata()`
`kv_caches_base_addr`	`list[int]`	Base memory addresses of registered KV cache regions
`device_id`	`int`	GPU device index (== TP rank)
`num_blocks`	`int`	Number of KV blocks in the remote cache
`block_lens`	`list[int]`	Per-layer block sizes in bytes
`kv_cache_layout`	`str`	`"HND"` or `"NHD"`
`block_size`	`int`	Token block size

Sources: vllm/distributed/kv_transfer/kv_connector/v1/nixl_connector.py145-233

Full Transfer Data Flow

Sources: vllm/distributed/kv_transfer/kv_connector/v1/nixl_connector.py783-851 vllm/distributed/kv_transfer/kv_connector/v1/nixl_connector.py1060-1400

`NixlConnectorMetadata` Fields

The NixlConnectorMetadata object assembled by build_connector_meta() carries:

Field	Type	Used By
`reqs_to_recv`	`dict[ReqId, ReqMeta]`	Worker: initiate NIXL GET for each request
`reqs_to_save`	`dict[ReqId, ReqMeta]`	Worker: stage KV to host buffer (prefill with `use_host_buffer=True`)
`reqs_to_send`	`dict[ReqId, float]`	Worker: track expiry of held prefill blocks
`reqs_in_batch`	`set[ReqId]`	Worker: know which decode requests are in current batch
`reqs_not_processed`	`set[ReqId]`	Worker: clean up aborted/non-KV requests

ReqMeta carries:

Field	Type	Description
`local_block_ids`	`list[int]`	Destination block IDs on the decoder
`local_physical_block_ids`	`list[int]`	Physical block IDs (differs when logical block size ≠ kernel block size)
`tp_size`	`int`	TP size from the producing side
`remote`	`RemoteMeta \| None`	Remote connection info (`block_ids`, `host`, `port`, `engine_id`, `request_id`)

Sources: vllm/distributed/kv_transfer/kv_connector/v1/nixl_connector.py254-299

KV Layout and Cross-Layer Optimization

HND Layout

Layout is enforced via VLLM_KV_CACHE_LAYOUT at startup and determined globally via get_kv_connector_cache_layout() in vllm/distributed/kv_transfer/kv_connector/utils.py29-42

Cross-Layer Contiguous Blocks

When enable_cross_layers_blocks = "True" in kv_connector_extra_config and the backend is FLASH_ATTN or FLASHINFER with HND layout, NixlConnector.prefer_cross_layer_blocks returns True.

register_cross_layers_kv_cache() is called instead of register_kv_caches() in this mode.

Sources: vllm/distributed/kv_transfer/kv_connector/v1/nixl_connector.py302-363 vllm/v1/worker/kv_connector_model_runner_mixin.py127-240

Heterogeneous Tensor Parallelism

TpKVTopology in vllm/distributed/kv_transfer/kv_connector/utils.py304-472 handles mismatched TP sizes between prefiller and decoder.

TP ratio computation (TpKVTopology.tp_ratio(remote_tp_size)):

local_tp ≥ remote_tp → positive ratio local_tp / remote_tp; each local rank reads from one remote rank
remote_tp > local_tp → negative ratio -(remote_tp / local_tp); each local rank reads from |ratio| remote ranks

At handshake time with remote_tp > local_tp, the worker performs |ratio| handshakes and stores src_xfer_handles_by_tp_ratio[-ratio] as a list of handles, one per remote rank.

TpKVTopology.get_target_remote_ranks(remote_tp_size) returns the list of remote TP ranks a given local rank reads from.

Sources: vllm/distributed/kv_transfer/kv_connector/utils.py304-480 vllm/distributed/kv_transfer/kv_connector/v1/nixl_connector.py980-1060

`MultiConnector`

Load routing:

get_num_new_matched_tokens() queries all connectors in order; the first with nonzero tokens is recorded in _requests_to_connector[request_id]
update_state_after_alloc() forwards to the chosen connector with actual blocks; all others receive empty blocks
On reload failures, the next connector in the list can serve as fallback (at the next request attempt)

Save routing:

save_kv_layer(), wait_for_save() are forwarded to all sub-connectors
get_finished() aggregates results; _extra_async_saves tracks requests where multiple connectors are saving asynchronously (all must finish before blocks are freed)

Metadata:

build_connector_meta() produces MultiKVConnectorMetadata containing a tuple of each sub-connector's metadata
bind_connector_metadata() distributes each element to the corresponding sub-connector

Sources: vllm/distributed/kv_transfer/kv_connector/v1/multi_connector.py105-434

`OffloadingConnector`

Key components:

Component	Role
`OffloadingConnectorScheduler`	Tracks which block hashes are offloaded; responds to `get_num_new_matched_tokens()` with cache hits
`OffloadingConnectorWorker`	Executes GPU↔CPU copies via CUDA streams (`CpuGpuOffloadingHandlers`)
`CPUOffloadingSpec`	Configuration: `cpu_bytes_to_use`, `block_size`
`OffloadingManager`	Abstract LRU/ARC manager; determines which blocks to evict

prefer_cross_layer_blocks = True is set, enabling cross-layer contiguous allocation for efficient block copies.

The spec is configured via:

Sources: vllm/distributed/kv_transfer/kv_connector/v1/offloading_connector.py116-242 vllm/v1/kv_offload/cpu.py21-100

Integration with Model Runner

KVConnectorModelRunnerMixin in vllm/v1/worker/kv_connector_model_runner_mixin.py integrates the connector into GPUModelRunner.execute_model().

Per-step lifecycle:

The context manager _get_kv_connector_output() encapsulates this lifecycle. kv_connector_no_forward() is called when there are no requests to run but async KV operations still need polling.

The global connector singleton is managed by kv_transfer_state.py via get_kv_transfer_group() / has_kv_transfer_group() / ensure_kv_transfer_shutdown().

Sources: vllm/v1/worker/kv_connector_model_runner_mixin.py40-125 vllm/distributed/kv_transfer/kv_connector/utils.py45-155 vllm/distributed/kv_transfer/kv_transfer_state.py1-60

Metrics and Observability

Class	Location	Purpose
`KVConnectorStats`	`metrics.py`	Base dataclass for transfer stats; has `aggregate()`, `reduce()`, `is_empty()`
`NixlKVConnectorStats`	`nixl_connector.py`	NIXL telemetry: duration, bytes, descriptor count per transfer
`MultiKVConnectorStats`	`multi_connector.py`	Dict of per-sub-connector `KVConnectorStats`
`OffloadingConnectorStats`	`offloading_connector.py`	Per-direction transfer size and time
`KVConnectorPromMetrics`	`metrics.py`	Base class for Prometheus metric registration and observation

Stats are collected by get_kv_connector_stats() on the worker side and aggregated by KVOutputAggregator before being returned to the scheduler process.

NixlKVConnectorStats records telemetry from nixlXferTelemetry objects returned by nixl_wrapper.get_xfer_telemetry(handle):

Field	Unit	Description
`xferDuration`	µs	Time for the NIXL transfer itself
`postDuration`	µs	Post-transfer processing time
`totalBytes`	bytes	Data transferred
`descCount`	count	NIXL descriptor count

Sources: vllm/distributed/kv_transfer/kv_connector/v1/metrics.py1-200 vllm/distributed/kv_transfer/kv_connector/v1/nixl_connector.py1400-1500

Configuration Reference

`KVTransferConfig` Fields

Field	Type	Default	Description
`kv_connector`	`str`	—	Registered connector name (e.g., `"NixlConnector"`)
`kv_role`	`str`	—	`"kv_producer"`, `"kv_consumer"`, or `"kv_both"`
`kv_buffer_device`	`str`	`"cuda"`	Transport buffer: `"cuda"` (VRAM) or `"cpu"` (DRAM)
`kv_connector_extra_config`	`dict`	`{}`	Connector-specific settings
`kv_connector_module_path`	`str`	`None`	Python module path for custom connectors
`engine_id`	`str`	UUID	Unique identifier for this engine
`kv_load_failure_policy`	`str`	`"fail"`	`"fail"` or `"recompute"` on KV load error

`NixlConnector` `kv_connector_extra_config` Keys

Key	Type	Default	Description
`backends`	`list[str]`	`["UCX"]`	NIXL transport backends (e.g., `["LIBFABRIC"]`)
`num_threads`	`int`	`4`	NIXL worker thread count (limits UAR exhaustion)
`enable_cross_layers_blocks`	`str`	`"False"`	Enable cross-layer contiguous KV tensor
`enable_permute_local_kv`	`str`	`"False"`	Enable HND↔NHD post-receive permutation (heterogeneous layout)

Environment Variables

Variable	Default	Description
`VLLM_NIXL_SIDE_CHANNEL_PORT`	`5600`	ZMQ handshake listener port; `port + dp_rank` per DP worker
`VLLM_NIXL_SIDE_CHANNEL_HOST`	`"localhost"`	Hostname advertised to decoders for handshake connection
`VLLM_NIXL_ABORT_REQUEST_TIMEOUT`	`480`	Seconds before held prefill blocks are force-released
`VLLM_KV_CACHE_LAYOUT`	`"NHD"`	Override KV tensor layout; `"HND"` recommended for NIXL
`UCX_RCACHE_MAX_UNRELEASED`	`"1024"` (auto-set)	Prevents UCX memory leak on Mellanox NICs
`UCX_NET_DEVICES`	—	UCX network device selection (e.g., `"all"` or `"mlx5_0:1"`)
`UCX_TLS`	—	UCX transport layer selection

Sources: vllm/distributed/kv_transfer/kv_connector/v1/nixl_connector.py84-143 docs/features/nixl_connector_usage.md113-132

KV Cache Transfer and Disaggregated Serving

Disaggregated Prefilling Concept

Connector Interface: KVConnectorBase_V1

Connector Roles

Scheduler-Side API

Worker-Side API

Metadata Base Classes

KVConnectorFactory

NixlConnector

Internal Structure

NixlConnectorScheduler

NixlConnectorWorker

NIXL Handshake Protocol

NixlAgentMetadata and NixlHandshakePayload

Full Transfer Data Flow

NixlConnectorMetadata Fields

KV Layout and Cross-Layer Optimization

HND Layout

Cross-Layer Contiguous Blocks

Heterogeneous Tensor Parallelism

MultiConnector

OffloadingConnector

Integration with Model Runner

Metrics and Observability

Configuration Reference

KVTransferConfig Fields

NixlConnector kv_connector_extra_config Keys

Environment Variables

On this page

KV Cache Transfer and Disaggregated Serving

Disaggregated Prefilling Concept

Connector Interface: KVConnectorBase_V1

Connector Roles

Scheduler-Side API

Worker-Side API

Metadata Base Classes

KVConnectorFactory

NixlConnector

Internal Structure

NixlConnectorScheduler

NixlConnectorWorker

NIXL Handshake Protocol

NixlAgentMetadata and NixlHandshakePayload

Full Transfer Data Flow

NixlConnectorMetadata Fields

KV Layout and Cross-Layer Optimization

HND Layout

Cross-Layer Contiguous Blocks

Heterogeneous Tensor Parallelism

MultiConnector

OffloadingConnector

Integration with Model Runner

Metrics and Observability

Configuration Reference

KVTransferConfig Fields

NixlConnector kv_connector_extra_config Keys

Environment Variables

On this page

Connector Interface: `KVConnectorBase_V1`

`KVConnectorFactory`

`NixlConnector`

`NixlConnectorScheduler`

`NixlConnectorWorker`

`NixlAgentMetadata` and `NixlHandshakePayload`

`NixlConnectorMetadata` Fields

`MultiConnector`

`OffloadingConnector`

`KVTransferConfig` Fields

`NixlConnector` `kv_connector_extra_config` Keys

Connector Interface: `KVConnectorBase_V1`

`KVConnectorFactory`

`NixlConnector`

`NixlConnectorScheduler`

`NixlConnectorWorker`

`NixlAgentMetadata` and `NixlHandshakePayload`

`NixlConnectorMetadata` Fields

`MultiConnector`

`OffloadingConnector`

`KVTransferConfig` Fields

`NixlConnector` `kv_connector_extra_config` Keys