LoRA Adapter Management

Relevant source files

This page covers how vLLM loads, stores, and applies LoRA adapters at inference time. It spans the full stack from the user-facing LoRARequest API, through layer replacement and weight storage, to the Triton kernels that apply adapter weights during the forward pass.

For information about model weight loading more generally, see Weight Loading and Model Initialization. For details on the OpenAI-compatible HTTP server endpoints that handle LoRA serving (including the /v1/load_lora_adapter route), see OpenAI-Compatible API Server. For configuration objects such as VllmConfig, see Configuration Objects.

Overview

vLLM implements multi-LoRA serving: a single server process can serve requests that target different LoRA adapters simultaneously, all atop one shared base model. The implementation is based on Punica (Chen et al., 2023), using batched grouped-GEMM Triton kernels to apply per-token LoRA computations without materializing separate per-adapter weights in the model graph.

Key design properties:

Pre-allocated weight pools. At startup, GPU tensors of shape (max_loras, ...) are allocated for every LoRA-capable layer. Active adapters are copied into slots within these tensors.
Layer replacement. Every base layer that supports LoRA is replaced in-place by a wrapper class (BaseLayerWithLoRA subclass) that knows how to invoke the Punica kernels.
LRU eviction. When more adapters are requested than can fit in the pool, an LRU policy evicts the least-recently-used adapter from GPU memory (optionally keeping it in CPU memory).
MoE support. Mixture-of-Experts layers (FusedMoE) have dedicated LoRA wrappers that apply adapter weights per-expert, per-token.

Configuration

`LoRARequest`

LoRARequest (vllm/lora/request.py) is the user-facing handle for a single adapter. It is attached to a generation request to indicate which adapter to use.

Field	Type	Description
`lora_name`	`str`	Human-readable name
`lora_int_id`	`int`	Globally unique integer ID
`lora_path`	`str`	Local path or HuggingFace repo ID
`base_model_name`	`str \| None`	Optional base model assertion
`tensorizer_config_dict`	`dict \| None`	For tensorized adapters

`LoRAConfig`

LoRAConfig (vllm/config/lora.py) is set at engine startup and governs GPU memory allocation and kernel behavior.

Field	Default	Description
`max_loras`	—	Maximum concurrent adapters in GPU memory
`max_lora_rank`	—	Maximum rank across all adapters
`lora_dtype`	`None`	Weight dtype (`float16`, `bfloat16`)
`max_cpu_loras`	`None`	CPU cache size (for LRU offload)
`fully_sharded_loras`	`False`	Shard LoRA weights across TP ranks
`specialize_active_lora`	`False`	Specialize CUDA graphs per LoRA count

The max_loras field directly determines how many GPU weight slots are allocated at startup. Every LoRA-capable layer allocates tensors shaped (max_loras, ...) at initialization time.

Sources: vllm/config/lora.py vllm/lora/request.py

Architecture Overview

Component overview: LoRA Adapter Management

Sources: vllm/lora/worker_manager.py vllm/lora/model_manager.py vllm/lora/lora_model.py vllm/lora/punica_wrapper/punica_gpu.py

Layer Replacement

When LoRA is enabled, the LoRAModelManager walks the model's modules and replaces each eligible layer with a LoRA-aware wrapper. This happens in LoRAModelManager._replace_modules() (vllm/lora/model_manager.py) using two factory functions from vllm/lora/utils.py:

from_layer(layer, max_loras, lora_config, packed_modules_list, model_config) — iterates _all_lora_classes, calls can_replace_layer() on each, returns the first match, and calls create_lora_weights() on it.
from_layer_logits_processor(layer, lm_head, ...) — special-cases LogitsProcessor → LogitsProcessorWithLoRA.

Layer class registry (_all_lora_classes)

Base Layer	LoRA Wrapper
`VocabParallelEmbedding`	`VocabParallelEmbeddingWithLoRA`
`ColumnParallelLinear`	`ColumnParallelLinearWithLoRA` / `WithShardedLoRA`
`MergedColumnParallelLinear`	`MergedColumnParallelLinearWithLoRA` / `WithShardedLoRA` / `VariableSlice`
`QKVParallelLinear`	`QKVParallelLinearWithLoRA` / `WithShardedLoRA`
`MergedQKVParallelLinear`	`MergedQKVParallelLinearWithLoRA` / `WithShardedLoRA`
`RowParallelLinear`	`RowParallelLinearWithLoRA` / `WithShardedLoRA`
`ReplicatedLinear`	`ReplicatedLinearWithLoRA`
`LogitsProcessor`	`LogitsProcessorWithLoRA`
`FusedMoE` (packed_modules len=2)	`FusedMoEWithLoRA`
`FusedMoE` (packed_modules len=1)	`FusedMoE3DWithLoRA`

Sources: vllm/lora/utils.py75-92 vllm/lora/utils.py103-121

How BaseLayerWithLoRA works

Sources: vllm/lora/layers/fused_moe.py44-59 vllm/lora/layers/logits_processor.py20-50

Adapter Weight Storage

Stacked GPU Tensors

Each BaseLayerWithLoRA pre-allocates GPU tensors in its create_lora_weights() method. The slot dimension (axis 0) has size max_loras, so all LoRA adapters share one contiguous buffer:

lora_a_stacked: shape (max_loras, 1, hidden_size, max_lora_rank) — down-projection
lora_b_stacked: shape (max_loras, 1, max_lora_rank, output_size) — up-projection

The 1 in position 1 is a padding dimension for kernel compatibility. The weight tensors are initialized to zeros; non-zero entries indicate an active adapter at that slot.

For MoE layers, the tensors add an expert dimension:

w13_lora_a_stacked: shape (max_loras, num_experts, max_lora_rank, hidden_size)
w13_lora_b_stacked: shape (max_loras, num_experts, intermediate_size, max_lora_rank)
w2_lora_a_stacked, w2_lora_b_stacked: similarly shaped for the down-projection.

See vllm/lora/layers/fused_moe.py354-414 for MoE allocation details.

`LoRALayerWeights` and `PackedLoRALayerWeights`

LoRALayerWeights (vllm/lora/lora_weights.py) holds the raw lora_a and lora_b tensors loaded from disk for one module in one adapter. PackedLoRALayerWeights packs multiple sublayers (e.g., gate and up projections of a merged column-parallel linear) together.

When an adapter is activated via BaseLayerWithLoRA.set_lora(index, lora_a, lora_b), the weights are copied into the pre-allocated slot using non-blocking CUDA copies.

Sources: vllm/lora/lora_weights.py vllm/lora/layers/fused_moe.py518-588

Model Manager

LoRAModelManager (vllm/lora/model_manager.py) is the central coordinator. It is created once per worker by the create_lora_manager() factory and is responsible for:

Layer replacement — iterates the model tree and calls from_layer() / from_layer_logits_processor() to install LoRA wrappers.
Adapter slot tracking — maintains lora_index_to_id: list[int | None], mapping each GPU slot (0 to max_loras - 1) to the currently loaded adapter ID.
Activation/deactivation — _activate_adapter(id) copies adapter weights into a slot; _deactivate_adapter(id) calls reset_lora() on all layers.
Mapping — set_adapter_mapping(mapping) updates the PunicaWrapperBase with which LoRA each token belongs to.

LRUCacheLoRAModelManager extends LoRAModelManager with an AdapterLRUCache (vllm/lora/model_manager.py48-56) that evicts the least-recently-used adapter when the GPU pool is full.

Sources: vllm/lora/model_manager.py1-100

Worker Manager

WorkerLoRAManager (vllm/lora/worker_manager.py) sits between the worker process and the LoRAModelManager. Its responsibilities are:

create_lora_manager(model, vllm_config) — delegates to create_lora_manager() factory, stores the resulting manager, returns the LoRA-ified model.
_load_adapter(lora_request) — resolves the adapter path via get_adapter_absolute_path() (vllm/lora/utils.py229-281), loads the PEFT config, reads and validates weights, returns a LoRAModel.
add_adapter(lora_request) — calls _load_adapter then passes to _adapter_manager.add_adapter().
remove_adapter(lora_id) — delegates to the model manager.
set_active_adapters(lora_requests, lora_mapping) — ensures all requested adapters are loaded, then calls _adapter_manager.set_active_adapters(lora_requests, lora_mapping).

LRUCacheWorkerLoRAManager uses LRUCacheLoRAModelManager as its _manager_cls, enabling GPU-slot LRU eviction.

LoRAModelRunnerMixin (vllm/v1/worker/lora_model_runner_mixin.py) is mixed into GPUModelRunner to provide:

load_lora_model() — creates the manager after model load.
set_active_loras() — builds LoRAMapping from the InputBatch and calls set_active_adapters.
add_lora() / remove_lora() / pin_lora() — adapter lifecycle operations.
maybe_setup_dummy_loras() / maybe_dummy_run_with_lora() — CUDA graph warmup helpers.

Sources: vllm/lora/worker_manager.py vllm/v1/worker/lora_model_runner_mixin.py

Adapter Lifecycle

Adapter lifecycle flow

Sources: vllm/lora/worker_manager.py87-180 vllm/lora/model_manager.py vllm/v1/worker/lora_model_runner_mixin.py31-90

Punica Wrapper: Kernel Dispatch

The Punica wrapper is an abstraction layer between BaseLayerWithLoRA and the actual Triton kernels. Every LoRA-enabled layer holds a reference to the wrapper and calls it during forward().

Class Hierarchy

Sources: vllm/lora/punica_wrapper/punica_base.py vllm/lora/punica_wrapper/punica_gpu.py

`LoRAMapping` and Token-to-Slot Mapping

Before each forward pass, LoRAModelManager.set_adapter_mapping() is called with a LoRAMapping object. LoRAMapping carries two sequences:

index_mapping: one integer per token — which LoRA ID applies to that token (0 = no LoRA).
prompt_mapping: one integer per sequence — for logit processor (sampler) indexing.
is_prefill: flag indicating prefill vs. decode phase.

PunicaWrapperBase._update_base_metadata() calls convert_mapping() (vllm/lora/punica_wrapper/utils.py) which converts these IDs to slot indices stored in token_lora_indices — a GPU tensor consulted by the Triton kernels.

PunicaWrapperGPU additionally calls LoRAKernelMeta.prepare_tensors(token_lora_indices) to compute:

token_indices_sorted_by_lora_ids — tokens grouped by LoRA
num_tokens_per_lora — how many tokens each LoRA processes
lora_token_start_loc — cumulative offsets
lora_ids — active LoRA IDs

These tensors drive the Triton kernel grid.

Sources: vllm/lora/punica_wrapper/punica_gpu.py75-89 vllm/lora/punica_wrapper/punica_base.py124-200

Triton Kernels

The two core LoRA operations are implemented as Triton kernels in vllm/lora/ops/triton_ops/

`lora_shrink` (down-projection)

File: vllm/lora/ops/triton_ops/lora_shrink_op.py

output[slice, token, :rank] += (input[token, :hidden] @ lora_a[slot, :hidden, :rank]) * scale

The kernel _lora_shrink_kernel is launched with a 3-axis grid: (M_blocks × N_blocks × SPLIT_K, num_slices, num_active_loras). Each program block looks up its LoRA ID and token offsets via lora_ids[lora_idx] and token_indices_sorted_by_lora_ids.

`lora_expand` (up-projection)

File: vllm/lora/ops/triton_ops/lora_expand_op.py

output[token, offset:offset+slice] += intermediate[slice, token, :rank] @ lora_b[slot, :rank, :output]

The kernel _lora_expand_kernel is similarly structured. Both kernels support TMA (Tensor Memory Accelerator) on SM90+ GPUs via supports_tma() and Programmatic Dependent Launch (PDL) via supports_pdl() in vllm/lora/ops/triton_ops/utils.py307-324

Kernel Configuration

Optimal block sizes are resolved at runtime from:

A user-provided tuned config folder (VLLM_TUNED_CONFIG_FOLDER) containing JSON files keyed by (max_loras, num_slices, batch, hidden_size, rank).
Default configs returned by get_lora_op_configs() (vllm/lora/ops/triton_ops/utils.py186-304).

Sources: vllm/lora/ops/triton_ops/lora_shrink_op.py vllm/lora/ops/triton_ops/lora_expand_op.py vllm/lora/ops/triton_ops/utils.py

MoE-Specific LoRA

`FusedMoEWithLoRA`

Standard FusedMoE layers are replaced by FusedMoEWithLoRA (vllm/lora/layers/fused_moe.py44-619). Expert parallelism (EP) is not supported with LoRA (assert not self.base_layer.use_ep).

The class stores separate A/B weight stacks for the w1/w3 (gate+up) and w2 (down) expert matrices:

w13_lora_a_stacked: (max_loras, num_experts, max_lora_rank, hidden_size)
w13_lora_b_stacked: (max_loras, num_experts, intermediate_size, max_lora_rank)
w2_lora_a_stacked:  (max_loras, num_experts, max_lora_rank, intermediate_size)
w2_lora_b_stacked:  (max_loras, num_experts, hidden_size, max_lora_rank)

adapter_enabled: an integer tensor of shape (max_loras + 1,) that flags which slots are active. The MoE kernel checks this tensor to skip inactive adapters.

Injection Mechanism

_inject_lora_into_fused_moe() (vllm/lora/layers/fused_moe.py130-352) wraps three hooks inside the underlying FusedMoEModularKernel:

Hook	Purpose
`fwd_decorator` on `m_fused_moe_fn.forward`	Captures `hidden_states`, `topk_ids`, `topk_weights` into a shared `moe_state_dict`
`act_decorator` on `fused_experts.activation`	After the first matmul, calls `punica_wrapper.add_lora_fused_moe()` for the w13 path
`moe_sum_decorator` on `fused_experts.moe_sum`	After the second matmul, calls `punica_wrapper.add_lora_fused_moe()` for the w2 path

MoE LoRA forward flow

Sources: vllm/lora/layers/fused_moe.py130-352

`FusedMoE3DWithLoRA`

FusedMoE3DWithLoRA (vllm/lora/layers/fused_moe.py622-781) handles models where the w1 and w3 weights are combined into a single 3D tensor rather than two separate tensors. It uses _w13_slices = 1 and packs both gate and up projections into a single w13_lora_b_stacked with doubled intermediate dimension. It also contains special-case handling for GPT-OSS's interleaved weight order.

The can_replace_layer() classmethod distinguishes the two:

FusedMoEWithLoRA.can_replace_layer() returns True when len(packed_modules_list) == 2.
FusedMoE3DWithLoRA.can_replace_layer() returns True when len(packed_modules_list) == 1.

Sources: vllm/lora/layers/fused_moe.py608-619 vllm/lora/layers/fused_moe.py771-781

MoE LoRA Kernel

The fused MoE LoRA Triton kernel is in vllm/lora/ops/triton_ops/fused_moe_lora_op.py It is invoked through PunicaWrapperGPU.add_lora_fused_moe() (vllm/lora/punica_wrapper/punica_gpu.py391-459) via:

_fused_moe_lora_shrink — computes hidden @ lora_a for each (token, expert, LoRA) triple.
_fused_moe_lora_expand — computes the intermediate @ lora_b and accumulates into the output.

Block assignment for tokens-to-experts is computed by moe_lora_align_block_size() (vllm/lora/punica_wrapper/punica_gpu.py325-389), which calls the custom CUDA op ops.moe_lora_align_block_size to produce sorted_token_ids, expert_ids, and num_tokens_post_padded tensors.

Sources: vllm/lora/ops/triton_ops/fused_moe_lora_op.py vllm/lora/punica_wrapper/punica_gpu.py391-459

Adapter Path Resolution

get_adapter_absolute_path(lora_path) (vllm/lora/utils.py229-281) resolves adapter paths in order:

If the path is absolute — return as-is.
If it starts with ~ — expand user home.
If the relative path exists locally — return os.path.abspath(...).
Otherwise — attempt download from HuggingFace Hub or ModelScope (if VLLM_USE_MODELSCOPE is set).

Tensor Parallelism

When tensor_parallel_size > 1, BaseLayerWithLoRA subclasses with Sharded in their name apply additional slicing of LoRA weights along the rank or output dimension. For example, ColumnParallelLinearWithShardedLoRA shards lora_b across TP ranks, requiring an all_reduce after the expand step.

For MoE layers with fully_sharded_loras=True:

w13_lora_a_stacked has rank sharded: max_lora_rank // tp_size per rank.
w13_lora_b_stacked has output sharded: intermediate_size // tp_size per rank.
w2_lora_a_stacked has input sharded: intermediate_size // tp_size per rank.
w2_lora_b_stacked has output sharded: hidden_size // tp_size per rank.

The _fused_moe_lora kernel handles the collective after shrink via tensor_model_parallel_all_reduce or tensor_model_parallel_all_gather depending on rank configuration (vllm/lora/ops/triton_ops/fused_moe_lora_op.py810-821).

Sources: vllm/lora/layers/fused_moe.py464-516 vllm/lora/ops/triton_ops/fused_moe_lora_op.py806-821

Dynamic LoRA at Runtime

The HTTP server exposes /v1/load_lora_adapter and /v1/unload_lora_adapter endpoints (enabled when VLLM_ALLOW_RUNTIME_LORA_UPDATING=True). These endpoints call add_lora() and remove_lora() on the engine, which propagates down to LoRAModelRunnerMixin.add_lora() (vllm/v1/worker/lora_model_runner_mixin.py271-278).

For the HTTP server side, see OpenAI-Compatible API Server.

Sources: vllm/v1/worker/lora_model_runner_mixin.py271-290 docs/features/lora.md105-155

LoRA Adapter Management

Relevant source files

Overview

Key design properties:

Pre-allocated weight pools. At startup, GPU tensors of shape (max_loras, ...) are allocated for every LoRA-capable layer. Active adapters are copied into slots within these tensors.
Layer replacement. Every base layer that supports LoRA is replaced in-place by a wrapper class (BaseLayerWithLoRA subclass) that knows how to invoke the Punica kernels.
LRU eviction. When more adapters are requested than can fit in the pool, an LRU policy evicts the least-recently-used adapter from GPU memory (optionally keeping it in CPU memory).
MoE support. Mixture-of-Experts layers (FusedMoE) have dedicated LoRA wrappers that apply adapter weights per-expert, per-token.

Configuration

`LoRARequest`

LoRARequest (vllm/lora/request.py) is the user-facing handle for a single adapter. It is attached to a generation request to indicate which adapter to use.

Field	Type	Description
`lora_name`	`str`	Human-readable name
`lora_int_id`	`int`	Globally unique integer ID
`lora_path`	`str`	Local path or HuggingFace repo ID
`base_model_name`	`str \| None`	Optional base model assertion
`tensorizer_config_dict`	`dict \| None`	For tensorized adapters

`LoRAConfig`

LoRAConfig (vllm/config/lora.py) is set at engine startup and governs GPU memory allocation and kernel behavior.

Field	Default	Description
`max_loras`	—	Maximum concurrent adapters in GPU memory
`max_lora_rank`	—	Maximum rank across all adapters
`lora_dtype`	`None`	Weight dtype (`float16`, `bfloat16`)
`max_cpu_loras`	`None`	CPU cache size (for LRU offload)
`fully_sharded_loras`	`False`	Shard LoRA weights across TP ranks
`specialize_active_lora`	`False`	Specialize CUDA graphs per LoRA count

The max_loras field directly determines how many GPU weight slots are allocated at startup. Every LoRA-capable layer allocates tensors shaped (max_loras, ...) at initialization time.

Sources: vllm/config/lora.py vllm/lora/request.py

Architecture Overview

Component overview: LoRA Adapter Management

Sources: vllm/lora/worker_manager.py vllm/lora/model_manager.py vllm/lora/lora_model.py vllm/lora/punica_wrapper/punica_gpu.py

Layer Replacement

from_layer(layer, max_loras, lora_config, packed_modules_list, model_config) — iterates _all_lora_classes, calls can_replace_layer() on each, returns the first match, and calls create_lora_weights() on it.
from_layer_logits_processor(layer, lm_head, ...) — special-cases LogitsProcessor → LogitsProcessorWithLoRA.

Layer class registry (_all_lora_classes)

Base Layer	LoRA Wrapper
`VocabParallelEmbedding`	`VocabParallelEmbeddingWithLoRA`
`ColumnParallelLinear`	`ColumnParallelLinearWithLoRA` / `WithShardedLoRA`
`MergedColumnParallelLinear`	`MergedColumnParallelLinearWithLoRA` / `WithShardedLoRA` / `VariableSlice`
`QKVParallelLinear`	`QKVParallelLinearWithLoRA` / `WithShardedLoRA`
`MergedQKVParallelLinear`	`MergedQKVParallelLinearWithLoRA` / `WithShardedLoRA`
`RowParallelLinear`	`RowParallelLinearWithLoRA` / `WithShardedLoRA`
`ReplicatedLinear`	`ReplicatedLinearWithLoRA`
`LogitsProcessor`	`LogitsProcessorWithLoRA`
`FusedMoE` (packed_modules len=2)	`FusedMoEWithLoRA`
`FusedMoE` (packed_modules len=1)	`FusedMoE3DWithLoRA`

Sources: vllm/lora/utils.py75-92 vllm/lora/utils.py103-121

How BaseLayerWithLoRA works

Sources: vllm/lora/layers/fused_moe.py44-59 vllm/lora/layers/logits_processor.py20-50

Adapter Weight Storage

Stacked GPU Tensors

Each BaseLayerWithLoRA pre-allocates GPU tensors in its create_lora_weights() method. The slot dimension (axis 0) has size max_loras, so all LoRA adapters share one contiguous buffer:

lora_a_stacked: shape (max_loras, 1, hidden_size, max_lora_rank) — down-projection
lora_b_stacked: shape (max_loras, 1, max_lora_rank, output_size) — up-projection

The 1 in position 1 is a padding dimension for kernel compatibility. The weight tensors are initialized to zeros; non-zero entries indicate an active adapter at that slot.

For MoE layers, the tensors add an expert dimension:

w13_lora_a_stacked: shape (max_loras, num_experts, max_lora_rank, hidden_size)
w13_lora_b_stacked: shape (max_loras, num_experts, intermediate_size, max_lora_rank)
w2_lora_a_stacked, w2_lora_b_stacked: similarly shaped for the down-projection.

See vllm/lora/layers/fused_moe.py354-414 for MoE allocation details.

`LoRALayerWeights` and `PackedLoRALayerWeights`

When an adapter is activated via BaseLayerWithLoRA.set_lora(index, lora_a, lora_b), the weights are copied into the pre-allocated slot using non-blocking CUDA copies.

Sources: vllm/lora/lora_weights.py vllm/lora/layers/fused_moe.py518-588

Model Manager

LoRAModelManager (vllm/lora/model_manager.py) is the central coordinator. It is created once per worker by the create_lora_manager() factory and is responsible for:

Layer replacement — iterates the model tree and calls from_layer() / from_layer_logits_processor() to install LoRA wrappers.
Adapter slot tracking — maintains lora_index_to_id: list[int | None], mapping each GPU slot (0 to max_loras - 1) to the currently loaded adapter ID.
Activation/deactivation — _activate_adapter(id) copies adapter weights into a slot; _deactivate_adapter(id) calls reset_lora() on all layers.
Mapping — set_adapter_mapping(mapping) updates the PunicaWrapperBase with which LoRA each token belongs to.

LRUCacheLoRAModelManager extends LoRAModelManager with an AdapterLRUCache (vllm/lora/model_manager.py48-56) that evicts the least-recently-used adapter when the GPU pool is full.

Sources: vllm/lora/model_manager.py1-100

Worker Manager

WorkerLoRAManager (vllm/lora/worker_manager.py) sits between the worker process and the LoRAModelManager. Its responsibilities are:

create_lora_manager(model, vllm_config) — delegates to create_lora_manager() factory, stores the resulting manager, returns the LoRA-ified model.
_load_adapter(lora_request) — resolves the adapter path via get_adapter_absolute_path() (vllm/lora/utils.py229-281), loads the PEFT config, reads and validates weights, returns a LoRAModel.
add_adapter(lora_request) — calls _load_adapter then passes to _adapter_manager.add_adapter().
remove_adapter(lora_id) — delegates to the model manager.
set_active_adapters(lora_requests, lora_mapping) — ensures all requested adapters are loaded, then calls _adapter_manager.set_active_adapters(lora_requests, lora_mapping).

LRUCacheWorkerLoRAManager uses LRUCacheLoRAModelManager as its _manager_cls, enabling GPU-slot LRU eviction.

LoRAModelRunnerMixin (vllm/v1/worker/lora_model_runner_mixin.py) is mixed into GPUModelRunner to provide:

load_lora_model() — creates the manager after model load.
set_active_loras() — builds LoRAMapping from the InputBatch and calls set_active_adapters.
add_lora() / remove_lora() / pin_lora() — adapter lifecycle operations.
maybe_setup_dummy_loras() / maybe_dummy_run_with_lora() — CUDA graph warmup helpers.

Sources: vllm/lora/worker_manager.py vllm/v1/worker/lora_model_runner_mixin.py

Adapter Lifecycle

Adapter lifecycle flow

Sources: vllm/lora/worker_manager.py87-180 vllm/lora/model_manager.py vllm/v1/worker/lora_model_runner_mixin.py31-90

Punica Wrapper: Kernel Dispatch

The Punica wrapper is an abstraction layer between BaseLayerWithLoRA and the actual Triton kernels. Every LoRA-enabled layer holds a reference to the wrapper and calls it during forward().

Class Hierarchy

Sources: vllm/lora/punica_wrapper/punica_base.py vllm/lora/punica_wrapper/punica_gpu.py

`LoRAMapping` and Token-to-Slot Mapping

Before each forward pass, LoRAModelManager.set_adapter_mapping() is called with a LoRAMapping object. LoRAMapping carries two sequences:

index_mapping: one integer per token — which LoRA ID applies to that token (0 = no LoRA).
prompt_mapping: one integer per sequence — for logit processor (sampler) indexing.
is_prefill: flag indicating prefill vs. decode phase.

PunicaWrapperGPU additionally calls LoRAKernelMeta.prepare_tensors(token_lora_indices) to compute:

token_indices_sorted_by_lora_ids — tokens grouped by LoRA
num_tokens_per_lora — how many tokens each LoRA processes
lora_token_start_loc — cumulative offsets
lora_ids — active LoRA IDs

These tensors drive the Triton kernel grid.

Sources: vllm/lora/punica_wrapper/punica_gpu.py75-89 vllm/lora/punica_wrapper/punica_base.py124-200

Triton Kernels

The two core LoRA operations are implemented as Triton kernels in vllm/lora/ops/triton_ops/

`lora_shrink` (down-projection)

File: vllm/lora/ops/triton_ops/lora_shrink_op.py

output[slice, token, :rank] += (input[token, :hidden] @ lora_a[slot, :hidden, :rank]) * scale

`lora_expand` (up-projection)

File: vllm/lora/ops/triton_ops/lora_expand_op.py

output[token, offset:offset+slice] += intermediate[slice, token, :rank] @ lora_b[slot, :rank, :output]

Kernel Configuration

Optimal block sizes are resolved at runtime from:

A user-provided tuned config folder (VLLM_TUNED_CONFIG_FOLDER) containing JSON files keyed by (max_loras, num_slices, batch, hidden_size, rank).
Default configs returned by get_lora_op_configs() (vllm/lora/ops/triton_ops/utils.py186-304).

Sources: vllm/lora/ops/triton_ops/lora_shrink_op.py vllm/lora/ops/triton_ops/lora_expand_op.py vllm/lora/ops/triton_ops/utils.py

MoE-Specific LoRA

`FusedMoEWithLoRA`

Standard FusedMoE layers are replaced by FusedMoEWithLoRA (vllm/lora/layers/fused_moe.py44-619). Expert parallelism (EP) is not supported with LoRA (assert not self.base_layer.use_ep).

The class stores separate A/B weight stacks for the w1/w3 (gate+up) and w2 (down) expert matrices:

w13_lora_a_stacked: (max_loras, num_experts, max_lora_rank, hidden_size)
w13_lora_b_stacked: (max_loras, num_experts, intermediate_size, max_lora_rank)
w2_lora_a_stacked:  (max_loras, num_experts, max_lora_rank, intermediate_size)
w2_lora_b_stacked:  (max_loras, num_experts, hidden_size, max_lora_rank)

adapter_enabled: an integer tensor of shape (max_loras + 1,) that flags which slots are active. The MoE kernel checks this tensor to skip inactive adapters.

Injection Mechanism

_inject_lora_into_fused_moe() (vllm/lora/layers/fused_moe.py130-352) wraps three hooks inside the underlying FusedMoEModularKernel:

Hook	Purpose
`fwd_decorator` on `m_fused_moe_fn.forward`	Captures `hidden_states`, `topk_ids`, `topk_weights` into a shared `moe_state_dict`
`act_decorator` on `fused_experts.activation`	After the first matmul, calls `punica_wrapper.add_lora_fused_moe()` for the w13 path
`moe_sum_decorator` on `fused_experts.moe_sum`	After the second matmul, calls `punica_wrapper.add_lora_fused_moe()` for the w2 path

MoE LoRA forward flow

Sources: vllm/lora/layers/fused_moe.py130-352

`FusedMoE3DWithLoRA`

The can_replace_layer() classmethod distinguishes the two:

FusedMoEWithLoRA.can_replace_layer() returns True when len(packed_modules_list) == 2.
FusedMoE3DWithLoRA.can_replace_layer() returns True when len(packed_modules_list) == 1.

Sources: vllm/lora/layers/fused_moe.py608-619 vllm/lora/layers/fused_moe.py771-781

MoE LoRA Kernel

The fused MoE LoRA Triton kernel is in vllm/lora/ops/triton_ops/fused_moe_lora_op.py It is invoked through PunicaWrapperGPU.add_lora_fused_moe() (vllm/lora/punica_wrapper/punica_gpu.py391-459) via:

_fused_moe_lora_shrink — computes hidden @ lora_a for each (token, expert, LoRA) triple.
_fused_moe_lora_expand — computes the intermediate @ lora_b and accumulates into the output.

Sources: vllm/lora/ops/triton_ops/fused_moe_lora_op.py vllm/lora/punica_wrapper/punica_gpu.py391-459

Adapter Path Resolution

get_adapter_absolute_path(lora_path) (vllm/lora/utils.py229-281) resolves adapter paths in order:

If the path is absolute — return as-is.
If it starts with ~ — expand user home.
If the relative path exists locally — return os.path.abspath(...).
Otherwise — attempt download from HuggingFace Hub or ModelScope (if VLLM_USE_MODELSCOPE is set).

Tensor Parallelism

For MoE layers with fully_sharded_loras=True:

w13_lora_a_stacked has rank sharded: max_lora_rank // tp_size per rank.
w13_lora_b_stacked has output sharded: intermediate_size // tp_size per rank.
w2_lora_a_stacked has input sharded: intermediate_size // tp_size per rank.
w2_lora_b_stacked has output sharded: hidden_size // tp_size per rank.

Sources: vllm/lora/layers/fused_moe.py464-516 vllm/lora/ops/triton_ops/fused_moe_lora_op.py806-821

Dynamic LoRA at Runtime

For the HTTP server side, see OpenAI-Compatible API Server.

Sources: vllm/v1/worker/lora_model_runner_mixin.py271-290 docs/features/lora.md105-155

LoRA Adapter Management

Overview

Configuration

LoRARequest

LoRAConfig

Architecture Overview

Layer Replacement

Adapter Weight Storage

Stacked GPU Tensors

LoRALayerWeights and PackedLoRALayerWeights

Model Manager

Worker Manager

Adapter Lifecycle

Punica Wrapper: Kernel Dispatch

Class Hierarchy

LoRAMapping and Token-to-Slot Mapping

Triton Kernels

lora_shrink (down-projection)

lora_expand (up-projection)

Kernel Configuration

MoE-Specific LoRA

FusedMoEWithLoRA

Injection Mechanism

FusedMoE3DWithLoRA

MoE LoRA Kernel

Adapter Path Resolution

Tensor Parallelism

Dynamic LoRA at Runtime

On this page

LoRA Adapter Management

Overview

Configuration

LoRARequest

LoRAConfig

Architecture Overview

Layer Replacement

Adapter Weight Storage

Stacked GPU Tensors

LoRALayerWeights and PackedLoRALayerWeights

Model Manager

Worker Manager

Adapter Lifecycle

Punica Wrapper: Kernel Dispatch

Class Hierarchy

LoRAMapping and Token-to-Slot Mapping

Triton Kernels

lora_shrink (down-projection)

lora_expand (up-projection)

Kernel Configuration

MoE-Specific LoRA

FusedMoEWithLoRA

Injection Mechanism

FusedMoE3DWithLoRA

MoE LoRA Kernel

Adapter Path Resolution

Tensor Parallelism

Dynamic LoRA at Runtime

On this page

`LoRARequest`

`LoRAConfig`

`LoRALayerWeights` and `PackedLoRALayerWeights`

`LoRAMapping` and Token-to-Slot Mapping

`lora_shrink` (down-projection)

`lora_expand` (up-projection)

`FusedMoEWithLoRA`

`FusedMoE3DWithLoRA`

`LoRARequest`

`LoRAConfig`

`LoRALayerWeights` and `PackedLoRALayerWeights`

`LoRAMapping` and Token-to-Slot Mapping

`lora_shrink` (down-projection)

`lora_expand` (up-projection)

`FusedMoEWithLoRA`

`FusedMoE3DWithLoRA`