This page covers how vLLM loads, stores, and applies LoRA adapters at inference time. It spans the full stack from the user-facing LoRARequest API, through layer replacement and weight storage, to the Triton kernels that apply adapter weights during the forward pass.
For information about model weight loading more generally, see Weight Loading and Model Initialization. For details on the OpenAI-compatible HTTP server endpoints that handle LoRA serving (including the /v1/load_lora_adapter route), see OpenAI-Compatible API Server. For configuration objects such as VllmConfig, see Configuration Objects.
vLLM implements multi-LoRA serving: a single server process can serve requests that target different LoRA adapters simultaneously, all atop one shared base model. The implementation is based on Punica (Chen et al., 2023), using batched grouped-GEMM Triton kernels to apply per-token LoRA computations without materializing separate per-adapter weights in the model graph.
Key design properties:
(max_loras, ...) are allocated for every LoRA-capable layer. Active adapters are copied into slots within these tensors.BaseLayerWithLoRA subclass) that knows how to invoke the Punica kernels.FusedMoE) have dedicated LoRA wrappers that apply adapter weights per-expert, per-token.LoRARequestLoRARequest (vllm/lora/request.py) is the user-facing handle for a single adapter. It is attached to a generation request to indicate which adapter to use.
| Field | Type | Description |
|---|---|---|
lora_name | str | Human-readable name |
lora_int_id | int | Globally unique integer ID |
lora_path | str | Local path or HuggingFace repo ID |
base_model_name | str | None | Optional base model assertion |
tensorizer_config_dict | dict | None | For tensorized adapters |
LoRAConfigLoRAConfig (vllm/config/lora.py) is set at engine startup and governs GPU memory allocation and kernel behavior.
| Field | Default | Description |
|---|---|---|
max_loras | — | Maximum concurrent adapters in GPU memory |
max_lora_rank | — | Maximum rank across all adapters |
lora_dtype | None | Weight dtype (float16, bfloat16) |
max_cpu_loras | None | CPU cache size (for LRU offload) |
fully_sharded_loras | False | Shard LoRA weights across TP ranks |
specialize_active_lora | False | Specialize CUDA graphs per LoRA count |
The max_loras field directly determines how many GPU weight slots are allocated at startup. Every LoRA-capable layer allocates tensors shaped (max_loras, ...) at initialization time.
Sources: vllm/config/lora.py vllm/lora/request.py
Component overview: LoRA Adapter Management
Sources: vllm/lora/worker_manager.py vllm/lora/model_manager.py vllm/lora/lora_model.py vllm/lora/punica_wrapper/punica_gpu.py
When LoRA is enabled, the LoRAModelManager walks the model's modules and replaces each eligible layer with a LoRA-aware wrapper. This happens in LoRAModelManager._replace_modules() (vllm/lora/model_manager.py) using two factory functions from vllm/lora/utils.py:
from_layer(layer, max_loras, lora_config, packed_modules_list, model_config) — iterates _all_lora_classes, calls can_replace_layer() on each, returns the first match, and calls create_lora_weights() on it.from_layer_logits_processor(layer, lm_head, ...) — special-cases LogitsProcessor → LogitsProcessorWithLoRA.Layer class registry (_all_lora_classes)
| Base Layer | LoRA Wrapper |
|---|---|
VocabParallelEmbedding | VocabParallelEmbeddingWithLoRA |
ColumnParallelLinear | ColumnParallelLinearWithLoRA / WithShardedLoRA |
MergedColumnParallelLinear | MergedColumnParallelLinearWithLoRA / WithShardedLoRA / VariableSlice |
QKVParallelLinear | QKVParallelLinearWithLoRA / WithShardedLoRA |
MergedQKVParallelLinear | MergedQKVParallelLinearWithLoRA / WithShardedLoRA |
RowParallelLinear | RowParallelLinearWithLoRA / WithShardedLoRA |
ReplicatedLinear | ReplicatedLinearWithLoRA |
LogitsProcessor | LogitsProcessorWithLoRA |
FusedMoE (packed_modules len=2) | FusedMoEWithLoRA |
FusedMoE (packed_modules len=1) | FusedMoE3DWithLoRA |
Sources: vllm/lora/utils.py75-92 vllm/lora/utils.py103-121
How BaseLayerWithLoRA works
Sources: vllm/lora/layers/fused_moe.py44-59 vllm/lora/layers/logits_processor.py20-50
Each BaseLayerWithLoRA pre-allocates GPU tensors in its create_lora_weights() method. The slot dimension (axis 0) has size max_loras, so all LoRA adapters share one contiguous buffer:
lora_a_stacked: shape (max_loras, 1, hidden_size, max_lora_rank) — down-projectionlora_b_stacked: shape (max_loras, 1, max_lora_rank, output_size) — up-projectionThe 1 in position 1 is a padding dimension for kernel compatibility. The weight tensors are initialized to zeros; non-zero entries indicate an active adapter at that slot.
For MoE layers, the tensors add an expert dimension:
w13_lora_a_stacked: shape (max_loras, num_experts, max_lora_rank, hidden_size)w13_lora_b_stacked: shape (max_loras, num_experts, intermediate_size, max_lora_rank)w2_lora_a_stacked, w2_lora_b_stacked: similarly shaped for the down-projection.See vllm/lora/layers/fused_moe.py354-414 for MoE allocation details.
LoRALayerWeights and PackedLoRALayerWeightsLoRALayerWeights (vllm/lora/lora_weights.py) holds the raw lora_a and lora_b tensors loaded from disk for one module in one adapter. PackedLoRALayerWeights packs multiple sublayers (e.g., gate and up projections of a merged column-parallel linear) together.
When an adapter is activated via BaseLayerWithLoRA.set_lora(index, lora_a, lora_b), the weights are copied into the pre-allocated slot using non-blocking CUDA copies.
Sources: vllm/lora/lora_weights.py vllm/lora/layers/fused_moe.py518-588
LoRAModelManager (vllm/lora/model_manager.py) is the central coordinator. It is created once per worker by the create_lora_manager() factory and is responsible for:
from_layer() / from_layer_logits_processor() to install LoRA wrappers.lora_index_to_id: list[int | None], mapping each GPU slot (0 to max_loras - 1) to the currently loaded adapter ID._activate_adapter(id) copies adapter weights into a slot; _deactivate_adapter(id) calls reset_lora() on all layers.set_adapter_mapping(mapping) updates the PunicaWrapperBase with which LoRA each token belongs to.LRUCacheLoRAModelManager extends LoRAModelManager with an AdapterLRUCache (vllm/lora/model_manager.py48-56) that evicts the least-recently-used adapter when the GPU pool is full.
Sources: vllm/lora/model_manager.py1-100
WorkerLoRAManager (vllm/lora/worker_manager.py) sits between the worker process and the LoRAModelManager. Its responsibilities are:
create_lora_manager(model, vllm_config) — delegates to create_lora_manager() factory, stores the resulting manager, returns the LoRA-ified model._load_adapter(lora_request) — resolves the adapter path via get_adapter_absolute_path() (vllm/lora/utils.py229-281), loads the PEFT config, reads and validates weights, returns a LoRAModel.add_adapter(lora_request) — calls _load_adapter then passes to _adapter_manager.add_adapter().remove_adapter(lora_id) — delegates to the model manager.set_active_adapters(lora_requests, lora_mapping) — ensures all requested adapters are loaded, then calls _adapter_manager.set_active_adapters(lora_requests, lora_mapping).LRUCacheWorkerLoRAManager uses LRUCacheLoRAModelManager as its _manager_cls, enabling GPU-slot LRU eviction.
LoRAModelRunnerMixin (vllm/v1/worker/lora_model_runner_mixin.py) is mixed into GPUModelRunner to provide:
load_lora_model() — creates the manager after model load.set_active_loras() — builds LoRAMapping from the InputBatch and calls set_active_adapters.add_lora() / remove_lora() / pin_lora() — adapter lifecycle operations.maybe_setup_dummy_loras() / maybe_dummy_run_with_lora() — CUDA graph warmup helpers.Sources: vllm/lora/worker_manager.py vllm/v1/worker/lora_model_runner_mixin.py
Adapter lifecycle flow
Sources: vllm/lora/worker_manager.py87-180 vllm/lora/model_manager.py vllm/v1/worker/lora_model_runner_mixin.py31-90
The Punica wrapper is an abstraction layer between BaseLayerWithLoRA and the actual Triton kernels. Every LoRA-enabled layer holds a reference to the wrapper and calls it during forward().
Sources: vllm/lora/punica_wrapper/punica_base.py vllm/lora/punica_wrapper/punica_gpu.py
LoRAMapping and Token-to-Slot MappingBefore each forward pass, LoRAModelManager.set_adapter_mapping() is called with a LoRAMapping object. LoRAMapping carries two sequences:
index_mapping: one integer per token — which LoRA ID applies to that token (0 = no LoRA).prompt_mapping: one integer per sequence — for logit processor (sampler) indexing.is_prefill: flag indicating prefill vs. decode phase.PunicaWrapperBase._update_base_metadata() calls convert_mapping() (vllm/lora/punica_wrapper/utils.py) which converts these IDs to slot indices stored in token_lora_indices — a GPU tensor consulted by the Triton kernels.
PunicaWrapperGPU additionally calls LoRAKernelMeta.prepare_tensors(token_lora_indices) to compute:
token_indices_sorted_by_lora_ids — tokens grouped by LoRAnum_tokens_per_lora — how many tokens each LoRA processeslora_token_start_loc — cumulative offsetslora_ids — active LoRA IDsThese tensors drive the Triton kernel grid.
Sources: vllm/lora/punica_wrapper/punica_gpu.py75-89 vllm/lora/punica_wrapper/punica_base.py124-200
The two core LoRA operations are implemented as Triton kernels in vllm/lora/ops/triton_ops/
lora_shrink (down-projection)File: vllm/lora/ops/triton_ops/lora_shrink_op.py
output[slice, token, :rank] += (input[token, :hidden] @ lora_a[slot, :hidden, :rank]) * scale
The kernel _lora_shrink_kernel is launched with a 3-axis grid: (M_blocks × N_blocks × SPLIT_K, num_slices, num_active_loras). Each program block looks up its LoRA ID and token offsets via lora_ids[lora_idx] and token_indices_sorted_by_lora_ids.
lora_expand (up-projection)File: vllm/lora/ops/triton_ops/lora_expand_op.py
output[token, offset:offset+slice] += intermediate[slice, token, :rank] @ lora_b[slot, :rank, :output]
The kernel _lora_expand_kernel is similarly structured. Both kernels support TMA (Tensor Memory Accelerator) on SM90+ GPUs via supports_tma() and Programmatic Dependent Launch (PDL) via supports_pdl() in vllm/lora/ops/triton_ops/utils.py307-324
Optimal block sizes are resolved at runtime from:
VLLM_TUNED_CONFIG_FOLDER) containing JSON files keyed by (max_loras, num_slices, batch, hidden_size, rank).get_lora_op_configs() (vllm/lora/ops/triton_ops/utils.py186-304).Sources: vllm/lora/ops/triton_ops/lora_shrink_op.py vllm/lora/ops/triton_ops/lora_expand_op.py vllm/lora/ops/triton_ops/utils.py
FusedMoEWithLoRAStandard FusedMoE layers are replaced by FusedMoEWithLoRA (vllm/lora/layers/fused_moe.py44-619). Expert parallelism (EP) is not supported with LoRA (assert not self.base_layer.use_ep).
The class stores separate A/B weight stacks for the w1/w3 (gate+up) and w2 (down) expert matrices:
w13_lora_a_stacked: (max_loras, num_experts, max_lora_rank, hidden_size)
w13_lora_b_stacked: (max_loras, num_experts, intermediate_size, max_lora_rank)
w2_lora_a_stacked: (max_loras, num_experts, max_lora_rank, intermediate_size)
w2_lora_b_stacked: (max_loras, num_experts, hidden_size, max_lora_rank)
adapter_enabled: an integer tensor of shape (max_loras + 1,) that flags which slots are active. The MoE kernel checks this tensor to skip inactive adapters.
_inject_lora_into_fused_moe() (vllm/lora/layers/fused_moe.py130-352) wraps three hooks inside the underlying FusedMoEModularKernel:
| Hook | Purpose |
|---|---|
fwd_decorator on m_fused_moe_fn.forward | Captures hidden_states, topk_ids, topk_weights into a shared moe_state_dict |
act_decorator on fused_experts.activation | After the first matmul, calls punica_wrapper.add_lora_fused_moe() for the w13 path |
moe_sum_decorator on fused_experts.moe_sum | After the second matmul, calls punica_wrapper.add_lora_fused_moe() for the w2 path |
MoE LoRA forward flow
Sources: vllm/lora/layers/fused_moe.py130-352
FusedMoE3DWithLoRAFusedMoE3DWithLoRA (vllm/lora/layers/fused_moe.py622-781) handles models where the w1 and w3 weights are combined into a single 3D tensor rather than two separate tensors. It uses _w13_slices = 1 and packs both gate and up projections into a single w13_lora_b_stacked with doubled intermediate dimension. It also contains special-case handling for GPT-OSS's interleaved weight order.
The can_replace_layer() classmethod distinguishes the two:
FusedMoEWithLoRA.can_replace_layer() returns True when len(packed_modules_list) == 2.FusedMoE3DWithLoRA.can_replace_layer() returns True when len(packed_modules_list) == 1.Sources: vllm/lora/layers/fused_moe.py608-619 vllm/lora/layers/fused_moe.py771-781
The fused MoE LoRA Triton kernel is in vllm/lora/ops/triton_ops/fused_moe_lora_op.py It is invoked through PunicaWrapperGPU.add_lora_fused_moe() (vllm/lora/punica_wrapper/punica_gpu.py391-459) via:
_fused_moe_lora_shrink — computes hidden @ lora_a for each (token, expert, LoRA) triple._fused_moe_lora_expand — computes the intermediate @ lora_b and accumulates into the output.Block assignment for tokens-to-experts is computed by moe_lora_align_block_size() (vllm/lora/punica_wrapper/punica_gpu.py325-389), which calls the custom CUDA op ops.moe_lora_align_block_size to produce sorted_token_ids, expert_ids, and num_tokens_post_padded tensors.
Sources: vllm/lora/ops/triton_ops/fused_moe_lora_op.py vllm/lora/punica_wrapper/punica_gpu.py391-459
get_adapter_absolute_path(lora_path) (vllm/lora/utils.py229-281) resolves adapter paths in order:
~ — expand user home.os.path.abspath(...).VLLM_USE_MODELSCOPE is set).When tensor_parallel_size > 1, BaseLayerWithLoRA subclasses with Sharded in their name apply additional slicing of LoRA weights along the rank or output dimension. For example, ColumnParallelLinearWithShardedLoRA shards lora_b across TP ranks, requiring an all_reduce after the expand step.
For MoE layers with fully_sharded_loras=True:
w13_lora_a_stacked has rank sharded: max_lora_rank // tp_size per rank.w13_lora_b_stacked has output sharded: intermediate_size // tp_size per rank.w2_lora_a_stacked has input sharded: intermediate_size // tp_size per rank.w2_lora_b_stacked has output sharded: hidden_size // tp_size per rank.The _fused_moe_lora kernel handles the collective after shrink via tensor_model_parallel_all_reduce or tensor_model_parallel_all_gather depending on rank configuration (vllm/lora/ops/triton_ops/fused_moe_lora_op.py810-821).
Sources: vllm/lora/layers/fused_moe.py464-516 vllm/lora/ops/triton_ops/fused_moe_lora_op.py806-821
The HTTP server exposes /v1/load_lora_adapter and /v1/unload_lora_adapter endpoints (enabled when VLLM_ALLOW_RUNTIME_LORA_UPDATING=True). These endpoints call add_lora() and remove_lora() on the engine, which propagates down to LoRAModelRunnerMixin.add_lora() (vllm/v1/worker/lora_model_runner_mixin.py271-278).
For the HTTP server side, see OpenAI-Compatible API Server.
Sources: vllm/v1/worker/lora_model_runner_mixin.py271-290 docs/features/lora.md105-155
Refresh this wiki
This wiki was recently refreshed. Please wait 5 days to refresh again.