This page covers vLLM's quantization infrastructure and Mixture-of-Experts (MoE) kernel system. It explains the quantization method registry, the FP8 linear and MoE pipelines, the modular MoE kernel abstraction, and how backend selection is performed at runtime.
For the linear layer and normalization implementations that host quantized weights, see Linear Layers and Normalization. For distributed expert parallelism configuration, see Parallelism Strategies. For the general attention backend system, see Attention Backends.
Every quantization scheme implements the QuantizationConfig abstract base class (vllm/model_executor/layers/quantization/base_config.py) and is registered in a central registry. When a model is loaded, the config's get_quant_method is called per layer to return a QuantizeMethodBase that knows how to create_weights, process_weights_after_loading, and apply the kernel.
Quantization method dispatch diagram:
Sources: vllm/model_executor/layers/quantization/fp8.py184-216 vllm/model_executor/layers/quantization/modelopt.py182-217
| Method | Config Class | Key File |
|---|---|---|
fp8 | Fp8Config | vllm/model_executor/layers/quantization/fp8.py |
modelopt | ModelOptFp8Config / ModelOptNvFp4Config | vllm/model_executor/layers/quantization/modelopt.py |
compressed_tensors | CompressedTensorsConfig | vllm/model_executor/layers/quantization/compressed_tensors/ |
gptq_marlin | GPTQMarlinConfig | vllm/model_executor/layers/quantization/gptq_marlin.py |
awq_marlin | AWQMarlinConfig | vllm/model_executor/layers/quantization/awq_marlin.py |
bitsandbytes | BitsAndBytesConfig | vllm/model_executor/layers/quantization/bitsandbytes.py |
gguf | GGUFConfig | vllm/model_executor/layers/quantization/gguf.py |
mxfp4 | Mxfp4Config | vllm/model_executor/layers/quantization/mxfp4.py |
moe_wna16 | — | vllm/model_executor/layers/quantization/moe_wna16.py |
Sources: vllm/model_executor/layers/quantization/fp8.py109-148 vllm/model_executor/layers/quantization/modelopt.py106-120 vllm/model_executor/layers/quantization/mxfp4.py66-83
Fp8Config (vllm/model_executor/layers/quantization/fp8.py109-237) controls:
activation_scheme: "static" or "dynamic" — whether activation scales are pre-computed or computed per-token at runtime.is_checkpoint_fp8_serialized: True if weights are stored as FP8 in the checkpoint.weight_block_size: enables block-wise quantization (e.g. [128, 128]). Requires is_checkpoint_fp8_serialized=True and activation_scheme="dynamic".ignored_layers: list of layer name prefixes to skip.| Class | Use case |
|---|---|
Fp8LinearMethod | Loads FP8-serialized checkpoint with static weight scale |
Fp8OnlineLinearMethod | Loads BF16/FP16 checkpoint, quantizes weights during loading |
For models without FP8 hardware support (compute capability < 89) or when VLLM_TEST_FORCE_FP8_MARLIN=1, both methods fall back to the Marlin FP8 kernel (apply_fp8_marlin_linear). ROCm and XPU platforms also skip Marlin.
Sources: vllm/model_executor/layers/quantization/fp8.py269-330 vllm/model_executor/layers/quantization/fp8.py526-648
For block-wise FP8 (e.g. DeepSeek-V3), W8A8BlockFp8LinearOp (vllm/model_executor/layers/quantization/utils/fp8_utils.py347) dispatches to the appropriate backend:
Block FP8 GEMM backend selection diagram:
Sources: vllm/model_executor/layers/quantization/utils/fp8_utils.py347-420 vllm/utils/deep_gemm.py68-77
The FlashInfer path uses torch.cond() for compile compatibility, selecting DeepGEMM swapAB for M < 32 and standard DeepGEMM for M >= 32 (vllm/model_executor/layers/quantization/utils/fp8_utils.py239-320).
DeepGEMM supports three scale formats controlled by DeepGemmQuantScaleFMT (vllm/utils/deep_gemm.py27-66):
| Format | Description |
|---|---|
FLOAT32 | Standard float32 scales |
FLOAT32_CEIL_UE8M0 | Float32 scales rounded up to UE8M0 (Hopper) |
UE8M0 | Packed UE8M0 (4 values per int32) (Blackwell) |
Controlled by VLLM_USE_DEEP_GEMM_E8M0 env var.
Sources: vllm/utils/deep_gemm.py27-77
FusedMoE (vllm/model_executor/layers/fused_moe/layer.py274) is the top-level CustomOp representing a full MoE layer. It owns:
w13 (gate+up) and w2 (down), in shape [num_experts, N, K]FusedMoEConfig — static layer configFusedMoEParallelConfig — parallelism config (TP, EP, DP)quant_method: FusedMoEMethodBase — the active quantization strategyrouter — top-K routerrunner: DefaultMoERunner — orchestrates executionFusedMoE internal structure:
Sources: vllm/model_executor/layers/fused_moe/layer.py274-666 vllm/model_executor/layers/fused_moe/config.py152
The MoE execution pipeline is decomposed into three independent stages via abstract interfaces in modular_kernel.py (vllm/model_executor/layers/fused_moe/modular_kernel.py):
[Router] → [FusedMoEPrepareAndFinalize] → [FusedMoEPermuteExpertsUnpermute]
| Interface | Role |
|---|---|
FusedMoEPrepareAndFinalize | Quantizes inputs, dispatches tokens (e.g. all2all), and gathers results |
FusedMoEPermuteExpertsUnpermute | Runs the actual expert GEMMs (permute → GEMM → unpermute) |
FusedMoEModularKernel | Combines a prepare/finalize and a permute/unpermute into one callable |
Modular kernel composition:
Sources: vllm/model_executor/layers/fused_moe/modular_kernel.py44-80
| Class | Backend | File |
|---|---|---|
MoEPrepareAndFinalizeNoEP | Single-GPU / TP-only | prepare_finalize.py |
NaiveEPDispatcher | All-reduce + scatter | all2all_utils.py |
DeepEPLLPrepareAndFinalize | DeepEP low-latency all2all | deep_ep_ll.py |
DeepEPHTDispatcher | DeepEP high-throughput all2all | deep_ep_ht.py |
| Class | Backend | Quantization |
|---|---|---|
TritonExperts | Triton fused kernel | FP8, INT8, W4A16, W8A16, unquantized |
CutlassExperts | CUTLASS grouped GEMM | FP8 |
DeepGemmExperts | DeepGEMM contiguous grouped GEMM | FP8 block |
BatchedDeepGemmExperts | DeepGEMM masked grouped GEMM | FP8 block |
MarlinExperts | Marlin kernel | INT4/FP4/FP8 |
Sources: vllm/model_executor/layers/fused_moe/cutlass_moe.py vllm/model_executor/layers/fused_moe/deep_gemm_moe.py vllm/model_executor/layers/fused_moe/batched_deep_gemm_moe.py
Each quantization config supplies a FusedMoEMethodBase subclass for FusedMoE layers.
MoE quant method class hierarchy:
Sources: vllm/model_executor/layers/quantization/fp8.py650 vllm/model_executor/layers/quantization/compressed_tensors/compressed_tensors_moe.py113-239 vllm/model_executor/layers/quantization/gptq_marlin.py vllm/model_executor/layers/quantization/mxfp4.py250
Fp8MoEMethod (serialized checkpoints) and Fp8OnlineMoEMethod (BF16 checkpoints, online quantize) call select_fp8_moe_backend to pick between:
| Backend enum | Kernel |
|---|---|
Fp8MoeBackend.TRITON | Triton fused_moe_kernel |
Fp8MoeBackend.CUTLASS | run_cutlass_moe_fp8 |
Fp8MoeBackend.DEEP_GEMM | DeepGemmExperts |
Fp8MoeBackend.FLASHINFER_TRTLLM | FlashInfer TensorRT-LLM fused MoE |
Fp8MoeBackend.FLASHINFER_CUTLASS | FlashInfer CUTLASS grouped GEMM |
Sources: vllm/model_executor/layers/quantization/fp8.py650 vllm/model_executor/layers/fused_moe/oracle/fp8.py
CompressedTensorsMoEMethod.get_moe_method (vllm/model_executor/layers/quantization/compressed_tensors/compressed_tensors_moe.py125-239) is a factory that inspects the weight/activation quantization scheme and returns the most appropriate subclass:
| Scheme detected | Method class |
|---|---|
| MXFP4 weights | CompressedTensorsW4A4Mxfp4MoEMethod |
| WNA16 packed | CompressedTensorsWNA16MarlinMoEMethod or CompressedTensorsWNA16MoEMethod |
| NVFP4 | CompressedTensorsW4A4Nvfp4MoEMethod |
| FP8 W8A8 | CompressedTensorsW8A8Fp8MoEMethod |
| INT8 W8A8 | CompressedTensorsW8A8Int8MoEMethod |
| FP8 W4A8 | CompressedTensorsW4A8Fp8MoEMethod |
| INT8 W4A8 | CompressedTensorsW4A8Int8MoEMethod |
Marlin is preferred for WNA16 when check_moe_marlin_supports_layer passes and the platform is not ROCm.
Sources: vllm/model_executor/layers/quantization/compressed_tensors/compressed_tensors_moe.py156-239
Mxfp4Config.get_quant_method (vllm/model_executor/layers/quantization/mxfp4.py211-247) creates Mxfp4MoEMethod, which selects a backend via get_mxfp4_backend:
Mxfp4Backend enum | Condition |
|---|---|
SM100_FI_MXFP4_MXFP8_CUTLASS | Blackwell + FlashInfer + VLLM_USE_FLASHINFER_MOE_MXFP4_MXFP8_CUTLASS=1 |
SM100_FI_MXFP4_MXFP8_TRTLLM | Blackwell + FlashInfer + VLLM_USE_FLASHINFER_MOE_MXFP4_MXFP8=1 |
SM100_FI_MXFP4_BF16 | Blackwell + FlashInfer |
SM90_FI_MXFP4_BF16 | Hopper + FlashInfer + VLLM_USE_FLASHINFER_MOE_MXFP4_BF16=1 |
TRITON | SM90/SM100, triton_kernels available |
MARLIN | fallback / VLLM_MXFP4_USE_MARLIN=1 |
CK | ROCm gfx950 + AITER enabled |
Sources: vllm/model_executor/layers/quantization/mxfp4.py108-183
When expert parallelism (EP) is enabled, FusedMoE.__init__ calls determine_expert_map (vllm/model_executor/layers/fused_moe/layer.py67-153) to produce:
local_num_experts: experts on this rank.expert_map: tensor of shape (global_num_experts,) mapping global → local index; -1 for experts not on this rank.expert_mask (ROCm AITER only): binary mask of same length.Two placement strategies are supported via ExpertPlacementStrategy:
| Strategy | Description |
|---|---|
"linear" | Consecutive blocks of experts per rank |
"round_robin" | Interleaved distribution, requires DeepEP low-latency backend |
The modular kernel receives expert_map so GEMM kernels can write zeros for absent experts.
Sources: vllm/model_executor/layers/fused_moe/layer.py67-153 vllm/model_executor/layers/fused_moe/layer.py424-474
fused_moe_kernel (vllm/model_executor/layers/fused_moe/fused_moe.py314-574) and fused_moe_kernel_gptq_awq (vllm/model_executor/layers/fused_moe/fused_moe.py81-311) are the core Triton kernels. Both implement the standard "sorted token" MoE pattern:
BLOCK_SIZE_M chunk of sorted tokens assigned to one expert.-1 (EP — not local), zeros are written.The kernel accepts use_fp8_w8a8, use_int8_w8a8, use_int8_w8a16, use_int4_w4a16, per_channel_quant, group_k, group_n compile-time constants for specialization.
Sources: vllm/model_executor/layers/fused_moe/fused_moe.py314-574 vllm/model_executor/layers/fused_moe/fused_moe.py81-311
ModelOptQuantConfigBase (vllm/model_executor/layers/quantization/modelopt.py132-218) is the base for NVIDIA ModelOpt-exported checkpoints. Concrete configs:
| Config | Algorithms |
|---|---|
ModelOptFp8Config | FP8, FP8_PER_CHANNEL_PER_TOKEN, FP8_PB_WO |
ModelOptNvFp4Config | NVFP4 |
ModelOptMxFp8Config | MXFP8 |
The exclude_modules list supports wildcard patterns (fnmatch) and also handles the language_model. prefix convention used by some HF checkpoints (vllm/model_executor/layers/quantization/modelopt.py144-180).
Sources: vllm/model_executor/layers/quantization/modelopt.py106-362
These env vars control quantization and MoE behavior at runtime:
| Variable | Default | Effect |
|---|---|---|
VLLM_USE_DEEP_GEMM | True | Enable DeepGEMM library for FP8 GEMMs |
VLLM_MOE_USE_DEEP_GEMM | True | Enable DeepGEMM for MoE GEMMs |
VLLM_USE_DEEP_GEMM_E8M0 | True | Use UE8M0 scale format with DeepGEMM |
VLLM_USE_DEEP_GEMM_TMA_ALIGNED_SCALES | True | TMA-align scale tensors |
VLLM_DEEP_GEMM_WARMUP | "relax" | Warmup mode: skip, full, or relax |
VLLM_BLOCKSCALE_FP8_GEMM_FLASHINFER | True | Allow FlashInfer for block-scale FP8 GEMM |
VLLM_USE_FLASHINFER_MOE_FP8 | False | Use FlashInfer for FP8 MoE |
VLLM_USE_FLASHINFER_MOE_FP4 | False | Use FlashInfer for FP4 MoE |
VLLM_USE_FLASHINFER_MOE_MXFP4_MXFP8 | False | FlashInfer MXFP4/MXFP8 TRT-LLM path |
VLLM_USE_FLASHINFER_MOE_MXFP4_MXFP8_CUTLASS | False | FlashInfer MXFP4/MXFP8 CUTLASS path |
VLLM_USE_FLASHINFER_MOE_MXFP4_BF16 | False | FlashInfer MXFP4 BF16 path (SM90) |
VLLM_FLASHINFER_MOE_BACKEND | "latency" | FlashInfer MoE backend: throughput, latency, masked_gemm |
VLLM_MXFP4_USE_MARLIN | None | Force Marlin backend for MXFP4 |
VLLM_MARLIN_USE_ATOMIC_ADD | False | Use atomic add in Marlin kernels |
VLLM_MARLIN_INPUT_DTYPE | None | Override Marlin input dtype (int8 or fp8) |
VLLM_FUSED_MOE_CHUNK_SIZE | 16384 | Chunk size for fused MoE activation |
VLLM_ENABLE_FUSED_MOE_ACTIVATION_CHUNKING | True | Enable activation chunking in MoE |
VLLM_USE_FUSED_MOE_GROUPED_TOPK | True | Use fused grouped top-K routing |
VLLM_USE_TRITON_AWQ | False | Use Triton for AWQ kernels |
VLLM_TEST_FORCE_FP8_MARLIN | False | Force Marlin fallback for FP8 |
VLLM_ROCM_USE_AITER_MOE | True | Use AITER fused MoE on ROCm |
VLLM_ROCM_USE_AITER_FUSION_SHARED_EXPERTS | False | Fuse shared experts in AITER MoE |
VLLM_ROCM_MOE_PADDING | True | Enable MoE padding on ROCm |
VLLM_MAX_TOKENS_PER_EXPERT_FP4_MOE | 163840 | Max tokens per expert for FP4 MoE |
VLLM_NVFP4_GEMM_BACKEND | None | Override NVFP4 GEMM backend |
Sources: vllm/envs.py56-57 vllm/envs.py95-96 vllm/envs.py101-118 vllm/envs.py145-168 vllm/envs.py153-156
The following diagram shows the full path from a FusedMoE.forward call through quantization dispatch to kernel execution:
Sources: vllm/model_executor/layers/fused_moe/modular_kernel.py44-80 vllm/model_executor/layers/fused_moe/layer.py647-702
The FusedMoeWeightScaleSupported enum (vllm/model_executor/layers/fused_moe/layer.py60-64) captures the scale granularities supported by MoE kernels:
| Value | Meaning |
|---|---|
TENSOR | Single scale per expert weight tensor |
CHANNEL | One scale per output channel |
GROUP | Block/group-wise scale (e.g. every 128 elements) |
BLOCK | 2D block-wise scale |
These are used by quant_method.create_weights to register the appropriate PerTensorScaleParameter, ChannelQuantScaleParameter, or BlockQuantScaleParameter.
Sources: vllm/model_executor/layers/fused_moe/layer.py60-64 vllm/model_executor/layers/quantization/fp8.py332-394
Refresh this wiki
This wiki was recently refreshed. Please wait 6 days to refresh again.