This page documents how quantization is applied to Mixture-of-Experts (MoE) layers in vLLM, and how the runtime selects appropriate compute kernels based on hardware and quantization type. Topics include: FP8, INT8, GPTQ-Marlin, AWQ-Marlin, compressed tensors, MXFP4/NVFP4, and ROCm AITER variants.
For the modular kernel architecture (FusedMoEModularKernel, FusedMoEPrepareAndFinalize, FusedMoEPermuteExpertsUnpermute), see 7.3. For FP8 quantization on linear layers, see 7.2. For the general quantization method registry and lifecycle, see 7.1.
Every FusedMoE layer acquires a FusedMoEMethodBase instance during __init__. This happens via the _get_quant_method closure at vllm/model_executor/layers/fused_moe/layer.py578-593 which calls quant_config.get_quant_method(self, prefix) where self is the FusedMoE layer. If the result is None, it falls back to UnquantizedFusedMoEMethod.
The FusedMoEMethodBase interface defines:
create_weights — allocates quantized weight parametersprocess_weights_after_loading — reformats weights for the target kernelapply — executes the quantized forward passmaybe_make_prepare_finalize — returns a FusedMoEPrepareAndFinalize for modular kernels (or None if the method is monolithic)supports_internal_mk / is_monolithic — flags for the modular kernel initialization pathsupports_eplb — expert parallelism load balancing supportDiagram: QuantizationConfig to FusedMoEMethodBase Dispatch
Sources: vllm/model_executor/layers/fused_moe/layer.py578-593 vllm/model_executor/layers/quantization/fp8.py184-216 vllm/model_executor/layers/quantization/compressed_tensors/compressed_tensors_moe.py124-240 vllm/model_executor/layers/quantization/mxfp4.py211-243
| Class | File | Checkpoint type |
|---|---|---|
Fp8MoEMethod | fp8.py | Serialized FP8 (static weight scales) |
Fp8OnlineMoEMethod | fp8.py | FP16/BF16 checkpoint; weights quantized during load |
CompressedTensorsW8A8Fp8MoEMethod | compressed_tensors_moe.py | Compressed-tensors format FP8 |
ModelOptFp8MoEMethod | modelopt.py | ModelOpt FP8 checkpoint |
Fp8MoEMethod supports three weight scale granularities based on Fp8Config.weight_block_size and activation_scheme:
kFp8StaticTensorSym) — single scale per weight matrixkFp8StaticChannelSym) — one scale per output channelkFp8Static128BlockSym) — e.g., 128×128 blocks (DeepSeek-V3 style)After weight loading, the oracle in vllm/model_executor/layers/fused_moe/oracle/fp8.py is called via select_fp8_moe_backend and make_fp8_moe_kernel. It returns a Fp8MoeBackend enum value and builds the appropriate FusedMoEPermuteExpertsUnpermute implementation.
Diagram: FP8 Backend Oracle Selection (select_fp8_moe_backend)
Sources: vllm/model_executor/layers/quantization/fp8.py650-900 vllm/model_executor/layers/fused_moe/cutlass_moe.py51-130 vllm/model_executor/layers/fused_moe/deep_gemm_moe.py44-80 vllm/model_executor/layers/fused_moe/flashinfer_trtllm_moe.py1-50 vllm/model_executor/layers/fused_moe/rocm_aiter_fused_moe.py1-60
After selecting a backend, convert_to_fp8_moe_kernel_format reshapes and transposes the weight and scale tensors into the layout expected by the chosen kernel. The function is called inside process_weights_after_loading.
Mxfp4Backend in vllm/model_executor/layers/quantization/mxfp4.py enumerates available backends:
| Enum Value | Kernel | Hardware |
|---|---|---|
SM100_FI_MXFP4_MXFP8_TRTLLM | FlashInfer TensorRT-LLM | Blackwell (SM100) |
SM100_FI_MXFP4_MXFP8_CUTLASS | FlashInfer CUTLASS | Blackwell (SM100) |
SM100_FI_MXFP4_BF16 | FlashInfer, BF16 output | Blackwell (SM100) |
SM90_FI_MXFP4_BF16 | FlashInfer, BF16 output | Hopper (SM90) |
MARLIN | Marlin GEMM | SM80+ fallback |
TRITON | OAI Triton kernels (OAITritonExperts) | SM90/SM100 (no FlashInfer) |
CK | Composable Kernels (AITER) | ROCm GFX950 |
NONE | Not supported | — |
Sources: vllm/model_executor/layers/quantization/mxfp4.py66-83
get_mxfp4_backend Selection LogicDiagram: MXFP4 Backend Selection (get_mxfp4_backend)
Sources: vllm/model_executor/layers/quantization/mxfp4.py84-183
| Class | Config | Backend Oracle |
|---|---|---|
Mxfp4MoEMethod | Mxfp4Config | get_mxfp4_backend |
CompressedTensorsW4A4Nvfp4MoEMethod | CompressedTensorsConfig | select_nvfp4_moe_backend |
CompressedTensorsW4A4Mxfp4MoEMethod | CompressedTensorsConfig | Fixed: Marlin |
ModelOptNvfp4MoEMethod | ModelOptNvfp4Config | select_nvfp4_moe_backend |
Mxfp4MoEMethod.process_weights_after_loading calls _swizzle_mxfp4 for FlashInfer layouts or prepare_moe_fp4_layer_for_marlin for Marlin.
Sources: vllm/model_executor/layers/quantization/mxfp4.py250-500 vllm/model_executor/layers/quantization/compressed_tensors/compressed_tensors_moe.py242-370
GPTQMarlinMoEMethod and AWQMarlinMoEMethod both use fused_marlin_moe from vllm/model_executor/layers/fused_moe/fused_marlin_moe.py for execution.
| Parameter | Description |
|---|---|
w13_qweight / w2_qweight | Packed INT4 weights |
w13_scales / w2_scales | Per-group dequantization scales |
w13_qzeros / w2_qzeros | Zero points |
w13_g_idx / w2_g_idx | Group indices (activation ordering) |
w13_workspace / w2_workspace | Marlin workspace buffers |
process_weights_after_loading calls:
marlin_moe_permute_scales — reorders scales to Marlin layoutmarlin_make_workspace_new — allocates workspaceThe GPTQMarlinState enum (values REPACK, READY) tracks whether weights still need repacking.
check_moe_marlin_supports_layer from marlin_utils.py gates Marlin use. On ROCm, Marlin MoE is not supported and CompressedTensorsWNA16MoEMethod is chosen instead.
The VLLM_USE_TRITON_AWQ environment variable forces the Triton path even on hardware that supports Marlin.
Sources: vllm/model_executor/layers/quantization/gptq_marlin.py1-400 vllm/model_executor/layers/quantization/awq_marlin.py1-400
CompressedTensorsConfig.get_quant_method delegates to the static factory CompressedTensorsMoEMethod.get_moe_method, which inspects the weight and input quantization descriptors and selects a concrete method.
Diagram: CompressedTensorsMoEMethod.get_moe_method Factory
Sources: vllm/model_executor/layers/quantization/compressed_tensors/compressed_tensors_moe.py124-240
Note:
CompressedTensorsWNA16MoEMethoddoes not supportactorder=group/dynamic; this combination raises aValueError. Activation ordering requires the Marlin path.
The Triton kernels in fused_moe.py are parameterized at JIT compile time using tl.constexpr flags. The invoke_fused_moe_kernel function selects which combination of flags to pass based on the FusedMoEQuantConfig.
fused_moe_kernel (general-purpose)vllm/model_executor/layers/fused_moe/fused_moe.py314-575
constexpr Parameter | Active Quantization Scheme |
|---|---|
use_fp8_w8a8=True | FP8 W8A8 (per-tensor, per-token, or block) |
use_int8_w8a8=True | INT8 W8A8 (per-tensor, per-token, or block) |
use_int8_w8a16=True | INT8 W8A16 (per-output-channel weight scale) |
per_channel_quant=True | Per-output-channel weight quantization |
group_k > 0 and group_n > 0 | Block quantization (N×K tile, e.g., 128×128) |
All flags False | BF16/FP16 unquantized |
fused_moe_kernel_gptq_awq (GPTQ/AWQ)vllm/model_executor/layers/fused_moe/fused_moe.py81-312
constexpr Parameter | Active Quantization Scheme |
|---|---|
use_int4_w4a16=True | INT4 W4A16 (GPTQ/AWQ INT4 packed) |
use_int8_w8a16=True | INT8 W8A16 (GPTQ/AWQ INT8) |
has_zp=True | With zero-point (AWQ asymmetric) |
has_zp=False | Without zero-point (GPTQ symmetric; substitutes a constant) |
When VLLM_ROCM_USE_AITER_MOE=1 and the platform is ROCm, the AITER library handles MoE computation. The integration lives in vllm/model_executor/layers/fused_moe/rocm_aiter_fused_moe.py.
QuantMethod in rocm_aiter_fused_moe.py maps vLLM quantization types to AITER's internal enum:
QuantMethod | Integer | Description |
|---|---|---|
NO | 0 | A16W16 — no quantization |
PER_TENSOR | 1 | W8A8 per-tensor scale |
PER_TOKEN | 2 | W8A8/W8A4 per-token scale |
BLOCK_1X32 | 3 | FP4 block 1×32 |
BLOCK_1X128 | 4 | W8A8 per-1×128 block |
BLOCK_128x128 | 5 | W8A8 per-128×128 block |
Sources: vllm/model_executor/layers/fused_moe/rocm_aiter_fused_moe.py30-44
When VLLM_ROCM_USE_AITER_FUSION_SHARED_EXPERTS=1, AITER can fuse shared experts (as in DeepSeek-V3) alongside routed experts in a single kernel call. FusedMoE.__init__ sets self.aiter_fmoe_shared_expert_enabled and self.num_fused_shared_experts accordingly. The init_aiter_topK_meta_data function pre-allocates top-K buffers that cover both routed and shared expert IDs vllm/model_executor/layers/fused_moe/rocm_aiter_fused_moe.py56-106
From vllm/model_executor/layers/fused_moe/layer.py402-407:
AITER MoE only supports gated activations (is_act_and_mul=True, i.e., SiLU/GELU with gate). Non-gated MoE falls back to standard Triton kernels.
Expert masks passed to AITER MoE must contain only 0s and 1s. The expert_mask tensor is constructed by determine_expert_map when return_expert_mask=True vllm/model_executor/layers/fused_moe/layer.py444-454
FusedMoEQuantConfig in vllm/model_executor/layers/fused_moe/config.py is the bridge between a quantization method and modular kernel infrastructure. It contains four FusedMoEQuantDesc objects describing activations and weights for both GEMMs:
| Field | Represents |
|---|---|
a1 | Input activation quantization |
w1 | Gate/up weight quantization |
a2 | Intermediate activation quantization |
w2 | Down weight quantization |
FusedMoEQuantDesc fields:
dtype — quantized dtype (e.g., torch.float8_e4m3fn, "mxfp4", "nvfp4")shape — GroupShape expressing quantization granularityscale — scale tensor or PrecisionConfig (GPT OSS Triton)alpha_or_gscale — global scales for NVFP4 / W4A8 FP8 per-channelzp — zero points for INT4/INT8 asymmetric schemesFactory functions that build FusedMoEQuantConfig instances:
| Factory | File | Scheme |
|---|---|---|
fp8_w8a8_moe_quant_config | config.py | FP8 W8A8 |
int8_w8a8_moe_quant_config | config.py | INT8 W8A8 |
int8_w8a16_moe_quant_config | config.py | INT8 W8A16 |
int4_w4a16_moe_quant_config | config.py | INT4 W4A16 |
mxfp4_mxfp8_moe_quant_config | config.py | MXFP4 weights, MXFP8 compute |
mxfp4_w4a16_moe_quant_config | config.py | MXFP4 weights, BF16 activations |
ocp_mx_moe_quant_config | config.py | OCP MX format |
make_fp8_moe_quant_config | oracle/fp8.py | Oracle-built FP8 config |
make_nvfp4_moe_quant_config | oracle/nvfp4.py | Oracle-built NVFP4 config |
make_mxfp4_moe_quant_config | oracle/nvfp4.py | Oracle-built MXFP4 config |
Sources: vllm/model_executor/layers/fused_moe/config.py152-450
After all weights are loaded, FusedMoE.maybe_init_modular_kernel is called vllm/model_executor/layers/fused_moe/layer.py677-703:
FusedMoE.maybe_init_modular_kernel()
├── if quant_method.supports_internal_mk or is_monolithic: return (no-op)
└── base_quant_method.maybe_make_prepare_finalize(routing_tables)
→ FusedMoEModularMethod.make(layer, base_quant_method, prepare_finalize, ...)
replaces self.quant_method
Methods that set is_monolithic = True handle all aspects of dispatch internally (e.g., GPTQMarlinMoEMethod calls fused_marlin_moe directly from apply). Methods that produce a prepare_finalize object are wrapped by FusedMoEModularMethod, which coordinates with the FusedMoEPrepareAndFinalize and FusedMoEPermuteExpertsUnpermute interfaces described in 7.3.
Additionally, from vllm/model_executor/layers/fused_moe/layer.py622-627 three method classes require the full (pre-sharding) intermediate_size to be passed for activation ordering:
GPTQMarlinMoEMethodCompressedTensorsWNA16MarlinMoEMethodCompressedTensorsWNA16MoEMethod| Variable | Default | Effect |
|---|---|---|
VLLM_MOE_USE_DEEP_GEMM | True | Enable DeepGEMM for FP8 block-quantized MoE |
VLLM_BLOCKSCALE_FP8_GEMM_FLASHINFER | True | Enable FlashInfer for block-scale FP8 GEMM in MoE |
VLLM_USE_FLASHINFER_MOE_FP8 | False | Force FlashInfer path for FP8 MoE |
VLLM_USE_FLASHINFER_MOE_FP4 | False | Force FlashInfer path for FP4 MoE |
VLLM_USE_FLASHINFER_MOE_FP16 | False | Force FlashInfer path for BF16/FP16 MoE |
VLLM_USE_FLASHINFER_MOE_INT4 | False | Force FlashInfer path for INT4 MoE |
VLLM_FLASHINFER_MOE_BACKEND | "latency" | FlashInfer MoE mode: "throughput", "latency", "masked_gemm" |
VLLM_USE_FLASHINFER_MOE_MXFP4_MXFP8 | False | Enable FlashInfer TRTLLM MXFP4+MXFP8 path on SM100 |
VLLM_USE_FLASHINFER_MOE_MXFP4_BF16 | False | Enable FlashInfer MXFP4+BF16 path on SM90 |
VLLM_USE_FLASHINFER_MOE_MXFP4_MXFP8_CUTLASS | False | Enable FlashInfer CUTLASS MXFP4+MXFP8 path on SM100 |
VLLM_MXFP4_USE_MARLIN | None | Force Marlin for MXFP4 (disables FlashInfer/Triton) |
VLLM_ROCM_USE_AITER_MOE | True | Enable AITER for MoE on ROCm |
VLLM_ROCM_USE_AITER_FUSION_SHARED_EXPERTS | False | Fuse shared experts into AITER MoE kernel |
VLLM_ROCM_MOE_PADDING | True | Enable weight padding for ROCm MoE |
VLLM_USE_TRITON_AWQ | False | Force Triton (bypass Marlin) for AWQ |
VLLM_MARLIN_USE_ATOMIC_ADD | False | Use atomic add in Marlin MoE kernel |
VLLM_MAX_TOKENS_PER_EXPERT_FP4_MOE | 163840 | Max tokens per expert for FP4 MoE |
VLLM_FUSED_MOE_CHUNK_SIZE | 16384 | Token chunk size for FusedMoE activation chunking |
VLLM_ENABLE_FUSED_MOE_ACTIVATION_CHUNKING | True | Enable activation chunking (disable if using torch.compile) |
VLLM_DEEPEPLL_NVFP4_DISPATCH | False | Enable NVFP4 dispatch via DeepEP low-latency path |
Sources: vllm/envs.py95-210
Refresh this wiki
This wiki was recently refreshed. Please wait 5 days to refresh again.