MoE Quantization and Backend Selection

Relevant source files

This page documents how quantization is applied to Mixture-of-Experts (MoE) layers in vLLM, and how the runtime selects appropriate compute kernels based on hardware and quantization type. Topics include: FP8, INT8, GPTQ-Marlin, AWQ-Marlin, compressed tensors, MXFP4/NVFP4, and ROCm AITER variants.

For the modular kernel architecture (FusedMoEModularKernel, FusedMoEPrepareAndFinalize, FusedMoEPermuteExpertsUnpermute), see 7.3. For FP8 quantization on linear layers, see 7.2. For the general quantization method registry and lifecycle, see 7.1.

MoE Quantization Method Selection

Every FusedMoE layer acquires a FusedMoEMethodBase instance during __init__. This happens via the _get_quant_method closure at vllm/model_executor/layers/fused_moe/layer.py578-593 which calls quant_config.get_quant_method(self, prefix) where self is the FusedMoE layer. If the result is None, it falls back to UnquantizedFusedMoEMethod.

The FusedMoEMethodBase interface defines:

create_weights — allocates quantized weight parameters
process_weights_after_loading — reformats weights for the target kernel
apply — executes the quantized forward pass
maybe_make_prepare_finalize — returns a FusedMoEPrepareAndFinalize for modular kernels (or None if the method is monolithic)
supports_internal_mk / is_monolithic — flags for the modular kernel initialization path
supports_eplb — expert parallelism load balancing support

Diagram: QuantizationConfig to FusedMoEMethodBase Dispatch

Sources: vllm/model_executor/layers/fused_moe/layer.py578-593 vllm/model_executor/layers/quantization/fp8.py184-216 vllm/model_executor/layers/quantization/compressed_tensors/compressed_tensors_moe.py124-240 vllm/model_executor/layers/quantization/mxfp4.py211-243

FP8 MoE Quantization

Method Classes

Class	File	Checkpoint type
`Fp8MoEMethod`	`fp8.py`	Serialized FP8 (static weight scales)
`Fp8OnlineMoEMethod`	`fp8.py`	FP16/BF16 checkpoint; weights quantized during load
`CompressedTensorsW8A8Fp8MoEMethod`	`compressed_tensors_moe.py`	Compressed-tensors format FP8
`ModelOptFp8MoEMethod`	`modelopt.py`	ModelOpt FP8 checkpoint

Fp8MoEMethod supports three weight scale granularities based on Fp8Config.weight_block_size and activation_scheme:

Per-tensor (kFp8StaticTensorSym) — single scale per weight matrix
Per-channel (kFp8StaticChannelSym) — one scale per output channel
Block (kFp8Static128BlockSym) — e.g., 128×128 blocks (DeepSeek-V3 style)

FP8 Backend Oracle

After weight loading, the oracle in vllm/model_executor/layers/fused_moe/oracle/fp8.py is called via select_fp8_moe_backend and make_fp8_moe_kernel. It returns a Fp8MoeBackend enum value and builds the appropriate FusedMoEPermuteExpertsUnpermute implementation.

Diagram: FP8 Backend Oracle Selection (select_fp8_moe_backend)

Sources: vllm/model_executor/layers/quantization/fp8.py650-900 vllm/model_executor/layers/fused_moe/cutlass_moe.py51-130 vllm/model_executor/layers/fused_moe/deep_gemm_moe.py44-80 vllm/model_executor/layers/fused_moe/flashinfer_trtllm_moe.py1-50 vllm/model_executor/layers/fused_moe/rocm_aiter_fused_moe.py1-60

Format Conversion

After selecting a backend, convert_to_fp8_moe_kernel_format reshapes and transposes the weight and scale tensors into the layout expected by the chosen kernel. The function is called inside process_weights_after_loading.

MXFP4 / NVFP4 MoE Quantization

Mxfp4Backend Enum

Mxfp4Backend in vllm/model_executor/layers/quantization/mxfp4.py enumerates available backends:

Enum Value	Kernel	Hardware
`SM100_FI_MXFP4_MXFP8_TRTLLM`	FlashInfer TensorRT-LLM	Blackwell (SM100)
`SM100_FI_MXFP4_MXFP8_CUTLASS`	FlashInfer CUTLASS	Blackwell (SM100)
`SM100_FI_MXFP4_BF16`	FlashInfer, BF16 output	Blackwell (SM100)
`SM90_FI_MXFP4_BF16`	FlashInfer, BF16 output	Hopper (SM90)
`MARLIN`	Marlin GEMM	SM80+ fallback
`TRITON`	OAI Triton kernels (`OAITritonExperts`)	SM90/SM100 (no FlashInfer)
`CK`	Composable Kernels (AITER)	ROCm GFX950
`NONE`	Not supported	—

Sources: vllm/model_executor/layers/quantization/mxfp4.py66-83

`get_mxfp4_backend` Selection Logic

Diagram: MXFP4 Backend Selection (get_mxfp4_backend)

Sources: vllm/model_executor/layers/quantization/mxfp4.py84-183

MXFP4/NVFP4 Method Classes

Class	Config	Backend Oracle
`Mxfp4MoEMethod`	`Mxfp4Config`	`get_mxfp4_backend`
`CompressedTensorsW4A4Nvfp4MoEMethod`	`CompressedTensorsConfig`	`select_nvfp4_moe_backend`
`CompressedTensorsW4A4Mxfp4MoEMethod`	`CompressedTensorsConfig`	Fixed: Marlin
`ModelOptNvfp4MoEMethod`	`ModelOptNvfp4Config`	`select_nvfp4_moe_backend`

Mxfp4MoEMethod.process_weights_after_loading calls _swizzle_mxfp4 for FlashInfer layouts or prepare_moe_fp4_layer_for_marlin for Marlin.

Sources: vllm/model_executor/layers/quantization/mxfp4.py250-500 vllm/model_executor/layers/quantization/compressed_tensors/compressed_tensors_moe.py242-370

GPTQ-Marlin and AWQ-Marlin MoE

GPTQMarlinMoEMethod and AWQMarlinMoEMethod both use fused_marlin_moe from vllm/model_executor/layers/fused_moe/fused_marlin_moe.py for execution.

Key Weight Parameters (GPTQ-Marlin)

Parameter	Description
`w13_qweight` / `w2_qweight`	Packed INT4 weights
`w13_scales` / `w2_scales`	Per-group dequantization scales
`w13_qzeros` / `w2_qzeros`	Zero points
`w13_g_idx` / `w2_g_idx`	Group indices (activation ordering)
`w13_workspace` / `w2_workspace`	Marlin workspace buffers

process_weights_after_loading calls:

marlin_moe_permute_scales — reorders scales to Marlin layout
marlin_make_workspace_new — allocates workspace

The GPTQMarlinState enum (values REPACK, READY) tracks whether weights still need repacking.

check_moe_marlin_supports_layer from marlin_utils.py gates Marlin use. On ROCm, Marlin MoE is not supported and CompressedTensorsWNA16MoEMethod is chosen instead.

The VLLM_USE_TRITON_AWQ environment variable forces the Triton path even on hardware that supports Marlin.

Sources: vllm/model_executor/layers/quantization/gptq_marlin.py1-400 vllm/model_executor/layers/quantization/awq_marlin.py1-400

Compressed Tensors MoE

CompressedTensorsConfig.get_quant_method delegates to the static factory CompressedTensorsMoEMethod.get_moe_method, which inspects the weight and input quantization descriptors and selects a concrete method.

Diagram: CompressedTensorsMoEMethod.get_moe_method Factory

Sources: vllm/model_executor/layers/quantization/compressed_tensors/compressed_tensors_moe.py124-240

Note: CompressedTensorsWNA16MoEMethod does not support actorder=group/dynamic; this combination raises a ValueError. Activation ordering requires the Marlin path.

Triton Kernel Dispatch

The Triton kernels in fused_moe.py are parameterized at JIT compile time using tl.constexpr flags. The invoke_fused_moe_kernel function selects which combination of flags to pass based on the FusedMoEQuantConfig.

`fused_moe_kernel` (general-purpose)

vllm/model_executor/layers/fused_moe/fused_moe.py314-575

`constexpr` Parameter	Active Quantization Scheme
`use_fp8_w8a8=True`	FP8 W8A8 (per-tensor, per-token, or block)
`use_int8_w8a8=True`	INT8 W8A8 (per-tensor, per-token, or block)
`use_int8_w8a16=True`	INT8 W8A16 (per-output-channel weight scale)
`per_channel_quant=True`	Per-output-channel weight quantization
`group_k > 0 and group_n > 0`	Block quantization (N×K tile, e.g., 128×128)
All flags `False`	BF16/FP16 unquantized

`fused_moe_kernel_gptq_awq` (GPTQ/AWQ)

vllm/model_executor/layers/fused_moe/fused_moe.py81-312

`constexpr` Parameter	Active Quantization Scheme
`use_int4_w4a16=True`	INT4 W4A16 (GPTQ/AWQ INT4 packed)
`use_int8_w8a16=True`	INT8 W8A16 (GPTQ/AWQ INT8)
`has_zp=True`	With zero-point (AWQ asymmetric)
`has_zp=False`	Without zero-point (GPTQ symmetric; substitutes a constant)

ROCm AITER MoE Variants

When VLLM_ROCM_USE_AITER_MOE=1 and the platform is ROCm, the AITER library handles MoE computation. The integration lives in vllm/model_executor/layers/fused_moe/rocm_aiter_fused_moe.py.

QuantMethod Enum

QuantMethod in rocm_aiter_fused_moe.py maps vLLM quantization types to AITER's internal enum:

`QuantMethod`	Integer	Description
`NO`	0	A16W16 — no quantization
`PER_TENSOR`	1	W8A8 per-tensor scale
`PER_TOKEN`	2	W8A8/W8A4 per-token scale
`BLOCK_1X32`	3	FP4 block 1×32
`BLOCK_1X128`	4	W8A8 per-1×128 block
`BLOCK_128x128`	5	W8A8 per-128×128 block

Sources: vllm/model_executor/layers/fused_moe/rocm_aiter_fused_moe.py30-44

Shared Expert Fusion

When VLLM_ROCM_USE_AITER_FUSION_SHARED_EXPERTS=1, AITER can fuse shared experts (as in DeepSeek-V3) alongside routed experts in a single kernel call. FusedMoE.__init__ sets self.aiter_fmoe_shared_expert_enabled and self.num_fused_shared_experts accordingly. The init_aiter_topK_meta_data function pre-allocates top-K buffers that cover both routed and shared expert IDs vllm/model_executor/layers/fused_moe/rocm_aiter_fused_moe.py56-106

AITER Compatibility Constraints

From vllm/model_executor/layers/fused_moe/layer.py402-407:

AITER MoE only supports gated activations (is_act_and_mul=True, i.e., SiLU/GELU with gate). Non-gated MoE falls back to standard Triton kernels.

Expert masks passed to AITER MoE must contain only 0s and 1s. The expert_mask tensor is constructed by determine_expert_map when return_expert_mask=True vllm/model_executor/layers/fused_moe/layer.py444-454

FusedMoEQuantConfig: Describing Quantization to Kernels

FusedMoEQuantConfig in vllm/model_executor/layers/fused_moe/config.py is the bridge between a quantization method and modular kernel infrastructure. It contains four FusedMoEQuantDesc objects describing activations and weights for both GEMMs:

Field	Represents
`a1`	Input activation quantization
`w1`	Gate/up weight quantization
`a2`	Intermediate activation quantization
`w2`	Down weight quantization

FusedMoEQuantDesc fields:

dtype — quantized dtype (e.g., torch.float8_e4m3fn, "mxfp4", "nvfp4")
shape — GroupShape expressing quantization granularity
scale — scale tensor or PrecisionConfig (GPT OSS Triton)
alpha_or_gscale — global scales for NVFP4 / W4A8 FP8 per-channel
zp — zero points for INT4/INT8 asymmetric schemes

Factory functions that build FusedMoEQuantConfig instances:

Factory	File	Scheme
`fp8_w8a8_moe_quant_config`	`config.py`	FP8 W8A8
`int8_w8a8_moe_quant_config`	`config.py`	INT8 W8A8
`int8_w8a16_moe_quant_config`	`config.py`	INT8 W8A16
`int4_w4a16_moe_quant_config`	`config.py`	INT4 W4A16
`mxfp4_mxfp8_moe_quant_config`	`config.py`	MXFP4 weights, MXFP8 compute
`mxfp4_w4a16_moe_quant_config`	`config.py`	MXFP4 weights, BF16 activations
`ocp_mx_moe_quant_config`	`config.py`	OCP MX format
`make_fp8_moe_quant_config`	`oracle/fp8.py`	Oracle-built FP8 config
`make_nvfp4_moe_quant_config`	`oracle/nvfp4.py`	Oracle-built NVFP4 config
`make_mxfp4_moe_quant_config`	`oracle/nvfp4.py`	Oracle-built MXFP4 config

Sources: vllm/model_executor/layers/fused_moe/config.py152-450

Modular Kernel Initialization

After all weights are loaded, FusedMoE.maybe_init_modular_kernel is called vllm/model_executor/layers/fused_moe/layer.py677-703:

FusedMoE.maybe_init_modular_kernel()
  ├── if quant_method.supports_internal_mk or is_monolithic: return (no-op)
  └── base_quant_method.maybe_make_prepare_finalize(routing_tables)
        → FusedMoEModularMethod.make(layer, base_quant_method, prepare_finalize, ...)
              replaces self.quant_method

Methods that set is_monolithic = True handle all aspects of dispatch internally (e.g., GPTQMarlinMoEMethod calls fused_marlin_moe directly from apply). Methods that produce a prepare_finalize object are wrapped by FusedMoEModularMethod, which coordinates with the FusedMoEPrepareAndFinalize and FusedMoEPermuteExpertsUnpermute interfaces described in 7.3.

Additionally, from vllm/model_executor/layers/fused_moe/layer.py622-627 three method classes require the full (pre-sharding) intermediate_size to be passed for activation ordering:

GPTQMarlinMoEMethod
CompressedTensorsWNA16MarlinMoEMethod
CompressedTensorsWNA16MoEMethod

Environment Variables

Variable	Default	Effect
`VLLM_MOE_USE_DEEP_GEMM`	`True`	Enable DeepGEMM for FP8 block-quantized MoE
`VLLM_BLOCKSCALE_FP8_GEMM_FLASHINFER`	`True`	Enable FlashInfer for block-scale FP8 GEMM in MoE
`VLLM_USE_FLASHINFER_MOE_FP8`	`False`	Force FlashInfer path for FP8 MoE
`VLLM_USE_FLASHINFER_MOE_FP4`	`False`	Force FlashInfer path for FP4 MoE
`VLLM_USE_FLASHINFER_MOE_FP16`	`False`	Force FlashInfer path for BF16/FP16 MoE
`VLLM_USE_FLASHINFER_MOE_INT4`	`False`	Force FlashInfer path for INT4 MoE
`VLLM_FLASHINFER_MOE_BACKEND`	`"latency"`	FlashInfer MoE mode: `"throughput"`, `"latency"`, `"masked_gemm"`
`VLLM_USE_FLASHINFER_MOE_MXFP4_MXFP8`	`False`	Enable FlashInfer TRTLLM MXFP4+MXFP8 path on SM100
`VLLM_USE_FLASHINFER_MOE_MXFP4_BF16`	`False`	Enable FlashInfer MXFP4+BF16 path on SM90
`VLLM_USE_FLASHINFER_MOE_MXFP4_MXFP8_CUTLASS`	`False`	Enable FlashInfer CUTLASS MXFP4+MXFP8 path on SM100
`VLLM_MXFP4_USE_MARLIN`	`None`	Force Marlin for MXFP4 (disables FlashInfer/Triton)
`VLLM_ROCM_USE_AITER_MOE`	`True`	Enable AITER for MoE on ROCm
`VLLM_ROCM_USE_AITER_FUSION_SHARED_EXPERTS`	`False`	Fuse shared experts into AITER MoE kernel
`VLLM_ROCM_MOE_PADDING`	`True`	Enable weight padding for ROCm MoE
`VLLM_USE_TRITON_AWQ`	`False`	Force Triton (bypass Marlin) for AWQ
`VLLM_MARLIN_USE_ATOMIC_ADD`	`False`	Use atomic add in Marlin MoE kernel
`VLLM_MAX_TOKENS_PER_EXPERT_FP4_MOE`	`163840`	Max tokens per expert for FP4 MoE
`VLLM_FUSED_MOE_CHUNK_SIZE`	`16384`	Token chunk size for FusedMoE activation chunking
`VLLM_ENABLE_FUSED_MOE_ACTIVATION_CHUNKING`	`True`	Enable activation chunking (disable if using `torch.compile`)
`VLLM_DEEPEPLL_NVFP4_DISPATCH`	`False`	Enable NVFP4 dispatch via DeepEP low-latency path

Sources: vllm/envs.py95-210

MoE Quantization and Backend Selection

Relevant source files

MoE Quantization Method Selection

The FusedMoEMethodBase interface defines:

create_weights — allocates quantized weight parameters
process_weights_after_loading — reformats weights for the target kernel
apply — executes the quantized forward pass
maybe_make_prepare_finalize — returns a FusedMoEPrepareAndFinalize for modular kernels (or None if the method is monolithic)
supports_internal_mk / is_monolithic — flags for the modular kernel initialization path
supports_eplb — expert parallelism load balancing support

Diagram: QuantizationConfig to FusedMoEMethodBase Dispatch

FP8 MoE Quantization

Method Classes

Class	File	Checkpoint type
`Fp8MoEMethod`	`fp8.py`	Serialized FP8 (static weight scales)
`Fp8OnlineMoEMethod`	`fp8.py`	FP16/BF16 checkpoint; weights quantized during load
`CompressedTensorsW8A8Fp8MoEMethod`	`compressed_tensors_moe.py`	Compressed-tensors format FP8
`ModelOptFp8MoEMethod`	`modelopt.py`	ModelOpt FP8 checkpoint

Fp8MoEMethod supports three weight scale granularities based on Fp8Config.weight_block_size and activation_scheme:

Per-tensor (kFp8StaticTensorSym) — single scale per weight matrix
Per-channel (kFp8StaticChannelSym) — one scale per output channel
Block (kFp8Static128BlockSym) — e.g., 128×128 blocks (DeepSeek-V3 style)

FP8 Backend Oracle

Diagram: FP8 Backend Oracle Selection (select_fp8_moe_backend)

Format Conversion

MXFP4 / NVFP4 MoE Quantization

Mxfp4Backend Enum

Mxfp4Backend in vllm/model_executor/layers/quantization/mxfp4.py enumerates available backends:

Enum Value	Kernel	Hardware
`SM100_FI_MXFP4_MXFP8_TRTLLM`	FlashInfer TensorRT-LLM	Blackwell (SM100)
`SM100_FI_MXFP4_MXFP8_CUTLASS`	FlashInfer CUTLASS	Blackwell (SM100)
`SM100_FI_MXFP4_BF16`	FlashInfer, BF16 output	Blackwell (SM100)
`SM90_FI_MXFP4_BF16`	FlashInfer, BF16 output	Hopper (SM90)
`MARLIN`	Marlin GEMM	SM80+ fallback
`TRITON`	OAI Triton kernels (`OAITritonExperts`)	SM90/SM100 (no FlashInfer)
`CK`	Composable Kernels (AITER)	ROCm GFX950
`NONE`	Not supported	—

Sources: vllm/model_executor/layers/quantization/mxfp4.py66-83

`get_mxfp4_backend` Selection Logic

Diagram: MXFP4 Backend Selection (get_mxfp4_backend)

Sources: vllm/model_executor/layers/quantization/mxfp4.py84-183

MXFP4/NVFP4 Method Classes

Class	Config	Backend Oracle
`Mxfp4MoEMethod`	`Mxfp4Config`	`get_mxfp4_backend`
`CompressedTensorsW4A4Nvfp4MoEMethod`	`CompressedTensorsConfig`	`select_nvfp4_moe_backend`
`CompressedTensorsW4A4Mxfp4MoEMethod`	`CompressedTensorsConfig`	Fixed: Marlin
`ModelOptNvfp4MoEMethod`	`ModelOptNvfp4Config`	`select_nvfp4_moe_backend`

Mxfp4MoEMethod.process_weights_after_loading calls _swizzle_mxfp4 for FlashInfer layouts or prepare_moe_fp4_layer_for_marlin for Marlin.

Sources: vllm/model_executor/layers/quantization/mxfp4.py250-500 vllm/model_executor/layers/quantization/compressed_tensors/compressed_tensors_moe.py242-370

GPTQ-Marlin and AWQ-Marlin MoE

GPTQMarlinMoEMethod and AWQMarlinMoEMethod both use fused_marlin_moe from vllm/model_executor/layers/fused_moe/fused_marlin_moe.py for execution.

Key Weight Parameters (GPTQ-Marlin)

Parameter	Description
`w13_qweight` / `w2_qweight`	Packed INT4 weights
`w13_scales` / `w2_scales`	Per-group dequantization scales
`w13_qzeros` / `w2_qzeros`	Zero points
`w13_g_idx` / `w2_g_idx`	Group indices (activation ordering)
`w13_workspace` / `w2_workspace`	Marlin workspace buffers

process_weights_after_loading calls:

marlin_moe_permute_scales — reorders scales to Marlin layout
marlin_make_workspace_new — allocates workspace

The GPTQMarlinState enum (values REPACK, READY) tracks whether weights still need repacking.

check_moe_marlin_supports_layer from marlin_utils.py gates Marlin use. On ROCm, Marlin MoE is not supported and CompressedTensorsWNA16MoEMethod is chosen instead.

The VLLM_USE_TRITON_AWQ environment variable forces the Triton path even on hardware that supports Marlin.

Sources: vllm/model_executor/layers/quantization/gptq_marlin.py1-400 vllm/model_executor/layers/quantization/awq_marlin.py1-400

Compressed Tensors MoE

Diagram: CompressedTensorsMoEMethod.get_moe_method Factory

Sources: vllm/model_executor/layers/quantization/compressed_tensors/compressed_tensors_moe.py124-240

Note: CompressedTensorsWNA16MoEMethod does not support actorder=group/dynamic; this combination raises a ValueError. Activation ordering requires the Marlin path.

Triton Kernel Dispatch

`fused_moe_kernel` (general-purpose)

vllm/model_executor/layers/fused_moe/fused_moe.py314-575

`constexpr` Parameter	Active Quantization Scheme
`use_fp8_w8a8=True`	FP8 W8A8 (per-tensor, per-token, or block)
`use_int8_w8a8=True`	INT8 W8A8 (per-tensor, per-token, or block)
`use_int8_w8a16=True`	INT8 W8A16 (per-output-channel weight scale)
`per_channel_quant=True`	Per-output-channel weight quantization
`group_k > 0 and group_n > 0`	Block quantization (N×K tile, e.g., 128×128)
All flags `False`	BF16/FP16 unquantized

`fused_moe_kernel_gptq_awq` (GPTQ/AWQ)

vllm/model_executor/layers/fused_moe/fused_moe.py81-312

`constexpr` Parameter	Active Quantization Scheme
`use_int4_w4a16=True`	INT4 W4A16 (GPTQ/AWQ INT4 packed)
`use_int8_w8a16=True`	INT8 W8A16 (GPTQ/AWQ INT8)
`has_zp=True`	With zero-point (AWQ asymmetric)
`has_zp=False`	Without zero-point (GPTQ symmetric; substitutes a constant)

ROCm AITER MoE Variants

When VLLM_ROCM_USE_AITER_MOE=1 and the platform is ROCm, the AITER library handles MoE computation. The integration lives in vllm/model_executor/layers/fused_moe/rocm_aiter_fused_moe.py.

QuantMethod Enum

QuantMethod in rocm_aiter_fused_moe.py maps vLLM quantization types to AITER's internal enum:

`QuantMethod`	Integer	Description
`NO`	0	A16W16 — no quantization
`PER_TENSOR`	1	W8A8 per-tensor scale
`PER_TOKEN`	2	W8A8/W8A4 per-token scale
`BLOCK_1X32`	3	FP4 block 1×32
`BLOCK_1X128`	4	W8A8 per-1×128 block
`BLOCK_128x128`	5	W8A8 per-128×128 block

Sources: vllm/model_executor/layers/fused_moe/rocm_aiter_fused_moe.py30-44

Shared Expert Fusion

AITER Compatibility Constraints

From vllm/model_executor/layers/fused_moe/layer.py402-407:

AITER MoE only supports gated activations (is_act_and_mul=True, i.e., SiLU/GELU with gate). Non-gated MoE falls back to standard Triton kernels.

FusedMoEQuantConfig: Describing Quantization to Kernels

Field	Represents
`a1`	Input activation quantization
`w1`	Gate/up weight quantization
`a2`	Intermediate activation quantization
`w2`	Down weight quantization

FusedMoEQuantDesc fields:

dtype — quantized dtype (e.g., torch.float8_e4m3fn, "mxfp4", "nvfp4")
shape — GroupShape expressing quantization granularity
scale — scale tensor or PrecisionConfig (GPT OSS Triton)
alpha_or_gscale — global scales for NVFP4 / W4A8 FP8 per-channel
zp — zero points for INT4/INT8 asymmetric schemes

Factory functions that build FusedMoEQuantConfig instances:

Factory	File	Scheme
`fp8_w8a8_moe_quant_config`	`config.py`	FP8 W8A8
`int8_w8a8_moe_quant_config`	`config.py`	INT8 W8A8
`int8_w8a16_moe_quant_config`	`config.py`	INT8 W8A16
`int4_w4a16_moe_quant_config`	`config.py`	INT4 W4A16
`mxfp4_mxfp8_moe_quant_config`	`config.py`	MXFP4 weights, MXFP8 compute
`mxfp4_w4a16_moe_quant_config`	`config.py`	MXFP4 weights, BF16 activations
`ocp_mx_moe_quant_config`	`config.py`	OCP MX format
`make_fp8_moe_quant_config`	`oracle/fp8.py`	Oracle-built FP8 config
`make_nvfp4_moe_quant_config`	`oracle/nvfp4.py`	Oracle-built NVFP4 config
`make_mxfp4_moe_quant_config`	`oracle/nvfp4.py`	Oracle-built MXFP4 config

Sources: vllm/model_executor/layers/fused_moe/config.py152-450

Modular Kernel Initialization

After all weights are loaded, FusedMoE.maybe_init_modular_kernel is called vllm/model_executor/layers/fused_moe/layer.py677-703:

FusedMoE.maybe_init_modular_kernel()
  ├── if quant_method.supports_internal_mk or is_monolithic: return (no-op)
  └── base_quant_method.maybe_make_prepare_finalize(routing_tables)
        → FusedMoEModularMethod.make(layer, base_quant_method, prepare_finalize, ...)
              replaces self.quant_method

Additionally, from vllm/model_executor/layers/fused_moe/layer.py622-627 three method classes require the full (pre-sharding) intermediate_size to be passed for activation ordering:

GPTQMarlinMoEMethod
CompressedTensorsWNA16MarlinMoEMethod
CompressedTensorsWNA16MoEMethod

Environment Variables

Variable	Default	Effect
`VLLM_MOE_USE_DEEP_GEMM`	`True`	Enable DeepGEMM for FP8 block-quantized MoE
`VLLM_BLOCKSCALE_FP8_GEMM_FLASHINFER`	`True`	Enable FlashInfer for block-scale FP8 GEMM in MoE
`VLLM_USE_FLASHINFER_MOE_FP8`	`False`	Force FlashInfer path for FP8 MoE
`VLLM_USE_FLASHINFER_MOE_FP4`	`False`	Force FlashInfer path for FP4 MoE
`VLLM_USE_FLASHINFER_MOE_FP16`	`False`	Force FlashInfer path for BF16/FP16 MoE
`VLLM_USE_FLASHINFER_MOE_INT4`	`False`	Force FlashInfer path for INT4 MoE
`VLLM_FLASHINFER_MOE_BACKEND`	`"latency"`	FlashInfer MoE mode: `"throughput"`, `"latency"`, `"masked_gemm"`
`VLLM_USE_FLASHINFER_MOE_MXFP4_MXFP8`	`False`	Enable FlashInfer TRTLLM MXFP4+MXFP8 path on SM100
`VLLM_USE_FLASHINFER_MOE_MXFP4_BF16`	`False`	Enable FlashInfer MXFP4+BF16 path on SM90
`VLLM_USE_FLASHINFER_MOE_MXFP4_MXFP8_CUTLASS`	`False`	Enable FlashInfer CUTLASS MXFP4+MXFP8 path on SM100
`VLLM_MXFP4_USE_MARLIN`	`None`	Force Marlin for MXFP4 (disables FlashInfer/Triton)
`VLLM_ROCM_USE_AITER_MOE`	`True`	Enable AITER for MoE on ROCm
`VLLM_ROCM_USE_AITER_FUSION_SHARED_EXPERTS`	`False`	Fuse shared experts into AITER MoE kernel
`VLLM_ROCM_MOE_PADDING`	`True`	Enable weight padding for ROCm MoE
`VLLM_USE_TRITON_AWQ`	`False`	Force Triton (bypass Marlin) for AWQ
`VLLM_MARLIN_USE_ATOMIC_ADD`	`False`	Use atomic add in Marlin MoE kernel
`VLLM_MAX_TOKENS_PER_EXPERT_FP4_MOE`	`163840`	Max tokens per expert for FP4 MoE
`VLLM_FUSED_MOE_CHUNK_SIZE`	`16384`	Token chunk size for FusedMoE activation chunking
`VLLM_ENABLE_FUSED_MOE_ACTIVATION_CHUNKING`	`True`	Enable activation chunking (disable if using `torch.compile`)
`VLLM_DEEPEPLL_NVFP4_DISPATCH`	`False`	Enable NVFP4 dispatch via DeepEP low-latency path

Sources: vllm/envs.py95-210

MoE Quantization and Backend Selection

MoE Quantization Method Selection

FP8 MoE Quantization

Method Classes

FP8 Backend Oracle

Format Conversion

MXFP4 / NVFP4 MoE Quantization

Mxfp4Backend Enum

get_mxfp4_backend Selection Logic

MXFP4/NVFP4 Method Classes

GPTQ-Marlin and AWQ-Marlin MoE

Key Weight Parameters (GPTQ-Marlin)

Compressed Tensors MoE

Triton Kernel Dispatch

fused_moe_kernel (general-purpose)

fused_moe_kernel_gptq_awq (GPTQ/AWQ)

ROCm AITER MoE Variants

QuantMethod Enum

Shared Expert Fusion

AITER Compatibility Constraints

FusedMoEQuantConfig: Describing Quantization to Kernels

Modular Kernel Initialization

Environment Variables

On this page

MoE Quantization and Backend Selection

MoE Quantization Method Selection

FP8 MoE Quantization

Method Classes

FP8 Backend Oracle

Format Conversion

MXFP4 / NVFP4 MoE Quantization

Mxfp4Backend Enum

get_mxfp4_backend Selection Logic

MXFP4/NVFP4 Method Classes

GPTQ-Marlin and AWQ-Marlin MoE

Key Weight Parameters (GPTQ-Marlin)

Compressed Tensors MoE

Triton Kernel Dispatch

fused_moe_kernel (general-purpose)

fused_moe_kernel_gptq_awq (GPTQ/AWQ)

ROCm AITER MoE Variants

QuantMethod Enum

Shared Expert Fusion

AITER Compatibility Constraints

FusedMoEQuantConfig: Describing Quantization to Kernels

Modular Kernel Initialization

Environment Variables

On this page

`get_mxfp4_backend` Selection Logic

`fused_moe_kernel` (general-purpose)

`fused_moe_kernel_gptq_awq` (GPTQ/AWQ)

`get_mxfp4_backend` Selection Logic

`fused_moe_kernel` (general-purpose)

`fused_moe_kernel_gptq_awq` (GPTQ/AWQ)