Quantization and MoE Optimizations

Relevant source files

This page covers vLLM's quantization infrastructure and Mixture-of-Experts (MoE) kernel system. It explains the quantization method registry, the FP8 linear and MoE pipelines, the modular MoE kernel abstraction, and how backend selection is performed at runtime.

For the linear layer and normalization implementations that host quantized weights, see Linear Layers and Normalization. For distributed expert parallelism configuration, see Parallelism Strategies. For the general attention backend system, see Attention Backends.

Quantization System Overview

Every quantization scheme implements the QuantizationConfig abstract base class (vllm/model_executor/layers/quantization/base_config.py) and is registered in a central registry. When a model is loaded, the config's get_quant_method is called per layer to return a QuantizeMethodBase that knows how to create_weights, process_weights_after_loading, and apply the kernel.

Quantization method dispatch diagram:

Sources: vllm/model_executor/layers/quantization/fp8.py184-216 vllm/model_executor/layers/quantization/modelopt.py182-217

Supported Quantization Methods

Method	Config Class	Key File
`fp8`	`Fp8Config`	`vllm/model_executor/layers/quantization/fp8.py`
`modelopt`	`ModelOptFp8Config` / `ModelOptNvFp4Config`	`vllm/model_executor/layers/quantization/modelopt.py`
`compressed_tensors`	`CompressedTensorsConfig`	`vllm/model_executor/layers/quantization/compressed_tensors/`
`gptq_marlin`	`GPTQMarlinConfig`	`vllm/model_executor/layers/quantization/gptq_marlin.py`
`awq_marlin`	`AWQMarlinConfig`	`vllm/model_executor/layers/quantization/awq_marlin.py`
`bitsandbytes`	`BitsAndBytesConfig`	`vllm/model_executor/layers/quantization/bitsandbytes.py`
`gguf`	`GGUFConfig`	`vllm/model_executor/layers/quantization/gguf.py`
`mxfp4`	`Mxfp4Config`	`vllm/model_executor/layers/quantization/mxfp4.py`
`moe_wna16`	—	`vllm/model_executor/layers/quantization/moe_wna16.py`

Sources: vllm/model_executor/layers/quantization/fp8.py109-148 vllm/model_executor/layers/quantization/modelopt.py106-120 vllm/model_executor/layers/quantization/mxfp4.py66-83

FP8 Quantization

Fp8Config

Fp8Config (vllm/model_executor/layers/quantization/fp8.py109-237) controls:

activation_scheme: "static" or "dynamic" — whether activation scales are pre-computed or computed per-token at runtime.
is_checkpoint_fp8_serialized: True if weights are stored as FP8 in the checkpoint.
weight_block_size: enables block-wise quantization (e.g. [128, 128]). Requires is_checkpoint_fp8_serialized=True and activation_scheme="dynamic".
ignored_layers: list of layer name prefixes to skip.

Linear Method Classes

Class	Use case
`Fp8LinearMethod`	Loads FP8-serialized checkpoint with static weight scale
`Fp8OnlineLinearMethod`	Loads BF16/FP16 checkpoint, quantizes weights during loading

For models without FP8 hardware support (compute capability < 89) or when VLLM_TEST_FORCE_FP8_MARLIN=1, both methods fall back to the Marlin FP8 kernel (apply_fp8_marlin_linear). ROCm and XPU platforms also skip Marlin.

Sources: vllm/model_executor/layers/quantization/fp8.py269-330 vllm/model_executor/layers/quantization/fp8.py526-648

W8A8BlockFp8LinearOp — Block FP8 GEMM Dispatch

For block-wise FP8 (e.g. DeepSeek-V3), W8A8BlockFp8LinearOp (vllm/model_executor/layers/quantization/utils/fp8_utils.py347) dispatches to the appropriate backend:

Block FP8 GEMM backend selection diagram:

Sources: vllm/model_executor/layers/quantization/utils/fp8_utils.py347-420 vllm/utils/deep_gemm.py68-77

The FlashInfer path uses torch.cond() for compile compatibility, selecting DeepGEMM swapAB for M < 32 and standard DeepGEMM for M >= 32 (vllm/model_executor/layers/quantization/utils/fp8_utils.py239-320).

DeepGEMM Scale Formats

DeepGEMM supports three scale formats controlled by DeepGemmQuantScaleFMT (vllm/utils/deep_gemm.py27-66):

Format	Description
`FLOAT32`	Standard float32 scales
`FLOAT32_CEIL_UE8M0`	Float32 scales rounded up to UE8M0 (Hopper)
`UE8M0`	Packed UE8M0 (4 values per int32) (Blackwell)

Controlled by VLLM_USE_DEEP_GEMM_E8M0 env var.

Sources: vllm/utils/deep_gemm.py27-77

FusedMoE Layer Architecture

FusedMoE Class

FusedMoE (vllm/model_executor/layers/fused_moe/layer.py274) is the top-level CustomOp representing a full MoE layer. It owns:

Weight tensors w13 (gate+up) and w2 (down), in shape [num_experts, N, K]
FusedMoEConfig — static layer config
FusedMoEParallelConfig — parallelism config (TP, EP, DP)
quant_method: FusedMoEMethodBase — the active quantization strategy
router — top-K router
runner: DefaultMoERunner — orchestrates execution

FusedMoE internal structure:

Sources: vllm/model_executor/layers/fused_moe/layer.py274-666 vllm/model_executor/layers/fused_moe/config.py152

Modular Kernel Architecture

The MoE execution pipeline is decomposed into three independent stages via abstract interfaces in modular_kernel.py (vllm/model_executor/layers/fused_moe/modular_kernel.py):

[Router] → [FusedMoEPrepareAndFinalize] → [FusedMoEPermuteExpertsUnpermute]

Interface	Role
`FusedMoEPrepareAndFinalize`	Quantizes inputs, dispatches tokens (e.g. all2all), and gathers results
`FusedMoEPermuteExpertsUnpermute`	Runs the actual expert GEMMs (permute → GEMM → unpermute)
`FusedMoEModularKernel`	Combines a prepare/finalize and a permute/unpermute into one callable

Modular kernel composition:

Sources: vllm/model_executor/layers/fused_moe/modular_kernel.py44-80

FusedMoEPrepareAndFinalize Implementations

Class	Backend	File
`MoEPrepareAndFinalizeNoEP`	Single-GPU / TP-only	`prepare_finalize.py`
`NaiveEPDispatcher`	All-reduce + scatter	`all2all_utils.py`
`DeepEPLLPrepareAndFinalize`	DeepEP low-latency all2all	`deep_ep_ll.py`
`DeepEPHTDispatcher`	DeepEP high-throughput all2all	`deep_ep_ht.py`

FusedMoEPermuteExpertsUnpermute Implementations

Class	Backend	Quantization
`TritonExperts`	Triton fused kernel	FP8, INT8, W4A16, W8A16, unquantized
`CutlassExperts`	CUTLASS grouped GEMM	FP8
`DeepGemmExperts`	DeepGEMM contiguous grouped GEMM	FP8 block
`BatchedDeepGemmExperts`	DeepGEMM masked grouped GEMM	FP8 block
`MarlinExperts`	Marlin kernel	INT4/FP4/FP8

Sources: vllm/model_executor/layers/fused_moe/cutlass_moe.py vllm/model_executor/layers/fused_moe/deep_gemm_moe.py vllm/model_executor/layers/fused_moe/batched_deep_gemm_moe.py

MoE Quantization Methods

Each quantization config supplies a FusedMoEMethodBase subclass for FusedMoE layers.

MoE quant method class hierarchy:

Sources: vllm/model_executor/layers/quantization/fp8.py650 vllm/model_executor/layers/quantization/compressed_tensors/compressed_tensors_moe.py113-239 vllm/model_executor/layers/quantization/gptq_marlin.py vllm/model_executor/layers/quantization/mxfp4.py250

FP8 MoE Methods

Fp8MoEMethod (serialized checkpoints) and Fp8OnlineMoEMethod (BF16 checkpoints, online quantize) call select_fp8_moe_backend to pick between:

Backend enum	Kernel
`Fp8MoeBackend.TRITON`	Triton `fused_moe_kernel`
`Fp8MoeBackend.CUTLASS`	`run_cutlass_moe_fp8`
`Fp8MoeBackend.DEEP_GEMM`	`DeepGemmExperts`
`Fp8MoeBackend.FLASHINFER_TRTLLM`	FlashInfer TensorRT-LLM fused MoE
`Fp8MoeBackend.FLASHINFER_CUTLASS`	FlashInfer CUTLASS grouped GEMM

Sources: vllm/model_executor/layers/quantization/fp8.py650 vllm/model_executor/layers/fused_moe/oracle/fp8.py

Compressed Tensors MoE

CompressedTensorsMoEMethod.get_moe_method (vllm/model_executor/layers/quantization/compressed_tensors/compressed_tensors_moe.py125-239) is a factory that inspects the weight/activation quantization scheme and returns the most appropriate subclass:

Scheme detected	Method class
MXFP4 weights	`CompressedTensorsW4A4Mxfp4MoEMethod`
WNA16 packed	`CompressedTensorsWNA16MarlinMoEMethod` or `CompressedTensorsWNA16MoEMethod`
NVFP4	`CompressedTensorsW4A4Nvfp4MoEMethod`
FP8 W8A8	`CompressedTensorsW8A8Fp8MoEMethod`
INT8 W8A8	`CompressedTensorsW8A8Int8MoEMethod`
FP8 W4A8	`CompressedTensorsW4A8Fp8MoEMethod`
INT8 W4A8	`CompressedTensorsW4A8Int8MoEMethod`

Marlin is preferred for WNA16 when check_moe_marlin_supports_layer passes and the platform is not ROCm.

Sources: vllm/model_executor/layers/quantization/compressed_tensors/compressed_tensors_moe.py156-239

MXFP4 MoE

Mxfp4Config.get_quant_method (vllm/model_executor/layers/quantization/mxfp4.py211-247) creates Mxfp4MoEMethod, which selects a backend via get_mxfp4_backend:

`Mxfp4Backend` enum	Condition
`SM100_FI_MXFP4_MXFP8_CUTLASS`	Blackwell + FlashInfer + `VLLM_USE_FLASHINFER_MOE_MXFP4_MXFP8_CUTLASS=1`
`SM100_FI_MXFP4_MXFP8_TRTLLM`	Blackwell + FlashInfer + `VLLM_USE_FLASHINFER_MOE_MXFP4_MXFP8=1`
`SM100_FI_MXFP4_BF16`	Blackwell + FlashInfer
`SM90_FI_MXFP4_BF16`	Hopper + FlashInfer + `VLLM_USE_FLASHINFER_MOE_MXFP4_BF16=1`
`TRITON`	SM90/SM100, triton_kernels available
`MARLIN`	fallback / `VLLM_MXFP4_USE_MARLIN=1`
`CK`	ROCm gfx950 + AITER enabled

Sources: vllm/model_executor/layers/quantization/mxfp4.py108-183

Expert Parallelism and Expert Maps

When expert parallelism (EP) is enabled, FusedMoE.__init__ calls determine_expert_map (vllm/model_executor/layers/fused_moe/layer.py67-153) to produce:

local_num_experts: experts on this rank.
expert_map: tensor of shape (global_num_experts,) mapping global → local index; -1 for experts not on this rank.
expert_mask (ROCm AITER only): binary mask of same length.

Two placement strategies are supported via ExpertPlacementStrategy:

Strategy	Description
`"linear"`	Consecutive blocks of experts per rank
`"round_robin"`	Interleaved distribution, requires DeepEP low-latency backend

The modular kernel receives expert_map so GEMM kernels can write zeros for absent experts.

Sources: vllm/model_executor/layers/fused_moe/layer.py67-153 vllm/model_executor/layers/fused_moe/layer.py424-474

Triton Fused MoE Kernel

fused_moe_kernel (vllm/model_executor/layers/fused_moe/fused_moe.py314-574) and fused_moe_kernel_gptq_awq (vllm/model_executor/layers/fused_moe/fused_moe.py81-311) are the core Triton kernels. Both implement the standard "sorted token" MoE pattern:

Each Triton program block handles a BLOCK_SIZE_M chunk of sorted tokens assigned to one expert.
If the expert ID for a block is -1 (EP — not local), zeros are written.
For quantized paths, dequantization happens after the accumulator loop, in float32, before optional router weight multiplication.

The kernel accepts use_fp8_w8a8, use_int8_w8a8, use_int8_w8a16, use_int4_w4a16, per_channel_quant, group_k, group_n compile-time constants for specialization.

Sources: vllm/model_executor/layers/fused_moe/fused_moe.py314-574 vllm/model_executor/layers/fused_moe/fused_moe.py81-311

ModelOpt Quantization

ModelOptQuantConfigBase (vllm/model_executor/layers/quantization/modelopt.py132-218) is the base for NVIDIA ModelOpt-exported checkpoints. Concrete configs:

Config	Algorithms
`ModelOptFp8Config`	`FP8`, `FP8_PER_CHANNEL_PER_TOKEN`, `FP8_PB_WO`
`ModelOptNvFp4Config`	`NVFP4`
`ModelOptMxFp8Config`	`MXFP8`

The exclude_modules list supports wildcard patterns (fnmatch) and also handles the language_model. prefix convention used by some HF checkpoints (vllm/model_executor/layers/quantization/modelopt.py144-180).

Sources: vllm/model_executor/layers/quantization/modelopt.py106-362

Key Environment Variables

These env vars control quantization and MoE behavior at runtime:

Variable	Default	Effect
`VLLM_USE_DEEP_GEMM`	`True`	Enable DeepGEMM library for FP8 GEMMs
`VLLM_MOE_USE_DEEP_GEMM`	`True`	Enable DeepGEMM for MoE GEMMs
`VLLM_USE_DEEP_GEMM_E8M0`	`True`	Use UE8M0 scale format with DeepGEMM
`VLLM_USE_DEEP_GEMM_TMA_ALIGNED_SCALES`	`True`	TMA-align scale tensors
`VLLM_DEEP_GEMM_WARMUP`	`"relax"`	Warmup mode: `skip`, `full`, or `relax`
`VLLM_BLOCKSCALE_FP8_GEMM_FLASHINFER`	`True`	Allow FlashInfer for block-scale FP8 GEMM
`VLLM_USE_FLASHINFER_MOE_FP8`	`False`	Use FlashInfer for FP8 MoE
`VLLM_USE_FLASHINFER_MOE_FP4`	`False`	Use FlashInfer for FP4 MoE
`VLLM_USE_FLASHINFER_MOE_MXFP4_MXFP8`	`False`	FlashInfer MXFP4/MXFP8 TRT-LLM path
`VLLM_USE_FLASHINFER_MOE_MXFP4_MXFP8_CUTLASS`	`False`	FlashInfer MXFP4/MXFP8 CUTLASS path
`VLLM_USE_FLASHINFER_MOE_MXFP4_BF16`	`False`	FlashInfer MXFP4 BF16 path (SM90)
`VLLM_FLASHINFER_MOE_BACKEND`	`"latency"`	FlashInfer MoE backend: `throughput`, `latency`, `masked_gemm`
`VLLM_MXFP4_USE_MARLIN`	`None`	Force Marlin backend for MXFP4
`VLLM_MARLIN_USE_ATOMIC_ADD`	`False`	Use atomic add in Marlin kernels
`VLLM_MARLIN_INPUT_DTYPE`	`None`	Override Marlin input dtype (`int8` or `fp8`)
`VLLM_FUSED_MOE_CHUNK_SIZE`	`16384`	Chunk size for fused MoE activation
`VLLM_ENABLE_FUSED_MOE_ACTIVATION_CHUNKING`	`True`	Enable activation chunking in MoE
`VLLM_USE_FUSED_MOE_GROUPED_TOPK`	`True`	Use fused grouped top-K routing
`VLLM_USE_TRITON_AWQ`	`False`	Use Triton for AWQ kernels
`VLLM_TEST_FORCE_FP8_MARLIN`	`False`	Force Marlin fallback for FP8
`VLLM_ROCM_USE_AITER_MOE`	`True`	Use AITER fused MoE on ROCm
`VLLM_ROCM_USE_AITER_FUSION_SHARED_EXPERTS`	`False`	Fuse shared experts in AITER MoE
`VLLM_ROCM_MOE_PADDING`	`True`	Enable MoE padding on ROCm
`VLLM_MAX_TOKENS_PER_EXPERT_FP4_MOE`	`163840`	Max tokens per expert for FP4 MoE
`VLLM_NVFP4_GEMM_BACKEND`	`None`	Override NVFP4 GEMM backend

Sources: vllm/envs.py56-57 vllm/envs.py95-96 vllm/envs.py101-118 vllm/envs.py145-168 vllm/envs.py153-156

Quantization + MoE Runtime Flow

The following diagram shows the full path from a FusedMoE.forward call through quantization dispatch to kernel execution:

Sources: vllm/model_executor/layers/fused_moe/modular_kernel.py44-80 vllm/model_executor/layers/fused_moe/layer.py647-702

Weight Scale Granularities

The FusedMoeWeightScaleSupported enum (vllm/model_executor/layers/fused_moe/layer.py60-64) captures the scale granularities supported by MoE kernels:

Value	Meaning
`TENSOR`	Single scale per expert weight tensor
`CHANNEL`	One scale per output channel
`GROUP`	Block/group-wise scale (e.g. every 128 elements)
`BLOCK`	2D block-wise scale

These are used by quant_method.create_weights to register the appropriate PerTensorScaleParameter, ChannelQuantScaleParameter, or BlockQuantScaleParameter.

Sources: vllm/model_executor/layers/fused_moe/layer.py60-64 vllm/model_executor/layers/quantization/fp8.py332-394

Quantization and MoE Optimizations

Relevant source files

Quantization System Overview

Quantization method dispatch diagram:

Sources: vllm/model_executor/layers/quantization/fp8.py184-216 vllm/model_executor/layers/quantization/modelopt.py182-217

Supported Quantization Methods

Method	Config Class	Key File
`fp8`	`Fp8Config`	`vllm/model_executor/layers/quantization/fp8.py`
`modelopt`	`ModelOptFp8Config` / `ModelOptNvFp4Config`	`vllm/model_executor/layers/quantization/modelopt.py`
`compressed_tensors`	`CompressedTensorsConfig`	`vllm/model_executor/layers/quantization/compressed_tensors/`
`gptq_marlin`	`GPTQMarlinConfig`	`vllm/model_executor/layers/quantization/gptq_marlin.py`
`awq_marlin`	`AWQMarlinConfig`	`vllm/model_executor/layers/quantization/awq_marlin.py`
`bitsandbytes`	`BitsAndBytesConfig`	`vllm/model_executor/layers/quantization/bitsandbytes.py`
`gguf`	`GGUFConfig`	`vllm/model_executor/layers/quantization/gguf.py`
`mxfp4`	`Mxfp4Config`	`vllm/model_executor/layers/quantization/mxfp4.py`
`moe_wna16`	—	`vllm/model_executor/layers/quantization/moe_wna16.py`

Sources: vllm/model_executor/layers/quantization/fp8.py109-148 vllm/model_executor/layers/quantization/modelopt.py106-120 vllm/model_executor/layers/quantization/mxfp4.py66-83

FP8 Quantization

Fp8Config

Fp8Config (vllm/model_executor/layers/quantization/fp8.py109-237) controls:

activation_scheme: "static" or "dynamic" — whether activation scales are pre-computed or computed per-token at runtime.
is_checkpoint_fp8_serialized: True if weights are stored as FP8 in the checkpoint.
weight_block_size: enables block-wise quantization (e.g. [128, 128]). Requires is_checkpoint_fp8_serialized=True and activation_scheme="dynamic".
ignored_layers: list of layer name prefixes to skip.

Linear Method Classes

Class	Use case
`Fp8LinearMethod`	Loads FP8-serialized checkpoint with static weight scale
`Fp8OnlineLinearMethod`	Loads BF16/FP16 checkpoint, quantizes weights during loading

Sources: vllm/model_executor/layers/quantization/fp8.py269-330 vllm/model_executor/layers/quantization/fp8.py526-648

W8A8BlockFp8LinearOp — Block FP8 GEMM Dispatch

For block-wise FP8 (e.g. DeepSeek-V3), W8A8BlockFp8LinearOp (vllm/model_executor/layers/quantization/utils/fp8_utils.py347) dispatches to the appropriate backend:

Block FP8 GEMM backend selection diagram:

Sources: vllm/model_executor/layers/quantization/utils/fp8_utils.py347-420 vllm/utils/deep_gemm.py68-77

DeepGEMM Scale Formats

DeepGEMM supports three scale formats controlled by DeepGemmQuantScaleFMT (vllm/utils/deep_gemm.py27-66):

Format	Description
`FLOAT32`	Standard float32 scales
`FLOAT32_CEIL_UE8M0`	Float32 scales rounded up to UE8M0 (Hopper)
`UE8M0`	Packed UE8M0 (4 values per int32) (Blackwell)

Controlled by VLLM_USE_DEEP_GEMM_E8M0 env var.

Sources: vllm/utils/deep_gemm.py27-77

FusedMoE Layer Architecture

FusedMoE Class

FusedMoE (vllm/model_executor/layers/fused_moe/layer.py274) is the top-level CustomOp representing a full MoE layer. It owns:

Weight tensors w13 (gate+up) and w2 (down), in shape [num_experts, N, K]
FusedMoEConfig — static layer config
FusedMoEParallelConfig — parallelism config (TP, EP, DP)
quant_method: FusedMoEMethodBase — the active quantization strategy
router — top-K router
runner: DefaultMoERunner — orchestrates execution

FusedMoE internal structure:

Sources: vllm/model_executor/layers/fused_moe/layer.py274-666 vllm/model_executor/layers/fused_moe/config.py152

Modular Kernel Architecture

The MoE execution pipeline is decomposed into three independent stages via abstract interfaces in modular_kernel.py (vllm/model_executor/layers/fused_moe/modular_kernel.py):

[Router] → [FusedMoEPrepareAndFinalize] → [FusedMoEPermuteExpertsUnpermute]

Interface	Role
`FusedMoEPrepareAndFinalize`	Quantizes inputs, dispatches tokens (e.g. all2all), and gathers results
`FusedMoEPermuteExpertsUnpermute`	Runs the actual expert GEMMs (permute → GEMM → unpermute)
`FusedMoEModularKernel`	Combines a prepare/finalize and a permute/unpermute into one callable

Modular kernel composition:

Sources: vllm/model_executor/layers/fused_moe/modular_kernel.py44-80

FusedMoEPrepareAndFinalize Implementations

Class	Backend	File
`MoEPrepareAndFinalizeNoEP`	Single-GPU / TP-only	`prepare_finalize.py`
`NaiveEPDispatcher`	All-reduce + scatter	`all2all_utils.py`
`DeepEPLLPrepareAndFinalize`	DeepEP low-latency all2all	`deep_ep_ll.py`
`DeepEPHTDispatcher`	DeepEP high-throughput all2all	`deep_ep_ht.py`

FusedMoEPermuteExpertsUnpermute Implementations

Class	Backend	Quantization
`TritonExperts`	Triton fused kernel	FP8, INT8, W4A16, W8A16, unquantized
`CutlassExperts`	CUTLASS grouped GEMM	FP8
`DeepGemmExperts`	DeepGEMM contiguous grouped GEMM	FP8 block
`BatchedDeepGemmExperts`	DeepGEMM masked grouped GEMM	FP8 block
`MarlinExperts`	Marlin kernel	INT4/FP4/FP8

Sources: vllm/model_executor/layers/fused_moe/cutlass_moe.py vllm/model_executor/layers/fused_moe/deep_gemm_moe.py vllm/model_executor/layers/fused_moe/batched_deep_gemm_moe.py

MoE Quantization Methods

Each quantization config supplies a FusedMoEMethodBase subclass for FusedMoE layers.

MoE quant method class hierarchy:

FP8 MoE Methods

Fp8MoEMethod (serialized checkpoints) and Fp8OnlineMoEMethod (BF16 checkpoints, online quantize) call select_fp8_moe_backend to pick between:

Backend enum	Kernel
`Fp8MoeBackend.TRITON`	Triton `fused_moe_kernel`
`Fp8MoeBackend.CUTLASS`	`run_cutlass_moe_fp8`
`Fp8MoeBackend.DEEP_GEMM`	`DeepGemmExperts`
`Fp8MoeBackend.FLASHINFER_TRTLLM`	FlashInfer TensorRT-LLM fused MoE
`Fp8MoeBackend.FLASHINFER_CUTLASS`	FlashInfer CUTLASS grouped GEMM

Sources: vllm/model_executor/layers/quantization/fp8.py650 vllm/model_executor/layers/fused_moe/oracle/fp8.py

Compressed Tensors MoE

Scheme detected	Method class
MXFP4 weights	`CompressedTensorsW4A4Mxfp4MoEMethod`
WNA16 packed	`CompressedTensorsWNA16MarlinMoEMethod` or `CompressedTensorsWNA16MoEMethod`
NVFP4	`CompressedTensorsW4A4Nvfp4MoEMethod`
FP8 W8A8	`CompressedTensorsW8A8Fp8MoEMethod`
INT8 W8A8	`CompressedTensorsW8A8Int8MoEMethod`
FP8 W4A8	`CompressedTensorsW4A8Fp8MoEMethod`
INT8 W4A8	`CompressedTensorsW4A8Int8MoEMethod`

Marlin is preferred for WNA16 when check_moe_marlin_supports_layer passes and the platform is not ROCm.

Sources: vllm/model_executor/layers/quantization/compressed_tensors/compressed_tensors_moe.py156-239

MXFP4 MoE

Mxfp4Config.get_quant_method (vllm/model_executor/layers/quantization/mxfp4.py211-247) creates Mxfp4MoEMethod, which selects a backend via get_mxfp4_backend:

`Mxfp4Backend` enum	Condition
`SM100_FI_MXFP4_MXFP8_CUTLASS`	Blackwell + FlashInfer + `VLLM_USE_FLASHINFER_MOE_MXFP4_MXFP8_CUTLASS=1`
`SM100_FI_MXFP4_MXFP8_TRTLLM`	Blackwell + FlashInfer + `VLLM_USE_FLASHINFER_MOE_MXFP4_MXFP8=1`
`SM100_FI_MXFP4_BF16`	Blackwell + FlashInfer
`SM90_FI_MXFP4_BF16`	Hopper + FlashInfer + `VLLM_USE_FLASHINFER_MOE_MXFP4_BF16=1`
`TRITON`	SM90/SM100, triton_kernels available
`MARLIN`	fallback / `VLLM_MXFP4_USE_MARLIN=1`
`CK`	ROCm gfx950 + AITER enabled

Sources: vllm/model_executor/layers/quantization/mxfp4.py108-183

Expert Parallelism and Expert Maps

When expert parallelism (EP) is enabled, FusedMoE.__init__ calls determine_expert_map (vllm/model_executor/layers/fused_moe/layer.py67-153) to produce:

local_num_experts: experts on this rank.
expert_map: tensor of shape (global_num_experts,) mapping global → local index; -1 for experts not on this rank.
expert_mask (ROCm AITER only): binary mask of same length.

Two placement strategies are supported via ExpertPlacementStrategy:

Strategy	Description
`"linear"`	Consecutive blocks of experts per rank
`"round_robin"`	Interleaved distribution, requires DeepEP low-latency backend

The modular kernel receives expert_map so GEMM kernels can write zeros for absent experts.

Sources: vllm/model_executor/layers/fused_moe/layer.py67-153 vllm/model_executor/layers/fused_moe/layer.py424-474

Triton Fused MoE Kernel

Each Triton program block handles a BLOCK_SIZE_M chunk of sorted tokens assigned to one expert.
If the expert ID for a block is -1 (EP — not local), zeros are written.
For quantized paths, dequantization happens after the accumulator loop, in float32, before optional router weight multiplication.

The kernel accepts use_fp8_w8a8, use_int8_w8a8, use_int8_w8a16, use_int4_w4a16, per_channel_quant, group_k, group_n compile-time constants for specialization.

Sources: vllm/model_executor/layers/fused_moe/fused_moe.py314-574 vllm/model_executor/layers/fused_moe/fused_moe.py81-311

ModelOpt Quantization

ModelOptQuantConfigBase (vllm/model_executor/layers/quantization/modelopt.py132-218) is the base for NVIDIA ModelOpt-exported checkpoints. Concrete configs:

Config	Algorithms
`ModelOptFp8Config`	`FP8`, `FP8_PER_CHANNEL_PER_TOKEN`, `FP8_PB_WO`
`ModelOptNvFp4Config`	`NVFP4`
`ModelOptMxFp8Config`	`MXFP8`

Sources: vllm/model_executor/layers/quantization/modelopt.py106-362

Key Environment Variables

These env vars control quantization and MoE behavior at runtime:

Variable	Default	Effect
`VLLM_USE_DEEP_GEMM`	`True`	Enable DeepGEMM library for FP8 GEMMs
`VLLM_MOE_USE_DEEP_GEMM`	`True`	Enable DeepGEMM for MoE GEMMs
`VLLM_USE_DEEP_GEMM_E8M0`	`True`	Use UE8M0 scale format with DeepGEMM
`VLLM_USE_DEEP_GEMM_TMA_ALIGNED_SCALES`	`True`	TMA-align scale tensors
`VLLM_DEEP_GEMM_WARMUP`	`"relax"`	Warmup mode: `skip`, `full`, or `relax`
`VLLM_BLOCKSCALE_FP8_GEMM_FLASHINFER`	`True`	Allow FlashInfer for block-scale FP8 GEMM
`VLLM_USE_FLASHINFER_MOE_FP8`	`False`	Use FlashInfer for FP8 MoE
`VLLM_USE_FLASHINFER_MOE_FP4`	`False`	Use FlashInfer for FP4 MoE
`VLLM_USE_FLASHINFER_MOE_MXFP4_MXFP8`	`False`	FlashInfer MXFP4/MXFP8 TRT-LLM path
`VLLM_USE_FLASHINFER_MOE_MXFP4_MXFP8_CUTLASS`	`False`	FlashInfer MXFP4/MXFP8 CUTLASS path
`VLLM_USE_FLASHINFER_MOE_MXFP4_BF16`	`False`	FlashInfer MXFP4 BF16 path (SM90)
`VLLM_FLASHINFER_MOE_BACKEND`	`"latency"`	FlashInfer MoE backend: `throughput`, `latency`, `masked_gemm`
`VLLM_MXFP4_USE_MARLIN`	`None`	Force Marlin backend for MXFP4
`VLLM_MARLIN_USE_ATOMIC_ADD`	`False`	Use atomic add in Marlin kernels
`VLLM_MARLIN_INPUT_DTYPE`	`None`	Override Marlin input dtype (`int8` or `fp8`)
`VLLM_FUSED_MOE_CHUNK_SIZE`	`16384`	Chunk size for fused MoE activation
`VLLM_ENABLE_FUSED_MOE_ACTIVATION_CHUNKING`	`True`	Enable activation chunking in MoE
`VLLM_USE_FUSED_MOE_GROUPED_TOPK`	`True`	Use fused grouped top-K routing
`VLLM_USE_TRITON_AWQ`	`False`	Use Triton for AWQ kernels
`VLLM_TEST_FORCE_FP8_MARLIN`	`False`	Force Marlin fallback for FP8
`VLLM_ROCM_USE_AITER_MOE`	`True`	Use AITER fused MoE on ROCm
`VLLM_ROCM_USE_AITER_FUSION_SHARED_EXPERTS`	`False`	Fuse shared experts in AITER MoE
`VLLM_ROCM_MOE_PADDING`	`True`	Enable MoE padding on ROCm
`VLLM_MAX_TOKENS_PER_EXPERT_FP4_MOE`	`163840`	Max tokens per expert for FP4 MoE
`VLLM_NVFP4_GEMM_BACKEND`	`None`	Override NVFP4 GEMM backend

Sources: vllm/envs.py56-57 vllm/envs.py95-96 vllm/envs.py101-118 vllm/envs.py145-168 vllm/envs.py153-156

Quantization + MoE Runtime Flow

The following diagram shows the full path from a FusedMoE.forward call through quantization dispatch to kernel execution:

Sources: vllm/model_executor/layers/fused_moe/modular_kernel.py44-80 vllm/model_executor/layers/fused_moe/layer.py647-702

Weight Scale Granularities

The FusedMoeWeightScaleSupported enum (vllm/model_executor/layers/fused_moe/layer.py60-64) captures the scale granularities supported by MoE kernels:

Value	Meaning
`TENSOR`	Single scale per expert weight tensor
`CHANNEL`	One scale per output channel
`GROUP`	Block/group-wise scale (e.g. every 128 elements)
`BLOCK`	2D block-wise scale

These are used by quant_method.create_weights to register the appropriate PerTensorScaleParameter, ChannelQuantScaleParameter, or BlockQuantScaleParameter.

Sources: vllm/model_executor/layers/fused_moe/layer.py60-64 vllm/model_executor/layers/quantization/fp8.py332-394

Quantization and MoE Optimizations

Quantization System Overview

Supported Quantization Methods

FP8 Quantization

Fp8Config

Linear Method Classes

W8A8BlockFp8LinearOp — Block FP8 GEMM Dispatch

DeepGEMM Scale Formats

FusedMoE Layer Architecture

FusedMoE Class

Modular Kernel Architecture

FusedMoEPrepareAndFinalize Implementations

FusedMoEPermuteExpertsUnpermute Implementations

MoE Quantization Methods

FP8 MoE Methods

Compressed Tensors MoE

MXFP4 MoE

Expert Parallelism and Expert Maps

Triton Fused MoE Kernel

ModelOpt Quantization

Key Environment Variables

Quantization + MoE Runtime Flow

Weight Scale Granularities

On this page

Quantization and MoE Optimizations

Quantization System Overview

Supported Quantization Methods

FP8 Quantization

Fp8Config

Linear Method Classes

W8A8BlockFp8LinearOp — Block FP8 GEMM Dispatch

DeepGEMM Scale Formats

FusedMoE Layer Architecture

FusedMoE Class

Modular Kernel Architecture

FusedMoEPrepareAndFinalize Implementations

FusedMoEPermuteExpertsUnpermute Implementations

MoE Quantization Methods

FP8 MoE Methods

Compressed Tensors MoE

MXFP4 MoE

Expert Parallelism and Expert Maps

Triton Fused MoE Kernel

ModelOpt Quantization

Key Environment Variables

Quantization + MoE Runtime Flow

Weight Scale Granularities

On this page