This page documents the FP8 (8-bit floating point) quantization infrastructure in vLLM, covering per-token-group activation quantization, per-block weight quantization, scale factor formats (including UE8M0), and the multi-backend GEMM dispatch mechanism used for block-quantized linear layers. The central dispatch class is W8A8BlockFp8LinearOp, which selects between CUTLASS, DeepGEMM, Triton, FlashInfer, and ROCm AITER backends.
For how FP8 quantization is applied to MoE expert layers specifically, see 7.3 and 7.4. For the general quantization method registry and how Fp8LinearMethod is wired into the model loading pipeline, see 7.1.
vLLM uses platform-appropriate FP8 dtypes:
| Platform | FP8 dtype | Notes |
|---|---|---|
| CUDA | torch.float8_e4m3fn | Standard FP8, max value 448.0 |
| ROCm | torch.float8_e4m3fnuz | Unsigned-zero variant, max value 224.0 |
The is_fp8 helper in vllm/model_executor/layers/quantization/utils/fp8_utils.py53-56 checks for either dtype. The platform-correct dtype is retrieved via current_platform.fp8_dtype().
Block-FP8 (W8A8) in vLLM uses two distinct quantization granularities for weights and activations:
| Tensor | Granularity | Typical block shape | Scale tensor shape |
|---|---|---|---|
| Weights | Per-block (2D tile) | [128, 128] (N×K) | (num_N_tiles, num_K_tiles) |
| Activations | Per-token-group (1D row group) | [1, 128] (one group per row segment) | (M, K//128) |
This matches the format required by models such as DeepSeek-V3, which uses 128-element groups in both dimensions.
Three scale factor formats are used, selected by DeepGemmQuantScaleFMT based on hardware and configuration:
DeepGemmQuantScaleFMT enum — vllm/utils/deep_gemm.py27-65
| Enum value | Description | Used when |
|---|---|---|
FLOAT32 | Float32 scale tensor, standard layout | DeepGEMM disabled, or Triton/CUTLASS backends |
FLOAT32_CEIL_UE8M0 | Float32 tensor, values ceiled to powers of 2 | Hopper (SM90), VLLM_USE_DEEP_GEMM_E8M0=1 |
UE8M0 | Int32 tensor, 4 UE8M0 scale values packed per element | Blackwell (SM100+), VLLM_USE_DEEP_GEMM_E8M0=1 |
The oracle decision is cached at startup via DeepGemmQuantScaleFMT.init_oracle_cache(). The helper is_deep_gemm_e8m0_used() vllm/utils/deep_gemm.py81-104 returns True when the UE8M0 path is active.
UE8M0 scale encoding: Each scale value s is stored as 2^ceil(log2(s)) — the next power-of-two ceiling of s. This enables efficient hardware-level dequantization on Blackwell. The conversion appears in the Triton kernel:
y_s = tl.math.exp2(tl.ceil(tl.log2(scale_raw))) if use_ue8m0 else scale_raw
vllm/model_executor/layers/quantization/utils/fp8_utils.py649
The memory layout of activation scale tensors varies by backend:
Diagram: Scale tensor layout variants for (M=4, K=256, group_size=128) → 2 groups per row
The layout is controlled by parameters to per_token_group_quant_fp8:
column_major_scales=False → row-major (Triton, default)column_major_scales=True, tma_aligned_scales=False → column-major (CUTLASS)column_major_scales=True, tma_aligned_scales=True → TMA-aligned (DeepGEMM Hopper)Sources: vllm/model_executor/layers/quantization/utils/fp8_utils.py857-981
QuantFP8QuantFP8 is a CustomOp that quantizes input activations to FP8. It supports per-tensor, per-token, per-channel, and per-group quantization modes.
Key constructor parameters:
| Parameter | Type | Effect |
|---|---|---|
static | bool | Static (pre-computed) vs. dynamic (online) scaling |
group_shape | GroupShape | PER_TOKEN, PER_TENSOR, or (1, group_size) for block |
column_major_scales | bool | Output scales in column-major layout |
tma_aligned_scales | bool | Apply TMA stride alignment (DeepGEMM Hopper) |
use_ue8m0 | bool | Use power-of-2 scale ceiling encoding |
Diagram: QuantFP8.forward_cuda dispatch logic
Sources: vllm/model_executor/layers/quantization/input_quant_fp8.py83-131
per_token_group_quant_fp8vllm/model_executor/layers/quantization/utils/fp8_utils.py857-981
Quantizes a 2D+ tensor along the last dimension in groups of group_size. Produces an FP8 tensor of the same shape and a scale tensor whose shape depends on the layout parameter.
Execution path:
torch.ops._C.per_token_group_fp8_quant (custom C++ kernel)_per_token_group_quant_fp8 (row-major scales) or _per_token_group_quant_fp8_colmajor (column-major scales)per_token_group_quant_fp8_packed_for_deepgemmvllm/model_executor/layers/quantization/utils/fp8_utils.py984-1055
Produces UE8M0-packed int32 scale tensors with TMA-aligned strides for the Blackwell DeepGEMM path. Requires use_ue8m0=True. Calls torch.ops._C.per_token_group_fp8_quant_packed.
silu_mul_per_token_group_quant_fp8_colmajorvllm/model_executor/layers/quantization/utils/fp8_utils.py727-790
A fused Triton kernel that performs SiLU activation, elementwise multiply (gate), and per-group FP8 quantization in a single pass. Used inside MoE layers between the gate/up projection and the down projection.
per_block_cast_to_fp8vllm/utils/deep_gemm.py360-380
Quantizes a 2D weight tensor to FP8 using 2D block-wise scales. This is performed offline at weight loading time.
x_view = x_padded.view(-1, block_m, n_blocks, block_n)
x_amax = x_view.abs().float().amax(dim=(1, 3), keepdim=True).clamp(1e-4)
sf = x_amax / fp8_max
sf = _ceil_to_ue8m0(sf) if use_ue8m0 else sf
x_scaled = (x_view * (1.0 / sf)).to(fp8_dtype)
The output scale shape is (M // block_m, N // block_n).
W8A8BlockFp8LinearOpW8A8BlockFp8LinearOp is the central class for running block-quantized FP8 matrix multiplications in linear layers.
vllm/model_executor/layers/quantization/utils/fp8_utils.py347-585
Diagram: W8A8BlockFp8LinearOp structure and backend methods
Diagram: apply() dispatch flowchart
Sources: vllm/model_executor/layers/quantization/utils/fp8_utils.py388-585
| Backend | Condition |
|---|---|
| FlashInfer + DeepGEMM | FlashInfer available, DeepGEMM N/K compatible, output bfloat16, SM90+ |
| DeepGEMM only | VLLM_USE_DEEP_GEMM=1, SM90 or SM100+, N%64==0, K%128==0, output bfloat16 |
| CUTLASS | CUTLASS_BLOCK_FP8_SUPPORTED (CUDA, checked at import) |
| ROCm AITER | ROCm platform with AITER enabled |
| Triton | Default fallback on all platforms |
| Backend | Activation scale layout | Weight scale layout |
|---|---|---|
| CUTLASS | column-major (column_major_scales=True) | row-major (N_tiles, K_tiles) transposed |
| DeepGEMM | TMA-aligned column-major | row-major (N_tiles, K_tiles) |
| Triton | row-major (M, K//group_size) | row-major (N_tiles, K_tiles) |
| AITER | row-major | row-major |
| FlashInfer | handled internally (BF16 input passed) | row-major |
DeepGEMM is an external library (deep_gemm package) providing JIT-compiled FP8 GEMM kernels optimized for Hopper and Blackwell GPUs. vLLM wraps it through lazy initialization in vllm/utils/deep_gemm.py.
vllm/utils/deep_gemm.py126-176
_lazy_init() is called on first use and resolves the following symbols from the deep_gemm package:
| vLLM wrapper | DeepGEMM symbol | Usage |
|---|---|---|
fp8_gemm_nt | deep_gemm.fp8_gemm_nt | Dense FP8 GEMM for linear layers |
m_grouped_fp8_gemm_nt_contiguous | deep_gemm.m_grouped_fp8_gemm_nt_contiguous | MoE grouped GEMM (contiguous layout) |
fp8_m_grouped_gemm_nt_masked | deep_gemm.fp8_m_grouped_gemm_nt_masked | MoE masked grouped GEMM |
transform_sf_into_required_layout | deep_gemm.transform_sf_into_required_layout | Scale factor layout transformation |
fp8_mqa_logits | deep_gemm.fp8_mqa_logits | FP8 multi-query attention logits |
is_deep_gemm_supported()Returns True when:
VLLM_USE_DEEP_GEMM=1 environment variable is setdeep_gemm package is installedshould_use_deepgemm_for_fp8_linear()vllm/utils/deep_gemm.py399-418
Also checks that weight dimensions satisfy alignment requirements:
N % 64 == 0K % 128 == 0torch.bfloat16DeepGEMM JIT-compiles kernels on first use and caches them at $VLLM_CACHE_ROOT/deep_gemm (controlled by the DG_JIT_CACHE_DIR environment variable, set automatically if not present).
The FlashInfer backend (_flashinfer_fp8_blockscale_gemm_impl) handles a special case: it uses a torch.cond branch to select between two sub-kernels based on batch size at runtime, maintaining compatibility with torch.compile.
vllm/model_executor/layers/quantization/utils/fp8_utils.py239-342
Diagram: FlashInfer batch-size dispatch
The torch.cond is required because torch.compile cannot capture input-dependent Python if/else branches. The M < 32 path uses FlashInfer's "swapAB" optimization while the M >= 32 path falls back to the standard DeepGEMM kernel. This split resolves an accuracy regression (88% vs 95% on GSM8K for DeepSeek-V3.1) observed when using the FlashInfer kernel for larger batch sizes.
_w8a8_triton_block_scaled_mm is the fallback Triton kernel for block-quantized GEMM.
vllm/model_executor/layers/quantization/utils/fp8_utils.py1058-1140
It implements a standard blocked matrix multiplication where:
BLOCK_SIZE_M, BLOCK_SIZE_N, BLOCK_SIZE_K tile the outputa_s from As[row, k_block], b_s from Bs[n_block, k_block]float32bfloat16 or float16Optimal block sizes per (N, K, device) are loaded from JSON config files via get_w8a8_block_fp8_configs():
vllm/model_executor/layers/quantization/utils/fp8_utils.py1143-1161
Config files are named in the pattern:
N={N},K={K},device_name={device_name},dtype=fp8_w8a8,block_shape=[{block_n},{block_k}].json
Each backend's core operation is registered as a torch custom op via direct_register_custom_op, enabling compatibility with torch.compile graph capture. Each registration pairs an implementation function with a fake/meta implementation that returns the correct output shape without computation.
| Custom op name | Implementation | Used by |
|---|---|---|
vllm.w8a8_triton_block_scaled_mm_func | _w8a8_triton_block_scaled_mm_func | Triton backend |
vllm.padded_cutlass | _padded_cutlass | CUTLASS on Hopper (pads M to multiple of 4) |
vllm.fp8_gemm_nt_op | _fp8_gemm_nt_op | DeepGEMM backend |
vllm.triton_per_token_group_quant_fp8 | _triton_per_token_group_quant_fp8_impl | ROCm Triton quantization |
vllm.flashinfer_fp8_blockscale_gemm | _flashinfer_fp8_blockscale_gemm_impl | FlashInfer backend |
Sources: vllm/model_executor/layers/quantization/utils/fp8_utils.py107-342
| Variable | Default | Effect |
|---|---|---|
VLLM_USE_DEEP_GEMM | 0 | Enable DeepGEMM kernels for FP8 linear and MoE |
VLLM_USE_DEEP_GEMM_E8M0 | 0 | Use UE8M0 (power-of-2) scale encoding |
VLLM_USE_DEEP_GEMM_TMA_ALIGNED_SCALES | 0 | Use TMA-aligned scale strides (Hopper) |
These are read in vllm/utils/deep_gemm.py43-58 and vllm/model_executor/layers/quantization/utils/fp8_utils.py381-382
Diagram: End-to-end data flow for a block-FP8 forward pass
Sources: vllm/model_executor/layers/quantization/utils/fp8_utils.py388-420
Tests are in tests/kernels/quantization/test_block_fp8.py and cover:
| Test | What it verifies |
|---|---|
test_per_token_group_quant_fp8 | Correctness of per_token_group_quant_fp8 across group sizes, layout modes, and TMA alignment |
test_w8a8_block_fp8_matmul | Triton block-scaled GEMM vs. reference implementation |
test_w8a8_block_fp8_cutlass_matmul | CUTLASS block-scaled GEMM including non-128-aligned N (e.g., N=576) |
test_w8a8_block_fp8_deep_gemm_matmul | DeepGEMM fp8_gemm_nt vs. reference, skips unsupported shapes |
test_w8a8_block_fp8_flashinfer_matmul | FlashInfer block-scale GEMM vs. reference |
The calc_diff function vllm/utils/deep_gemm.py383-396 computes a cosine-similarity-based global difference metric used in DeepGEMM tests, since per-element tolerances are too strict for Blackwell kernels.
Refresh this wiki
This wiki was recently refreshed. Please wait 5 days to refresh again.