FP8 Quantization and GEMM

Relevant source files

Purpose and Scope

This page documents the FP8 (8-bit floating point) quantization infrastructure in vLLM, covering per-token-group activation quantization, per-block weight quantization, scale factor formats (including UE8M0), and the multi-backend GEMM dispatch mechanism used for block-quantized linear layers. The central dispatch class is W8A8BlockFp8LinearOp, which selects between CUTLASS, DeepGEMM, Triton, FlashInfer, and ROCm AITER backends.

For how FP8 quantization is applied to MoE expert layers specifically, see 7.3 and 7.4. For the general quantization method registry and how Fp8LinearMethod is wired into the model loading pipeline, see 7.1.

FP8 Data Types

vLLM uses platform-appropriate FP8 dtypes:

Platform	FP8 dtype	Notes
CUDA	`torch.float8_e4m3fn`	Standard FP8, max value 448.0
ROCm	`torch.float8_e4m3fnuz`	Unsigned-zero variant, max value 224.0

The is_fp8 helper in vllm/model_executor/layers/quantization/utils/fp8_utils.py53-56 checks for either dtype. The platform-correct dtype is retrieved via current_platform.fp8_dtype().

Quantization Granularities

Block-FP8 (W8A8) in vLLM uses two distinct quantization granularities for weights and activations:

Tensor	Granularity	Typical block shape	Scale tensor shape
Weights	Per-block (2D tile)	`[128, 128]` (N×K)	`(num_N_tiles, num_K_tiles)`
Activations	Per-token-group (1D row group)	`[1, 128]` (one group per row segment)	`(M, K//128)`

This matches the format required by models such as DeepSeek-V3, which uses 128-element groups in both dimensions.

Scale Factor Formats

Three scale factor formats are used, selected by DeepGemmQuantScaleFMT based on hardware and configuration:

DeepGemmQuantScaleFMT enum — vllm/utils/deep_gemm.py27-65

Enum value	Description	Used when
`FLOAT32`	Float32 scale tensor, standard layout	DeepGEMM disabled, or Triton/CUTLASS backends
`FLOAT32_CEIL_UE8M0`	Float32 tensor, values ceiled to powers of 2	Hopper (SM90), `VLLM_USE_DEEP_GEMM_E8M0=1`
`UE8M0`	Int32 tensor, 4 UE8M0 scale values packed per element	Blackwell (SM100+), `VLLM_USE_DEEP_GEMM_E8M0=1`

The oracle decision is cached at startup via DeepGemmQuantScaleFMT.init_oracle_cache(). The helper is_deep_gemm_e8m0_used() vllm/utils/deep_gemm.py81-104 returns True when the UE8M0 path is active.

UE8M0 scale encoding: Each scale value s is stored as 2^ceil(log2(s)) — the next power-of-two ceiling of s. This enables efficient hardware-level dequantization on Blackwell. The conversion appears in the Triton kernel:

y_s = tl.math.exp2(tl.ceil(tl.log2(scale_raw))) if use_ue8m0 else scale_raw

vllm/model_executor/layers/quantization/utils/fp8_utils.py649

Scale Tensor Memory Layouts

The memory layout of activation scale tensors varies by backend:

Diagram: Scale tensor layout variants for (M=4, K=256, group_size=128) → 2 groups per row

The layout is controlled by parameters to per_token_group_quant_fp8:

column_major_scales=False → row-major (Triton, default)
column_major_scales=True, tma_aligned_scales=False → column-major (CUTLASS)
column_major_scales=True, tma_aligned_scales=True → TMA-aligned (DeepGEMM Hopper)

Sources: vllm/model_executor/layers/quantization/utils/fp8_utils.py857-981

Input Quantization: `QuantFP8`

QuantFP8 is a CustomOp that quantizes input activations to FP8. It supports per-tensor, per-token, per-channel, and per-group quantization modes.

Key constructor parameters:

Parameter	Type	Effect
`static`	bool	Static (pre-computed) vs. dynamic (online) scaling
`group_shape`	`GroupShape`	`PER_TOKEN`, `PER_TENSOR`, or `(1, group_size)` for block
`column_major_scales`	bool	Output scales in column-major layout
`tma_aligned_scales`	bool	Apply TMA stride alignment (DeepGEMM Hopper)
`use_ue8m0`	bool	Use power-of-2 scale ceiling encoding

Diagram: QuantFP8.forward_cuda dispatch logic

Sources: vllm/model_executor/layers/quantization/input_quant_fp8.py83-131

Activation Quantization Functions

`per_token_group_quant_fp8`

vllm/model_executor/layers/quantization/utils/fp8_utils.py857-981

Quantizes a 2D+ tensor along the last dimension in groups of group_size. Produces an FP8 tensor of the same shape and a scale tensor whose shape depends on the layout parameter.

Execution path:

On CUDA with a contiguous input: calls torch.ops._C.per_token_group_fp8_quant (custom C++ kernel)
Fallback: Triton kernel _per_token_group_quant_fp8 (row-major scales) or _per_token_group_quant_fp8_colmajor (column-major scales)

`per_token_group_quant_fp8_packed_for_deepgemm`

vllm/model_executor/layers/quantization/utils/fp8_utils.py984-1055

Produces UE8M0-packed int32 scale tensors with TMA-aligned strides for the Blackwell DeepGEMM path. Requires use_ue8m0=True. Calls torch.ops._C.per_token_group_fp8_quant_packed.

`silu_mul_per_token_group_quant_fp8_colmajor`

vllm/model_executor/layers/quantization/utils/fp8_utils.py727-790

A fused Triton kernel that performs SiLU activation, elementwise multiply (gate), and per-group FP8 quantization in a single pass. Used inside MoE layers between the gate/up projection and the down projection.

Weight Quantization: `per_block_cast_to_fp8`

vllm/utils/deep_gemm.py360-380

Quantizes a 2D weight tensor to FP8 using 2D block-wise scales. This is performed offline at weight loading time.

x_view = x_padded.view(-1, block_m, n_blocks, block_n)
x_amax = x_view.abs().float().amax(dim=(1, 3), keepdim=True).clamp(1e-4)
sf = x_amax / fp8_max
sf = _ceil_to_ue8m0(sf) if use_ue8m0 else sf
x_scaled = (x_view * (1.0 / sf)).to(fp8_dtype)

The output scale shape is (M // block_m, N // block_n).

GEMM Backend Dispatch: `W8A8BlockFp8LinearOp`

W8A8BlockFp8LinearOp is the central class for running block-quantized FP8 matrix multiplications in linear layers.

vllm/model_executor/layers/quantization/utils/fp8_utils.py347-585

Diagram: W8A8BlockFp8LinearOp structure and backend methods

Diagram: apply() dispatch flowchart

Sources: vllm/model_executor/layers/quantization/utils/fp8_utils.py388-585

Backend Selection Criteria

Backend	Condition
FlashInfer + DeepGEMM	FlashInfer available, DeepGEMM N/K compatible, output bfloat16, SM90+
DeepGEMM only	`VLLM_USE_DEEP_GEMM=1`, SM90 or SM100+, N%64==0, K%128==0, output bfloat16
CUTLASS	`CUTLASS_BLOCK_FP8_SUPPORTED` (CUDA, checked at import)
ROCm AITER	ROCm platform with AITER enabled
Triton	Default fallback on all platforms

Scale Layout per Backend

Backend	Activation scale layout	Weight scale layout
CUTLASS	column-major (`column_major_scales=True`)	row-major `(N_tiles, K_tiles)` transposed
DeepGEMM	TMA-aligned column-major	row-major `(N_tiles, K_tiles)`
Triton	row-major `(M, K//group_size)`	row-major `(N_tiles, K_tiles)`
AITER	row-major	row-major
FlashInfer	handled internally (BF16 input passed)	row-major

DeepGEMM Integration

DeepGEMM is an external library (deep_gemm package) providing JIT-compiled FP8 GEMM kernels optimized for Hopper and Blackwell GPUs. vLLM wraps it through lazy initialization in vllm/utils/deep_gemm.py.

Lazy Initialization

vllm/utils/deep_gemm.py126-176

_lazy_init() is called on first use and resolves the following symbols from the deep_gemm package:

vLLM wrapper	DeepGEMM symbol	Usage
`fp8_gemm_nt`	`deep_gemm.fp8_gemm_nt`	Dense FP8 GEMM for linear layers
`m_grouped_fp8_gemm_nt_contiguous`	`deep_gemm.m_grouped_fp8_gemm_nt_contiguous`	MoE grouped GEMM (contiguous layout)
`fp8_m_grouped_gemm_nt_masked`	`deep_gemm.fp8_m_grouped_gemm_nt_masked`	MoE masked grouped GEMM
`transform_sf_into_required_layout`	`deep_gemm.transform_sf_into_required_layout`	Scale factor layout transformation
`fp8_mqa_logits`	`deep_gemm.fp8_mqa_logits`	FP8 multi-query attention logits

`is_deep_gemm_supported()`

vllm/utils/deep_gemm.py68-77

Returns True when:

Platform is CUDA
GPU is Hopper (SM90) or Blackwell (SM100+)
VLLM_USE_DEEP_GEMM=1 environment variable is set
The deep_gemm package is installed

`should_use_deepgemm_for_fp8_linear()`

vllm/utils/deep_gemm.py399-418

Also checks that weight dimensions satisfy alignment requirements:

N % 64 == 0
K % 128 == 0
Output dtype is torch.bfloat16

JIT Cache

DeepGEMM JIT-compiles kernels on first use and caches them at $VLLM_CACHE_ROOT/deep_gemm (controlled by the DG_JIT_CACHE_DIR environment variable, set automatically if not present).

FlashInfer FP8 Block-Scale GEMM

The FlashInfer backend (_flashinfer_fp8_blockscale_gemm_impl) handles a special case: it uses a torch.cond branch to select between two sub-kernels based on batch size at runtime, maintaining compatibility with torch.compile.

vllm/model_executor/layers/quantization/utils/fp8_utils.py239-342

Diagram: FlashInfer batch-size dispatch

The torch.cond is required because torch.compile cannot capture input-dependent Python if/else branches. The M < 32 path uses FlashInfer's "swapAB" optimization while the M >= 32 path falls back to the standard DeepGEMM kernel. This split resolves an accuracy regression (88% vs 95% on GSM8K for DeepSeek-V3.1) observed when using the FlashInfer kernel for larger batch sizes.

Triton GEMM Kernel

_w8a8_triton_block_scaled_mm is the fallback Triton kernel for block-quantized GEMM.

vllm/model_executor/layers/quantization/utils/fp8_utils.py1058-1140

It implements a standard blocked matrix multiplication where:

Block sizes BLOCK_SIZE_M, BLOCK_SIZE_N, BLOCK_SIZE_K tile the output
Scale factors are loaded per-block: a_s from As[row, k_block], b_s from Bs[n_block, k_block]
Accumulation is in float32
Output is cast to bfloat16 or float16

Optimal block sizes per (N, K, device) are loaded from JSON config files via get_w8a8_block_fp8_configs():

vllm/model_executor/layers/quantization/utils/fp8_utils.py1143-1161

Config files are named in the pattern:

N={N},K={K},device_name={device_name},dtype=fp8_w8a8,block_shape=[{block_n},{block_k}].json

Custom Op Registration

Each backend's core operation is registered as a torch custom op via direct_register_custom_op, enabling compatibility with torch.compile graph capture. Each registration pairs an implementation function with a fake/meta implementation that returns the correct output shape without computation.

Custom op name	Implementation	Used by
`vllm.w8a8_triton_block_scaled_mm_func`	`_w8a8_triton_block_scaled_mm_func`	Triton backend
`vllm.padded_cutlass`	`_padded_cutlass`	CUTLASS on Hopper (pads M to multiple of 4)
`vllm.fp8_gemm_nt_op`	`_fp8_gemm_nt_op`	DeepGEMM backend
`vllm.triton_per_token_group_quant_fp8`	`_triton_per_token_group_quant_fp8_impl`	ROCm Triton quantization
`vllm.flashinfer_fp8_blockscale_gemm`	`_flashinfer_fp8_blockscale_gemm_impl`	FlashInfer backend

Sources: vllm/model_executor/layers/quantization/utils/fp8_utils.py107-342

Environment Variables

Variable	Default	Effect
`VLLM_USE_DEEP_GEMM`	`0`	Enable DeepGEMM kernels for FP8 linear and MoE
`VLLM_USE_DEEP_GEMM_E8M0`	`0`	Use UE8M0 (power-of-2) scale encoding
`VLLM_USE_DEEP_GEMM_TMA_ALIGNED_SCALES`	`0`	Use TMA-aligned scale strides (Hopper)

These are read in vllm/utils/deep_gemm.py43-58 and vllm/model_executor/layers/quantization/utils/fp8_utils.py381-382

Data Flow Through a Block-FP8 Linear Layer

Diagram: End-to-end data flow for a block-FP8 forward pass

Sources: vllm/model_executor/layers/quantization/utils/fp8_utils.py388-420

Testing

Tests are in tests/kernels/quantization/test_block_fp8.py and cover:

Test	What it verifies
`test_per_token_group_quant_fp8`	Correctness of `per_token_group_quant_fp8` across group sizes, layout modes, and TMA alignment
`test_w8a8_block_fp8_matmul`	Triton block-scaled GEMM vs. reference implementation
`test_w8a8_block_fp8_cutlass_matmul`	CUTLASS block-scaled GEMM including non-128-aligned N (e.g., N=576)
`test_w8a8_block_fp8_deep_gemm_matmul`	DeepGEMM `fp8_gemm_nt` vs. reference, skips unsupported shapes
`test_w8a8_block_fp8_flashinfer_matmul`	FlashInfer block-scale GEMM vs. reference

The calc_diff function vllm/utils/deep_gemm.py383-396 computes a cosine-similarity-based global difference metric used in DeepGEMM tests, since per-element tolerances are too strict for Blackwell kernels.

Sources: tests/kernels/quantization/test_block_fp8.py1-278

FP8 Quantization and GEMM

Relevant source files

Purpose and Scope

FP8 Data Types

vLLM uses platform-appropriate FP8 dtypes:

Platform	FP8 dtype	Notes
CUDA	`torch.float8_e4m3fn`	Standard FP8, max value 448.0
ROCm	`torch.float8_e4m3fnuz`	Unsigned-zero variant, max value 224.0

The is_fp8 helper in vllm/model_executor/layers/quantization/utils/fp8_utils.py53-56 checks for either dtype. The platform-correct dtype is retrieved via current_platform.fp8_dtype().

Quantization Granularities

Block-FP8 (W8A8) in vLLM uses two distinct quantization granularities for weights and activations:

Tensor	Granularity	Typical block shape	Scale tensor shape
Weights	Per-block (2D tile)	`[128, 128]` (N×K)	`(num_N_tiles, num_K_tiles)`
Activations	Per-token-group (1D row group)	`[1, 128]` (one group per row segment)	`(M, K//128)`

This matches the format required by models such as DeepSeek-V3, which uses 128-element groups in both dimensions.

Scale Factor Formats

Three scale factor formats are used, selected by DeepGemmQuantScaleFMT based on hardware and configuration:

DeepGemmQuantScaleFMT enum — vllm/utils/deep_gemm.py27-65

Enum value	Description	Used when
`FLOAT32`	Float32 scale tensor, standard layout	DeepGEMM disabled, or Triton/CUTLASS backends
`FLOAT32_CEIL_UE8M0`	Float32 tensor, values ceiled to powers of 2	Hopper (SM90), `VLLM_USE_DEEP_GEMM_E8M0=1`
`UE8M0`	Int32 tensor, 4 UE8M0 scale values packed per element	Blackwell (SM100+), `VLLM_USE_DEEP_GEMM_E8M0=1`

y_s = tl.math.exp2(tl.ceil(tl.log2(scale_raw))) if use_ue8m0 else scale_raw

vllm/model_executor/layers/quantization/utils/fp8_utils.py649

Scale Tensor Memory Layouts

The memory layout of activation scale tensors varies by backend:

Diagram: Scale tensor layout variants for (M=4, K=256, group_size=128) → 2 groups per row

The layout is controlled by parameters to per_token_group_quant_fp8:

column_major_scales=False → row-major (Triton, default)
column_major_scales=True, tma_aligned_scales=False → column-major (CUTLASS)
column_major_scales=True, tma_aligned_scales=True → TMA-aligned (DeepGEMM Hopper)

Sources: vllm/model_executor/layers/quantization/utils/fp8_utils.py857-981

Input Quantization: `QuantFP8`

QuantFP8 is a CustomOp that quantizes input activations to FP8. It supports per-tensor, per-token, per-channel, and per-group quantization modes.

Key constructor parameters:

Parameter	Type	Effect
`static`	bool	Static (pre-computed) vs. dynamic (online) scaling
`group_shape`	`GroupShape`	`PER_TOKEN`, `PER_TENSOR`, or `(1, group_size)` for block
`column_major_scales`	bool	Output scales in column-major layout
`tma_aligned_scales`	bool	Apply TMA stride alignment (DeepGEMM Hopper)
`use_ue8m0`	bool	Use power-of-2 scale ceiling encoding

Diagram: QuantFP8.forward_cuda dispatch logic

Sources: vllm/model_executor/layers/quantization/input_quant_fp8.py83-131

Activation Quantization Functions

`per_token_group_quant_fp8`

vllm/model_executor/layers/quantization/utils/fp8_utils.py857-981

Quantizes a 2D+ tensor along the last dimension in groups of group_size. Produces an FP8 tensor of the same shape and a scale tensor whose shape depends on the layout parameter.

Execution path:

On CUDA with a contiguous input: calls torch.ops._C.per_token_group_fp8_quant (custom C++ kernel)
Fallback: Triton kernel _per_token_group_quant_fp8 (row-major scales) or _per_token_group_quant_fp8_colmajor (column-major scales)

`per_token_group_quant_fp8_packed_for_deepgemm`

vllm/model_executor/layers/quantization/utils/fp8_utils.py984-1055

Produces UE8M0-packed int32 scale tensors with TMA-aligned strides for the Blackwell DeepGEMM path. Requires use_ue8m0=True. Calls torch.ops._C.per_token_group_fp8_quant_packed.

`silu_mul_per_token_group_quant_fp8_colmajor`

vllm/model_executor/layers/quantization/utils/fp8_utils.py727-790

Weight Quantization: `per_block_cast_to_fp8`

vllm/utils/deep_gemm.py360-380

Quantizes a 2D weight tensor to FP8 using 2D block-wise scales. This is performed offline at weight loading time.

x_view = x_padded.view(-1, block_m, n_blocks, block_n)
x_amax = x_view.abs().float().amax(dim=(1, 3), keepdim=True).clamp(1e-4)
sf = x_amax / fp8_max
sf = _ceil_to_ue8m0(sf) if use_ue8m0 else sf
x_scaled = (x_view * (1.0 / sf)).to(fp8_dtype)

The output scale shape is (M // block_m, N // block_n).

GEMM Backend Dispatch: `W8A8BlockFp8LinearOp`

W8A8BlockFp8LinearOp is the central class for running block-quantized FP8 matrix multiplications in linear layers.

vllm/model_executor/layers/quantization/utils/fp8_utils.py347-585

Diagram: W8A8BlockFp8LinearOp structure and backend methods

Diagram: apply() dispatch flowchart

Sources: vllm/model_executor/layers/quantization/utils/fp8_utils.py388-585

Backend Selection Criteria

Backend	Condition
FlashInfer + DeepGEMM	FlashInfer available, DeepGEMM N/K compatible, output bfloat16, SM90+
DeepGEMM only	`VLLM_USE_DEEP_GEMM=1`, SM90 or SM100+, N%64==0, K%128==0, output bfloat16
CUTLASS	`CUTLASS_BLOCK_FP8_SUPPORTED` (CUDA, checked at import)
ROCm AITER	ROCm platform with AITER enabled
Triton	Default fallback on all platforms

Scale Layout per Backend

Backend	Activation scale layout	Weight scale layout
CUTLASS	column-major (`column_major_scales=True`)	row-major `(N_tiles, K_tiles)` transposed
DeepGEMM	TMA-aligned column-major	row-major `(N_tiles, K_tiles)`
Triton	row-major `(M, K//group_size)`	row-major `(N_tiles, K_tiles)`
AITER	row-major	row-major
FlashInfer	handled internally (BF16 input passed)	row-major

DeepGEMM Integration

Lazy Initialization

vllm/utils/deep_gemm.py126-176

_lazy_init() is called on first use and resolves the following symbols from the deep_gemm package:

vLLM wrapper	DeepGEMM symbol	Usage
`fp8_gemm_nt`	`deep_gemm.fp8_gemm_nt`	Dense FP8 GEMM for linear layers
`m_grouped_fp8_gemm_nt_contiguous`	`deep_gemm.m_grouped_fp8_gemm_nt_contiguous`	MoE grouped GEMM (contiguous layout)
`fp8_m_grouped_gemm_nt_masked`	`deep_gemm.fp8_m_grouped_gemm_nt_masked`	MoE masked grouped GEMM
`transform_sf_into_required_layout`	`deep_gemm.transform_sf_into_required_layout`	Scale factor layout transformation
`fp8_mqa_logits`	`deep_gemm.fp8_mqa_logits`	FP8 multi-query attention logits

`is_deep_gemm_supported()`

vllm/utils/deep_gemm.py68-77

Returns True when:

Platform is CUDA
GPU is Hopper (SM90) or Blackwell (SM100+)
VLLM_USE_DEEP_GEMM=1 environment variable is set
The deep_gemm package is installed

`should_use_deepgemm_for_fp8_linear()`

vllm/utils/deep_gemm.py399-418

Also checks that weight dimensions satisfy alignment requirements:

N % 64 == 0
K % 128 == 0
Output dtype is torch.bfloat16

JIT Cache

DeepGEMM JIT-compiles kernels on first use and caches them at $VLLM_CACHE_ROOT/deep_gemm (controlled by the DG_JIT_CACHE_DIR environment variable, set automatically if not present).

FlashInfer FP8 Block-Scale GEMM

vllm/model_executor/layers/quantization/utils/fp8_utils.py239-342

Diagram: FlashInfer batch-size dispatch

Triton GEMM Kernel

_w8a8_triton_block_scaled_mm is the fallback Triton kernel for block-quantized GEMM.

vllm/model_executor/layers/quantization/utils/fp8_utils.py1058-1140

It implements a standard blocked matrix multiplication where:

Block sizes BLOCK_SIZE_M, BLOCK_SIZE_N, BLOCK_SIZE_K tile the output
Scale factors are loaded per-block: a_s from As[row, k_block], b_s from Bs[n_block, k_block]
Accumulation is in float32
Output is cast to bfloat16 or float16

Optimal block sizes per (N, K, device) are loaded from JSON config files via get_w8a8_block_fp8_configs():

vllm/model_executor/layers/quantization/utils/fp8_utils.py1143-1161

Config files are named in the pattern:

N={N},K={K},device_name={device_name},dtype=fp8_w8a8,block_shape=[{block_n},{block_k}].json

Custom Op Registration

Custom op name	Implementation	Used by
`vllm.w8a8_triton_block_scaled_mm_func`	`_w8a8_triton_block_scaled_mm_func`	Triton backend
`vllm.padded_cutlass`	`_padded_cutlass`	CUTLASS on Hopper (pads M to multiple of 4)
`vllm.fp8_gemm_nt_op`	`_fp8_gemm_nt_op`	DeepGEMM backend
`vllm.triton_per_token_group_quant_fp8`	`_triton_per_token_group_quant_fp8_impl`	ROCm Triton quantization
`vllm.flashinfer_fp8_blockscale_gemm`	`_flashinfer_fp8_blockscale_gemm_impl`	FlashInfer backend

Sources: vllm/model_executor/layers/quantization/utils/fp8_utils.py107-342

Environment Variables

Variable	Default	Effect
`VLLM_USE_DEEP_GEMM`	`0`	Enable DeepGEMM kernels for FP8 linear and MoE
`VLLM_USE_DEEP_GEMM_E8M0`	`0`	Use UE8M0 (power-of-2) scale encoding
`VLLM_USE_DEEP_GEMM_TMA_ALIGNED_SCALES`	`0`	Use TMA-aligned scale strides (Hopper)

These are read in vllm/utils/deep_gemm.py43-58 and vllm/model_executor/layers/quantization/utils/fp8_utils.py381-382

Data Flow Through a Block-FP8 Linear Layer

Diagram: End-to-end data flow for a block-FP8 forward pass

Sources: vllm/model_executor/layers/quantization/utils/fp8_utils.py388-420

Testing

Tests are in tests/kernels/quantization/test_block_fp8.py and cover:

Test	What it verifies
`test_per_token_group_quant_fp8`	Correctness of `per_token_group_quant_fp8` across group sizes, layout modes, and TMA alignment
`test_w8a8_block_fp8_matmul`	Triton block-scaled GEMM vs. reference implementation
`test_w8a8_block_fp8_cutlass_matmul`	CUTLASS block-scaled GEMM including non-128-aligned N (e.g., N=576)
`test_w8a8_block_fp8_deep_gemm_matmul`	DeepGEMM `fp8_gemm_nt` vs. reference, skips unsupported shapes
`test_w8a8_block_fp8_flashinfer_matmul`	FlashInfer block-scale GEMM vs. reference

Sources: tests/kernels/quantization/test_block_fp8.py1-278

FP8 Quantization and GEMM

Purpose and Scope

FP8 Data Types

Quantization Granularities

Scale Factor Formats

Scale Tensor Memory Layouts

Input Quantization: QuantFP8

Activation Quantization Functions

per_token_group_quant_fp8

per_token_group_quant_fp8_packed_for_deepgemm

silu_mul_per_token_group_quant_fp8_colmajor

Weight Quantization: per_block_cast_to_fp8

GEMM Backend Dispatch: W8A8BlockFp8LinearOp

Backend Selection Criteria

Scale Layout per Backend

DeepGEMM Integration

Lazy Initialization

is_deep_gemm_supported()

should_use_deepgemm_for_fp8_linear()

JIT Cache

FlashInfer FP8 Block-Scale GEMM

Triton GEMM Kernel

Custom Op Registration

Environment Variables

Data Flow Through a Block-FP8 Linear Layer

Testing

On this page

FP8 Quantization and GEMM

Purpose and Scope

FP8 Data Types

Quantization Granularities

Scale Factor Formats

Scale Tensor Memory Layouts

Input Quantization: QuantFP8

Activation Quantization Functions

per_token_group_quant_fp8

per_token_group_quant_fp8_packed_for_deepgemm

silu_mul_per_token_group_quant_fp8_colmajor

Weight Quantization: per_block_cast_to_fp8

GEMM Backend Dispatch: W8A8BlockFp8LinearOp

Backend Selection Criteria

Scale Layout per Backend

DeepGEMM Integration

Lazy Initialization

is_deep_gemm_supported()

should_use_deepgemm_for_fp8_linear()

JIT Cache

FlashInfer FP8 Block-Scale GEMM

Triton GEMM Kernel

Custom Op Registration

Environment Variables

Data Flow Through a Block-FP8 Linear Layer

Testing

On this page

Input Quantization: `QuantFP8`

`per_token_group_quant_fp8`

`per_token_group_quant_fp8_packed_for_deepgemm`

`silu_mul_per_token_group_quant_fp8_colmajor`

Weight Quantization: `per_block_cast_to_fp8`

GEMM Backend Dispatch: `W8A8BlockFp8LinearOp`

`is_deep_gemm_supported()`

`should_use_deepgemm_for_fp8_linear()`

Input Quantization: `QuantFP8`

`per_token_group_quant_fp8`

`per_token_group_quant_fp8_packed_for_deepgemm`

`silu_mul_per_token_group_quant_fp8_colmajor`

Weight Quantization: `per_block_cast_to_fp8`

GEMM Backend Dispatch: `W8A8BlockFp8LinearOp`

`is_deep_gemm_supported()`

`should_use_deepgemm_for_fp8_linear()`