INT8 Quantization and Precision Modes

Relevant source files

This page describes how ncnn represents and executes INT8 inference: the Quantize, Dequantize, and Requantize layer implementations, the scale_data and bias_data conventions, float2int8 conversion utilities, and how SIMD packing layouts interact with INT8 data paths across ARM, x86, MIPS, and LoongArch backends.

For the post-training calibration workflow that produces the scale tables consumed at runtime, see page 8.2 For how convolution layers select between Winograd, im2col-GEMM, and INT8 kernel variants at create_pipeline time, see page 4.1

INT8 Inference Data Flow

INT8 inference in ncnn uses a quantize → compute → dequantize/requantize pattern. Floating-point activations are converted to signed char before a layer, the layer accumulates into int32, and the result is either converted back to float (for a float next-layer) or requantized to INT8 (for a chained INT8 next-layer).

Diagram: INT8 tensor data flow through ncnn layers

Sources: src/layer/quantize.cpp src/layer/requantize.cpp src/layer/arm/dequantize_arm.cpp

The Three Core Layer Types

Diagram: Quantize / Dequantize / Requantize class hierarchy

Sources: src/layer/quantize.cpp src/layer/arm/quantize_arm.cpp src/layer/x86/quantize_x86.cpp src/layer/mips/quantize_mips.cpp src/layer/arm/dequantize_arm.cpp src/layer/x86/dequantize_x86.cpp src/layer/mips/dequantize_mips.cpp src/layer/loongarch/dequantize_loongarch.cpp src/layer/arm/requantize_arm.cpp src/layer/x86/requantize_x86.cpp src/layer/mips/requantize_mips.cpp src/layer/loongarch/requantize_loongarch.cpp

Quantize

Converts a float (or bf16/fp16) blob to signed char (INT8).

Parameters (loaded via load_param):

Param Index	Field	Meaning
0	`scale_data_size`	1 = global scale; >1 = per-channel scales

Weights (loaded via load_model):

Field	Size
`scale_data`	`scale_data_size` floats

Forward operation (scalar path):

output[i] = float2int8(input[i] * scale_data[channel_i])

The scalar float2int8 is defined identically in both src/layer/quantize.cpp and src/layer/arm/arm_usability.h / src/layer/x86/x86_usability.h: round to nearest integer, then clamp to [-127, 127].

Sources: src/layer/quantize.cpp1-80 src/layer/arm/quantize_arm.cpp31-96

Dequantize

Converts an int32 blob (accumulator result from INT8 convolution) to a float (or bf16/fp16) blob.

Parameters:

Param Index	Field	Meaning
0	`scale_data_size`	1 = global; >1 = per-channel
1	`bias_data_size`	0 = no bias; 1 = scalar; >1 = per-channel

Weights:

Field	Size
`scale_data`	`scale_data_size` floats
`bias_data`	`bias_data_size` floats (optional)

Forward operation:

output[i] = int_value[i] * scale_data[channel_i] + bias_data[channel_i]

When bias_data_size == 0:

output[i] = int_value[i] * scale_data[channel_i]

The bias here is the convolution layer's bias pre-folded into the dequantization step by ncnn2int8. Dequantize supports multiple output types: when opt.use_bf16_storage is set it outputs uint16 (bfloat16); when opt.use_fp16_storage is set it outputs __fp16.

Sources: src/layer/arm/dequantize_arm.cpp30-147 src/layer/arm/dequantize_arm.cpp149-228 src/layer/arm/dequantize_arm.cpp230-423

Requantize

Converts an int32 accumulator directly to signed char (INT8), optionally applying a bias and an activation function. This avoids a round-trip through float when chaining INT8 layers.

Parameters:

Param Index	Field	Meaning
0	`scale_in_data_size`	Scales the int32 accumulator
1	`scale_out_data_size`	Rescales to fit int8 output range
2	`bias_data_size`	Optional per-channel bias
3	`activation_type`	0=none, 1=relu, 2=leakyrelu, etc.
4	`activation_params`	Slope for leaky relu, etc.

Weights:

Field	Size
`scale_in_data`	`scale_in_data_size` floats
`scale_out_data`	`scale_out_data_size` floats
`bias_data`	`bias_data_size` floats (optional)

Forward operation:

float v = int_value * scale_in + bias
v = activation(v, activation_type, activation_params)
output = float2int8(v * scale_out)

The ARM implementation optimizes the relu activation case by fusing scale_in * scale_out into a single multiply and applying the clamp-to-zero in the same NEON pass (float2int8relu).

Sources: src/layer/requantize.cpp1-141 src/layer/arm/requantize_arm.cpp23-173 src/layer/x86/requantize_x86.cpp25-175

Scale Data and Bias Data

The values in scale_data are derived from the calibration table file produced by ncnn2table. The formula is:

scale = 127.0 / threshold

where threshold is the activation threshold determined per-blob. For weights it is:

weight_scale[output_channel] = 127.0 / absmax_of_that_channel

The bias_data in Dequantize is not the raw convolution bias. During ncnn2int8, the convolution bias is folded in, pre-scaled so it is expressed in the same units as the int32 accumulator output scaled back to float.

For Requantize, bias_data values are further scaled by scale_out internally in the ARM implementation, allowing the add and multiply to happen in one NEON fused-multiply-add instruction.

Scale dimensionality rules:

`scale_data_size`	Meaning
1	Single scalar applied to all channels
== channel count	Per-channel; indexed by output channel index
== `elempack`	Per-lane within a packed group (used when `elempack > 1`)

Sources: src/layer/arm/dequantize_arm.cpp30-55 src/layer/arm/requantize_arm.cpp38-70 tools/quantize/ncnn2table.cpp338-407

float2int8 Conversion Utilities

Each architecture defines its own family of float2int8 helpers. They all clamp to [-127, 127] (not -128, preserving signed symmetry for easier scale computation).

Diagram: float2int8 utility functions by file and variant

Sources: src/layer/x86/x86_usability.h143-377 src/layer/arm/arm_usability.h7-255

Scalar float2int8

Defined independently in each architecture header and in the base layer files:

signed char float2int8(float v) {
    int int32 = (int)round(v);
    if (int32 > 127) return 127;
    if (int32 < -127) return -127;
    return (signed char)int32;
}

src/layer/x86/x86_usability.h143-149
src/layer/arm/arm_usability.h7-13
src/layer/requantize.cpp10-16

ARM NEON Rounding

On AArch64, vcvtaq_s32_f32 provides hardware round-to-nearest. On ARMv7/AArch32, rounding is simulated by adding ±0.5 before truncation with vcvtq_s32_f32.

src/layer/arm/arm_usability.h27-49

x86 SSE Rounding

The SSE path cannot rely on _MM_ROUND_NEAREST being set (it may be in banker's rounding mode), so it simulates round-to-nearest by adding ±0.5 (sign-preserving) before truncation with _mm_cvttps_epi32. The result is then packed through _mm_packs_epi32 → _mm_packs_epi16 and clamped to [-127, 127] using _mm_min_epi16 / _mm_max_epi16.

src/layer/x86/x86_usability.h287-310

elempack and Packing Layout Interaction

The elempack field of a Mat records how many elements are grouped into a single memory unit (see page 2.3). INT8 layers must handle pack-format conversions because the input float tensor may be in pack4/pack8 format while the INT8 output target pack may differ.

The ARM Quantize_arm handles three cases explicitly:

Input elempack	Output elempack	Function called
4	8	`quantize_pack4to8` – merges two pack4 float slices
4	1	`quantize_pack4to1` – scatters pack4 float into four separate int8 channels
N	N	`quantize` – in-place packing preserved

The output elempack for INT8 is chosen by the forward method based on the total channel count:

For dequantize, since the input is already int32 with some elempack (typically 4 or 8), scale vectors are loaded with matching width:

Sources: src/layer/arm/quantize_arm.cpp237-249 src/layer/arm/quantize_arm.cpp98-141 src/layer/arm/quantize_arm.cpp143-208 src/layer/arm/dequantize_arm.cpp39-55

Activation Functions in Requantize

Requantize supports inline activation to avoid an extra layer dispatch. The activation_type parameter selects the function applied between scaling and requantization:

`activation_type`	Meaning
0	No activation
1	ReLU
2	Leaky ReLU (slope in `activation_params[0]`)
3	Clip (min/max in `activation_params`)
4	Sigmoid
other	Implementation-defined

On ARM, the relu and leaky-relu cases have dedicated NEON-optimized helpers (float2int8relu, float2int8leakyrelu) that fold the activation into the int8 packing operation. On x86, the dispatch goes through activation_sse, activation_avx, and activation_avx512 from x86_activation.h.

Sources: src/layer/requantize.cpp55-65 src/layer/arm/requantize_arm.cpp23-173 src/layer/x86/requantize_x86.cpp104-175

Winograd 6-bit Weight Quantization

For 3×3 convolutions with dilation=1, stride=1 (eligible for Winograd transforms), the calibration tool uses 6-bit weight quantization (max representable value 31) instead of the normal 7-bit (max value 127):

The 2-bit headroom is needed because the Winograd transform multiplies weights by transformation matrices that can increase their magnitude. All three calibration methods (KL, ACIQ, EQ) apply this rule.

Sources: tools/quantize/ncnn2table.cpp332-358 tools/quantize/ncnn2table.cpp808-835

FP16 and BF16 Precision Interaction

The Quantize and Dequantize layers participate in ncnn's reduced-precision storage modes. The forward method of each architecture subclass dispatches based on the Option flags:

Diagram: Quantize_arm forward dispatch

For Dequantize_arm, the output type follows the same dispatch: when opt.use_bf16_storage, forward_bf16s is called and it emits uint16 (bfloat16) values using float2bfloat. When opt.use_fp16_storage, it emits __fp16 values.

Sources: src/layer/arm/quantize_arm.cpp211-229 src/layer/arm/dequantize_arm.cpp149-168

Architecture-Specific SIMD Widths

The table below summarizes the SIMD widths used in the INT8 quantize/dequantize layers per architecture:

Architecture	Max elempack (float in)	Max elempack (int8 out)	Key instruction
ARM NEON (ARMv7)	4	8	`vcvtq_s32_f32` + simulated round
ARM NEON (AArch64)	4/8	8	`vcvtaq_s32_f32`
x86 SSE2	4	4	`_mm_packs_epi32` + `_mm_packs_epi16`
x86 AVX2	8	8	`_mm256_cvtepi32_ps` / `_mm256_loadu_si256`
x86 AVX-512	16	16	`_mm512_cvtepi32_ps` / `float2int8_avx512`
MIPS MSA	8	8	`__msa_ffint_s_w` + `__msa_sat_s_w`
LoongArch LSX	8	8	`__lsx_vftint_w_s` + saturation

Sources: src/layer/arm/quantize_arm.cpp51-96 src/layer/x86/quantize_x86.cpp70-120 src/layer/mips/quantize_mips.cpp src/layer/mips/requantize_mips.cpp src/layer/loongarch/requantize_loongarch.cpp

Calibration Table Format

The .table file produced by ncnn2table (and consumed by ncnn2int8) has two sections:

layername_param_0  w0 w1 w2 ... wN
layername          a0

Lines ending in _param_0 carry weight scales: one float per output channel (or per group for depthwise).
Lines without suffix carry activation scales: one float for the input blob to that layer.

These map directly to weight_scales and bottom_blob_scales in QuantNet.

To exclude a layer from INT8 (mixed precision), comment out its weight scale line:

#conv1_param_0 156.639840536

Sources: tools/quantize/ncnn2table.cpp141-184 docs/how-to-use-and-FAQ/quantized-int8-inference.md115-124

int8_scale_term in Convolution

Convolution, ConvolutionDepthWise, and InnerProduct layers carry an int8_scale_term field (param index 8). This is used by ncnn2int8 to record whether a layer has been quantized and what the scale-term type is. It is stored in the .param file alongside the normal convolution parameters.

The value distinguishes:

0: FP32 layer, no INT8 path
Non-zero: INT8 path active; specific values encode whether the scale is per-layer or per-channel, and whether requantization or dequantization follows.

Sources: tools/quantize/ncnn2table.cpp1005-1070

End-to-End Quantization Workflow Summary

For the calibration tooling (ncnn2table, ncnn2int8) see page 8.2 The runtime steps from a user perspective are:

Load the quantized .param / .bin files via Net::load_param and Net::load_model. The runtime detects INT8 layers automatically from int8_scale_term.
No code changes are required in the inference path; INT8 dispatch is transparent.
Mixed precision is achieved at table generation time by commenting out specific layer scale lines before running ncnn2int8.

Sources: docs/how-to-use-and-FAQ/quantized-int8-inference.md104-124

INT8 Quantization and Precision Modes

Relevant source files

INT8 Inference Data Flow

Diagram: INT8 tensor data flow through ncnn layers

Sources: src/layer/quantize.cpp src/layer/requantize.cpp src/layer/arm/dequantize_arm.cpp

The Three Core Layer Types

Diagram: Quantize / Dequantize / Requantize class hierarchy

Quantize

Converts a float (or bf16/fp16) blob to signed char (INT8).

Parameters (loaded via load_param):

Param Index	Field	Meaning
0	`scale_data_size`	1 = global scale; >1 = per-channel scales

Weights (loaded via load_model):

Field	Size
`scale_data`	`scale_data_size` floats

Forward operation (scalar path):

output[i] = float2int8(input[i] * scale_data[channel_i])

Sources: src/layer/quantize.cpp1-80 src/layer/arm/quantize_arm.cpp31-96

Dequantize

Converts an int32 blob (accumulator result from INT8 convolution) to a float (or bf16/fp16) blob.

Parameters:

Param Index	Field	Meaning
0	`scale_data_size`	1 = global; >1 = per-channel
1	`bias_data_size`	0 = no bias; 1 = scalar; >1 = per-channel

Weights:

Field	Size
`scale_data`	`scale_data_size` floats
`bias_data`	`bias_data_size` floats (optional)

Forward operation:

output[i] = int_value[i] * scale_data[channel_i] + bias_data[channel_i]

When bias_data_size == 0:

output[i] = int_value[i] * scale_data[channel_i]

Sources: src/layer/arm/dequantize_arm.cpp30-147 src/layer/arm/dequantize_arm.cpp149-228 src/layer/arm/dequantize_arm.cpp230-423

Requantize

Converts an int32 accumulator directly to signed char (INT8), optionally applying a bias and an activation function. This avoids a round-trip through float when chaining INT8 layers.

Parameters:

Param Index	Field	Meaning
0	`scale_in_data_size`	Scales the int32 accumulator
1	`scale_out_data_size`	Rescales to fit int8 output range
2	`bias_data_size`	Optional per-channel bias
3	`activation_type`	0=none, 1=relu, 2=leakyrelu, etc.
4	`activation_params`	Slope for leaky relu, etc.

Weights:

Field	Size
`scale_in_data`	`scale_in_data_size` floats
`scale_out_data`	`scale_out_data_size` floats
`bias_data`	`bias_data_size` floats (optional)

Forward operation:

float v = int_value * scale_in + bias
v = activation(v, activation_type, activation_params)
output = float2int8(v * scale_out)

The ARM implementation optimizes the relu activation case by fusing scale_in * scale_out into a single multiply and applying the clamp-to-zero in the same NEON pass (float2int8relu).

Sources: src/layer/requantize.cpp1-141 src/layer/arm/requantize_arm.cpp23-173 src/layer/x86/requantize_x86.cpp25-175

Scale Data and Bias Data

The values in scale_data are derived from the calibration table file produced by ncnn2table. The formula is:

scale = 127.0 / threshold

where threshold is the activation threshold determined per-blob. For weights it is:

weight_scale[output_channel] = 127.0 / absmax_of_that_channel

For Requantize, bias_data values are further scaled by scale_out internally in the ARM implementation, allowing the add and multiply to happen in one NEON fused-multiply-add instruction.

Scale dimensionality rules:

`scale_data_size`	Meaning
1	Single scalar applied to all channels
== channel count	Per-channel; indexed by output channel index
== `elempack`	Per-lane within a packed group (used when `elempack > 1`)

Sources: src/layer/arm/dequantize_arm.cpp30-55 src/layer/arm/requantize_arm.cpp38-70 tools/quantize/ncnn2table.cpp338-407

float2int8 Conversion Utilities

Each architecture defines its own family of float2int8 helpers. They all clamp to [-127, 127] (not -128, preserving signed symmetry for easier scale computation).

Diagram: float2int8 utility functions by file and variant

Sources: src/layer/x86/x86_usability.h143-377 src/layer/arm/arm_usability.h7-255

Scalar float2int8

Defined independently in each architecture header and in the base layer files:

signed char float2int8(float v) {
    int int32 = (int)round(v);
    if (int32 > 127) return 127;
    if (int32 < -127) return -127;
    return (signed char)int32;
}

src/layer/x86/x86_usability.h143-149
src/layer/arm/arm_usability.h7-13
src/layer/requantize.cpp10-16

ARM NEON Rounding

On AArch64, vcvtaq_s32_f32 provides hardware round-to-nearest. On ARMv7/AArch32, rounding is simulated by adding ±0.5 before truncation with vcvtq_s32_f32.

src/layer/arm/arm_usability.h27-49

x86 SSE Rounding

src/layer/x86/x86_usability.h287-310

elempack and Packing Layout Interaction

The ARM Quantize_arm handles three cases explicitly:

Input elempack	Output elempack	Function called
4	8	`quantize_pack4to8` – merges two pack4 float slices
4	1	`quantize_pack4to1` – scatters pack4 float into four separate int8 channels
N	N	`quantize` – in-place packing preserved

The output elempack for INT8 is chosen by the forward method based on the total channel count:

For dequantize, since the input is already int32 with some elempack (typically 4 or 8), scale vectors are loaded with matching width:

Sources: src/layer/arm/quantize_arm.cpp237-249 src/layer/arm/quantize_arm.cpp98-141 src/layer/arm/quantize_arm.cpp143-208 src/layer/arm/dequantize_arm.cpp39-55

Activation Functions in Requantize

Requantize supports inline activation to avoid an extra layer dispatch. The activation_type parameter selects the function applied between scaling and requantization:

`activation_type`	Meaning
0	No activation
1	ReLU
2	Leaky ReLU (slope in `activation_params[0]`)
3	Clip (min/max in `activation_params`)
4	Sigmoid
other	Implementation-defined

Sources: src/layer/requantize.cpp55-65 src/layer/arm/requantize_arm.cpp23-173 src/layer/x86/requantize_x86.cpp104-175

Winograd 6-bit Weight Quantization

Sources: tools/quantize/ncnn2table.cpp332-358 tools/quantize/ncnn2table.cpp808-835

FP16 and BF16 Precision Interaction

The Quantize and Dequantize layers participate in ncnn's reduced-precision storage modes. The forward method of each architecture subclass dispatches based on the Option flags:

Diagram: Quantize_arm forward dispatch

Sources: src/layer/arm/quantize_arm.cpp211-229 src/layer/arm/dequantize_arm.cpp149-168

Architecture-Specific SIMD Widths

The table below summarizes the SIMD widths used in the INT8 quantize/dequantize layers per architecture:

Architecture	Max elempack (float in)	Max elempack (int8 out)	Key instruction
ARM NEON (ARMv7)	4	8	`vcvtq_s32_f32` + simulated round
ARM NEON (AArch64)	4/8	8	`vcvtaq_s32_f32`
x86 SSE2	4	4	`_mm_packs_epi32` + `_mm_packs_epi16`
x86 AVX2	8	8	`_mm256_cvtepi32_ps` / `_mm256_loadu_si256`
x86 AVX-512	16	16	`_mm512_cvtepi32_ps` / `float2int8_avx512`
MIPS MSA	8	8	`__msa_ffint_s_w` + `__msa_sat_s_w`
LoongArch LSX	8	8	`__lsx_vftint_w_s` + saturation

Sources: src/layer/arm/quantize_arm.cpp51-96 src/layer/x86/quantize_x86.cpp70-120 src/layer/mips/quantize_mips.cpp src/layer/mips/requantize_mips.cpp src/layer/loongarch/requantize_loongarch.cpp

Calibration Table Format

The .table file produced by ncnn2table (and consumed by ncnn2int8) has two sections:

layername_param_0  w0 w1 w2 ... wN
layername          a0

Lines ending in _param_0 carry weight scales: one float per output channel (or per group for depthwise).
Lines without suffix carry activation scales: one float for the input blob to that layer.

These map directly to weight_scales and bottom_blob_scales in QuantNet.

To exclude a layer from INT8 (mixed precision), comment out its weight scale line:

#conv1_param_0 156.639840536

Sources: tools/quantize/ncnn2table.cpp141-184 docs/how-to-use-and-FAQ/quantized-int8-inference.md115-124

int8_scale_term in Convolution

The value distinguishes:

0: FP32 layer, no INT8 path
Non-zero: INT8 path active; specific values encode whether the scale is per-layer or per-channel, and whether requantization or dequantization follows.

Sources: tools/quantize/ncnn2table.cpp1005-1070

End-to-End Quantization Workflow Summary

For the calibration tooling (ncnn2table, ncnn2int8) see page 8.2 The runtime steps from a user perspective are:

Load the quantized .param / .bin files via Net::load_param and Net::load_model. The runtime detects INT8 layers automatically from int8_scale_term.
No code changes are required in the inference path; INT8 dispatch is transparent.
Mixed precision is achieved at table generation time by commenting out specific layer scale lines before running ncnn2int8.

Sources: docs/how-to-use-and-FAQ/quantized-int8-inference.md104-124

INT8 Quantization and Precision Modes

INT8 Inference Data Flow

The Three Core Layer Types

Quantize

Dequantize

Requantize

Scale Data and Bias Data

float2int8 Conversion Utilities

Scalar float2int8

ARM NEON Rounding

x86 SSE Rounding

elempack and Packing Layout Interaction

Activation Functions in Requantize

Winograd 6-bit Weight Quantization

FP16 and BF16 Precision Interaction

Architecture-Specific SIMD Widths

Calibration Table Format

int8_scale_term in Convolution

End-to-End Quantization Workflow Summary

On this page

INT8 Quantization and Precision Modes

INT8 Inference Data Flow

The Three Core Layer Types

Quantize

Dequantize

Requantize

Scale Data and Bias Data

float2int8 Conversion Utilities

Scalar float2int8

ARM NEON Rounding

x86 SSE Rounding

elempack and Packing Layout Interaction

Activation Functions in Requantize

Winograd 6-bit Weight Quantization

FP16 and BF16 Precision Interaction

Architecture-Specific SIMD Widths

Calibration Table Format

int8_scale_term in Convolution

End-to-End Quantization Workflow Summary

On this page