This page describes how ncnn represents and executes INT8 inference: the Quantize, Dequantize, and Requantize layer implementations, the scale_data and bias_data conventions, float2int8 conversion utilities, and how SIMD packing layouts interact with INT8 data paths across ARM, x86, MIPS, and LoongArch backends.
For the post-training calibration workflow that produces the scale tables consumed at runtime, see page 8.2 For how convolution layers select between Winograd, im2col-GEMM, and INT8 kernel variants at create_pipeline time, see page 4.1
INT8 inference in ncnn uses a quantize → compute → dequantize/requantize pattern. Floating-point activations are converted to signed char before a layer, the layer accumulates into int32, and the result is either converted back to float (for a float next-layer) or requantized to INT8 (for a chained INT8 next-layer).
Diagram: INT8 tensor data flow through ncnn layers
Sources: src/layer/quantize.cpp src/layer/requantize.cpp src/layer/arm/dequantize_arm.cpp
Diagram: Quantize / Dequantize / Requantize class hierarchy
Sources: src/layer/quantize.cpp src/layer/arm/quantize_arm.cpp src/layer/x86/quantize_x86.cpp src/layer/mips/quantize_mips.cpp src/layer/arm/dequantize_arm.cpp src/layer/x86/dequantize_x86.cpp src/layer/mips/dequantize_mips.cpp src/layer/loongarch/dequantize_loongarch.cpp src/layer/arm/requantize_arm.cpp src/layer/x86/requantize_x86.cpp src/layer/mips/requantize_mips.cpp src/layer/loongarch/requantize_loongarch.cpp
Converts a float (or bf16/fp16) blob to signed char (INT8).
Parameters (loaded via load_param):
| Param Index | Field | Meaning |
|---|---|---|
| 0 | scale_data_size | 1 = global scale; >1 = per-channel scales |
Weights (loaded via load_model):
| Field | Size |
|---|---|
scale_data | scale_data_size floats |
Forward operation (scalar path):
output[i] = float2int8(input[i] * scale_data[channel_i])
The scalar float2int8 is defined identically in both src/layer/quantize.cpp and src/layer/arm/arm_usability.h / src/layer/x86/x86_usability.h: round to nearest integer, then clamp to [-127, 127].
Sources: src/layer/quantize.cpp1-80 src/layer/arm/quantize_arm.cpp31-96
Converts an int32 blob (accumulator result from INT8 convolution) to a float (or bf16/fp16) blob.
Parameters:
| Param Index | Field | Meaning |
|---|---|---|
| 0 | scale_data_size | 1 = global; >1 = per-channel |
| 1 | bias_data_size | 0 = no bias; 1 = scalar; >1 = per-channel |
Weights:
| Field | Size |
|---|---|
scale_data | scale_data_size floats |
bias_data | bias_data_size floats (optional) |
Forward operation:
output[i] = int_value[i] * scale_data[channel_i] + bias_data[channel_i]
When bias_data_size == 0:
output[i] = int_value[i] * scale_data[channel_i]
The bias here is the convolution layer's bias pre-folded into the dequantization step by ncnn2int8. Dequantize supports multiple output types: when opt.use_bf16_storage is set it outputs uint16 (bfloat16); when opt.use_fp16_storage is set it outputs __fp16.
Sources: src/layer/arm/dequantize_arm.cpp30-147 src/layer/arm/dequantize_arm.cpp149-228 src/layer/arm/dequantize_arm.cpp230-423
Converts an int32 accumulator directly to signed char (INT8), optionally applying a bias and an activation function. This avoids a round-trip through float when chaining INT8 layers.
Parameters:
| Param Index | Field | Meaning |
|---|---|---|
| 0 | scale_in_data_size | Scales the int32 accumulator |
| 1 | scale_out_data_size | Rescales to fit int8 output range |
| 2 | bias_data_size | Optional per-channel bias |
| 3 | activation_type | 0=none, 1=relu, 2=leakyrelu, etc. |
| 4 | activation_params | Slope for leaky relu, etc. |
Weights:
| Field | Size |
|---|---|
scale_in_data | scale_in_data_size floats |
scale_out_data | scale_out_data_size floats |
bias_data | bias_data_size floats (optional) |
Forward operation:
float v = int_value * scale_in + bias
v = activation(v, activation_type, activation_params)
output = float2int8(v * scale_out)
The ARM implementation optimizes the relu activation case by fusing scale_in * scale_out into a single multiply and applying the clamp-to-zero in the same NEON pass (float2int8relu).
Sources: src/layer/requantize.cpp1-141 src/layer/arm/requantize_arm.cpp23-173 src/layer/x86/requantize_x86.cpp25-175
The values in scale_data are derived from the calibration table file produced by ncnn2table. The formula is:
scale = 127.0 / threshold
where threshold is the activation threshold determined per-blob. For weights it is:
weight_scale[output_channel] = 127.0 / absmax_of_that_channel
The bias_data in Dequantize is not the raw convolution bias. During ncnn2int8, the convolution bias is folded in, pre-scaled so it is expressed in the same units as the int32 accumulator output scaled back to float.
For Requantize, bias_data values are further scaled by scale_out internally in the ARM implementation, allowing the add and multiply to happen in one NEON fused-multiply-add instruction.
Scale dimensionality rules:
scale_data_size | Meaning |
|---|---|
| 1 | Single scalar applied to all channels |
| == channel count | Per-channel; indexed by output channel index |
== elempack | Per-lane within a packed group (used when elempack > 1) |
Sources: src/layer/arm/dequantize_arm.cpp30-55 src/layer/arm/requantize_arm.cpp38-70 tools/quantize/ncnn2table.cpp338-407
Each architecture defines its own family of float2int8 helpers. They all clamp to [-127, 127] (not -128, preserving signed symmetry for easier scale computation).
Diagram: float2int8 utility functions by file and variant
Sources: src/layer/x86/x86_usability.h143-377 src/layer/arm/arm_usability.h7-255
Defined independently in each architecture header and in the base layer files:
signed char float2int8(float v) {
int int32 = (int)round(v);
if (int32 > 127) return 127;
if (int32 < -127) return -127;
return (signed char)int32;
}
src/layer/x86/x86_usability.h143-149
src/layer/arm/arm_usability.h7-13
src/layer/requantize.cpp10-16
On AArch64, vcvtaq_s32_f32 provides hardware round-to-nearest. On ARMv7/AArch32, rounding is simulated by adding ±0.5 before truncation with vcvtq_s32_f32.
src/layer/arm/arm_usability.h27-49
The SSE path cannot rely on _MM_ROUND_NEAREST being set (it may be in banker's rounding mode), so it simulates round-to-nearest by adding ±0.5 (sign-preserving) before truncation with _mm_cvttps_epi32. The result is then packed through _mm_packs_epi32 → _mm_packs_epi16 and clamped to [-127, 127] using _mm_min_epi16 / _mm_max_epi16.
src/layer/x86/x86_usability.h287-310
The elempack field of a Mat records how many elements are grouped into a single memory unit (see page 2.3). INT8 layers must handle pack-format conversions because the input float tensor may be in pack4/pack8 format while the INT8 output target pack may differ.
The ARM Quantize_arm handles three cases explicitly:
| Input elempack | Output elempack | Function called |
|---|---|---|
| 4 | 8 | quantize_pack4to8 – merges two pack4 float slices |
| 4 | 1 | quantize_pack4to1 – scatters pack4 float into four separate int8 channels |
| N | N | quantize – in-place packing preserved |
The output elempack for INT8 is chosen by the forward method based on the total channel count:
For dequantize, since the input is already int32 with some elempack (typically 4 or 8), scale vectors are loaded with matching width:
Sources: src/layer/arm/quantize_arm.cpp237-249 src/layer/arm/quantize_arm.cpp98-141 src/layer/arm/quantize_arm.cpp143-208 src/layer/arm/dequantize_arm.cpp39-55
Requantize supports inline activation to avoid an extra layer dispatch. The activation_type parameter selects the function applied between scaling and requantization:
activation_type | Meaning |
|---|---|
| 0 | No activation |
| 1 | ReLU |
| 2 | Leaky ReLU (slope in activation_params[0]) |
| 3 | Clip (min/max in activation_params) |
| 4 | Sigmoid |
| other | Implementation-defined |
On ARM, the relu and leaky-relu cases have dedicated NEON-optimized helpers (float2int8relu, float2int8leakyrelu) that fold the activation into the int8 packing operation. On x86, the dispatch goes through activation_sse, activation_avx, and activation_avx512 from x86_activation.h.
Sources: src/layer/requantize.cpp55-65 src/layer/arm/requantize_arm.cpp23-173 src/layer/x86/requantize_x86.cpp104-175
For 3×3 convolutions with dilation=1, stride=1 (eligible for Winograd transforms), the calibration tool uses 6-bit weight quantization (max representable value 31) instead of the normal 7-bit (max value 127):
The 2-bit headroom is needed because the Winograd transform multiplies weights by transformation matrices that can increase their magnitude. All three calibration methods (KL, ACIQ, EQ) apply this rule.
Sources: tools/quantize/ncnn2table.cpp332-358 tools/quantize/ncnn2table.cpp808-835
The Quantize and Dequantize layers participate in ncnn's reduced-precision storage modes. The forward method of each architecture subclass dispatches based on the Option flags:
Diagram: Quantize_arm forward dispatch
For Dequantize_arm, the output type follows the same dispatch: when opt.use_bf16_storage, forward_bf16s is called and it emits uint16 (bfloat16) values using float2bfloat. When opt.use_fp16_storage, it emits __fp16 values.
Sources: src/layer/arm/quantize_arm.cpp211-229 src/layer/arm/dequantize_arm.cpp149-168
The table below summarizes the SIMD widths used in the INT8 quantize/dequantize layers per architecture:
| Architecture | Max elempack (float in) | Max elempack (int8 out) | Key instruction |
|---|---|---|---|
| ARM NEON (ARMv7) | 4 | 8 | vcvtq_s32_f32 + simulated round |
| ARM NEON (AArch64) | 4/8 | 8 | vcvtaq_s32_f32 |
| x86 SSE2 | 4 | 4 | _mm_packs_epi32 + _mm_packs_epi16 |
| x86 AVX2 | 8 | 8 | _mm256_cvtepi32_ps / _mm256_loadu_si256 |
| x86 AVX-512 | 16 | 16 | _mm512_cvtepi32_ps / float2int8_avx512 |
| MIPS MSA | 8 | 8 | __msa_ffint_s_w + __msa_sat_s_w |
| LoongArch LSX | 8 | 8 | __lsx_vftint_w_s + saturation |
Sources: src/layer/arm/quantize_arm.cpp51-96 src/layer/x86/quantize_x86.cpp70-120 src/layer/mips/quantize_mips.cpp src/layer/mips/requantize_mips.cpp src/layer/loongarch/requantize_loongarch.cpp
The .table file produced by ncnn2table (and consumed by ncnn2int8) has two sections:
layername_param_0 w0 w1 w2 ... wN
layername a0
_param_0 carry weight scales: one float per output channel (or per group for depthwise).These map directly to weight_scales and bottom_blob_scales in QuantNet.
To exclude a layer from INT8 (mixed precision), comment out its weight scale line:
#conv1_param_0 156.639840536
Sources: tools/quantize/ncnn2table.cpp141-184 docs/how-to-use-and-FAQ/quantized-int8-inference.md115-124
Convolution, ConvolutionDepthWise, and InnerProduct layers carry an int8_scale_term field (param index 8). This is used by ncnn2int8 to record whether a layer has been quantized and what the scale-term type is. It is stored in the .param file alongside the normal convolution parameters.
The value distinguishes:
0: FP32 layer, no INT8 pathSources: tools/quantize/ncnn2table.cpp1005-1070
For the calibration tooling (ncnn2table, ncnn2int8) see page 8.2 The runtime steps from a user perspective are:
.param / .bin files via Net::load_param and Net::load_model. The runtime detects INT8 layers automatically from int8_scale_term.ncnn2int8.Sources: docs/how-to-use-and-FAQ/quantized-int8-inference.md104-124
Refresh this wiki
This wiki was recently refreshed. Please wait 3 days to refresh again.