This document describes how ncnn leverages x86 SIMD instruction sets (SSE2, AVX, AVX2, AVX512, VNNI) to accelerate neural network inference on x86 and x86_64 processors. The optimizations apply to CPU layer implementations for convolution, inner product, depthwise convolution, and related operations.
For ARM-specific SIMD optimizations, see ARM NEON Optimizations. For runtime CPU detection mechanisms shared across architectures, see Runtime CPU Detection and Dispatch. For INT8 quantization with VNNI instructions, see INT8 Quantization and Precision Modes.
ncnn's x86 backend supports multiple SIMD instruction set levels through conditional compilation and runtime detection:
| Instruction Set | Register Width | Element Packing | Macro Flag | CPU Requirements |
|---|---|---|---|---|
| SSE2 | 128-bit | pack4 (4×fp32) | __SSE2__ | Pentium 4+ (2001) |
| AVX | 256-bit | pack8 (8×fp32) | __AVX__ | Sandy Bridge+ (2011) |
| AVX2 | 256-bit | pack8 + FMA | __AVX2__ | Haswell+ (2013) |
| AVX512F | 512-bit | pack16 (16×fp32) | __AVX512F__ | Skylake-X+ (2017) |
| AVX-VNNI | 256-bit | INT8 dot products | __AVXVNNI__ | Alder Lake+ (2021) |
The codebase uses preprocessor directives to include SIMD-specific code paths:
Sources: src/layer/x86/convolution_x86.cpp6-17
ncnn uses platform-specific layer subclasses that override base layer implementations:
Diagram: Layer Pipeline Creation and Element Packing Selection
The element packing strategy is determined during create_pipeline():
Element packing groups multiple feature map channels into SIMD registers, enabling parallel computation. A pack4 layout stores 4 consecutive channels in a single 128-bit SSE register, pack8 uses 256-bit AVX registers for 8 channels, and pack16 uses 512-bit AVX-512 registers for 16 channels.
Sources: src/layer/x86/convolution_x86.cpp332-349 src/layer/x86/convolutiondepthwise_x86.cpp66-78
The Convolution_x86 class selects among multiple implementation strategies during pipeline creation:
Diagram: Convolution Implementation Strategy Selection
The selection logic is based on several heuristics:
num_input * num_output * kernel_w * kernel_h * dilation_w * dilation_h * stride_w * stride_h * sizeof(float) * 2 > l2_cache_sizenum_input < 16 || num_output < 16test_prefer_winograd23() and test_prefer_winograd63() functionsSources: src/layer/x86/convolution_x86.cpp269-483 src/layer/x86/convolution_x86.cpp105-267
Winograd convolution reduces arithmetic complexity for 3×3 kernels by transforming the convolution into element-wise multiplications in a different domain:
Diagram: Winograd F(4,3) Convolution Data Flow
The three Winograd variants differ in output tile size:
| Variant | Output Tile Size | Input Tile Size | Transform Matrices | Use Case |
|---|---|---|---|---|
| F(2,3) | 2×2 | 4×4 | 4×4 matrices | Small feature maps (3-14 px) |
| F(4,3) | 4×4 | 6×6 | 6×6 matrices | Medium feature maps (default) |
| F(6,3) | 6×6 | 8×8 | 8×8 matrices | Large feature maps, small channels (≤32) |
The selection between variants uses profiled performance data encoded in lookup tables:
Sources: src/layer/x86/convolution_x86.cpp105-267 src/layer/x86/convolution_x86.cpp353-437
The conv3x3s1_pack8_avx() function implements a hand-optimized 3×3 convolution for packed 8-channel data:
Diagram: conv3x3s1_pack8_avx() Execution Flow
Key optimization techniques:
_mm256_fmadd_ps() for fused multiply-add when AVX2 is availableThe specialized kernels exist for multiple configurations:
| Kernel Function | elempack | out_elempack | Kernel Size | Stride | AVX Level |
|---|---|---|---|---|---|
conv3x3s1_pack8_avx | 8 | 8 | 3×3 | 1 | AVX |
conv3x3s1_pack1to8_avx | 1 | 8 | 3×3 | 1 | AVX |
conv3x3s2_pack1to8_avx | 1 | 8 | 3×3 | 2 | AVX |
conv3x3s1_pack16to1_avx512 | 16 | 1 | 3×3 | 1 | AVX512F |
conv2x2s1_pack8_avx | 8 | 8 | 2×2 | 1 | AVX |
Sources: src/layer/x86/convolution_x86.cpp700-778 src/layer/x86/convolution_3x3_pack8.h
For larger kernels or when L2 cache is insufficient, ncnn uses the classical im2col + GEMM approach:
Diagram: im2col + GEMM Convolution Pipeline
The GEMM kernels are highly optimized with multiple specializations:
gemm_transB_packed_tile() for SSE2 with 4-channel packinggemm_transB_packed_tile_avx() for AVX with 8-channel packinggemm_transB_packed_tile_avx512() for AVX-512 with 16-channel packingThe GEMM implementation uses:
create_pipeline() to optimize memory access patternsSources: src/layer/x86/convolution_x86.cpp453-461 src/layer/x86/convolution_im2col_gemm.h
The InnerProduct_x86 layer implements fully connected operations as matrix-matrix multiplication:
Diagram: InnerProduct Layer Dispatch and Execution
Weights are transformed during pipeline creation to optimize memory access:
This layout allows each output neuron group (pack4/8/16) to have contiguous weight access, enabling efficient SIMD loads.
Sources: src/layer/x86/innerproduct_x86.cpp68-75 src/layer/x86/innerproduct_fp.h
Depthwise convolutions (where group == channels == num_output) have dedicated optimizations:
Diagram: Depthwise Convolution Dispatch Strategy
The depthwise convolution applies a separate 3×3 filter to each channel independently:
Each channel group of 8 is processed independently, with each spatial position computing a dot product of the 3×3 window with the 3×3 kernel weights.
Sources: src/layer/x86/convolutiondepthwise_x86.cpp255-453 src/layer/x86/convolutiondepthwise_3x3_pack8.h
ncnn uses consistent intrinsics patterns across layers:
| Operation | SSE2 (pack4) | AVX (pack8) | AVX-512 (pack16) |
|---|---|---|---|
| Zero initialization | _mm_setzero_ps() | _mm256_setzero_ps() | _mm512_setzero_ps() |
| Load aligned | _mm_load_ps() | _mm256_load_ps() | _mm512_load_ps() |
| Load unaligned | _mm_loadu_ps() | _mm256_loadu_ps() | _mm512_loadu_ps() |
| Store unaligned | _mm_storeu_ps() | _mm256_storeu_ps() | _mm512_storeu_ps() |
| Multiply-add | _mm_add_ps(_mm_mul_ps()) | _mm256_fmadd_ps() | _mm512_fmadd_ps() |
| Broadcast scalar | _mm_set1_ps() | _mm256_broadcast_ss() | _mm512_set1_ps() |
When AVX2 or AVX-512 is available, ncnn uses FMA instructions for efficiency:
FMA reduces instruction count and improves floating-point precision by eliminating intermediate rounding.
Activation functions are implemented with SIMD-specific helper functions:
Sources: src/layer/x86/x86_activation.h src/layer/x86/convolution_x86.cpp700-738
Standard weight layout is (kh × kw × inch × outch). For SIMD execution, weights are transformed to (pb × pa × kh × kw × inch/pa × outch/pb) where:
pa = input element pack size (4, 8, or 16)pb = output element pack size (4, 8, or 16)This transformation enables:
elempack × out_elempack weights consecutivelySources: src/layer/x86/convolution_x86.cpp69-103 src/layer/x86/convolution_x86.cpp463-477
Theoretical peak performance (GFLOPS) for a 3 GHz CPU:
| Instruction Set | FP32 Operations/Cycle | Peak GFLOPS (1 core) | Typical Speedup vs Scalar |
|---|---|---|---|
| Scalar | 1 | 3 | 1× |
| SSE2 | 4 | 12 | 3-4× |
| AVX + FMA | 16 | 48 | 12-15× |
| AVX-512 + FMA | 32 | 96 | 25-30× |
Actual speedups depend on:
num_input % elempack != 0| Strategy | Arithmetic Intensity | Memory Traffic | Best For |
|---|---|---|---|
| Direct Convolution | Low-Medium | High (repeated loads) | Small kernels (3×3, 5×5), stride>1 |
| Winograd | High | Low (tile buffering) | 3×3 kernels, stride=1, medium-large feature maps |
| im2col+GEMM | High | Very High (im2col overhead) | Large kernels (≥7×7), large channel counts |
Sources: src/layer/x86/convolution_x86.cpp450-452 src/layer/x86/convolution_x86.cpp105-267
Refresh this wiki
This wiki was recently refreshed. Please wait 2 days to refresh again.