x86 SIMD Optimizations

Relevant source files

This document describes how ncnn leverages x86 SIMD instruction sets (SSE2, AVX, AVX2, AVX512, VNNI) to accelerate neural network inference on x86 and x86_64 processors. The optimizations apply to CPU layer implementations for convolution, inner product, depthwise convolution, and related operations.

For ARM-specific SIMD optimizations, see ARM NEON Optimizations. For runtime CPU detection mechanisms shared across architectures, see Runtime CPU Detection and Dispatch. For INT8 quantization with VNNI instructions, see INT8 Quantization and Precision Modes.

SIMD Instruction Set Support

ncnn's x86 backend supports multiple SIMD instruction set levels through conditional compilation and runtime detection:

Instruction Set	Register Width	Element Packing	Macro Flag	CPU Requirements
SSE2	128-bit	pack4 (4×fp32)	`__SSE2__`	Pentium 4+ (2001)
AVX	256-bit	pack8 (8×fp32)	`__AVX__`	Sandy Bridge+ (2011)
AVX2	256-bit	pack8 + FMA	`__AVX2__`	Haswell+ (2013)
AVX512F	512-bit	pack16 (16×fp32)	`__AVX512F__`	Skylake-X+ (2017)
AVX-VNNI	256-bit	INT8 dot products	`__AVXVNNI__`	Alder Lake+ (2021)

The codebase uses preprocessor directives to include SIMD-specific code paths:

Sources: src/layer/x86/convolution_x86.cpp6-17

Runtime CPU Detection and Element Packing

Layer Class Hierarchy

ncnn uses platform-specific layer subclasses that override base layer implementations:

Diagram: Layer Pipeline Creation and Element Packing Selection

The element packing strategy is determined during create_pipeline():

Element packing groups multiple feature map channels into SIMD registers, enabling parallel computation. A pack4 layout stores 4 consecutive channels in a single 128-bit SSE register, pack8 uses 256-bit AVX registers for 8 channels, and pack16 uses 512-bit AVX-512 registers for 16 channels.

Sources: src/layer/x86/convolution_x86.cpp332-349 src/layer/x86/convolutiondepthwise_x86.cpp66-78

Convolution Layer Architecture

Implementation Strategy Selection

The Convolution_x86 class selects among multiple implementation strategies during pipeline creation:

Diagram: Convolution Implementation Strategy Selection

The selection logic is based on several heuristics:

L2 Cache Size Check: Prefers SGEMM if num_input * num_output * kernel_w * kernel_h * dilation_w * dilation_h * stride_w * stride_h * sizeof(float) * 2 > l2_cache_size
Small Channel Check: Direct kernels preferred when num_input < 16 || num_output < 16
Winograd Profiling: Uses profiled lookup tables from test_prefer_winograd23() and test_prefer_winograd63() functions

Sources: src/layer/x86/convolution_x86.cpp269-483 src/layer/x86/convolution_x86.cpp105-267

Winograd Convolution Transform

Winograd convolution reduces arithmetic complexity for 3×3 kernels by transforming the convolution into element-wise multiplications in a different domain:

Diagram: Winograd F(4,3) Convolution Data Flow

The three Winograd variants differ in output tile size:

Variant	Output Tile Size	Input Tile Size	Transform Matrices	Use Case
F(2,3)	2×2	4×4	4×4 matrices	Small feature maps (3-14 px)
F(4,3)	4×4	6×6	6×6 matrices	Medium feature maps (default)
F(6,3)	6×6	8×8	8×8 matrices	Large feature maps, small channels (≤32)

The selection between variants uses profiled performance data encoded in lookup tables:

Sources: src/layer/x86/convolution_x86.cpp105-267 src/layer/x86/convolution_x86.cpp353-437

Direct Convolution Kernels

Pack8 3×3 Convolution (AVX)

The conv3x3s1_pack8_avx() function implements a hand-optimized 3×3 convolution for packed 8-channel data:

Diagram: conv3x3s1_pack8_avx() Execution Flow

Key optimization techniques:

Register Blocking: Keeps 9 input values and 9 weight values in AVX registers
FMA Instructions: Uses _mm256_fmadd_ps() for fused multiply-add when AVX2 is available
Unrolled Kernel Loop: All 9 multiplications are explicitly unrolled
Sequential Memory Access: Weight layout optimized for cache-line aligned loads

The specialized kernels exist for multiple configurations:

Kernel Function	elempack	out_elempack	Kernel Size	Stride	AVX Level
`conv3x3s1_pack8_avx`	8	8	3×3	1	AVX
`conv3x3s1_pack1to8_avx`	1	8	3×3	1	AVX
`conv3x3s2_pack1to8_avx`	1	8	3×3	2	AVX
`conv3x3s1_pack16to1_avx512`	16	1	3×3	1	AVX512F
`conv2x2s1_pack8_avx`	8	8	2×2	1	AVX

Sources: src/layer/x86/convolution_x86.cpp700-778 src/layer/x86/convolution_3x3_pack8.h

im2col + GEMM Approach

For larger kernels or when L2 cache is insufficient, ncnn uses the classical im2col + GEMM approach:

Diagram: im2col + GEMM Convolution Pipeline

The GEMM kernels are highly optimized with multiple specializations:

Pack4 GEMM: gemm_transB_packed_tile() for SSE2 with 4-channel packing
Pack8 GEMM: gemm_transB_packed_tile_avx() for AVX with 8-channel packing
Pack16 GEMM: gemm_transB_packed_tile_avx512() for AVX-512 with 16-channel packing

The GEMM implementation uses:

Tile-based Blocking: Divides output into tiles that fit in L1 cache
Pre-packed Weights: Transformed during create_pipeline() to optimize memory access patterns
Multi-threading: Tiles distributed across threads via OpenMP

Sources: src/layer/x86/convolution_x86.cpp453-461 src/layer/x86/convolution_im2col_gemm.h

InnerProduct (Fully Connected) Layer

GEMM-based Implementation

The InnerProduct_x86 layer implements fully connected operations as matrix-matrix multiplication:

Diagram: InnerProduct Layer Dispatch and Execution

Weight Transformation

Weights are transformed during pipeline creation to optimize memory access:

This layout allows each output neuron group (pack4/8/16) to have contiguous weight access, enabling efficient SIMD loads.

Sources: src/layer/x86/innerproduct_x86.cpp68-75 src/layer/x86/innerproduct_fp.h

Depthwise Convolution Optimizations

Depthwise convolutions (where group == channels == num_output) have dedicated optimizations:

Diagram: Depthwise Convolution Dispatch Strategy

Pack8 Depthwise 3×3 Implementation

The depthwise convolution applies a separate 3×3 filter to each channel independently:

Each channel group of 8 is processed independently, with each spatial position computing a dot product of the 3×3 window with the 3×3 kernel weights.

Sources: src/layer/x86/convolutiondepthwise_x86.cpp255-453 src/layer/x86/convolutiondepthwise_3x3_pack8.h

SIMD Intrinsics Patterns

Common AVX Patterns

ncnn uses consistent intrinsics patterns across layers:

Operation	SSE2 (pack4)	AVX (pack8)	AVX-512 (pack16)
Zero initialization	`_mm_setzero_ps()`	`_mm256_setzero_ps()`	`_mm512_setzero_ps()`
Load aligned	`_mm_load_ps()`	`_mm256_load_ps()`	`_mm512_load_ps()`
Load unaligned	`_mm_loadu_ps()`	`_mm256_loadu_ps()`	`_mm512_loadu_ps()`
Store unaligned	`_mm_storeu_ps()`	`_mm256_storeu_ps()`	`_mm512_storeu_ps()`
Multiply-add	`_mm_add_ps(_mm_mul_ps())`	`_mm256_fmadd_ps()`	`_mm512_fmadd_ps()`
Broadcast scalar	`_mm_set1_ps()`	`_mm256_broadcast_ss()`	`_mm512_set1_ps()`

FMA (Fused Multiply-Add)

When AVX2 or AVX-512 is available, ncnn uses FMA instructions for efficiency:

FMA reduces instruction count and improves floating-point precision by eliminating intermediate rounding.

Activation Function Vectorization

Activation functions are implemented with SIMD-specific helper functions:

Sources: src/layer/x86/x86_activation.h src/layer/x86/convolution_x86.cpp700-738

Weight Data Transformation

Packed Weight Layout

Standard weight layout is (kh × kw × inch × outch). For SIMD execution, weights are transformed to (pb × pa × kh × kw × inch/pa × outch/pb) where:

pa = input element pack size (4, 8, or 16)
pb = output element pack size (4, 8, or 16)

This transformation enables:

Contiguous SIMD loads: Each kernel position loads elempack × out_elempack weights consecutively
Cache efficiency: Related weights for a tile are nearby in memory
Reduced address computation: Simplified indexing in inner loops

Sources: src/layer/x86/convolution_x86.cpp69-103 src/layer/x86/convolution_x86.cpp463-477

Performance Characteristics

Instruction Set Performance Comparison

Theoretical peak performance (GFLOPS) for a 3 GHz CPU:

Instruction Set	FP32 Operations/Cycle	Peak GFLOPS (1 core)	Typical Speedup vs Scalar
Scalar	1	3	1×
SSE2	4	12	3-4×
AVX + FMA	16	48	12-15×
AVX-512 + FMA	32	96	25-30×

Actual speedups depend on:

Memory bandwidth: SIMD widens computation but not memory bus
Cache locality: Transformed weight layouts improve cache hit rates
Kernel size and shape: Smaller kernels (3×3, 5×5) see better speedups than large kernels
Channel alignment: Performance degrades when num_input % elempack != 0

Strategy Performance Trade-offs

Strategy	Arithmetic Intensity	Memory Traffic	Best For
Direct Convolution	Low-Medium	High (repeated loads)	Small kernels (3×3, 5×5), stride>1
Winograd	High	Low (tile buffering)	3×3 kernels, stride=1, medium-large feature maps
im2col+GEMM	High	Very High (im2col overhead)	Large kernels (≥7×7), large channel counts

Sources: src/layer/x86/convolution_x86.cpp450-452 src/layer/x86/convolution_x86.cpp105-267

x86 SIMD Optimizations

Relevant source files

SIMD Instruction Set Support

ncnn's x86 backend supports multiple SIMD instruction set levels through conditional compilation and runtime detection:

Instruction Set	Register Width	Element Packing	Macro Flag	CPU Requirements
SSE2	128-bit	pack4 (4×fp32)	`__SSE2__`	Pentium 4+ (2001)
AVX	256-bit	pack8 (8×fp32)	`__AVX__`	Sandy Bridge+ (2011)
AVX2	256-bit	pack8 + FMA	`__AVX2__`	Haswell+ (2013)
AVX512F	512-bit	pack16 (16×fp32)	`__AVX512F__`	Skylake-X+ (2017)
AVX-VNNI	256-bit	INT8 dot products	`__AVXVNNI__`	Alder Lake+ (2021)

The codebase uses preprocessor directives to include SIMD-specific code paths:

Sources: src/layer/x86/convolution_x86.cpp6-17

Runtime CPU Detection and Element Packing

Layer Class Hierarchy

ncnn uses platform-specific layer subclasses that override base layer implementations:

Diagram: Layer Pipeline Creation and Element Packing Selection

The element packing strategy is determined during create_pipeline():

Sources: src/layer/x86/convolution_x86.cpp332-349 src/layer/x86/convolutiondepthwise_x86.cpp66-78

Convolution Layer Architecture

Implementation Strategy Selection

The Convolution_x86 class selects among multiple implementation strategies during pipeline creation:

Diagram: Convolution Implementation Strategy Selection

The selection logic is based on several heuristics:

L2 Cache Size Check: Prefers SGEMM if num_input * num_output * kernel_w * kernel_h * dilation_w * dilation_h * stride_w * stride_h * sizeof(float) * 2 > l2_cache_size
Small Channel Check: Direct kernels preferred when num_input < 16 || num_output < 16
Winograd Profiling: Uses profiled lookup tables from test_prefer_winograd23() and test_prefer_winograd63() functions

Sources: src/layer/x86/convolution_x86.cpp269-483 src/layer/x86/convolution_x86.cpp105-267

Winograd Convolution Transform

Winograd convolution reduces arithmetic complexity for 3×3 kernels by transforming the convolution into element-wise multiplications in a different domain:

Diagram: Winograd F(4,3) Convolution Data Flow

The three Winograd variants differ in output tile size:

Variant	Output Tile Size	Input Tile Size	Transform Matrices	Use Case
F(2,3)	2×2	4×4	4×4 matrices	Small feature maps (3-14 px)
F(4,3)	4×4	6×6	6×6 matrices	Medium feature maps (default)
F(6,3)	6×6	8×8	8×8 matrices	Large feature maps, small channels (≤32)

The selection between variants uses profiled performance data encoded in lookup tables:

Sources: src/layer/x86/convolution_x86.cpp105-267 src/layer/x86/convolution_x86.cpp353-437

Direct Convolution Kernels

Pack8 3×3 Convolution (AVX)

The conv3x3s1_pack8_avx() function implements a hand-optimized 3×3 convolution for packed 8-channel data:

Diagram: conv3x3s1_pack8_avx() Execution Flow

Key optimization techniques:

Register Blocking: Keeps 9 input values and 9 weight values in AVX registers
FMA Instructions: Uses _mm256_fmadd_ps() for fused multiply-add when AVX2 is available
Unrolled Kernel Loop: All 9 multiplications are explicitly unrolled
Sequential Memory Access: Weight layout optimized for cache-line aligned loads

The specialized kernels exist for multiple configurations:

Kernel Function	elempack	out_elempack	Kernel Size	Stride	AVX Level
`conv3x3s1_pack8_avx`	8	8	3×3	1	AVX
`conv3x3s1_pack1to8_avx`	1	8	3×3	1	AVX
`conv3x3s2_pack1to8_avx`	1	8	3×3	2	AVX
`conv3x3s1_pack16to1_avx512`	16	1	3×3	1	AVX512F
`conv2x2s1_pack8_avx`	8	8	2×2	1	AVX

Sources: src/layer/x86/convolution_x86.cpp700-778 src/layer/x86/convolution_3x3_pack8.h

im2col + GEMM Approach

For larger kernels or when L2 cache is insufficient, ncnn uses the classical im2col + GEMM approach:

Diagram: im2col + GEMM Convolution Pipeline

The GEMM kernels are highly optimized with multiple specializations:

Pack4 GEMM: gemm_transB_packed_tile() for SSE2 with 4-channel packing
Pack8 GEMM: gemm_transB_packed_tile_avx() for AVX with 8-channel packing
Pack16 GEMM: gemm_transB_packed_tile_avx512() for AVX-512 with 16-channel packing

The GEMM implementation uses:

Tile-based Blocking: Divides output into tiles that fit in L1 cache
Pre-packed Weights: Transformed during create_pipeline() to optimize memory access patterns
Multi-threading: Tiles distributed across threads via OpenMP

Sources: src/layer/x86/convolution_x86.cpp453-461 src/layer/x86/convolution_im2col_gemm.h

InnerProduct (Fully Connected) Layer

GEMM-based Implementation

The InnerProduct_x86 layer implements fully connected operations as matrix-matrix multiplication:

Diagram: InnerProduct Layer Dispatch and Execution

Weight Transformation

Weights are transformed during pipeline creation to optimize memory access:

This layout allows each output neuron group (pack4/8/16) to have contiguous weight access, enabling efficient SIMD loads.

Sources: src/layer/x86/innerproduct_x86.cpp68-75 src/layer/x86/innerproduct_fp.h

Depthwise Convolution Optimizations

Depthwise convolutions (where group == channels == num_output) have dedicated optimizations:

Diagram: Depthwise Convolution Dispatch Strategy

Pack8 Depthwise 3×3 Implementation

The depthwise convolution applies a separate 3×3 filter to each channel independently:

Each channel group of 8 is processed independently, with each spatial position computing a dot product of the 3×3 window with the 3×3 kernel weights.

Sources: src/layer/x86/convolutiondepthwise_x86.cpp255-453 src/layer/x86/convolutiondepthwise_3x3_pack8.h

SIMD Intrinsics Patterns

Common AVX Patterns

ncnn uses consistent intrinsics patterns across layers:

Operation	SSE2 (pack4)	AVX (pack8)	AVX-512 (pack16)
Zero initialization	`_mm_setzero_ps()`	`_mm256_setzero_ps()`	`_mm512_setzero_ps()`
Load aligned	`_mm_load_ps()`	`_mm256_load_ps()`	`_mm512_load_ps()`
Load unaligned	`_mm_loadu_ps()`	`_mm256_loadu_ps()`	`_mm512_loadu_ps()`
Store unaligned	`_mm_storeu_ps()`	`_mm256_storeu_ps()`	`_mm512_storeu_ps()`
Multiply-add	`_mm_add_ps(_mm_mul_ps())`	`_mm256_fmadd_ps()`	`_mm512_fmadd_ps()`
Broadcast scalar	`_mm_set1_ps()`	`_mm256_broadcast_ss()`	`_mm512_set1_ps()`

FMA (Fused Multiply-Add)

When AVX2 or AVX-512 is available, ncnn uses FMA instructions for efficiency:

FMA reduces instruction count and improves floating-point precision by eliminating intermediate rounding.

Activation Function Vectorization

Activation functions are implemented with SIMD-specific helper functions:

Sources: src/layer/x86/x86_activation.h src/layer/x86/convolution_x86.cpp700-738

Weight Data Transformation

Packed Weight Layout

Standard weight layout is (kh × kw × inch × outch). For SIMD execution, weights are transformed to (pb × pa × kh × kw × inch/pa × outch/pb) where:

pa = input element pack size (4, 8, or 16)
pb = output element pack size (4, 8, or 16)

This transformation enables:

Contiguous SIMD loads: Each kernel position loads elempack × out_elempack weights consecutively
Cache efficiency: Related weights for a tile are nearby in memory
Reduced address computation: Simplified indexing in inner loops

Sources: src/layer/x86/convolution_x86.cpp69-103 src/layer/x86/convolution_x86.cpp463-477

Performance Characteristics

Instruction Set Performance Comparison

Theoretical peak performance (GFLOPS) for a 3 GHz CPU:

Instruction Set	FP32 Operations/Cycle	Peak GFLOPS (1 core)	Typical Speedup vs Scalar
Scalar	1	3	1×
SSE2	4	12	3-4×
AVX + FMA	16	48	12-15×
AVX-512 + FMA	32	96	25-30×

Actual speedups depend on:

Memory bandwidth: SIMD widens computation but not memory bus
Cache locality: Transformed weight layouts improve cache hit rates
Kernel size and shape: Smaller kernels (3×3, 5×5) see better speedups than large kernels
Channel alignment: Performance degrades when num_input % elempack != 0

Strategy Performance Trade-offs

Strategy	Arithmetic Intensity	Memory Traffic	Best For
Direct Convolution	Low-Medium	High (repeated loads)	Small kernels (3×3, 5×5), stride>1
Winograd	High	Low (tile buffering)	3×3 kernels, stride=1, medium-large feature maps
im2col+GEMM	High	Very High (im2col overhead)	Large kernels (≥7×7), large channel counts

Sources: src/layer/x86/convolution_x86.cpp450-452 src/layer/x86/convolution_x86.cpp105-267

x86 SIMD Optimizations

SIMD Instruction Set Support

Runtime CPU Detection and Element Packing

Layer Class Hierarchy

Convolution Layer Architecture

Implementation Strategy Selection

Winograd Convolution Transform

Direct Convolution Kernels

Pack8 3×3 Convolution (AVX)

im2col + GEMM Approach

InnerProduct (Fully Connected) Layer

GEMM-based Implementation

Weight Transformation

Depthwise Convolution Optimizations

Pack8 Depthwise 3×3 Implementation

SIMD Intrinsics Patterns

Common AVX Patterns

FMA (Fused Multiply-Add)

Activation Function Vectorization

Weight Data Transformation

Packed Weight Layout

Performance Characteristics

Instruction Set Performance Comparison

Strategy Performance Trade-offs

On this page

x86 SIMD Optimizations

SIMD Instruction Set Support

Runtime CPU Detection and Element Packing

Layer Class Hierarchy

Convolution Layer Architecture

Implementation Strategy Selection

Winograd Convolution Transform

Direct Convolution Kernels

Pack8 3×3 Convolution (AVX)

im2col + GEMM Approach

InnerProduct (Fully Connected) Layer

GEMM-based Implementation

Weight Transformation

Depthwise Convolution Optimizations

Pack8 Depthwise 3×3 Implementation

SIMD Intrinsics Patterns

Common AVX Patterns

FMA (Fused Multiply-Add)

Activation Function Vectorization

Weight Data Transformation

Packed Weight Layout

Performance Characteristics

Instruction Set Performance Comparison

Strategy Performance Trade-offs

On this page