This document describes NCNN's depthwise convolution layer implementations, which provide specialized computation for depthwise separable convolutions commonly used in efficient neural network architectures like MobileNet, EfficientNet, and other mobile-optimized models. For standard convolution operations, see page 4.1. For fully-connected layers, see page 4.3.
Depthwise convolution is a specialized convolution operation where each input channel is convolved independently with its own set of filters, rather than mixing all channels. When group == num_input == num_output, the operation becomes a pure depthwise convolution with significantly reduced computational cost (1/C of standard convolution, where C is the number of channels). When these conditions are not met, NCNN falls back to group convolution by creating separate standard Convolution layers.
NCNN implements depthwise convolution through a three-tier architecture:
ConvolutionDepthWise) - Portable C++ for all platformsConvolutionDepthWise_arm) - NEON intrinsics with pack4 supportConvolutionDepthWise_x86) - SSE/AVX/AVX512 with pack4/pack8/pack16 supportSources: src/layer/convolutiondepthwise.cpp1-271 src/layer/convolutiondepthwise.h1-74 src/layer/arm/convolutiondepthwise_arm.cpp1-681 src/layer/x86/convolutiondepthwise_x86.cpp1-800
Key Distinction: The depthwise condition channels * elempack == group && group == num_output determines whether true depthwise kernels are used or group convolution fallback.
Sources: src/layer/convolutiondepthwise.cpp178-212 src/layer/arm/convolutiondepthwise_arm.cpp344-508
Key Classes:
ConvolutionDepthWise::load_param() - Loads parameters (num_output, kernel, stride, etc.)ConvolutionDepthWise::create_pipeline() - Checks depthwise condition and creates group_ops if neededConvolutionDepthWise::forward() - Dispatches to optimized kernels or group_opsgroup_ops vector - Contains individual Convolution layers when not true depthwiseSources: src/layer/convolutiondepthwise.h10-74 src/layer/arm/convolutiondepthwise_arm.h10-47 src/layer/x86/convolutiondepthwise_x86.h10-44
The ConvolutionDepthWise layer accepts the following parameters via load_param():
| Parameter | ParamDict ID | Description | Default |
|---|---|---|---|
num_output | 0 | Number of output channels | 0 |
kernel_w | 1 | Kernel width | 0 |
kernel_h | 11 | Kernel height | kernel_w |
dilation_w | 2 | Dilation width | 1 |
dilation_h | 12 | Dilation height | dilation_w |
stride_w | 3 | Stride width | 1 |
stride_h | 13 | Stride height | stride_w |
pad_left | 4 | Left padding | 0 |
pad_right | 15 | Right padding | pad_left |
pad_top | 14 | Top padding | pad_left |
pad_bottom | 16 | Bottom padding | pad_top |
pad_value | 18 | Padding fill value | 0.f |
bias_term | 5 | Whether bias is present | 0 |
weight_data_size | 6 | Total weight data size | 0 |
group | 7 | Number of groups | 1 |
int8_scale_term | 8 | INT8 quantization mode | 0 |
activation_type | 9 | Fused activation type | 0 |
activation_params | 10 | Activation parameters | Mat() |
dynamic_weight | 19 | Runtime weight input | 0 |
Critical Condition: When channels == group && group == num_output, optimized depthwise kernels are used. Otherwise, create_group_ops() is called to create group separate Convolution layers.
Sources: src/layer/convolutiondepthwise.cpp18-63
Key Operations:
channels == group == num_outputConvolution layers when not true depthwise(maxk, channels) to packed formatSources: src/layer/arm/convolutiondepthwise_arm.cpp57-167 src/layer/x86/convolutiondepthwise_x86.cpp46-137
Execution Paths:
Convolution layers in group_opsSources: src/layer/arm/convolutiondepthwise_arm.cpp285-572 src/layer/x86/convolutiondepthwise_x86.cpp255-800
The ARM implementation provides highly optimized assembly kernels for common configurations:
| Function | Configuration | File |
|---|---|---|
convdw3x3s1_pack4_neon() | 3×3, stride=1 | src/layer/arm/convolutiondepthwise_3x3_pack4.h |
convdw3x3s2_pack4_neon() | 3×3, stride=2 | src/layer/arm/convolutiondepthwise_3x3_pack4.h |
convdw5x5s1_pack4_neon() | 5×5, stride=1 | src/layer/arm/convolutiondepthwise_5x5_pack4.h |
convdw5x5s2_pack4_neon() | 5×5, stride=2 | src/layer/arm/convolutiondepthwise_5x5_pack4.h |
| Function | Configuration | File |
|---|---|---|
convdw3x3s1_neon() | 3×3, stride=1 | src/layer/arm/convolutiondepthwise_3x3.h |
convdw3x3s2_neon() | 3×3, stride=2 | src/layer/arm/convolutiondepthwise_3x3.h |
convdw5x5s1_neon() | 5×5, stride=1 | src/layer/arm/convolutiondepthwise_5x5.h |
convdw5x5s2_neon() | 5×5, stride=2 | src/layer/arm/convolutiondepthwise_5x5.h |
| Function | SIMD | Configuration | Header |
|---|---|---|---|
convdw3x3s1_pack4_sse() | SSE2 | pack4, 3×3, stride=1 | convolutiondepthwise_3x3_pack4.h |
convdw3x3s2_pack4_sse() | SSE2 | pack4, 3×3, stride=2 | convolutiondepthwise_3x3_pack4.h |
convdw5x5s1_pack4_sse() | SSE2 | pack4, 5×5, stride=1 | convolutiondepthwise_5x5_pack4.h |
convdw5x5s2_pack4_sse() | SSE2 | pack4, 5×5, stride=2 | convolutiondepthwise_5x5_pack4.h |
convdw3x3s1_pack8_avx() | AVX | pack8, 3×3, stride=1 | convolutiondepthwise_3x3_pack8.h |
convdw3x3s2_pack8_avx() | AVX | pack8, 3×3, stride=2 | convolutiondepthwise_3x3_pack8.h |
convdw5x5s1_pack8_avx() | AVX | pack8, 5×5, stride=1 | convolutiondepthwise_5x5_pack8.h |
convdw5x5s2_pack8_avx() | AVX | pack8, 5×5, stride=2 | convolutiondepthwise_5x5_pack8.h |
convdw3x3s1_pack16_avx512() | AVX512 | pack16, 3×3, stride=1 | convolutiondepthwise_3x3_pack16.h |
convdw3x3s2_pack16_avx512() | AVX512 | pack16, 3×3, stride=2 | convolutiondepthwise_3x3_pack16.h |
convdw5x5s1_pack16_avx512() | AVX512 | pack16, 5×5, stride=1 | convolutiondepthwise_5x5_pack16.h |
convdw5x5s2_pack16_avx512() | AVX512 | pack16, 5×5, stride=2 | convolutiondepthwise_5x5_pack16.h |
The pack size is selected at create_pipeline time based on whether channels is divisible by 16, 8, or 4. For pack1, the scalar convdw3x3s1_sse() / convdw3x3s2_sse() kernels from convolutiondepthwise_3x3.h apply.
Generic Fallback: When no optimized kernel matches, the code falls back to a generic loop at src/layer/arm/convolutiondepthwise_arm.cpp396-455 (pack4) or creates group_ops for arbitrary configurations.
Sources: src/layer/arm/convolutiondepthwise_arm.cpp349-507 src/layer/x86/convolutiondepthwise_x86.cpp18-32 src/layer/x86/convolutiondepthwise_x86.cpp255-650
The pack4 optimized kernels process 4 channels simultaneously using NEON vector instructions. This is the generic fallback loop used when no hand-coded assembly kernel matches the configuration.
Generic pack4 data flow (used when kernel is not 3×3 or 5×5):
The key fields used in this path are:
weight_data_tm — contains weights repacked to (maxk, group, 4) layout via convert_packing() during create_pipelinespace_ofs — precomputed array of flattened input offsets for each kernel position, accounting for dilationactivation_ps() — inline NEON helper that applies the fused activation element-wise to float32x4_tThe outer loop iterates over groups (g = 0..channels-1), the middle loops over spatial positions (i, j), and the innermost loop accumulates maxk kernel positions. Each iteration of the inner loop uses vld1q_f32 + vmlaq_f32 to process 4 channels at once.
Performance: This achieves 4× throughput compared to scalar processing by processing 4 channels per SIMD register. The hand-coded assembly kernels (convdw3x3s1_pack4_neon, etc.) provide additional throughput via instruction-level parallelism and prefetch hints.
Sources: src/layer/arm/convolutiondepthwise_arm.cpp396-456
INT8 quantized depthwise convolution provides 4× memory reduction and faster inference on mobile devices:
INT8 Scale Modes (from src/layer/convolutiondepthwise.cpp82-113):
| Mode | int8_scale_term | Weights | Input | Output |
|---|---|---|---|---|
| Per-group weights | 1 or 101 | group scales | 1 scale (broadcast) | Optional (>100) |
| Per-tensor | 2 or 102 | 1 scale (broadcast) | 1 scale (broadcast) | Optional (>100) |
Pack8 INT8 Kernels: ARM supports pack8 for INT8 to process 8 channels simultaneously using int8x8_t and int16x8_t accumulators:
convdw3x3s1_int8_neon(), convdw3x3s2_int8_neon()convdw3x3s1_pack8_int8_neon(), convdw3x3s2_pack8_int8_neon()Sources: src/layer/convolutiondepthwise.cpp82-141 src/layer/arm/convolutiondepthwise_arm.cpp64-69
When the depthwise condition is not met, NCNN creates individual Convolution layers for each group:
Forward Pass with Group Ops:
Each group processes independently:
group_ops[g]->forward() for each groupThis fallback ensures correctness for arbitrary group configurations while true depthwise cases use optimized kernels.
Sources: src/layer/arm/convolutiondepthwise_arm.cpp169-264 src/layer/x86/convolutiondepthwise_x86.cpp139-234
NCNN supports multiple precision modes for depthwise convolution:
| Mode | Data Type | Storage Size | Conversion | Use Case |
|---|---|---|---|---|
| FP32 | float | 4 bytes | Native | Default, highest accuracy |
| FP16 Storage | unsigned short | 2 bytes | float32_to_float16() at load | ARM82 ASIMDHP, 2× memory reduction |
| FP16 Arithmetic | __fp16 | 2 bytes | Native FP16 compute | ARM82 with use_fp16_arithmetic |
| BF16 Storage | unsigned short | 2 bytes | float32_to_bfloat16() at load | ARM NEON, maintains FP32 range |
| INT8 | signed char | 1 byte | quantize_to_int8() with scales | Mobile inference, 4× reduction |
Dispatch Logic: ConvolutionDepthWise_arm::forward at src/layer/arm/convolutiondepthwise_arm.cpp285-309 first checks int8_scale_term and dispatches to forward_int8_arm. Otherwise it checks bottom_blob.elembits():
elembits == 16 + use_fp16_storage + use_fp16_arithmetic → forward_fp16sa (native FP16 compute, ARM82 ASIMDHP)elembits == 16 + use_fp16_storage → forward_fp16s (FP16 storage, FP32 compute)elembits == 16 + use_bf16_storage → forward_bf16sforwardBF16 Pack4 Kernels: src/layer/arm/convolutiondepthwise_3x3_pack4_bf16s.h and src/layer/arm/convolutiondepthwise_5x5_pack4_bf16s.h provide BF16-optimized implementations. Weights are stored as unsigned short in weight_data_tm and converted during convolution.
Sources: src/layer/arm/convolutiondepthwise_arm.cpp285-310 src/layer/arm/convolutiondepthwise_arm.cpp684-900
The inverse operation, depthwise transposed convolution, is implemented in DeconvolutionDepthWise:
DeconvolutionDepthWise (src/layer/deconvolutiondepthwise.cpp)DeconvolutionDepthWise_arm (src/layer/arm/deconvolutiondepthwise_arm.cpp)group_ops with Deconvolution layers when not true depthwiseThe implementation mirrors depthwise convolution but reverses the spatial transformation, expanding the spatial dimensions rather than reducing them.
Sources: src/layer/deconvolutiondepthwise.cpp1-351 src/layer/arm/deconvolutiondepthwise_arm.cpp1-1100
Computational Complexity:
Memory Access Patterns:
create_pipeline() to avoid runtime overheadspace_ofs array (src/layer/arm/convolutiondepthwise_arm.cpp400-416) enables indexed memory access without stride calculations in hot loopOptimization Levels:
group_opsRefresh this wiki
This wiki was recently refreshed. Please wait 2 days to refresh again.