Depthwise Convolution Layers

Relevant source files

This document describes NCNN's depthwise convolution layer implementations, which provide specialized computation for depthwise separable convolutions commonly used in efficient neural network architectures like MobileNet, EfficientNet, and other mobile-optimized models. For standard convolution operations, see page 4.1. For fully-connected layers, see page 4.3.

Overview

Depthwise convolution is a specialized convolution operation where each input channel is convolved independently with its own set of filters, rather than mixing all channels. When group == num_input == num_output, the operation becomes a pure depthwise convolution with significantly reduced computational cost (1/C of standard convolution, where C is the number of channels). When these conditions are not met, NCNN falls back to group convolution by creating separate standard Convolution layers.

NCNN implements depthwise convolution through a three-tier architecture:

Generic implementation (ConvolutionDepthWise) - Portable C++ for all platforms
ARM-optimized (ConvolutionDepthWise_arm) - NEON intrinsics with pack4 support
x86-optimized (ConvolutionDepthWise_x86) - SSE/AVX/AVX512 with pack4/pack8/pack16 support

Sources: src/layer/convolutiondepthwise.cpp1-271 src/layer/convolutiondepthwise.h1-74 src/layer/arm/convolutiondepthwise_arm.cpp1-681 src/layer/x86/convolutiondepthwise_x86.cpp1-800

Depthwise vs Standard Convolution

Key Distinction: The depthwise condition channels * elempack == group && group == num_output determines whether true depthwise kernels are used or group convolution fallback.

Sources: src/layer/convolutiondepthwise.cpp178-212 src/layer/arm/convolutiondepthwise_arm.cpp344-508

Class Hierarchy and Architecture

Key Classes:

ConvolutionDepthWise::load_param() - Loads parameters (num_output, kernel, stride, etc.)
ConvolutionDepthWise::create_pipeline() - Checks depthwise condition and creates group_ops if needed
ConvolutionDepthWise::forward() - Dispatches to optimized kernels or group_ops
group_ops vector - Contains individual Convolution layers when not true depthwise

Sources: src/layer/convolutiondepthwise.h10-74 src/layer/arm/convolutiondepthwise_arm.h10-47 src/layer/x86/convolutiondepthwise_x86.h10-44

Layer Parameters

The ConvolutionDepthWise layer accepts the following parameters via load_param():

Parameter	ParamDict ID	Description	Default
`num_output`	0	Number of output channels	0
`kernel_w`	1	Kernel width	0
`kernel_h`	11	Kernel height	kernel_w
`dilation_w`	2	Dilation width	1
`dilation_h`	12	Dilation height	dilation_w
`stride_w`	3	Stride width	1
`stride_h`	13	Stride height	stride_w
`pad_left`	4	Left padding	0
`pad_right`	15	Right padding	pad_left
`pad_top`	14	Top padding	pad_left
`pad_bottom`	16	Bottom padding	pad_top
`pad_value`	18	Padding fill value	0.f
`bias_term`	5	Whether bias is present	0
`weight_data_size`	6	Total weight data size	0
`group`	7	Number of groups	1
`int8_scale_term`	8	INT8 quantization mode	0
`activation_type`	9	Fused activation type	0
`activation_params`	10	Activation parameters	Mat()
`dynamic_weight`	19	Runtime weight input	0

Critical Condition: When channels == group && group == num_output, optimized depthwise kernels are used. Otherwise, create_group_ops() is called to create group separate Convolution layers.

Sources: src/layer/convolutiondepthwise.cpp18-63

Pipeline Creation Flow

Key Operations:

Depthwise Detection: Check if channels == group == num_output
Packing: Convert weights to pack4/pack8/pack16 format for SIMD efficiency
Group Fallback: Create individual Convolution layers when not true depthwise
Weight Transformation: Reshape weights from (maxk, channels) to packed format

Sources: src/layer/arm/convolutiondepthwise_arm.cpp57-167 src/layer/x86/convolutiondepthwise_x86.cpp46-137

Forward Pass Execution

Execution Paths:

Optimized Depthwise (pack4): Uses hand-coded NEON/SSE/AVX assembly for 3x3 and 5x5 kernels
Optimized Depthwise (pack1): Uses NEON intrinsics for scalar processing
Generic Depthwise (pack4): Falls back to generic NEON multiply-accumulate loop
Group Convolution: Delegates to individual Convolution layers in group_ops

Sources: src/layer/arm/convolutiondepthwise_arm.cpp285-572 src/layer/x86/convolutiondepthwise_x86.cpp255-800

Optimized Kernel Dispatch

The ARM implementation provides highly optimized assembly kernels for common configurations:

ARM NEON Kernels (pack4)

Function	Configuration	File
`convdw3x3s1_pack4_neon()`	3×3, stride=1	src/layer/arm/convolutiondepthwise_3x3_pack4.h
`convdw3x3s2_pack4_neon()`	3×3, stride=2	src/layer/arm/convolutiondepthwise_3x3_pack4.h
`convdw5x5s1_pack4_neon()`	5×5, stride=1	src/layer/arm/convolutiondepthwise_5x5_pack4.h
`convdw5x5s2_pack4_neon()`	5×5, stride=2	src/layer/arm/convolutiondepthwise_5x5_pack4.h

ARM NEON Kernels (pack1)

Function	Configuration	File
`convdw3x3s1_neon()`	3×3, stride=1	src/layer/arm/convolutiondepthwise_3x3.h
`convdw3x3s2_neon()`	3×3, stride=2	src/layer/arm/convolutiondepthwise_3x3.h
`convdw5x5s1_neon()`	5×5, stride=1	src/layer/arm/convolutiondepthwise_5x5.h
`convdw5x5s2_neon()`	5×5, stride=2	src/layer/arm/convolutiondepthwise_5x5.h

x86 Kernels (Pack4/8/16)

Function	SIMD	Configuration	Header
`convdw3x3s1_pack4_sse()`	SSE2	pack4, 3×3, stride=1	`convolutiondepthwise_3x3_pack4.h`
`convdw3x3s2_pack4_sse()`	SSE2	pack4, 3×3, stride=2	`convolutiondepthwise_3x3_pack4.h`
`convdw5x5s1_pack4_sse()`	SSE2	pack4, 5×5, stride=1	`convolutiondepthwise_5x5_pack4.h`
`convdw5x5s2_pack4_sse()`	SSE2	pack4, 5×5, stride=2	`convolutiondepthwise_5x5_pack4.h`
`convdw3x3s1_pack8_avx()`	AVX	pack8, 3×3, stride=1	`convolutiondepthwise_3x3_pack8.h`
`convdw3x3s2_pack8_avx()`	AVX	pack8, 3×3, stride=2	`convolutiondepthwise_3x3_pack8.h`
`convdw5x5s1_pack8_avx()`	AVX	pack8, 5×5, stride=1	`convolutiondepthwise_5x5_pack8.h`
`convdw5x5s2_pack8_avx()`	AVX	pack8, 5×5, stride=2	`convolutiondepthwise_5x5_pack8.h`
`convdw3x3s1_pack16_avx512()`	AVX512	pack16, 3×3, stride=1	`convolutiondepthwise_3x3_pack16.h`
`convdw3x3s2_pack16_avx512()`	AVX512	pack16, 3×3, stride=2	`convolutiondepthwise_3x3_pack16.h`
`convdw5x5s1_pack16_avx512()`	AVX512	pack16, 5×5, stride=1	`convolutiondepthwise_5x5_pack16.h`
`convdw5x5s2_pack16_avx512()`	AVX512	pack16, 5×5, stride=2	`convolutiondepthwise_5x5_pack16.h`

The pack size is selected at create_pipeline time based on whether channels is divisible by 16, 8, or 4. For pack1, the scalar convdw3x3s1_sse() / convdw3x3s2_sse() kernels from convolutiondepthwise_3x3.h apply.

Generic Fallback: When no optimized kernel matches, the code falls back to a generic loop at src/layer/arm/convolutiondepthwise_arm.cpp396-455 (pack4) or creates group_ops for arbitrary configurations.

Sources: src/layer/arm/convolutiondepthwise_arm.cpp349-507 src/layer/x86/convolutiondepthwise_x86.cpp18-32 src/layer/x86/convolutiondepthwise_x86.cpp255-650

Pack4 NEON Optimization Example

The pack4 optimized kernels process 4 channels simultaneously using NEON vector instructions. This is the generic fallback loop used when no hand-coded assembly kernel matches the configuration.

Generic pack4 data flow (used when kernel is not 3×3 or 5×5):

The key fields used in this path are:

weight_data_tm — contains weights repacked to (maxk, group, 4) layout via convert_packing() during create_pipeline
space_ofs — precomputed array of flattened input offsets for each kernel position, accounting for dilation
activation_ps() — inline NEON helper that applies the fused activation element-wise to float32x4_t

The outer loop iterates over groups (g = 0..channels-1), the middle loops over spatial positions (i, j), and the innermost loop accumulates maxk kernel positions. Each iteration of the inner loop uses vld1q_f32 + vmlaq_f32 to process 4 channels at once.

Performance: This achieves 4× throughput compared to scalar processing by processing 4 channels per SIMD register. The hand-coded assembly kernels (convdw3x3s1_pack4_neon, etc.) provide additional throughput via instruction-level parallelism and prefetch hints.

Sources: src/layer/arm/convolutiondepthwise_arm.cpp396-456

INT8 Quantization Support

INT8 quantized depthwise convolution provides 4× memory reduction and faster inference on mobile devices:

INT8 Scale Modes (from src/layer/convolutiondepthwise.cpp82-113):

Mode	`int8_scale_term`	Weights	Input	Output
Per-group weights	1 or 101	`group` scales	1 scale (broadcast)	Optional (>100)
Per-tensor	2 or 102	1 scale (broadcast)	1 scale (broadcast)	Optional (>100)

Pack8 INT8 Kernels: ARM supports pack8 for INT8 to process 8 channels simultaneously using int8x8_t and int16x8_t accumulators:

src/layer/arm/convolutiondepthwise_3x3_int8.h - Contains convdw3x3s1_int8_neon(), convdw3x3s2_int8_neon()
src/layer/arm/convolutiondepthwise_3x3_pack8_int8.h - Contains convdw3x3s1_pack8_int8_neon(), convdw3x3s2_pack8_int8_neon()

Sources: src/layer/convolutiondepthwise.cpp82-141 src/layer/arm/convolutiondepthwise_arm.cpp64-69

Group Convolution Fallback

When the depthwise condition is not met, NCNN creates individual Convolution layers for each group:

Forward Pass with Group Ops:

Each group processes independently:

Unpacking: Convert input from pack4 to pack1 if needed (src/layer/arm/convolutiondepthwise_arm.cpp524-533)
Group Slicing: Extract channel ranges for each group
Forward: Call group_ops[g]->forward() for each group
Repacking: Convert output back to pack4 if needed (src/layer/arm/convolutiondepthwise_arm.cpp560-565)

This fallback ensures correctness for arbitrary group configurations while true depthwise cases use optimized kernels.

Sources: src/layer/arm/convolutiondepthwise_arm.cpp169-264 src/layer/x86/convolutiondepthwise_x86.cpp139-234

Precision Modes and Data Types

NCNN supports multiple precision modes for depthwise convolution:

Mode	Data Type	Storage Size	Conversion	Use Case
FP32	`float`	4 bytes	Native	Default, highest accuracy
FP16 Storage	`unsigned short`	2 bytes	`float32_to_float16()` at load	ARM82 ASIMDHP, 2× memory reduction
FP16 Arithmetic	`__fp16`	2 bytes	Native FP16 compute	ARM82 with `use_fp16_arithmetic`
BF16 Storage	`unsigned short`	2 bytes	`float32_to_bfloat16()` at load	ARM NEON, maintains FP32 range
INT8	`signed char`	1 byte	`quantize_to_int8()` with scales	Mobile inference, 4× reduction

Dispatch Logic: ConvolutionDepthWise_arm::forward at src/layer/arm/convolutiondepthwise_arm.cpp285-309 first checks int8_scale_term and dispatches to forward_int8_arm. Otherwise it checks bottom_blob.elembits():

elembits == 16 + use_fp16_storage + use_fp16_arithmetic → forward_fp16sa (native FP16 compute, ARM82 ASIMDHP)
elembits == 16 + use_fp16_storage → forward_fp16s (FP16 storage, FP32 compute)
elembits == 16 + use_bf16_storage → forward_bf16s
Otherwise → FP32 path inline in forward

BF16 Pack4 Kernels: src/layer/arm/convolutiondepthwise_3x3_pack4_bf16s.h and src/layer/arm/convolutiondepthwise_5x5_pack4_bf16s.h provide BF16-optimized implementations. Weights are stored as unsigned short in weight_data_tm and converted during convolution.

Sources: src/layer/arm/convolutiondepthwise_arm.cpp285-310 src/layer/arm/convolutiondepthwise_arm.cpp684-900

Deconvolution DepthWise

The inverse operation, depthwise transposed convolution, is implemented in DeconvolutionDepthWise:

Class: DeconvolutionDepthWise (src/layer/deconvolutiondepthwise.cpp)
ARM Optimized: DeconvolutionDepthWise_arm (src/layer/arm/deconvolutiondepthwise_arm.cpp)
Weight Transpose: Reverses kernel weights before processing (src/layer/arm/deconvolutiondepthwise_arm.cpp60-75)
Same Fallback: Uses group_ops with Deconvolution layers when not true depthwise

The implementation mirrors depthwise convolution but reverses the spatial transformation, expanding the spatial dimensions rather than reducing them.

Sources: src/layer/deconvolutiondepthwise.cpp1-351 src/layer/arm/deconvolutiondepthwise_arm.cpp1-1100

Performance Characteristics

Computational Complexity:

Depthwise: O(H' × W' × K² × C) - Linear in channels
Standard Conv: O(H' × W' × K² × C_in × C_out) - Quadratic in channels
Speedup: ~C_out/1 for typical MobileNet (C_in ≈ C_out)

Memory Access Patterns:

Pack4: Processes 4 channels per cache line, improving data locality
Weight Transform: Precomputes packed weights during create_pipeline() to avoid runtime overhead
Space Offsets: Precomputed space_ofs array (src/layer/arm/convolutiondepthwise_arm.cpp400-416) enables indexed memory access without stride calculations in hot loop

Optimization Levels:

Best: 3x3/5x5 stride-1/2 with pack4 on ARM NEON → Hand-coded assembly
Good: Generic pack4 → NEON intrinsics
Fallback: Arbitrary kernels → Group convolution via group_ops

Sources: src/layer/arm/convolutiondepthwise_arm.cpp349-456

Depthwise Convolution Layers

Relevant source files

Overview

NCNN implements depthwise convolution through a three-tier architecture:

Generic implementation (ConvolutionDepthWise) - Portable C++ for all platforms
ARM-optimized (ConvolutionDepthWise_arm) - NEON intrinsics with pack4 support
x86-optimized (ConvolutionDepthWise_x86) - SSE/AVX/AVX512 with pack4/pack8/pack16 support

Sources: src/layer/convolutiondepthwise.cpp1-271 src/layer/convolutiondepthwise.h1-74 src/layer/arm/convolutiondepthwise_arm.cpp1-681 src/layer/x86/convolutiondepthwise_x86.cpp1-800

Depthwise vs Standard Convolution

Key Distinction: The depthwise condition channels * elempack == group && group == num_output determines whether true depthwise kernels are used or group convolution fallback.

Sources: src/layer/convolutiondepthwise.cpp178-212 src/layer/arm/convolutiondepthwise_arm.cpp344-508

Class Hierarchy and Architecture

Key Classes:

ConvolutionDepthWise::load_param() - Loads parameters (num_output, kernel, stride, etc.)
ConvolutionDepthWise::create_pipeline() - Checks depthwise condition and creates group_ops if needed
ConvolutionDepthWise::forward() - Dispatches to optimized kernels or group_ops
group_ops vector - Contains individual Convolution layers when not true depthwise

Sources: src/layer/convolutiondepthwise.h10-74 src/layer/arm/convolutiondepthwise_arm.h10-47 src/layer/x86/convolutiondepthwise_x86.h10-44

Layer Parameters

The ConvolutionDepthWise layer accepts the following parameters via load_param():

Parameter	ParamDict ID	Description	Default
`num_output`	0	Number of output channels	0
`kernel_w`	1	Kernel width	0
`kernel_h`	11	Kernel height	kernel_w
`dilation_w`	2	Dilation width	1
`dilation_h`	12	Dilation height	dilation_w
`stride_w`	3	Stride width	1
`stride_h`	13	Stride height	stride_w
`pad_left`	4	Left padding	0
`pad_right`	15	Right padding	pad_left
`pad_top`	14	Top padding	pad_left
`pad_bottom`	16	Bottom padding	pad_top
`pad_value`	18	Padding fill value	0.f
`bias_term`	5	Whether bias is present	0
`weight_data_size`	6	Total weight data size	0
`group`	7	Number of groups	1
`int8_scale_term`	8	INT8 quantization mode	0
`activation_type`	9	Fused activation type	0
`activation_params`	10	Activation parameters	Mat()
`dynamic_weight`	19	Runtime weight input	0

Sources: src/layer/convolutiondepthwise.cpp18-63

Pipeline Creation Flow

Key Operations:

Depthwise Detection: Check if channels == group == num_output
Packing: Convert weights to pack4/pack8/pack16 format for SIMD efficiency
Group Fallback: Create individual Convolution layers when not true depthwise
Weight Transformation: Reshape weights from (maxk, channels) to packed format

Sources: src/layer/arm/convolutiondepthwise_arm.cpp57-167 src/layer/x86/convolutiondepthwise_x86.cpp46-137

Forward Pass Execution

Execution Paths:

Optimized Depthwise (pack4): Uses hand-coded NEON/SSE/AVX assembly for 3x3 and 5x5 kernels
Optimized Depthwise (pack1): Uses NEON intrinsics for scalar processing
Generic Depthwise (pack4): Falls back to generic NEON multiply-accumulate loop
Group Convolution: Delegates to individual Convolution layers in group_ops

Sources: src/layer/arm/convolutiondepthwise_arm.cpp285-572 src/layer/x86/convolutiondepthwise_x86.cpp255-800

Optimized Kernel Dispatch

The ARM implementation provides highly optimized assembly kernels for common configurations:

ARM NEON Kernels (pack4)

Function	Configuration	File
`convdw3x3s1_pack4_neon()`	3×3, stride=1	src/layer/arm/convolutiondepthwise_3x3_pack4.h
`convdw3x3s2_pack4_neon()`	3×3, stride=2	src/layer/arm/convolutiondepthwise_3x3_pack4.h
`convdw5x5s1_pack4_neon()`	5×5, stride=1	src/layer/arm/convolutiondepthwise_5x5_pack4.h
`convdw5x5s2_pack4_neon()`	5×5, stride=2	src/layer/arm/convolutiondepthwise_5x5_pack4.h

ARM NEON Kernels (pack1)

Function	Configuration	File
`convdw3x3s1_neon()`	3×3, stride=1	src/layer/arm/convolutiondepthwise_3x3.h
`convdw3x3s2_neon()`	3×3, stride=2	src/layer/arm/convolutiondepthwise_3x3.h
`convdw5x5s1_neon()`	5×5, stride=1	src/layer/arm/convolutiondepthwise_5x5.h
`convdw5x5s2_neon()`	5×5, stride=2	src/layer/arm/convolutiondepthwise_5x5.h

x86 Kernels (Pack4/8/16)

Function	SIMD	Configuration	Header
`convdw3x3s1_pack4_sse()`	SSE2	pack4, 3×3, stride=1	`convolutiondepthwise_3x3_pack4.h`
`convdw3x3s2_pack4_sse()`	SSE2	pack4, 3×3, stride=2	`convolutiondepthwise_3x3_pack4.h`
`convdw5x5s1_pack4_sse()`	SSE2	pack4, 5×5, stride=1	`convolutiondepthwise_5x5_pack4.h`
`convdw5x5s2_pack4_sse()`	SSE2	pack4, 5×5, stride=2	`convolutiondepthwise_5x5_pack4.h`
`convdw3x3s1_pack8_avx()`	AVX	pack8, 3×3, stride=1	`convolutiondepthwise_3x3_pack8.h`
`convdw3x3s2_pack8_avx()`	AVX	pack8, 3×3, stride=2	`convolutiondepthwise_3x3_pack8.h`
`convdw5x5s1_pack8_avx()`	AVX	pack8, 5×5, stride=1	`convolutiondepthwise_5x5_pack8.h`
`convdw5x5s2_pack8_avx()`	AVX	pack8, 5×5, stride=2	`convolutiondepthwise_5x5_pack8.h`
`convdw3x3s1_pack16_avx512()`	AVX512	pack16, 3×3, stride=1	`convolutiondepthwise_3x3_pack16.h`
`convdw3x3s2_pack16_avx512()`	AVX512	pack16, 3×3, stride=2	`convolutiondepthwise_3x3_pack16.h`
`convdw5x5s1_pack16_avx512()`	AVX512	pack16, 5×5, stride=1	`convolutiondepthwise_5x5_pack16.h`
`convdw5x5s2_pack16_avx512()`	AVX512	pack16, 5×5, stride=2	`convolutiondepthwise_5x5_pack16.h`

Sources: src/layer/arm/convolutiondepthwise_arm.cpp349-507 src/layer/x86/convolutiondepthwise_x86.cpp18-32 src/layer/x86/convolutiondepthwise_x86.cpp255-650

Pack4 NEON Optimization Example

The pack4 optimized kernels process 4 channels simultaneously using NEON vector instructions. This is the generic fallback loop used when no hand-coded assembly kernel matches the configuration.

Generic pack4 data flow (used when kernel is not 3×3 or 5×5):

The key fields used in this path are:

weight_data_tm — contains weights repacked to (maxk, group, 4) layout via convert_packing() during create_pipeline
space_ofs — precomputed array of flattened input offsets for each kernel position, accounting for dilation
activation_ps() — inline NEON helper that applies the fused activation element-wise to float32x4_t

Sources: src/layer/arm/convolutiondepthwise_arm.cpp396-456

INT8 Quantization Support

INT8 quantized depthwise convolution provides 4× memory reduction and faster inference on mobile devices:

INT8 Scale Modes (from src/layer/convolutiondepthwise.cpp82-113):

Mode	`int8_scale_term`	Weights	Input	Output
Per-group weights	1 or 101	`group` scales	1 scale (broadcast)	Optional (>100)
Per-tensor	2 or 102	1 scale (broadcast)	1 scale (broadcast)	Optional (>100)

Pack8 INT8 Kernels: ARM supports pack8 for INT8 to process 8 channels simultaneously using int8x8_t and int16x8_t accumulators:

src/layer/arm/convolutiondepthwise_3x3_int8.h - Contains convdw3x3s1_int8_neon(), convdw3x3s2_int8_neon()
src/layer/arm/convolutiondepthwise_3x3_pack8_int8.h - Contains convdw3x3s1_pack8_int8_neon(), convdw3x3s2_pack8_int8_neon()

Sources: src/layer/convolutiondepthwise.cpp82-141 src/layer/arm/convolutiondepthwise_arm.cpp64-69

Group Convolution Fallback

When the depthwise condition is not met, NCNN creates individual Convolution layers for each group:

Forward Pass with Group Ops:

Each group processes independently:

Unpacking: Convert input from pack4 to pack1 if needed (src/layer/arm/convolutiondepthwise_arm.cpp524-533)
Group Slicing: Extract channel ranges for each group
Forward: Call group_ops[g]->forward() for each group
Repacking: Convert output back to pack4 if needed (src/layer/arm/convolutiondepthwise_arm.cpp560-565)

This fallback ensures correctness for arbitrary group configurations while true depthwise cases use optimized kernels.

Sources: src/layer/arm/convolutiondepthwise_arm.cpp169-264 src/layer/x86/convolutiondepthwise_x86.cpp139-234

Precision Modes and Data Types

NCNN supports multiple precision modes for depthwise convolution:

Mode	Data Type	Storage Size	Conversion	Use Case
FP32	`float`	4 bytes	Native	Default, highest accuracy
FP16 Storage	`unsigned short`	2 bytes	`float32_to_float16()` at load	ARM82 ASIMDHP, 2× memory reduction
FP16 Arithmetic	`__fp16`	2 bytes	Native FP16 compute	ARM82 with `use_fp16_arithmetic`
BF16 Storage	`unsigned short`	2 bytes	`float32_to_bfloat16()` at load	ARM NEON, maintains FP32 range
INT8	`signed char`	1 byte	`quantize_to_int8()` with scales	Mobile inference, 4× reduction

elembits == 16 + use_fp16_storage + use_fp16_arithmetic → forward_fp16sa (native FP16 compute, ARM82 ASIMDHP)
elembits == 16 + use_fp16_storage → forward_fp16s (FP16 storage, FP32 compute)
elembits == 16 + use_bf16_storage → forward_bf16s
Otherwise → FP32 path inline in forward

Sources: src/layer/arm/convolutiondepthwise_arm.cpp285-310 src/layer/arm/convolutiondepthwise_arm.cpp684-900

Deconvolution DepthWise

The inverse operation, depthwise transposed convolution, is implemented in DeconvolutionDepthWise:

Class: DeconvolutionDepthWise (src/layer/deconvolutiondepthwise.cpp)
ARM Optimized: DeconvolutionDepthWise_arm (src/layer/arm/deconvolutiondepthwise_arm.cpp)
Weight Transpose: Reverses kernel weights before processing (src/layer/arm/deconvolutiondepthwise_arm.cpp60-75)
Same Fallback: Uses group_ops with Deconvolution layers when not true depthwise

The implementation mirrors depthwise convolution but reverses the spatial transformation, expanding the spatial dimensions rather than reducing them.

Sources: src/layer/deconvolutiondepthwise.cpp1-351 src/layer/arm/deconvolutiondepthwise_arm.cpp1-1100

Performance Characteristics

Computational Complexity:

Depthwise: O(H' × W' × K² × C) - Linear in channels
Standard Conv: O(H' × W' × K² × C_in × C_out) - Quadratic in channels
Speedup: ~C_out/1 for typical MobileNet (C_in ≈ C_out)

Memory Access Patterns:

Pack4: Processes 4 channels per cache line, improving data locality
Weight Transform: Precomputes packed weights during create_pipeline() to avoid runtime overhead
Space Offsets: Precomputed space_ofs array (src/layer/arm/convolutiondepthwise_arm.cpp400-416) enables indexed memory access without stride calculations in hot loop

Optimization Levels:

Best: 3x3/5x5 stride-1/2 with pack4 on ARM NEON → Hand-coded assembly
Good: Generic pack4 → NEON intrinsics
Fallback: Arbitrary kernels → Group convolution via group_ops

Sources: src/layer/arm/convolutiondepthwise_arm.cpp349-456

Depthwise Convolution Layers

Overview

Depthwise vs Standard Convolution

Class Hierarchy and Architecture

Layer Parameters

Pipeline Creation Flow

Forward Pass Execution

Optimized Kernel Dispatch

ARM NEON Kernels (pack4)

ARM NEON Kernels (pack1)

x86 Kernels (Pack4/8/16)

Pack4 NEON Optimization Example

INT8 Quantization Support

Group Convolution Fallback

Precision Modes and Data Types

Deconvolution DepthWise

Performance Characteristics

On this page

Depthwise Convolution Layers

Overview

Depthwise vs Standard Convolution

Class Hierarchy and Architecture

Layer Parameters

Pipeline Creation Flow

Forward Pass Execution

Optimized Kernel Dispatch

ARM NEON Kernels (pack4)

ARM NEON Kernels (pack1)

x86 Kernels (Pack4/8/16)

Pack4 NEON Optimization Example

INT8 Quantization Support

Group Convolution Fallback

Precision Modes and Data Types

Deconvolution DepthWise

Performance Characteristics

On this page