This page documents ncnn's runtime CPU feature detection and dynamic dispatch system. At startup, ncnn detects which CPU instruction set extensions (ISA) are available on the host processor and selects the most optimized layer implementations accordingly. This allows a single ncnn binary to run efficiently across different CPU architectures and feature levels without requiring separate builds for each ISA variant.
For details on specific optimizations for ARM and x86 platforms, see ARM NEON Optimizations and x86 SIMD Optimizations. For quantized inference modes, see INT8 Quantization and Precision Modes.
ncnn's runtime dispatch system operates through three main components:
Diagram: Runtime CPU Dispatch Flow
Sources: src/cpu.cpp182-250 src/layer.cpp145-230
Runtime CPU dispatch is controlled by the NCNN_RUNTIME_CPU CMake option, which is enabled by default:
option(NCNN_RUNTIME_CPU "runtime dispatch cpu routines" ON)
When NCNN_RUNTIME_CPU=ON, ncnn compiles multiple ISA-specific versions of performance-critical layers. Each variant is compiled with appropriate compiler flags to enable specific instruction sets.
The build system generates multiple layer source files with different ISA flags. The ncnn_add_arch_opt_layer macro creates ISA-specific variants:
cmake/ncnn_add_layer.cmake2-46
For x86 platforms, variants are generated for:
layer_registry_avx[]layer_registry_fma[]layer_registry_avx512[]layer_registry_avx512_vnni[]For ARM platforms:
layer_registry_arch[] (NEON, dotprod, etc.)The layer source generation is automated through custom CMake commands that duplicate and rename layer implementations:
cmake/ncnn_add_layer.cmake8-27
Sources: CMakeLists.txt84 cmake/ncnn_add_layer.cmake2-80 src/CMakeLists.txt405-720
ncnn uses different methods to detect CPU capabilities depending on the platform:
| Platform | Method | Key Functions |
|---|---|---|
| Linux/Android | getauxval() + /proc/self/auxv | get_elf_hwcap() |
| Windows (ARM) | ruapu library | ruapu_init(), ruapu_supports() |
| macOS/iOS | sysctlbyname() | get_hw_cpufamily(), get_hw_capability() |
| x86/x86_64 | CPUID instruction | x86_cpuid(), x86_get_xcr0() |
Diagram: Platform-Specific CPU Detection Methods
Sources: src/cpu.cpp312-519 src/cpu.cpp521-838
On Linux and Android, ncnn reads hardware capabilities from the ELF auxiliary vector:
The get_elf_hwcap_from_getauxval() function first attempts to use getauxval() (available on Android API level 18+), then falls back to reading /proc/self/auxv:
Hardware capability flags are defined for ARM64:
HWCAP_ASIMD: NEON/ASIMD supportHWCAP_ASIMDHP: Half-precision floating pointHWCAP_ASIMDDP: Dot product instructionsHWCAP2_I8MM: 8-bit integer matrix multiplyHWCAP2_BF16: BFloat16 supportHWCAP_SVE, HWCAP2_SVE2: Scalable Vector ExtensionsOn Windows ARM, ncnn uses the ruapu library for runtime feature detection:
The ruapu library performs safe feature detection by testing instructions and catching illegal instruction exceptions.
On macOS and iOS, ncnn queries CPU features through sysctlbyname():
Key queries include:
hw.cpufamily: CPU family identifier (A-series chip generation)hw.optional.arm.FEAT_FP16: FP16 arithmetic supporthw.optional.arm.FEAT_DotProd: Dot product supporthw.optional.arm.FEAT_BF16: BFloat16 supportCPU family identifiers distinguish between generations:
CPUFAMILY_ARM_FIRESTORM_ICESTORM: A14/M1CPUFAMILY_ARM_AVALANCHE_BLIZZARD: A15/M2CPUFAMILY_ARM_EVEREST_SAWTOOTH: A16CPUFAMILY_ARM_DONAN: M4On x86 platforms, ncnn uses the CPUID instruction to query processor features:
The detection process involves:
Example AVX2 detection logic:
1. Check CPUID function 0 returns >= 7
2. Check CPUID function 1 returns AVX, XSAVE, OSXSAVE bits
3. Check XCR0 register has AVX state enabled (bits 1-2)
4. Check CPUID function 7.0 returns AVX2 bit
Sources: src/cpu.cpp312-838 src/cpu.h44-101
CPU detection results are stored in global static variables:
For ARM platforms:
For x86 platforms:
These flags are initialized once at startup and queried during layer creation.
Sources: src/cpu.cpp182-250
ncnn maintains separate layer registry arrays for different ISA levels. Each array contains function pointers to layer creator functions:
The template generates:
layer_registry[]: Generic implementationslayer_registry_arch[]: Architecture-specific (ARM NEON, etc.)layer_registry_avx[]: AVX-optimized (x86)layer_registry_fma[]: FMA-optimized (x86)layer_registry_avx512[]: AVX512-optimized (x86)layer_registry_avx512_vnni[]: AVX512-VNNI-optimizedEach registry entry maps a layer type to its creator function:
For each layer that benefits from ISA-specific optimization, multiple source files are generated:
Example for Convolution layer:
convolution.cpp - Generic implementationconvolution_x86.cpp - x86 base implementationconvolution_x86_avx.cpp - Generated with AVX flagsconvolution_x86_avx512.cpp - Generated with AVX512 flagsconvolution_arm.cpp - ARM base implementationconvolution_arm_asimdhp.cpp - Generated with FP16 flagsThe generation is performed by ncnn_generate_*_source.cmake scripts that:
Convolution_x86 → Convolution_x86_avx)cmake/ncnn_add_layer.cmake2-46
Sources: src/layer_registry.h.in1-34 cmake/ncnn_add_layer.cmake2-80
When a layer is created, ncnn selects the most optimized implementation through a multi-level dispatch:
Diagram: Layer Creation and ISA Dispatch
The runtime dispatch logic is implemented in create_layer():
Selection priority (highest to lowest):
For x86 platforms:
NCNN_RUNTIME_CPU && NCNN_AVX512VNNI && cpu_support_x86_avx512_vnni()NCNN_RUNTIME_CPU && NCNN_AVX512FP16 && cpu_support_x86_avx512_fp16()NCNN_RUNTIME_CPU && NCNN_AVX512BF16 && cpu_support_x86_avx512_bf16()NCNN_RUNTIME_CPU && NCNN_AVX512 && cpu_support_x86_avx512()NCNN_RUNTIME_CPU && NCNN_AVXVNNI && cpu_support_x86_avx_vnni()NCNN_RUNTIME_CPU && NCNN_FMA && cpu_support_x86_fma()NCNN_RUNTIME_CPU && NCNN_AVX && cpu_support_x86_avx()For ARM platforms:
The logic checks if:
NCNN_AVX512)cpu_support_x86_avx512())For the Convolution layer on an AVX512-capable CPU:
Build time: Multiple variants compiled
convolution.cpp → Convolution_layer_creator → layer_registry[typeindex]convolution_x86_avx512.cpp → Convolution_x86_avx512_layer_creator → layer_registry_avx512[typeindex]Runtime: Feature detection
cpu_support_x86_avx512() returns 1Layer creation: Dispatch selects AVX512
create_layer() finds non-NULL creator in layer_registry_avx512[]Convolution_x86_avx512_layer_creator()Convolution_x86_avx512* instanceIf AVX512 is not available:
Sources: src/layer.cpp161-230 src/layer_registry.h.in1-34
| Feature | Detection Function | Description |
|---|---|---|
| NEON | cpu_support_arm_neon() | Basic SIMD (armv7) or ASIMD (armv8) |
| VFPv4 | cpu_support_arm_vfpv4() | FP16 conversion + FMA |
| ASIMDHP | cpu_support_arm_asimdhp() | FP16 arithmetic (ARMv8.2) |
| ASIMDDP | cpu_support_arm_asimddp() | Dot product (ARMv8.2) |
| ASIMDFHM | cpu_support_arm_asimdfhm() | FP16 FML (ARMv8.2) |
| BF16 | cpu_support_arm_bf16() | BFloat16 (ARMv8.4) |
| I8MM | cpu_support_arm_i8mm() | INT8 matrix multiply (ARMv8.4) |
| SVE | cpu_support_arm_sve() | Scalable Vector Ext (ARMv8.6) |
| SVE2 | cpu_support_arm_sve2() | SVE version 2 |
| Feature | Detection Function | Description |
|---|---|---|
| AVX | cpu_support_x86_avx() | 256-bit vectors |
| FMA | cpu_support_x86_fma() | Fused multiply-add |
| XOP | cpu_support_x86_xop() | AMD extended operations |
| F16C | cpu_support_x86_f16c() | FP16 conversion |
| AVX2 | cpu_support_x86_avx2() | AVX2 + FMA + F16C |
| AVX-VNNI | cpu_support_x86_avx_vnni() | AVX VNNI |
| AVX512 | cpu_support_x86_avx512() | AVX512F/BW/CD/DQ/VL |
| AVX512-VNNI | cpu_support_x86_avx512_vnni() | AVX512 VNNI |
| AVX512-BF16 | cpu_support_x86_avx512_bf16() | AVX512 BFloat16 |
| AVX512-FP16 | cpu_support_x86_avx512_fp16() | AVX512 FP16 |
| Architecture | Features | Detection |
|---|---|---|
| MIPS | MSA, MMI | cpu_support_mips_msa(), cpu_support_loongson_mmi() |
| LoongArch | LSX, LASX | cpu_support_loongarch_lsx(), cpu_support_loongarch_lasx() |
| RISC-V | RVV, ZFH | cpu_support_riscv_v(), cpu_support_riscv_zfh() |
Sources: src/cpu.h43-115 src/cpu.cpp839-1419
To build ncnn without runtime dispatch (smaller binary, single ISA):
This compiles only the enabled ISA variants and removes dispatch overhead. The resulting binary requires CPUs with those features or will crash with illegal instruction errors.
For maximum compatibility, enable NCNN_RUNTIME_CPU=ON (default) and let ncnn automatically select the best implementation at runtime.
Refresh this wiki
This wiki was recently refreshed. Please wait 3 days to refresh again.