This document describes the configuration system in ncnn, centered around the Option class. The Option class provides a unified interface for controlling execution behavior, memory management, precision modes, and hardware acceleration across the entire inference pipeline.
For information about the memory allocators referenced by Option, see Memory Allocators. For details on Vulkan GPU execution configured by Option, see GPU and Vulkan System.
The Option system serves as the central configuration mechanism for ncnn inference. It controls:
Sources: src/option.h1-156 src/option.cpp1-79
The Option class is defined in src/option.h17-151 and contains approximately 50 configuration fields grouped by functional category. All fields are public and directly accessible.
Sources: src/option.h17-151
| Field | Type | Default | Description |
|---|---|---|---|
num_threads | int | Physical big CPU count | Number of OpenMP threads for CPU inference |
openmp_blocktime | int | 20 (ms) | Time threads busy-wait before sleeping |
lightmode | bool | true | Enables intermediate blob recycling to reduce memory usage |
The num_threads field defaults to the count returned by get_physical_big_cpu_count() src/option.cpp17 which detects ARM big cores or total physical cores on other architectures. The openmp_blocktime of 20ms balances performance (keeping cores active) with power consumption src/option.cpp28
When lightmode is enabled, intermediate blobs are released after use src/option.cpp12 reducing peak memory consumption at the cost of preventing blob reuse in multi-branch networks.
Sources: src/option.h36-46 src/option.cpp12-28
| Field | Type | Default | Description |
|---|---|---|---|
blob_allocator | Allocator* | 0 (null) | CPU memory allocator for intermediate tensors |
workspace_allocator | Allocator* | 0 (null) | CPU memory allocator for temporary workspace |
blob_vkallocator | VkAllocator* | 0 (null) | Vulkan allocator for GPU tensors |
workspace_vkallocator | VkAllocator* | 0 (null) | Vulkan allocator for GPU temporary buffers |
staging_vkallocator | VkAllocator* | 0 (null) | Vulkan allocator for CPU-GPU data transfer |
use_local_pool_allocator | bool | true | Use thread-local pool allocators |
use_weights_in_host_memory | bool | false | Store model weights in host memory (not device) |
use_mapped_model_loading | bool | false | Use memory-mapped files for model loading |
When allocator pointers are null, the system uses default allocators. The Net class creates local PoolAllocator instances when use_local_pool_allocator is true src/net.cpp69-70
Sources: src/option.h40-58 src/option.cpp18-59
The Option system provides fine-grained control over numeric precision at three levels: storage format, packing format, and arithmetic operations.
| Precision | Storage Flag | Packed Flag | Arithmetic Flag | Default | Description |
|---|---|---|---|---|---|
| FP16 | use_fp16_storage | use_fp16_packed | use_fp16_arithmetic | true | Half-precision float (ARM FP16, Vulkan) |
| BF16 | use_bf16_storage | use_bf16_packed | N/A | false | Brain float16 format |
| INT8 | use_int8_storage | use_int8_packed | use_int8_arithmetic | true/false | 8-bit integer quantized inference |
Storage flags control the memory format of tensors. Packed flags enable SIMD-optimized memory layouts (e.g., pack4, pack8). Arithmetic flags control whether actual computations use reduced precision.
The system also provides shader-specific precision control:
use_fp16_uniform (default: true) - Use FP16 for shader push constants src/option.cpp70use_int8_uniform (default: true) - Use INT8 for shader push constants src/option.cpp71For INT8 inference, use_int8_inference must be enabled before loading the network src/option.h77-81 and the model must be quantized using ncnn2int8 (see Post-Training Quantization Tools).
Sources: src/option.h85-146 src/option.cpp32-71 src/net.cpp100-121
| Flag | Default | Applies To | Memory Impact | Performance Impact |
|---|---|---|---|---|
use_winograd_convolution | true | 3x3 stride-1 conv | Higher | 2-3x faster |
use_winograd23_convolution | true | Winograd F(2,3) | Medium | Fastest for 3x3 |
use_winograd43_convolution | true | Winograd F(4,3) | Higher | Better for larger tiles |
use_winograd63_convolution | true | Winograd F(6,3) | Highest | Best for very large inputs |
use_sgemm_convolution | true | 1x1 stride-1 conv | Medium | 2-3x faster |
use_packing_layout | true | All operators | Higher | 2-4x faster with SIMD |
use_a53_a55_optimized_kernel | Auto-detected | ARM Cortex-A53/A55 | None | 10-20% faster on A53/A55 |
Winograd convolution reduces arithmetic complexity for 3x3 kernels by transforming to a larger tile size at the cost of increased memory usage src/option.h65-69 The three Winograd variants F(2,3), F(4,3), and F(6,3) trade off tile size, memory, and performance.
SGEMM convolution uses im2col + GEMM for 1x1 convolutions src/option.h71-75
The use_a53_a55_optimized_kernel flag is automatically set based on is_current_thread_running_on_a53_a55() detection src/option.cpp68 but can be manually overridden.
Important: Changes to use_winograd_convolution, use_sgemm_convolution, and use_packing_layout must be applied before loading the network structure and weights, as they affect how layers are created src/option.h67-102
Sources: src/option.h64-142 src/option.cpp30-68
| Field | Type | Default | Description |
|---|---|---|---|
use_vulkan_compute | bool | false | Enable Vulkan GPU execution |
vulkan_device_index | int | -1 | GPU device index (-1 = default device) |
pipeline_cache | PipelineCache* | 0 | Vulkan pipeline cache for faster shader compilation |
use_shader_local_memory | bool | true | Use local/shared memory in shaders |
use_cooperative_matrix | bool | true | Use tensor core operations (if available) |
use_subgroup_ops | bool | true | Enable subgroup operations in shaders |
use_tensor_storage | bool | false | Use tensor storage layout for images |
The vulkan_device_index selects which GPU to use when multiple are available src/option.h105 A value of -1 uses get_default_gpu_index() src/option.cpp46
use_vulkan_compute is disabled by default with a comment "TODO enable me" src/option.cpp33 indicating it requires explicit opt-in by the user.
The use_cooperative_matrix flag enables Cooperative Matrix operations for optimized GEMM on supported hardware (Tensor Cores on NVIDIA, Matrix Cores on AMD) src/option.cpp62
use_shader_local_memory controls whether shader implementations use local/shared memory for tile caching src/option.cpp61
Sources: src/option.h21-133 src/option.cpp21-62
The flush_denormals field controls CPU floating-point denormal handling to improve performance src/option.h115-121:
0 = DAZ OFF, FTZ OFF (full IEEE 754 compliance)
1 = DAZ ON, FTZ OFF (denormals are zero on input)
2 = DAZ OFF, FTZ ON (flush denormals to zero on output)
3 = DAZ ON, FTZ ON (both enabled - maximum performance)
Default is 3 (both enabled) for maximum performance src/option.cpp54
Sources: src/option.h115-121 src/option.cpp54
The Option object flows from the Net level down through Extractor to individual layer executions, controlling behavior at each stage.
Sources: src/net.cpp27-549
The Net class owns an Option object that applies to the entire network src/net.cpp907-910:
The NetPrivate constructor takes a reference to this Option src/net.cpp86-88:
Sources: src/net.cpp27-910
When creating an Extractor, the user can override specific fields:
The Extractor maintains its own Option copy that can be customized per inference session.
Sources: Documentation inferred from typical ncnn usage patterns
Individual layers can disable specific features through a featmask field. The get_masked_option() function applies this mask src/net.cpp100-121:
The masking logic src/net.cpp104-118:
| Bit | Feature Masked | Effect |
|---|---|---|
| 0 | use_fp16_arithmetic | Disable FP16 arithmetic for this layer |
| 1 | use_fp16_packed, use_fp16_storage | Disable FP16 storage formats |
| 2 | use_bf16_packed, use_bf16_storage | Disable BF16 formats |
| 3 | use_int8_packed, use_int8_storage, use_int8_arithmetic | Disable INT8 quantization |
| 4 | use_vulkan_compute, use_tensor_storage | Disable Vulkan execution |
| 5 | use_sgemm_convolution | Disable SGEMM convolution |
| 6 | use_winograd_convolution | Disable Winograd convolution |
| 7 | num_threads = 1 | Force single-threaded execution |
This allows layers to opt-out of optimizations that don't apply or cause issues for that specific operation.
Sources: src/net.cpp100-168
The convert_layout() functions src/net.cpp358-549 perform precision and packing conversions based on Option flags:
The conversion process:
For Vulkan, the conversion occurs on GPU through vkdev->convert_packing() src/net.cpp552-588 which can cast precision on-the-fly.
Sources: src/net.cpp358-588
Vulkan commands like record_upload() and record_download() use Option to determine precision and memory allocation src/command.cpp358-586:
The logic optimizes data transfer based on GPU type src/command.cpp363-467:
Discrete GPU (type == 0): Cast to FP16/BF16 on CPU before upload to reduce PCIe bandwidth Integrated GPU (type != 0): Upload FP32 and cast on GPU since memory is shared
Sources: src/command.cpp358-586
| Category | Field | Default | Notes |
|---|---|---|---|
| Threading | num_threads | Physical big CPU count | Auto-detected |
| Threading | openmp_blocktime | 20 ms | Balances perf/power |
| Memory | lightmode | true | Recycle intermediates |
| Memory | use_local_pool_allocator | true | Thread-local pools |
| Precision | use_fp16_packed | true | Enable FP16 |
| Precision | use_fp16_storage | true | Enable FP16 |
| Precision | use_fp16_arithmetic | true | Enable FP16 |
| Precision | use_bf16_storage | false | Disabled by default |
| Precision | use_int8_packed | true | Enable INT8 |
| Precision | use_int8_storage | true | Enable INT8 |
| Precision | use_int8_arithmetic | false | Disabled by default |
| Convolution | use_winograd_convolution | true | All Winograd variants |
| Convolution | use_sgemm_convolution | true | im2col+GEMM |
| Convolution | use_packing_layout | true | SIMD layout |
| Vulkan | use_vulkan_compute | false | Explicit opt-in |
| Vulkan | use_shader_local_memory | true | Tile caching |
| Vulkan | use_cooperative_matrix | true | Tensor cores |
| Vulkan | vulkan_device_index | -1 | Default device |
| CPU | flush_denormals | 3 | DAZ+FTZ enabled |
| ARM | use_a53_a55_optimized_kernel | Auto | Detected at runtime |
Sources: src/option.cpp10-76
Sources: Documentation inferred from typical ncnn usage patterns, based on src/net.cpp and src/option.h
Refresh this wiki
This wiki was recently refreshed. Please wait 2 days to refresh again.