Option and Configuration System

Relevant source files

This document describes the configuration system in ncnn, centered around the Option class. The Option class provides a unified interface for controlling execution behavior, memory management, precision modes, and hardware acceleration across the entire inference pipeline.

For information about the memory allocators referenced by Option, see Memory Allocators. For details on Vulkan GPU execution configured by Option, see GPU and Vulkan System.

Purpose and Scope

The Option system serves as the central configuration mechanism for ncnn inference. It controls:

Thread count and OpenMP behavior
Memory allocator assignment (CPU and GPU)
Precision modes (FP32, FP16, BF16, INT8)
Convolution algorithm selection (Winograd, SGEMM, direct)
Vulkan GPU compute enablement and settings
Platform-specific optimizations (ARM, x86)
Resource management policies (lightmode, memory mapping)

Sources: src/option.h1-156 src/option.cpp1-79

Option Class Structure

The Option class is defined in src/option.h17-151 and contains approximately 50 configuration fields grouped by functional category. All fields are public and directly accessible.

Sources: src/option.h17-151

Configuration Categories

Threading and Execution Control

Field	Type	Default	Description
`num_threads`	`int`	Physical big CPU count	Number of OpenMP threads for CPU inference
`openmp_blocktime`	`int`	20 (ms)	Time threads busy-wait before sleeping
`lightmode`	`bool`	`true`	Enables intermediate blob recycling to reduce memory usage

The num_threads field defaults to the count returned by get_physical_big_cpu_count() src/option.cpp17 which detects ARM big cores or total physical cores on other architectures. The openmp_blocktime of 20ms balances performance (keeping cores active) with power consumption src/option.cpp28

When lightmode is enabled, intermediate blobs are released after use src/option.cpp12 reducing peak memory consumption at the cost of preventing blob reuse in multi-branch networks.

Sources: src/option.h36-46 src/option.cpp12-28

Memory Management Configuration

Field	Type	Default	Description
`blob_allocator`	`Allocator*`	`0` (null)	CPU memory allocator for intermediate tensors
`workspace_allocator`	`Allocator*`	`0` (null)	CPU memory allocator for temporary workspace
`blob_vkallocator`	`VkAllocator*`	`0` (null)	Vulkan allocator for GPU tensors
`workspace_vkallocator`	`VkAllocator*`	`0` (null)	Vulkan allocator for GPU temporary buffers
`staging_vkallocator`	`VkAllocator*`	`0` (null)	Vulkan allocator for CPU-GPU data transfer
`use_local_pool_allocator`	`bool`	`true`	Use thread-local pool allocators
`use_weights_in_host_memory`	`bool`	`false`	Store model weights in host memory (not device)
`use_mapped_model_loading`	`bool`	`false`	Use memory-mapped files for model loading

When allocator pointers are null, the system uses default allocators. The Net class creates local PoolAllocator instances when use_local_pool_allocator is true src/net.cpp69-70

Sources: src/option.h40-58 src/option.cpp18-59

Precision and Data Type Control

The Option system provides fine-grained control over numeric precision at three levels: storage format, packing format, and arithmetic operations.

Precision	Storage Flag	Packed Flag	Arithmetic Flag	Default	Description
FP16	`use_fp16_storage`	`use_fp16_packed`	`use_fp16_arithmetic`	true	Half-precision float (ARM FP16, Vulkan)
BF16	`use_bf16_storage`	`use_bf16_packed`	N/A	false	Brain float16 format
INT8	`use_int8_storage`	`use_int8_packed`	`use_int8_arithmetic`	true/false	8-bit integer quantized inference

Storage flags control the memory format of tensors. Packed flags enable SIMD-optimized memory layouts (e.g., pack4, pack8). Arithmetic flags control whether actual computations use reduced precision.

The system also provides shader-specific precision control:

use_fp16_uniform (default: true) - Use FP16 for shader push constants src/option.cpp70
use_int8_uniform (default: true) - Use INT8 for shader push constants src/option.cpp71

For INT8 inference, use_int8_inference must be enabled before loading the network src/option.h77-81 and the model must be quantized using ncnn2int8 (see Post-Training Quantization Tools).

Sources: src/option.h85-146 src/option.cpp32-71 src/net.cpp100-121

Convolution Optimization Flags

Flag	Default	Applies To	Memory Impact	Performance Impact
`use_winograd_convolution`	`true`	3x3 stride-1 conv	Higher	2-3x faster
`use_winograd23_convolution`	`true`	Winograd F(2,3)	Medium	Fastest for 3x3
`use_winograd43_convolution`	`true`	Winograd F(4,3)	Higher	Better for larger tiles
`use_winograd63_convolution`	`true`	Winograd F(6,3)	Highest	Best for very large inputs
`use_sgemm_convolution`	`true`	1x1 stride-1 conv	Medium	2-3x faster
`use_packing_layout`	`true`	All operators	Higher	2-4x faster with SIMD
`use_a53_a55_optimized_kernel`	Auto-detected	ARM Cortex-A53/A55	None	10-20% faster on A53/A55

Winograd convolution reduces arithmetic complexity for 3x3 kernels by transforming to a larger tile size at the cost of increased memory usage src/option.h65-69 The three Winograd variants F(2,3), F(4,3), and F(6,3) trade off tile size, memory, and performance.

SGEMM convolution uses im2col + GEMM for 1x1 convolutions src/option.h71-75

The use_a53_a55_optimized_kernel flag is automatically set based on is_current_thread_running_on_a53_a55() detection src/option.cpp68 but can be manually overridden.

Important: Changes to use_winograd_convolution, use_sgemm_convolution, and use_packing_layout must be applied before loading the network structure and weights, as they affect how layers are created src/option.h67-102

Sources: src/option.h64-142 src/option.cpp30-68

Vulkan GPU Configuration

Field	Type	Default	Description
`use_vulkan_compute`	`bool`	`false`	Enable Vulkan GPU execution
`vulkan_device_index`	`int`	`-1`	GPU device index (-1 = default device)
`pipeline_cache`	`PipelineCache*`	`0`	Vulkan pipeline cache for faster shader compilation
`use_shader_local_memory`	`bool`	`true`	Use local/shared memory in shaders
`use_cooperative_matrix`	`bool`	`true`	Use tensor core operations (if available)
`use_subgroup_ops`	`bool`	`true`	Enable subgroup operations in shaders
`use_tensor_storage`	`bool`	`false`	Use tensor storage layout for images

The vulkan_device_index selects which GPU to use when multiple are available src/option.h105 A value of -1 uses get_default_gpu_index() src/option.cpp46

use_vulkan_compute is disabled by default with a comment "TODO enable me" src/option.cpp33 indicating it requires explicit opt-in by the user.

The use_cooperative_matrix flag enables Cooperative Matrix operations for optimized GEMM on supported hardware (Tensor Cores on NVIDIA, Matrix Cores on AMD) src/option.cpp62

use_shader_local_memory controls whether shader implementations use local/shared memory for tile caching src/option.cpp61

Sources: src/option.h21-133 src/option.cpp21-62

Denormal Handling

The flush_denormals field controls CPU floating-point denormal handling to improve performance src/option.h115-121:

0 = DAZ OFF, FTZ OFF (full IEEE 754 compliance)
1 = DAZ ON,  FTZ OFF (denormals are zero on input)
2 = DAZ OFF, FTZ ON  (flush denormals to zero on output)
3 = DAZ ON,  FTZ ON  (both enabled - maximum performance)

Default is 3 (both enabled) for maximum performance src/option.cpp54

Sources: src/option.h115-121 src/option.cpp54

Option Flow Through the System

The Option object flows from the Net level down through Extractor to individual layer executions, controlling behavior at each stage.

Sources: src/net.cpp27-549

Net Level Configuration

The Net class owns an Option object that applies to the entire network src/net.cpp907-910:

The NetPrivate constructor takes a reference to this Option src/net.cpp86-88:

Sources: src/net.cpp27-910

Extractor Level Overrides

When creating an Extractor, the user can override specific fields:

The Extractor maintains its own Option copy that can be customized per inference session.

Sources: Documentation inferred from typical ncnn usage patterns

Option Masking for Layer-Specific Configuration

Individual layers can disable specific features through a featmask field. The get_masked_option() function applies this mask src/net.cpp100-121:

The masking logic src/net.cpp104-118:

Bit	Feature Masked	Effect
0	`use_fp16_arithmetic`	Disable FP16 arithmetic for this layer
1	`use_fp16_packed`, `use_fp16_storage`	Disable FP16 storage formats
2	`use_bf16_packed`, `use_bf16_storage`	Disable BF16 formats
3	`use_int8_packed`, `use_int8_storage`, `use_int8_arithmetic`	Disable INT8 quantization
4	`use_vulkan_compute`, `use_tensor_storage`	Disable Vulkan execution
5	`use_sgemm_convolution`	Disable SGEMM convolution
6	`use_winograd_convolution`	Disable Winograd convolution
7	`num_threads` = 1	Force single-threaded execution

This allows layers to opt-out of optimizations that don't apply or cause issues for that specific operation.

Sources: src/net.cpp100-168

Layout Conversion Based on Option

The convert_layout() functions src/net.cpp358-549 perform precision and packing conversions based on Option flags:

The conversion process:

Precision conversion (FP32 → FP16/BF16) occurs first if enabled and supported src/net.cpp368-411
Element packing reorganizes memory layout for SIMD efficiency src/net.cpp414-495
Reverse conversion (FP16/BF16 → FP32) happens if layer doesn't support reduced precision src/net.cpp497-546

For Vulkan, the conversion occurs on GPU through vkdev->convert_packing() src/net.cpp552-588 which can cast precision on-the-fly.

Sources: src/net.cpp358-588

Vulkan Command Recording with Option

Vulkan commands like record_upload() and record_download() use Option to determine precision and memory allocation src/command.cpp358-586:

The logic optimizes data transfer based on GPU type src/command.cpp363-467:

Discrete GPU (type == 0): Cast to FP16/BF16 on CPU before upload to reduce PCIe bandwidth Integrated GPU (type != 0): Upload FP32 and cast on GPU since memory is shared

Sources: src/command.cpp358-586

Default Values Summary

Category	Field	Default	Notes
Threading	`num_threads`	Physical big CPU count	Auto-detected
Threading	`openmp_blocktime`	20 ms	Balances perf/power
Memory	`lightmode`	`true`	Recycle intermediates
Memory	`use_local_pool_allocator`	`true`	Thread-local pools
Precision	`use_fp16_packed`	`true`	Enable FP16
Precision	`use_fp16_storage`	`true`	Enable FP16
Precision	`use_fp16_arithmetic`	`true`	Enable FP16
Precision	`use_bf16_storage`	`false`	Disabled by default
Precision	`use_int8_packed`	`true`	Enable INT8
Precision	`use_int8_storage`	`true`	Enable INT8
Precision	`use_int8_arithmetic`	`false`	Disabled by default
Convolution	`use_winograd_convolution`	`true`	All Winograd variants
Convolution	`use_sgemm_convolution`	`true`	im2col+GEMM
Convolution	`use_packing_layout`	`true`	SIMD layout
Vulkan	`use_vulkan_compute`	`false`	Explicit opt-in
Vulkan	`use_shader_local_memory`	`true`	Tile caching
Vulkan	`use_cooperative_matrix`	`true`	Tensor cores
Vulkan	`vulkan_device_index`	-1	Default device
CPU	`flush_denormals`	3	DAZ+FTZ enabled
ARM	`use_a53_a55_optimized_kernel`	Auto	Detected at runtime

Sources: src/option.cpp10-76

Usage Example

Sources: Documentation inferred from typical ncnn usage patterns, based on src/net.cpp and src/option.h

Option and Configuration System

Relevant source files

For information about the memory allocators referenced by Option, see Memory Allocators. For details on Vulkan GPU execution configured by Option, see GPU and Vulkan System.

Purpose and Scope

The Option system serves as the central configuration mechanism for ncnn inference. It controls:

Thread count and OpenMP behavior
Memory allocator assignment (CPU and GPU)
Precision modes (FP32, FP16, BF16, INT8)
Convolution algorithm selection (Winograd, SGEMM, direct)
Vulkan GPU compute enablement and settings
Platform-specific optimizations (ARM, x86)
Resource management policies (lightmode, memory mapping)

Sources: src/option.h1-156 src/option.cpp1-79

Option Class Structure

The Option class is defined in src/option.h17-151 and contains approximately 50 configuration fields grouped by functional category. All fields are public and directly accessible.

Sources: src/option.h17-151

Configuration Categories

Threading and Execution Control

Field	Type	Default	Description
`num_threads`	`int`	Physical big CPU count	Number of OpenMP threads for CPU inference
`openmp_blocktime`	`int`	20 (ms)	Time threads busy-wait before sleeping
`lightmode`	`bool`	`true`	Enables intermediate blob recycling to reduce memory usage

When lightmode is enabled, intermediate blobs are released after use src/option.cpp12 reducing peak memory consumption at the cost of preventing blob reuse in multi-branch networks.

Sources: src/option.h36-46 src/option.cpp12-28

Memory Management Configuration

Field	Type	Default	Description
`blob_allocator`	`Allocator*`	`0` (null)	CPU memory allocator for intermediate tensors
`workspace_allocator`	`Allocator*`	`0` (null)	CPU memory allocator for temporary workspace
`blob_vkallocator`	`VkAllocator*`	`0` (null)	Vulkan allocator for GPU tensors
`workspace_vkallocator`	`VkAllocator*`	`0` (null)	Vulkan allocator for GPU temporary buffers
`staging_vkallocator`	`VkAllocator*`	`0` (null)	Vulkan allocator for CPU-GPU data transfer
`use_local_pool_allocator`	`bool`	`true`	Use thread-local pool allocators
`use_weights_in_host_memory`	`bool`	`false`	Store model weights in host memory (not device)
`use_mapped_model_loading`	`bool`	`false`	Use memory-mapped files for model loading

When allocator pointers are null, the system uses default allocators. The Net class creates local PoolAllocator instances when use_local_pool_allocator is true src/net.cpp69-70

Sources: src/option.h40-58 src/option.cpp18-59

Precision and Data Type Control

The Option system provides fine-grained control over numeric precision at three levels: storage format, packing format, and arithmetic operations.

Precision	Storage Flag	Packed Flag	Arithmetic Flag	Default	Description
FP16	`use_fp16_storage`	`use_fp16_packed`	`use_fp16_arithmetic`	true	Half-precision float (ARM FP16, Vulkan)
BF16	`use_bf16_storage`	`use_bf16_packed`	N/A	false	Brain float16 format
INT8	`use_int8_storage`	`use_int8_packed`	`use_int8_arithmetic`	true/false	8-bit integer quantized inference

The system also provides shader-specific precision control:

use_fp16_uniform (default: true) - Use FP16 for shader push constants src/option.cpp70
use_int8_uniform (default: true) - Use INT8 for shader push constants src/option.cpp71

For INT8 inference, use_int8_inference must be enabled before loading the network src/option.h77-81 and the model must be quantized using ncnn2int8 (see Post-Training Quantization Tools).

Sources: src/option.h85-146 src/option.cpp32-71 src/net.cpp100-121

Convolution Optimization Flags

Flag	Default	Applies To	Memory Impact	Performance Impact
`use_winograd_convolution`	`true`	3x3 stride-1 conv	Higher	2-3x faster
`use_winograd23_convolution`	`true`	Winograd F(2,3)	Medium	Fastest for 3x3
`use_winograd43_convolution`	`true`	Winograd F(4,3)	Higher	Better for larger tiles
`use_winograd63_convolution`	`true`	Winograd F(6,3)	Highest	Best for very large inputs
`use_sgemm_convolution`	`true`	1x1 stride-1 conv	Medium	2-3x faster
`use_packing_layout`	`true`	All operators	Higher	2-4x faster with SIMD
`use_a53_a55_optimized_kernel`	Auto-detected	ARM Cortex-A53/A55	None	10-20% faster on A53/A55

SGEMM convolution uses im2col + GEMM for 1x1 convolutions src/option.h71-75

The use_a53_a55_optimized_kernel flag is automatically set based on is_current_thread_running_on_a53_a55() detection src/option.cpp68 but can be manually overridden.

Sources: src/option.h64-142 src/option.cpp30-68

Vulkan GPU Configuration

Field	Type	Default	Description
`use_vulkan_compute`	`bool`	`false`	Enable Vulkan GPU execution
`vulkan_device_index`	`int`	`-1`	GPU device index (-1 = default device)
`pipeline_cache`	`PipelineCache*`	`0`	Vulkan pipeline cache for faster shader compilation
`use_shader_local_memory`	`bool`	`true`	Use local/shared memory in shaders
`use_cooperative_matrix`	`bool`	`true`	Use tensor core operations (if available)
`use_subgroup_ops`	`bool`	`true`	Enable subgroup operations in shaders
`use_tensor_storage`	`bool`	`false`	Use tensor storage layout for images

The vulkan_device_index selects which GPU to use when multiple are available src/option.h105 A value of -1 uses get_default_gpu_index() src/option.cpp46

use_vulkan_compute is disabled by default with a comment "TODO enable me" src/option.cpp33 indicating it requires explicit opt-in by the user.

The use_cooperative_matrix flag enables Cooperative Matrix operations for optimized GEMM on supported hardware (Tensor Cores on NVIDIA, Matrix Cores on AMD) src/option.cpp62

use_shader_local_memory controls whether shader implementations use local/shared memory for tile caching src/option.cpp61

Sources: src/option.h21-133 src/option.cpp21-62

Denormal Handling

The flush_denormals field controls CPU floating-point denormal handling to improve performance src/option.h115-121:

0 = DAZ OFF, FTZ OFF (full IEEE 754 compliance)
1 = DAZ ON,  FTZ OFF (denormals are zero on input)
2 = DAZ OFF, FTZ ON  (flush denormals to zero on output)
3 = DAZ ON,  FTZ ON  (both enabled - maximum performance)

Default is 3 (both enabled) for maximum performance src/option.cpp54

Sources: src/option.h115-121 src/option.cpp54

Option Flow Through the System

The Option object flows from the Net level down through Extractor to individual layer executions, controlling behavior at each stage.

Sources: src/net.cpp27-549

Net Level Configuration

The Net class owns an Option object that applies to the entire network src/net.cpp907-910:

The NetPrivate constructor takes a reference to this Option src/net.cpp86-88:

Sources: src/net.cpp27-910

Extractor Level Overrides

When creating an Extractor, the user can override specific fields:

The Extractor maintains its own Option copy that can be customized per inference session.

Sources: Documentation inferred from typical ncnn usage patterns

Option Masking for Layer-Specific Configuration

Individual layers can disable specific features through a featmask field. The get_masked_option() function applies this mask src/net.cpp100-121:

The masking logic src/net.cpp104-118:

Bit	Feature Masked	Effect
0	`use_fp16_arithmetic`	Disable FP16 arithmetic for this layer
1	`use_fp16_packed`, `use_fp16_storage`	Disable FP16 storage formats
2	`use_bf16_packed`, `use_bf16_storage`	Disable BF16 formats
3	`use_int8_packed`, `use_int8_storage`, `use_int8_arithmetic`	Disable INT8 quantization
4	`use_vulkan_compute`, `use_tensor_storage`	Disable Vulkan execution
5	`use_sgemm_convolution`	Disable SGEMM convolution
6	`use_winograd_convolution`	Disable Winograd convolution
7	`num_threads` = 1	Force single-threaded execution

This allows layers to opt-out of optimizations that don't apply or cause issues for that specific operation.

Sources: src/net.cpp100-168

Layout Conversion Based on Option

The convert_layout() functions src/net.cpp358-549 perform precision and packing conversions based on Option flags:

The conversion process:

Precision conversion (FP32 → FP16/BF16) occurs first if enabled and supported src/net.cpp368-411
Element packing reorganizes memory layout for SIMD efficiency src/net.cpp414-495
Reverse conversion (FP16/BF16 → FP32) happens if layer doesn't support reduced precision src/net.cpp497-546

For Vulkan, the conversion occurs on GPU through vkdev->convert_packing() src/net.cpp552-588 which can cast precision on-the-fly.

Sources: src/net.cpp358-588

Vulkan Command Recording with Option

Vulkan commands like record_upload() and record_download() use Option to determine precision and memory allocation src/command.cpp358-586:

The logic optimizes data transfer based on GPU type src/command.cpp363-467:

Discrete GPU (type == 0): Cast to FP16/BF16 on CPU before upload to reduce PCIe bandwidth Integrated GPU (type != 0): Upload FP32 and cast on GPU since memory is shared

Sources: src/command.cpp358-586

Default Values Summary

Category	Field	Default	Notes
Threading	`num_threads`	Physical big CPU count	Auto-detected
Threading	`openmp_blocktime`	20 ms	Balances perf/power
Memory	`lightmode`	`true`	Recycle intermediates
Memory	`use_local_pool_allocator`	`true`	Thread-local pools
Precision	`use_fp16_packed`	`true`	Enable FP16
Precision	`use_fp16_storage`	`true`	Enable FP16
Precision	`use_fp16_arithmetic`	`true`	Enable FP16
Precision	`use_bf16_storage`	`false`	Disabled by default
Precision	`use_int8_packed`	`true`	Enable INT8
Precision	`use_int8_storage`	`true`	Enable INT8
Precision	`use_int8_arithmetic`	`false`	Disabled by default
Convolution	`use_winograd_convolution`	`true`	All Winograd variants
Convolution	`use_sgemm_convolution`	`true`	im2col+GEMM
Convolution	`use_packing_layout`	`true`	SIMD layout
Vulkan	`use_vulkan_compute`	`false`	Explicit opt-in
Vulkan	`use_shader_local_memory`	`true`	Tile caching
Vulkan	`use_cooperative_matrix`	`true`	Tensor cores
Vulkan	`vulkan_device_index`	-1	Default device
CPU	`flush_denormals`	3	DAZ+FTZ enabled
ARM	`use_a53_a55_optimized_kernel`	Auto	Detected at runtime

Sources: src/option.cpp10-76

Usage Example

Sources: Documentation inferred from typical ncnn usage patterns, based on src/net.cpp and src/option.h

Option and Configuration System

Purpose and Scope

Option Class Structure

Configuration Categories

Threading and Execution Control

Memory Management Configuration

Precision and Data Type Control

Convolution Optimization Flags

Vulkan GPU Configuration

Denormal Handling

Option Flow Through the System

Net Level Configuration

Extractor Level Overrides

Option Masking for Layer-Specific Configuration

Layout Conversion Based on Option

Vulkan Command Recording with Option

Default Values Summary

Usage Example

On this page

Option and Configuration System

Purpose and Scope

Option Class Structure

Configuration Categories

Threading and Execution Control

Memory Management Configuration

Precision and Data Type Control

Convolution Optimization Flags

Vulkan GPU Configuration

Denormal Handling

Option Flow Through the System

Net Level Configuration

Extractor Level Overrides

Option Masking for Layer-Specific Configuration

Layout Conversion Based on Option

Vulkan Command Recording with Option

Default Values Summary

Usage Example

On this page