Benchmarking System

Relevant source files

The Benchmarking System provides tools for measuring neural network inference performance across different hardware platforms and configurations. The primary component is benchncnn, a command-line tool that executes inference on a suite of standard models or custom user models, reporting timing statistics for CPU and GPU execution paths.

For information about building NCNN, see Platform Support and Build System. For details on the runtime execution architecture being benchmarked, see Core Runtime Architecture.

Overview

The benchmarking system consists of three main components:

benchncnn tool: Command-line utility that measures inference performance on built-in or custom models
Model embedding system: Build-time mechanism for embedding model definitions as C arrays
RankCards tool: Post-processing utility that ranks hardware platforms by performance

Sources: benchmark/README.md benchmark/benchncnn.cpp benchmark/CMakeLists.txt cmake/ncnn_add_param.cmake

benchncnn Tool Architecture

The benchncnn tool measures inference performance by running models with randomly generated weights, eliminating the need to load large binary weight files while still exercising the full inference pipeline.

Command-Line Interface

The tool accepts positional and key-value arguments:

Parameter	Type	Default	Description
loop_count	int	4	Number of inference iterations
num_threads	int	max_cpu_count	Thread pool size
powersave	int	0	CPU core selection (0=all, 1=little, 2=big)
gpu_device	int	-1	GPU device index (-1 for CPU-only)
cooling_down	int	1	Enable 10-second sleep between tests
param	string	-	Path to custom .param file
shape	string	-	Input shapes in [w,h,c] format

Sources: benchmark/benchncnn.cpp177-182 benchmark/benchncnn.cpp246-313 benchmark/README.md23-63

Memory Management System

The benchmarking tool uses dedicated memory allocators to ensure consistent performance measurement:

The allocators are initialized at startup and cleared before each benchmark run to ensure fair measurement:

g_blob_pool_allocator: Lock-free allocator for blob data (intermediate tensors)
g_workspace_pool_allocator: Pooled allocator for temporary workspace
g_blob_vkallocator: Vulkan device memory allocator (GPU mode)
g_staging_vkallocator: Vulkan staging memory for CPU-GPU transfers (GPU mode)

Sources: benchmark/benchncnn.cpp42-49 benchmark/benchncnn.cpp61-69 benchmark/benchncnn.cpp333-363

DataReaderFromEmpty Class

The DataReaderFromEmpty class implements the ncnn::DataReader interface to generate zero-initialized model weights at runtime, avoiding disk I/O overhead:

This approach ensures that benchmarks measure pure inference performance without weight loading overhead, and allows benchmarking without storing multi-gigabyte model files.

Sources: benchmark/benchncnn.cpp24-36 benchmark/benchncnn.cpp92-93

Benchmark Execution Flow

The benchmark() function orchestrates the performance measurement process:

Key aspects of the execution flow:

Skip logic: INT8 models are skipped when GPU is enabled (not yet supported)
Memory clearing: Allocators are cleared to prevent cached allocations from affecting timing
Warmup phase: 8 iterations (CPU) or 10 iterations (GPU) to warm up caches and JIT compilers
Timing phase: Configured number of iterations with high-precision timing via ncnn::get_current_time()
Statistics: Min, max, and average execution times across all iterations

Sources: benchmark/benchncnn.cpp51-168 benchmark/benchncnn.cpp338-346

Built-in Model Benchmarks

The benchmarking tool includes 35 standard models embedded at compile time:

Model Category	Models
Classification	squeezenet, mobilenet, mobilenet_v2, mobilenet_v3, shufflenet, shufflenet_v2, mnasnet, proxylessnasnet, efficientnet_b0, efficientnetv2_b0, regnety_400m, googlenet, resnet18, alexnet, vgg16, resnet50, vision_transformer
Detection	blazeface, squeezenet_ssd, mobilenet_ssd, mobilenet_yolo, mobilenetv2_yolov3, yolov4-tiny, nanodet_m, yolo-fastest-1.1, yolo-fastestv2, FastestDet
INT8 Quantized	squeezenet_int8, mobilenet_int8, googlenet_int8, resnet18_int8, vgg16_int8, resnet50_int8, squeezenet_ssd_int8, mobilenet_ssd_int8

Model Embedding System

Model definitions are embedded as C arrays at build time using CMake macros:

The conversion process (in ncnn_generate_param_header.cmake):

Read param file: Load the text-based model definition
Normalize whitespace: Remove extra spaces and empty lines to reduce size
Convert to hex: Transform text to hexadecimal byte array
Generate C array: Create a static const char array declaration
Aggregate headers: Combine all model headers into benchncnn_param_data.h

Example generated code:

Sources: benchmark/CMakeLists.txt8-50 cmake/ncnn_add_param.cmake cmake/ncnn_generate_param_header.cmake benchmark/benchncnn.cpp18

Default Benchmark Execution

When run without custom model parameters, benchncnn executes all built-in models:

Each built-in benchmark specifies:

Model name for output display
Input tensor dimensions (width, height, channels)
ncnn::Option configuration
Pointer to embedded param data

Sources: benchmark/benchncnn.cpp381-460

Custom Model Benchmarking

Users can benchmark their own models by providing a .param file and input shapes:

Command-Line Usage

Shape Parsing System

The parse_shape_list() function parses input specifications:

Supported input dimensions:

1D tensors: [size] → ncnn::Mat(size)
2D tensors: [width,height] → ncnn::Mat(width, height)
3D tensors: [width,height,channels] → ncnn::Mat(width, height, channels)
4D tensors: [width,height,depth,channels] → ncnn::Mat(width, height, depth, channels)

Sources: benchmark/benchncnn.cpp184-244 benchmark/benchncnn.cpp309-318 benchmark/README.md49-50

Performance Measurement Methodology

Timing Configuration

The benchmarking system uses two sets of iterations:

Phase	Variable	Default (CPU)	Default (GPU)	Purpose
Warmup	`g_warmup_loop_count`	8	10	Cache warming, JIT compilation
Timing	`g_loop_count`	4	User-specified	Actual performance measurement

Sources: benchmark/benchncnn.cpp38-39 benchmark/benchncnn.cpp118-132 benchmark/benchncnn.cpp138-163 benchmark/benchncnn.cpp338-345

Cooling Down Mechanism

When enabled, the tool sleeps for 10 seconds between model benchmarks to prevent thermal throttling:

This is especially important for mobile/embedded platforms where sustained computation causes CPU/GPU frequency scaling.

Sources: benchmark/benchncnn.cpp98-102 benchmark/benchncnn.cpp329

Output Format

Results are printed to stderr in a tabular format:

          squeezenet  min =   11.66  max =   11.80  avg =   11.74
     squeezenet_int8  min =   12.24  max =   12.39  avg =   12.31
           mobilenet  min =   19.56  max =   19.73  avg =   19.65

Format specification: %20s min = %7.2f max = %7.2f avg = %7.2f

Model name: 20 characters, right-aligned
Timing values: 7 characters total, 2 decimal places (milliseconds)

Sources: benchmark/benchncnn.cpp167

Option Configuration

The ncnn::Option object controls inference behavior during benchmarking:

All optimization features are enabled to benchmark peak performance capabilities.

Sources: benchmark/benchncnn.cpp354-373

RankCards Performance Ranking System

The RankCards tool analyzes benchmark results from README.md and generates performance rankings:

Architecture

Data Structures

Ratio Calculation Algorithm

The ranking uses a logarithmically-weighted average to emphasize relative performance on slower models:

ratio = Σ(log(ref_time[i]) × (time[i] / ref_time[i])) / Σ(log(ref_time[i]))

Where:

ref_time[i]: Reference board time for model i
time[i]: Current board time for model i
Logarithmic weighting gives more importance to slower models

Lower ratios indicate faster hardware. The reference board is currently defined as:

Sources: benchmark/RankCards/main.cpp benchmark/RankCards/Rcards.h38-171 benchmark/RankCards/README.md

Running RankCards

Output is written to benchmark/RankCards/README.md with a ranked table of all platforms.

Sources: benchmark/RankCards/main.cpp22-172 benchmark/RankCards/CMakeLists.txt

Usage Examples

Basic CPU Benchmarking

GPU Benchmarking

Android Device Benchmarking

Performance Tuning Tips

From benchmark/README.md:

These settings prevent frequency scaling from affecting benchmark consistency.

Sources: benchmark/README.md30-88 benchmark/README.md658-698

Custom Model Example

Sources: benchmark/README.md23-29 benchmark/README.md47-50

Benchmark Results Repository

The benchmark/README.md file contains extensive benchmark results from community contributions, covering:

Desktop platforms: Intel/AMD CPUs, NVIDIA/AMD GPUs
Mobile SoCs: Qualcomm Snapdragon, MediaTek Dimensity, HiSilicon Kirin
Embedded boards: Raspberry Pi, NVIDIA Jetson, Rockchip RK series
Server platforms: AWS instances, Xeon processors
Alternative architectures: Loongson, RISC-V, MIPS, PowerPC

Results are organized by platform with full hardware specifications and benchmark configurations documented.

Sources: benchmark/README.md94-2000 (extensive results section)

Benchmarking System

Relevant source files

For information about building NCNN, see Platform Support and Build System. For details on the runtime execution architecture being benchmarked, see Core Runtime Architecture.

Overview

The benchmarking system consists of three main components:

benchncnn tool: Command-line utility that measures inference performance on built-in or custom models
Model embedding system: Build-time mechanism for embedding model definitions as C arrays
RankCards tool: Post-processing utility that ranks hardware platforms by performance

Sources: benchmark/README.md benchmark/benchncnn.cpp benchmark/CMakeLists.txt cmake/ncnn_add_param.cmake

benchncnn Tool Architecture

Command-Line Interface

The tool accepts positional and key-value arguments:

Parameter	Type	Default	Description
loop_count	int	4	Number of inference iterations
num_threads	int	max_cpu_count	Thread pool size
powersave	int	0	CPU core selection (0=all, 1=little, 2=big)
gpu_device	int	-1	GPU device index (-1 for CPU-only)
cooling_down	int	1	Enable 10-second sleep between tests
param	string	-	Path to custom .param file
shape	string	-	Input shapes in [w,h,c] format

Sources: benchmark/benchncnn.cpp177-182 benchmark/benchncnn.cpp246-313 benchmark/README.md23-63

Memory Management System

The benchmarking tool uses dedicated memory allocators to ensure consistent performance measurement:

The allocators are initialized at startup and cleared before each benchmark run to ensure fair measurement:

g_blob_pool_allocator: Lock-free allocator for blob data (intermediate tensors)
g_workspace_pool_allocator: Pooled allocator for temporary workspace
g_blob_vkallocator: Vulkan device memory allocator (GPU mode)
g_staging_vkallocator: Vulkan staging memory for CPU-GPU transfers (GPU mode)

Sources: benchmark/benchncnn.cpp42-49 benchmark/benchncnn.cpp61-69 benchmark/benchncnn.cpp333-363

DataReaderFromEmpty Class

The DataReaderFromEmpty class implements the ncnn::DataReader interface to generate zero-initialized model weights at runtime, avoiding disk I/O overhead:

This approach ensures that benchmarks measure pure inference performance without weight loading overhead, and allows benchmarking without storing multi-gigabyte model files.

Sources: benchmark/benchncnn.cpp24-36 benchmark/benchncnn.cpp92-93

Benchmark Execution Flow

The benchmark() function orchestrates the performance measurement process:

Key aspects of the execution flow:

Skip logic: INT8 models are skipped when GPU is enabled (not yet supported)
Memory clearing: Allocators are cleared to prevent cached allocations from affecting timing
Warmup phase: 8 iterations (CPU) or 10 iterations (GPU) to warm up caches and JIT compilers
Timing phase: Configured number of iterations with high-precision timing via ncnn::get_current_time()
Statistics: Min, max, and average execution times across all iterations

Sources: benchmark/benchncnn.cpp51-168 benchmark/benchncnn.cpp338-346

Built-in Model Benchmarks

The benchmarking tool includes 35 standard models embedded at compile time:

Model Category	Models
Classification	squeezenet, mobilenet, mobilenet_v2, mobilenet_v3, shufflenet, shufflenet_v2, mnasnet, proxylessnasnet, efficientnet_b0, efficientnetv2_b0, regnety_400m, googlenet, resnet18, alexnet, vgg16, resnet50, vision_transformer
Detection	blazeface, squeezenet_ssd, mobilenet_ssd, mobilenet_yolo, mobilenetv2_yolov3, yolov4-tiny, nanodet_m, yolo-fastest-1.1, yolo-fastestv2, FastestDet
INT8 Quantized	squeezenet_int8, mobilenet_int8, googlenet_int8, resnet18_int8, vgg16_int8, resnet50_int8, squeezenet_ssd_int8, mobilenet_ssd_int8

Model Embedding System

Model definitions are embedded as C arrays at build time using CMake macros:

The conversion process (in ncnn_generate_param_header.cmake):

Read param file: Load the text-based model definition
Normalize whitespace: Remove extra spaces and empty lines to reduce size
Convert to hex: Transform text to hexadecimal byte array
Generate C array: Create a static const char array declaration
Aggregate headers: Combine all model headers into benchncnn_param_data.h

Example generated code:

Sources: benchmark/CMakeLists.txt8-50 cmake/ncnn_add_param.cmake cmake/ncnn_generate_param_header.cmake benchmark/benchncnn.cpp18

Default Benchmark Execution

When run without custom model parameters, benchncnn executes all built-in models:

Each built-in benchmark specifies:

Model name for output display
Input tensor dimensions (width, height, channels)
ncnn::Option configuration
Pointer to embedded param data

Sources: benchmark/benchncnn.cpp381-460

Custom Model Benchmarking

Users can benchmark their own models by providing a .param file and input shapes:

Command-Line Usage

Shape Parsing System

The parse_shape_list() function parses input specifications:

Supported input dimensions:

1D tensors: [size] → ncnn::Mat(size)
2D tensors: [width,height] → ncnn::Mat(width, height)
3D tensors: [width,height,channels] → ncnn::Mat(width, height, channels)
4D tensors: [width,height,depth,channels] → ncnn::Mat(width, height, depth, channels)

Sources: benchmark/benchncnn.cpp184-244 benchmark/benchncnn.cpp309-318 benchmark/README.md49-50

Performance Measurement Methodology

Timing Configuration

The benchmarking system uses two sets of iterations:

Phase	Variable	Default (CPU)	Default (GPU)	Purpose
Warmup	`g_warmup_loop_count`	8	10	Cache warming, JIT compilation
Timing	`g_loop_count`	4	User-specified	Actual performance measurement

Sources: benchmark/benchncnn.cpp38-39 benchmark/benchncnn.cpp118-132 benchmark/benchncnn.cpp138-163 benchmark/benchncnn.cpp338-345

Cooling Down Mechanism

When enabled, the tool sleeps for 10 seconds between model benchmarks to prevent thermal throttling:

This is especially important for mobile/embedded platforms where sustained computation causes CPU/GPU frequency scaling.

Sources: benchmark/benchncnn.cpp98-102 benchmark/benchncnn.cpp329

Output Format

Results are printed to stderr in a tabular format:

          squeezenet  min =   11.66  max =   11.80  avg =   11.74
     squeezenet_int8  min =   12.24  max =   12.39  avg =   12.31
           mobilenet  min =   19.56  max =   19.73  avg =   19.65

Format specification: %20s min = %7.2f max = %7.2f avg = %7.2f

Model name: 20 characters, right-aligned
Timing values: 7 characters total, 2 decimal places (milliseconds)

Sources: benchmark/benchncnn.cpp167

Option Configuration

The ncnn::Option object controls inference behavior during benchmarking:

All optimization features are enabled to benchmark peak performance capabilities.

Sources: benchmark/benchncnn.cpp354-373

RankCards Performance Ranking System

The RankCards tool analyzes benchmark results from README.md and generates performance rankings:

Architecture

Data Structures

Ratio Calculation Algorithm

The ranking uses a logarithmically-weighted average to emphasize relative performance on slower models:

ratio = Σ(log(ref_time[i]) × (time[i] / ref_time[i])) / Σ(log(ref_time[i]))

Where:

ref_time[i]: Reference board time for model i
time[i]: Current board time for model i
Logarithmic weighting gives more importance to slower models

Lower ratios indicate faster hardware. The reference board is currently defined as:

Sources: benchmark/RankCards/main.cpp benchmark/RankCards/Rcards.h38-171 benchmark/RankCards/README.md

Running RankCards

Output is written to benchmark/RankCards/README.md with a ranked table of all platforms.

Sources: benchmark/RankCards/main.cpp22-172 benchmark/RankCards/CMakeLists.txt

Usage Examples

Basic CPU Benchmarking

GPU Benchmarking

Android Device Benchmarking

Performance Tuning Tips

From benchmark/README.md:

These settings prevent frequency scaling from affecting benchmark consistency.

Sources: benchmark/README.md30-88 benchmark/README.md658-698

Custom Model Example

Sources: benchmark/README.md23-29 benchmark/README.md47-50

Benchmark Results Repository

The benchmark/README.md file contains extensive benchmark results from community contributions, covering:

Desktop platforms: Intel/AMD CPUs, NVIDIA/AMD GPUs
Mobile SoCs: Qualcomm Snapdragon, MediaTek Dimensity, HiSilicon Kirin
Embedded boards: Raspberry Pi, NVIDIA Jetson, Rockchip RK series
Server platforms: AWS instances, Xeon processors
Alternative architectures: Loongson, RISC-V, MIPS, PowerPC

Results are organized by platform with full hardware specifications and benchmark configurations documented.

Sources: benchmark/README.md94-2000 (extensive results section)

Benchmarking System

Overview

benchncnn Tool Architecture

Command-Line Interface

Memory Management System

DataReaderFromEmpty Class

Benchmark Execution Flow

Built-in Model Benchmarks

Model Embedding System

Default Benchmark Execution

Custom Model Benchmarking

Command-Line Usage

Shape Parsing System

Performance Measurement Methodology

Timing Configuration

Cooling Down Mechanism

Output Format

Option Configuration

RankCards Performance Ranking System

Architecture

Data Structures

Ratio Calculation Algorithm

Running RankCards

Usage Examples

Basic CPU Benchmarking

GPU Benchmarking

Android Device Benchmarking

Performance Tuning Tips

Custom Model Example

Benchmark Results Repository

On this page

Benchmarking System

Overview

benchncnn Tool Architecture

Command-Line Interface

Memory Management System

DataReaderFromEmpty Class

Benchmark Execution Flow

Built-in Model Benchmarks

Model Embedding System

Default Benchmark Execution

Custom Model Benchmarking

Command-Line Usage

Shape Parsing System

Performance Measurement Methodology

Timing Configuration

Cooling Down Mechanism

Output Format

Option Configuration

RankCards Performance Ranking System

Architecture

Data Structures

Ratio Calculation Algorithm

Running RankCards

Usage Examples

Basic CPU Benchmarking

GPU Benchmarking

Android Device Benchmarking

Performance Tuning Tips

Custom Model Example

Benchmark Results Repository

On this page