The Benchmarking System provides tools for measuring neural network inference performance across different hardware platforms and configurations. The primary component is benchncnn, a command-line tool that executes inference on a suite of standard models or custom user models, reporting timing statistics for CPU and GPU execution paths.
For information about building NCNN, see Platform Support and Build System. For details on the runtime execution architecture being benchmarked, see Core Runtime Architecture.
The benchmarking system consists of three main components:
Sources: benchmark/README.md benchmark/benchncnn.cpp benchmark/CMakeLists.txt cmake/ncnn_add_param.cmake
The benchncnn tool measures inference performance by running models with randomly generated weights, eliminating the need to load large binary weight files while still exercising the full inference pipeline.
The tool accepts positional and key-value arguments:
| Parameter | Type | Default | Description |
|---|---|---|---|
| loop_count | int | 4 | Number of inference iterations |
| num_threads | int | max_cpu_count | Thread pool size |
| powersave | int | 0 | CPU core selection (0=all, 1=little, 2=big) |
| gpu_device | int | -1 | GPU device index (-1 for CPU-only) |
| cooling_down | int | 1 | Enable 10-second sleep between tests |
| param | string | - | Path to custom .param file |
| shape | string | - | Input shapes in [w,h,c] format |
Sources: benchmark/benchncnn.cpp177-182 benchmark/benchncnn.cpp246-313 benchmark/README.md23-63
The benchmarking tool uses dedicated memory allocators to ensure consistent performance measurement:
The allocators are initialized at startup and cleared before each benchmark run to ensure fair measurement:
g_blob_pool_allocator: Lock-free allocator for blob data (intermediate tensors)g_workspace_pool_allocator: Pooled allocator for temporary workspaceg_blob_vkallocator: Vulkan device memory allocator (GPU mode)g_staging_vkallocator: Vulkan staging memory for CPU-GPU transfers (GPU mode)Sources: benchmark/benchncnn.cpp42-49 benchmark/benchncnn.cpp61-69 benchmark/benchncnn.cpp333-363
The DataReaderFromEmpty class implements the ncnn::DataReader interface to generate zero-initialized model weights at runtime, avoiding disk I/O overhead:
This approach ensures that benchmarks measure pure inference performance without weight loading overhead, and allows benchmarking without storing multi-gigabyte model files.
Sources: benchmark/benchncnn.cpp24-36 benchmark/benchncnn.cpp92-93
The benchmark() function orchestrates the performance measurement process:
Key aspects of the execution flow:
ncnn::get_current_time()Sources: benchmark/benchncnn.cpp51-168 benchmark/benchncnn.cpp338-346
The benchmarking tool includes 35 standard models embedded at compile time:
| Model Category | Models |
|---|---|
| Classification | squeezenet, mobilenet, mobilenet_v2, mobilenet_v3, shufflenet, shufflenet_v2, mnasnet, proxylessnasnet, efficientnet_b0, efficientnetv2_b0, regnety_400m, googlenet, resnet18, alexnet, vgg16, resnet50, vision_transformer |
| Detection | blazeface, squeezenet_ssd, mobilenet_ssd, mobilenet_yolo, mobilenetv2_yolov3, yolov4-tiny, nanodet_m, yolo-fastest-1.1, yolo-fastestv2, FastestDet |
| INT8 Quantized | squeezenet_int8, mobilenet_int8, googlenet_int8, resnet18_int8, vgg16_int8, resnet50_int8, squeezenet_ssd_int8, mobilenet_ssd_int8 |
Model definitions are embedded as C arrays at build time using CMake macros:
The conversion process (in ncnn_generate_param_header.cmake):
benchncnn_param_data.hExample generated code:
Sources: benchmark/CMakeLists.txt8-50 cmake/ncnn_add_param.cmake cmake/ncnn_generate_param_header.cmake benchmark/benchncnn.cpp18
When run without custom model parameters, benchncnn executes all built-in models:
Each built-in benchmark specifies:
ncnn::Option configurationSources: benchmark/benchncnn.cpp381-460
Users can benchmark their own models by providing a .param file and input shapes:
The parse_shape_list() function parses input specifications:
Supported input dimensions:
[size] → ncnn::Mat(size)[width,height] → ncnn::Mat(width, height)[width,height,channels] → ncnn::Mat(width, height, channels)[width,height,depth,channels] → ncnn::Mat(width, height, depth, channels)Sources: benchmark/benchncnn.cpp184-244 benchmark/benchncnn.cpp309-318 benchmark/README.md49-50
The benchmarking system uses two sets of iterations:
| Phase | Variable | Default (CPU) | Default (GPU) | Purpose |
|---|---|---|---|---|
| Warmup | g_warmup_loop_count | 8 | 10 | Cache warming, JIT compilation |
| Timing | g_loop_count | 4 | User-specified | Actual performance measurement |
Sources: benchmark/benchncnn.cpp38-39 benchmark/benchncnn.cpp118-132 benchmark/benchncnn.cpp138-163 benchmark/benchncnn.cpp338-345
When enabled, the tool sleeps for 10 seconds between model benchmarks to prevent thermal throttling:
This is especially important for mobile/embedded platforms where sustained computation causes CPU/GPU frequency scaling.
Sources: benchmark/benchncnn.cpp98-102 benchmark/benchncnn.cpp329
Results are printed to stderr in a tabular format:
squeezenet min = 11.66 max = 11.80 avg = 11.74
squeezenet_int8 min = 12.24 max = 12.39 avg = 12.31
mobilenet min = 19.56 max = 19.73 avg = 19.65
Format specification: %20s min = %7.2f max = %7.2f avg = %7.2f
Sources: benchmark/benchncnn.cpp167
The ncnn::Option object controls inference behavior during benchmarking:
All optimization features are enabled to benchmark peak performance capabilities.
Sources: benchmark/benchncnn.cpp354-373
The RankCards tool analyzes benchmark results from README.md and generates performance rankings:
The ranking uses a logarithmically-weighted average to emphasize relative performance on slower models:
ratio = Σ(log(ref_time[i]) × (time[i] / ref_time[i])) / Σ(log(ref_time[i]))
Where:
ref_time[i]: Reference board time for model itime[i]: Current board time for model iLower ratios indicate faster hardware. The reference board is currently defined as:
Sources: benchmark/RankCards/main.cpp benchmark/RankCards/Rcards.h38-171 benchmark/RankCards/README.md
Output is written to benchmark/RankCards/README.md with a ranked table of all platforms.
Sources: benchmark/RankCards/main.cpp22-172 benchmark/RankCards/CMakeLists.txt
From benchmark/README.md:
These settings prevent frequency scaling from affecting benchmark consistency.
Sources: benchmark/README.md30-88 benchmark/README.md658-698
Sources: benchmark/README.md23-29 benchmark/README.md47-50
The benchmark/README.md file contains extensive benchmark results from community contributions, covering:
Results are organized by platform with full hardware specifications and benchmark configurations documented.
Sources: benchmark/README.md94-2000 (extensive results section)
Refresh this wiki
This wiki was recently refreshed. Please wait 2 days to refresh again.