This page describes the toolchain for converting trained models from external frameworks into ncnn's native format and for subsequently optimizing those models before deployment. It covers the available converter tools, the ncnn model file format, optimization passes in ncnnoptimize, INT8 quantization via ncnn2int8, and binary embedding via ncnn2mem.
For runtime details on how the resulting .param and .bin files are loaded and executed, see Core Runtime Architecture and Network Loading and Inference Pipeline. For the internal PNNX IR and its detailed pass pipeline, see PNNX PyTorch Converter Architecture and PNNX Intermediate Representation. For the post-training quantization tools in depth, see Post-Training Quantization Tools.
ncnn models consist of two files:
| File | Content |
|---|---|
.param | Text file describing the layer graph — layer types, names, input/output blob names, and per-layer parameters |
.bin | Binary file containing all weight data in row-major order |
Every .param file begins with the magic number 7767517, followed by a line containing the total layer count and blob count. Each subsequent line describes one layer:
7767517
<layer_count> <blob_count>
<LayerType> <LayerName> <bottom_count> <top_count> <bottoms...> <tops...> [param_id=value ...]
This format is written by all converters and consumed by ncnn::Net::load_param at runtime.
Sources: tools/caffe/caffe2ncnn.cpp99-100 tools/mxnet/mxnet2ncnn.cpp981-982 tools/onnx/onnx2ncnn.cpp1-20 tools/modelwriter.h208-244
Conversion pipeline diagram
Sources: tools/pnnx/src/main.cpp226-400 tools/onnx/onnx2ncnn.cpp1-43 tools/caffe/caffe2ncnn.cpp65-100 tools/mxnet/mxnet2ncnn.cpp959-984 tools/ncnnoptimize.cpp36-78 tools/quantize/ncnn2int8.cpp108-131 tools/ncnn2mem.cpp153-230
PNNX is the primary recommended tool for converting PyTorch models. Its entry point is tools/pnnx/src/main.cpp. It accepts a TorchScript .pt file or an ONNX file and emits both PNNX-format intermediate files and ncnn-format output files.
pnnx model.pt inputshape=[1,3,224,224]
pnnx model.onnx inputshape=[1,3,224,224]
Key arguments parsed by main():
| Argument | Default | Purpose |
|---|---|---|
pnnxparam | model.pnnx.param | PNNX IR parameter file |
pnnxbin | model.pnnx.bin | PNNX IR weight file |
ncnnparam | model.ncnn.param | ncnn parameter output |
ncnnbin | model.ncnn.bin | ncnn weight output |
fp16 | 1 | Save weights in FP16 |
optlevel | 2 | Optimization pass depth (0–5) |
inputshape | — | Required for shape inference, e.g. [1,3,224,224] |
device | cpu | cpu or gpu for tracing |
moduleop | — | Comma-separated module class names to keep opaque |
PNNX internal pipeline diagram
Sources: tools/pnnx/src/main.cpp23-36 tools/pnnx/src/main.cpp226-400 tools/pnnx/src/save_ncnn.cpp74-90
The PNNX intermediate representation is defined in tools/pnnx/src/ir.h and implemented in tools/pnnx/src/ir.cpp.
PNNX IR class diagram
Sources: tools/pnnx/src/ir.h1-200 tools/pnnx/src/ir.cpp520-791
save_ncnn() in tools/pnnx/src/save_ncnn.cpp iterates the Graph::ops list after pass_ncnn has remapped PNNX operators to ncnn layer types. It writes the .param file in ncnn's text format and the .bin file with optionally FP16-converted weight data.
Sources: tools/pnnx/src/save_ncnn.cpp74-120
Located at tools/caffe/caffe2ncnn.cpp. Reads a Caffe .prototxt (text proto, via read_proto_from_text) and .caffemodel (binary proto, via read_proto_from_binary). Maps Caffe layer types to ncnn layer types, writing the .param and .bin files directly.
Notable type mappings applied by main():
| Caffe type | ncnn type |
|---|---|
Convolution (group != 1) | ConvolutionDepthWise |
Deconvolution (group != 1) | DeconvolutionDepthWise |
MemoryData | Input |
ReLU6 | Clip |
Silence | Noop |
BN | Scale |
Where a blob is consumed by more than one layer, caffe2ncnn inserts synthetic Split layers and suffixed blob names (_splitncnn_N) to make the DAG explicit.
Sources: tools/caffe/caffe2ncnn.cpp65-270
Located at tools/onnx/onnx2ncnn.cpp. Parses an ONNX binary protobuf via read_proto_from_binary into an onnx::ModelProto. Before writing ncnn output it runs several graph-level fusion passes over the mutable ONNX graph:
| Pass function | What it fuses |
|---|---|
fuse_weight_reshape | Absorbs Reshape of constant weights into weight tensors directly |
fuse_weight_transpose | Absorbs Transpose(perm=[1,0]) of 2-D weight tensors |
fuse_shufflechannel | Reshape – Transpose – Reshape → ShuffleChannel |
fuse_shufflechannel_split | ShuffleChannel(reverse) – Gather – Gather → Split |
fuse_hardswish | Add(+3) – Clip(0,6) – Mul – Div(/6) → HardSwish |
fuse_hardsigmoid | Add(+3) – Clip(0,6) – Div(/6) → HardSigmoid |
fuse_swish | Sigmoid – Mul → Swish |
fuse_batchnorm1d_squeeze_unsqueeze | Unsqueeze – BN – Squeeze → BatchNormalization |
fuse_rewrite_gather | Single-index Gather → Crop |
Nodes reduced by these passes are marked noop_reducedncnn and excluded from the final output.
Sources: tools/onnx/onnx2ncnn.cpp362-1200
Located at tools/mxnet/mxnet2ncnn.cpp. Reads an MXNet JSON symbol file via read_mxnet_json (custom line-by-line parser into MXNetNode objects) and an MXNet binary parameter file via read_mxnet_param. MXNet parameters are read as raw NDArray binary records identified by a magic number (0xF993FAC9 or 0xF993FAC8).
Similar pre-processing fusion passes exist:
| Pass | What it fuses |
|---|---|
fuse_shufflechannel | Reshape – SwapAxis – Reshape → ShuffleChannel |
fuse_hardsigmoid_hardswish | _plus_scalar(+3) – clip(0,6) – _div_scalar(/6) → HardSigmoid or HardSwish |
Sources: tools/mxnet/mxnet2ncnn.cpp342-960
ncnnoptimize (implemented in tools/ncnnoptimize.cpp) loads an existing ncnn model using the NetOptimize class (which extends ModelWriter, itself extending ncnn::Net), applies a sequence of structural fusion and elimination passes, then saves the result to new .param and .bin files.
ModelWriter (defined in tools/modelwriter.h) is a subclass of ncnn::Net that exposes the internal layers and blobs vectors mutably, adds storage_type (FP16 vs FP32), cutstart/cutend for graph section extraction, and provides serialization via save(parampath, binpath).
Sources: tools/modelwriter.h208-244
NetOptimize adds fusion and elimination methods on top of ModelWriter:
Fusion passes — merge consecutive layers into a single parameterized layer:
| Method | Effect |
|---|---|
fuse_batchnorm_scale | BatchNorm – Scale → BatchNorm |
fuse_convolution_batchnorm | Convolution – BatchNorm → Convolution (BN params folded into conv weights) |
fuse_convolution_mul | Convolution – BinaryOp(Mul, MemoryData) → Convolution |
fuse_convolution_add | Convolution – BinaryOp(Add, MemoryData) → Convolution |
fuse_convolutiondepthwise_batchnorm | Same as above for depthwise conv |
fuse_deconvolution_batchnorm | Same for transposed conv |
fuse_innerproduct_batchnorm | InnerProduct – BatchNorm → InnerProduct |
fuse_innerproduct_add | InnerProduct – BinaryOp(Add, MemoryData) → InnerProduct |
fuse_innerproduct_dropout | InnerProduct – Dropout → InnerProduct |
fuse_convolution_activation | Appends ReLU/Clip/Sigmoid activation into convolution activation_type |
fuse_memorydata_binaryop | Folds constant MemoryData scalars into BinaryOp |
fuse_binaryop_eltwise | BinaryOp(Add) + MemoryData → Eltwise |
Elimination passes — remove no-op layers:
| Method | Effect |
|---|---|
eliminate_dropout | Removes all Dropout layers |
eliminate_pooling1x1 | Removes 1×1 stride-1 average pools with no padding |
eliminate_noop | Removes Noop layers |
eliminate_split | Removes Split layers with a single consumer |
eliminate_orphaned_memorydata | Removes MemoryData not consumed by any layer |
eliminate_flatten_after_global_pooling | Removes redundant Flatten after global pool |
eliminate_reshape_after_global_pooling | Same for Reshape |
eliminate_flatten_after_innerproduct | Removes Flatten after InnerProduct |
eliminate_reshape_before_binaryop | Removes rank-expanding Reshape before BinaryOp |
Replace passes — substitute one layer type for a more efficient equivalent:
| Method | Effect |
|---|---|
replace_reduction_with_global_pooling | Reduction(mean, all axes) → Pooling(global_avg) |
replace_prelu_with_leaky_relu | PReLU with single slope → ReLU(negative_slope) |
replace_convolution_with_innerproduct_after_global_pooling | 1×1 Convolution after global pool → InnerProduct |
replace_convolution_with_innerproduct_after_innerproduct | 1×1 Convolution after InnerProduct → InnerProduct |
Sources: tools/ncnnoptimize.cpp36-78 tools/ncnnoptimize.cpp85-2500
The fuse_convolution_batchnorm pass is representative of how weight-level arithmetic is used during fusion. Given BatchNorm parameters slope, mean, var, bias, and epsilon:
b[i] = slope[i] / sqrt(var[i] + eps)
a[i] = bias[i] - slope[i] * mean[i] / sqrt(var[i] + eps)
The convolution weights for output channel i are multiplied by b[i], and the bias is updated to bias[i] * b[i] + a[i]. The BatchNorm layer is then marked ncnnfused so it is excluded on save.
Sources: tools/ncnnoptimize.cpp146-227
ModelWriter::storage_type controls whether weights are written as float32 (type 0) or float16 (type 1) in fwrite_weight_tag_data. The quantize tag prepended to each weight blob in the .bin file encodes the storage type so that ModelBin on the runtime side knows how to interpret it.
Sources: tools/modelwriter.h220-242 src/modelbin.cpp1-80
ncnnoptimize model.param model.bin model_opt.param model_opt.bin [storage_type]
storage_type: 0 = FP32, 1 = FP16.
INT8 quantization is a two-step process:
Step 1: Run ncnn2table (see Post-Training Quantization Tools) against a calibration dataset to produce a .table file with per-layer and per-weight activation scales.
Step 2: Run ncnn2int8 with the original model and the .table file to produce an INT8 model.
The NetQuantize class in tools/quantize/ncnn2int8.cpp extends ModelWriter and holds two maps read from the scale table file:
blob_int8scale_table — maps layer name → ncnn::Mat of activation scalesweight_int8scale_table — maps <layername>_param_0 → ncnn::Mat of per-output-channel weight scalesQuantization methods:
| Method | Target layer |
|---|---|
quantize_convolution | Convolution — weights quantized with ncnn::quantize_to_int8 |
quantize_convolutiondepthwise | ConvolutionDepthWise — per-group quantization |
quantize_innerproduct | InnerProduct |
quantize_rnn | RNN — per-output-row scale derived from weight abs-max |
quantize_lstm | LSTM |
quantize_gru | GRU |
quantize_embed | Embed |
quantize_gemm | Gemm |
quantize_multiheadattention | MultiHeadAttention |
quantize_sdpa | SDPA |
fuse_requantize | Fuses Dequantize – Quantize pairs into Requantize |
After quantization, the layer's int8_scale_term is set (e.g. 2 for per-channel), and weight_data_int8_scales / bottom_blob_int8_scales are stored in the layer parameters. These are serialized into the .bin file so the runtime's Quantize and Requantize layers can apply them.
Quantization workflow diagram
Sources: tools/quantize/ncnn2int8.cpp37-106 tools/quantize/ncnn2int8.cpp108-131 tools/quantize/ncnn2int8.cpp138-194
ncnn2mem (in tools/ncnn2mem.cpp) converts a model's .param and .bin files into formats that can be compiled directly into a binary, eliminating filesystem access at runtime.
It produces:
| Output file | Content |
|---|---|
model.param.bin | Binary-encoded version of the text .param file |
model.id.h | C++ header with a namespace of integer constants for each layer and blob name |
The text .param is re-parsed by dump_param() and re-serialized as binary data. sanitize_name() converts blob/layer names into valid C++ identifiers. The generated model.param.bin is loaded at runtime by passing its address to Net::load_param_bin via a DataReaderFromMemory.
The .bin file itself is used directly as a byte array — the application either mmaps it or stores it in a const unsigned char[].
ncnn2mem output diagram
The model.id.h header lets the application refer to blobs by name at compile time rather than via string lookup. For example:
Sources: tools/ncnn2mem.cpp153-230 tools/ncnn2mem.cpp31-51
Both converters and the runtime use the ModelBin / DataReader abstraction for reading weights. ModelBin (in src/modelbin.h and src/modelbin.cpp) provides load(w, type) methods that interpret the quantization tag at the start of each weight blob. DataReader (in src/datareader.h, src/datareader.cpp) abstracts over file I/O (DataReaderFromStdio), memory pointer (DataReaderFromMemory), and other sources.
ModelWriter in tools/modelwriter.h writes weights with fwrite_weight_tag_data, which prepends the storage-type tag before the raw float data.
Sources: src/modelbin.h1-50 src/modelbin.cpp1-80 src/datareader.h1-40 tools/modelwriter.h238-244
| Use case | Recommended path |
|---|---|
| PyTorch model | pnnx → ncnnoptimize |
| ONNX model | pnnx (ONNX input) or onnx2ncnn → ncnnoptimize |
| Caffe model | caffe2ncnn → ncnnoptimize |
| MXNet model | mxnet2ncnn → ncnnoptimize |
| Mobile deployment, size matters | add ncnnoptimize with storage_type=1 (FP16) |
| Highest throughput on ARM/x86 | add ncnn2table + ncnn2int8 for INT8 |
| No filesystem on device | add ncnn2mem to embed the model |
Sources: tools/pnnx/src/main.cpp201-224 tools/ncnnoptimize.cpp36-78 tools/quantize/ncnn2int8.cpp108-131 tools/ncnn2mem.cpp153-170
Refresh this wiki
This wiki was recently refreshed. Please wait 3 days to refresh again.