This document covers CPU inference optimization strategies in PaddleOCR, including configuration of MKL-DNN acceleration, threading parameters, and build-time optimization options. These optimizations apply to both Python and C++ inference deployments when running on CPU hardware.
For GPU optimization and TensorRT acceleration, see NVIDIA GPU and TensorRT. For general inference configuration, see Python Inference System.
The CPU optimization system in PaddleOCR consists of three main components: runtime configuration parameters, Intel MKL-DNN (oneDNN) acceleration library integration, and threading/parallelization controls. These optimizations are configured through the predictor creation process.
Sources: tools/infer/utility.py177-435
The CPU optimization parameters are defined in the argument parser and used during predictor creation:
| Parameter | Type | Default | Description |
|---|---|---|---|
enable_mkldnn | bool | None | Enable Intel MKL-DNN acceleration. When None, automatically determined based on model type |
cpu_threads | int | 10 | Number of threads for CPU math library operations |
ir_optim | bool | True | Enable inference IR optimizations |
precision | str | "fp32" | Precision mode: "fp32" or "fp16" (with MKL-DNN BFloat16) |
Key Implementation Points:
The enable_mkldnn parameter in tools/infer/utility.py133 controls whether MKL-DNN acceleration is enabled. When set to None (default), the system makes automatic decisions based on the model architecture.
The cpu_threads parameter in tools/infer/utility.py134 specifies the number of threads used by the CPU math library. This directly affects parallel computation during inference.
Sources: tools/infer/utility.py38-169
MKL-DNN (now called oneDNN) is Intel's open-source performance library for deep learning applications. It provides highly optimized implementations of common DNN operations specifically tuned for Intel architectures.
Implementation Details:
The MKL-DNN configuration occurs in tools/infer/utility.py386-402 The key steps are:
args.enable_mkldnn is True, config.enable_mkldnn() is called at tools/infer/utility.py391config.set_mkldnn_cache_capacity(10) at tools/infer/utility.py390 prevents memory leaks by caching 10 different shapesconfig.enable_mkldnn_bfloat16() is called at tools/infer/utility.py393config.disable_mkldnn() at tools/infer/utility.py396 explicitly disables when requestedSources: tools/infer/utility.py386-402
The cache capacity setting is critical for preventing memory leaks:
This caches optimized kernels for up to 10 different input shapes. Without this limit, the cache could grow indefinitely as new shapes are encountered, leading to memory exhaustion in long-running inference scenarios.
Sources: tools/infer/utility.py389-391
The number of CPU threads directly impacts parallel computation performance:
Configuration in Code:
The thread count is configured in tools/infer/utility.py398-402:
Thread Count Guidelines:
Sources: tools/infer/utility.py398-402
For C++ inference, CPU optimizations are configured at compile time through CMake options:
CMakeLists.txt Configuration:
The primary CPU optimization options are defined in deploy/cpp_infer/CMakeLists.txt9:
Compiler Flags for CPU:
Linux/Unix flags in deploy/cpp_infer/CMakeLists.txt89-92:
Windows flags in deploy/cpp_infer/CMakeLists.txt72-78:
Sources: deploy/cpp_infer/CMakeLists.txt1-244
The choice between Intel MKL and OpenBLAS affects performance characteristics:
| Aspect | Intel MKL | OpenBLAS |
|---|---|---|
| Performance on Intel CPUs | Highly optimized | Good general performance |
| Performance on AMD CPUs | May underperform | Better cross-vendor support |
| License | Intel Simplified Software License | BSD License |
| Binary Size | Larger | Smaller |
| Tuning | Auto-tuned for Intel architectures | Generic optimizations |
The MKL library is linked in deploy/cpp_infer/CMakeLists.txt136-158 when WITH_MKL=ON.
Sources: deploy/cpp_infer/CMakeLists.txt136-158
The provided build script demonstrates CPU-only configuration in deploy/cpp_infer/tools/build.sh1-23:
Sources: deploy/cpp_infer/tools/build.sh1-23
Key Optimization Points:
config.disable_gpu() ensures CPU execution pathSources: tools/infer/utility.py386-435
Example 1: Text Detection with MKL-DNN
Example 2: Text Recognition with Custom Thread Count
Sources: tools/infer/predict_det.py36-150 tools/infer/predict_rec.py39-203
Example: CPU-Only Build with MKL
Sources: deploy/cpp_infer/tools/build.sh1-23
Decision Matrix:
| Scenario | Recommended cpu_threads | Rationale |
|---|---|---|
| Single image inference on 4-core CPU | 4 | Match physical core count |
| Batch inference on 8-core CPU | 8 | Maximize parallelism |
| Multi-process deployment on 16-core CPU | 4-8 per process | Leave cores for process scheduling |
| Real-time inference on embedded CPU | 2-4 | Balance latency vs power |
Thread Scaling Behavior:
Sources: tools/infer/utility.py398-402
When to Enable MKL-DNN:
✓ Recommended:
✗ Not Recommended:
Precision Selection:
| Precision | MKL-DNN Config | Performance | Accuracy | Use Case |
|---|---|---|---|---|
| FP32 | enable_mkldnn() | Baseline | Full precision | Default |
| BF16 | enable_mkldnn_bfloat16() | ~2x faster | Minimal loss | Intel CPUs with BF16 support |
Sources: tools/infer/utility.py387-396
The memory optimizer in tools/infer/utility.py410 (config.enable_memory_optim()) provides:
Best Practice: Always enable unless debugging memory-related issues.
Sources: tools/infer/utility.py410
Certain IR optimization passes are incompatible with specific models or CPU execution modes. PaddleOCR explicitly removes problematic passes:
Pass Deletion Rationale:
conv_transpose_eltwiseadd_bn_fuse_pass: May cause numerical issues on CPUmatmul_transpose_reshape_fuse_pass: Incompatible with dynamic shapesfc_fuse_pass: Causes errors in table recognition modelsSources: tools/infer/utility.py412-421
Modern PaddlePaddle versions support an improved IR (Intermediate Representation) and executor:
Benefits:
Compatibility: Feature-gated to ensure compatibility with older PaddlePaddle versions.
Sources: tools/infer/utility.py404-407
Issue 1: Low CPU utilization
Symptom: CPU usage stays below 50% during inference
Solution: Increase cpu_threads parameter to match CPU core count:
Issue 2: Memory leak during continuous inference
Symptom: Memory usage grows over time with varying input shapes
Cause: MKL-DNN cache grows unbounded
Solution: MKL-DNN cache capacity is automatically set to 10 in tools/infer/utility.py390 If issues persist, consider disabling MKL-DNN for dynamic shape workloads.
Issue 3: Slower performance on AMD CPUs
Symptom: Inference is slower than expected on AMD processors
Solution: Disable MKL-DNN and use OpenBLAS:
For C++ builds:
Issue 4: First inference much slower than subsequent runs
Symptom: Initial prediction takes 2-10x longer than later predictions
Cause: MKL-DNN kernel compilation and optimization on first run
Solution: Use warmup iterations in tools/infer/predict_system.py202-205:
Sources: tools/infer/utility.py389-391 tools/infer/predict_system.py202-205
CPU optimization in PaddleOCR provides substantial performance improvements through three primary mechanisms:
Intel MKL-DNN/oneDNN: Highly optimized kernels for Intel CPUs, controlled by enable_mkldnn flag with cache capacity management to prevent memory leaks
Threading Configuration: Adjustable CPU thread count via cpu_threads parameter (default 10), allowing optimization for different hardware and workload characteristics
Build-time Optimizations: CMake options (WITH_MKL, WITH_GPU=OFF) and compiler flags (-O3, OpenMP) for C++ deployment
Key Recommendations:
cpu_threads to physical core count for optimal throughputSources: tools/infer/utility.py177-435 deploy/cpp_infer/CMakeLists.txt1-244
Refresh this wiki
This wiki was recently refreshed. Please wait 2 days to refresh again.