High-Performance Inference (HPI) in PaddleOCR refers to a set of optimization techniques and configurations that significantly accelerate model inference while maintaining accuracy. This includes hardware-accelerated inference backends (TensorRT, MKLDNN), reduced precision computation (FP16, INT8), and optimized model formats (ONNX). This document covers the configuration and usage of these optimization techniques.
For information about basic Python inference workflows, see page 5.1 (Python Inference System). For C++ deployment with Paddle Inference API, see page 5.3 (C++ Inference and Build System). For production service deployment patterns, see page 5.4 (Service Deployment). AMP training configuration (to produce FP16-ready models) is covered in page 4.4 (Training Loop and Optimization).
PaddleOCR supports two primary inference modes that can be selected based on performance requirements:
| Mode | Description | Use Case | Configuration |
|---|---|---|---|
| Standard Mode | FP32 precision with basic optimizations | Accuracy-critical applications, baseline performance | Default configuration |
| High-Performance Mode | Accelerated backends with reduced precision | Speed-critical applications, production deployment | use_tensorrt=True, precision="fp16", enable_mkldnn=True |
The high-performance mode can achieve 2-4x speedup for GPU inference and 1.5-2x speedup for CPU inference compared to standard mode.
Sources: tools/infer/utility.py177-436
TensorRT is NVIDIA's high-performance deep learning inference optimizer and runtime. PaddleOCR integrates TensorRT to achieve significant GPU inference speedup through kernel fusion, precision calibration, and dynamic tensor memory management.
The TensorRT backend is configured through the create_predictor function with the following key parameters:
| Parameter | Type | Default | Description |
|---|---|---|---|
use_tensorrt | bool | False | Enable TensorRT engine |
precision | str | "fp32" | Precision mode: "fp32", "fp16", "int8" |
min_subgraph_size | int | 15 | Minimum operators in TensorRT subgraph |
max_batch_size | int | 10 | Maximum batch size for optimization |
gpu_mem | int | 500 | GPU memory allocation in MB |
Sources: tools/infer/utility.py282-346 tools/infer/utility.py438-514
For models with variable input dimensions (common in OCR detection and recognition), TensorRT requires dynamic shape configuration. This is specified in the inference.yml file:
The TensorRT conversion function validates and processes these shapes:
Key Implementation:
TensorRT supports multiple precision modes that trade accuracy for speed:
| Precision | Relative Speed | Accuracy Impact | Hardware Requirement |
|---|---|---|---|
| FP32 | 1.0x (baseline) | None | Any NVIDIA GPU |
| FP16 | 2-3x | Minimal (<0.5%) | Compute Capability ≥ 7.0 |
| INT8 | 3-4x | Low (1-2%) | Compute Capability ≥ 7.0 + calibration |
Implementation: tools/infer/utility.py265-273
To enable TensorRT acceleration for text detection:
Sources: tools/infer/utility.py52-53 tools/infer/utility.py55 tools/infer/utility.py72
MKLDNN (Intel Math Kernel Library for Deep Neural Networks), now called oneDNN, provides optimized primitive operations for CPU inference. It includes vectorized operations, cache optimization, and automatic thread management.
Sources: tools/infer/utility.py386-402
| Parameter | Type | Default | Description |
|---|---|---|---|
enable_mkldnn | bool | None | Enable MKLDNN acceleration (None = auto-detect) |
cpu_threads | int | 10 | Number of CPU threads for inference |
precision | str | "fp32" | Precision mode: "fp32" or "fp16" (BFloat16) |
MKLDNN uses a cache to store optimized operator kernels for different input shapes. The cache capacity is set to 10 to prevent memory leaks when processing variable-sized inputs (common in OCR):
Implementation: tools/infer/utility.py389-391
Sources: tools/infer/utility.py133-134
ONNX Runtime provides cross-platform inference with support for multiple execution providers. PaddleOCR models can be exported to ONNX format for deployment scenarios requiring non-Paddle backends.
| Aspect | ONNX Runtime | Paddle Inference |
|---|---|---|
| Performance | Good (with optimization) | Excellent (native) |
| Platform Support | Broader (Windows/Linux/Mac) | Linux/Windows focus |
| Hardware Support | CPU, CUDA, DirectML, CoreML | CPU, CUDA, XPU, NPU, MLU |
| Model Format | .onnx | .pdmodel/.pdiparams |
| Optimization | Limited to ONNX ops | Full Paddle optimization |
Sources: tools/infer/utility.py200-238
| Parameter | Type | Description |
|---|---|---|
use_onnx | bool | Enable ONNX Runtime backend |
onnx_providers | list | Execution providers (e.g., ["CUDAExecutionProvider"]) |
onnx_sess_options | list | Session options for ONNX Runtime |
PaddleOCR models must be exported to ONNX format before using ONNX Runtime:
The ONNX inference path bypasses Paddle Inference entirely:
Key Components:
Sources: tools/infer/utility.py200-238 tools/infer/predict_det.py260-263
The following table summarizes all high-performance inference configuration parameters available in PaddleOCR:
| Parameter | Type | Default | Description | Source |
|---|---|---|---|---|
use_gpu | bool | True | Enable GPU acceleration | tools/infer/utility.py41 |
gpu_id | int | 0 | GPU device ID | tools/infer/utility.py57 |
gpu_mem | int | 500 | GPU memory allocation (MB) | tools/infer/utility.py56 |
ir_optim | bool | True | Enable inference IR optimization | tools/infer/utility.py52 |
| Parameter | Type | Default | Description | Source |
|---|---|---|---|---|
use_tensorrt | bool | False | Enable TensorRT engine | tools/infer/utility.py53 |
precision | str | "fp32" | Precision mode (fp32/fp16/int8) | tools/infer/utility.py55 |
min_subgraph_size | int | 15 | Min operators in TensorRT subgraph | tools/infer/utility.py54 |
max_batch_size | int | 10 | Maximum batch size for TensorRT | tools/infer/utility.py72 |
| Parameter | Type | Default | Description | Source |
|---|---|---|---|---|
enable_mkldnn | bool | None | Enable MKLDNN/oneDNN | tools/infer/utility.py133 |
cpu_threads | int | 10 | Number of CPU threads | tools/infer/utility.py134 |
| Parameter | Type | Default | Description | Source |
|---|---|---|---|---|
use_onnx | bool | False | Enable ONNX Runtime backend | tools/infer/utility.py157 |
onnx_providers | list | False | ONNX execution providers | tools/infer/utility.py158 |
onnx_sess_options | list | False | ONNX session options | tools/infer/utility.py159 |
| Parameter | Type | Default | Description | Source |
|---|---|---|---|---|
use_xpu | bool | False | Enable Kunlunxin XPU | tools/infer/utility.py42 |
use_npu | bool | False | Enable Ascend NPU | tools/infer/utility.py43 |
use_mlu | bool | False | Enable Cambricon MLU | tools/infer/utility.py44 |
use_metax_gpu | bool | False | Enable MetaX GPU | tools/infer/utility.py45 |
use_gcu | bool | False | Enable Enflame GCU | tools/infer/utility.py46-51 |
Sources: tools/infer/utility.py38-159
The create_predictor function in utility.py constructs an inference configuration object that encapsulates all optimization settings:
Several inference passes are explicitly deleted to ensure compatibility with specific OCR models:
| Pass | Reason for Deletion | Affected Models |
|---|---|---|
conv_transpose_eltwiseadd_bn_fuse_pass | Causes numerical instability | Most detection models |
matmul_transpose_reshape_fuse_pass | Incompatible with dynamic shapes | All models |
gpu_cpu_map_matmul_v2_to_matmul_pass | Breaks SRN architecture | SRN recognition model |
simplify_with_basic_ops_pass | Incompatible with RE module | Relation extraction models |
fc_fuse_pass | Not supported for table models | Table recognition models |
Implementation: tools/infer/utility.py409-421
Sources: tools/infer/utility.py263-435
The following guidance describes the typical relative impact of each optimization. Actual numbers vary by hardware and model.
| Configuration | Typical Speedup vs. FP32 Baseline | Notes |
|---|---|---|
| TensorRT FP32 | ~1.1–1.3x | Graph optimization only |
| TensorRT FP16 | ~2–4x | Recognition models benefit most |
| TensorRT INT8 | ~3–5x | Requires calibration data |
| Configuration | Typical Speedup vs. Baseline | Notes |
|---|---|---|
| MKLDNN enabled | ~1.5–4x | Larger gains on smaller (mobile) models |
| MKLDNN + BFloat16 | ~2–5x | Requires Intel CPU with BF16 support |
Increased cpu_threads | ~1.2–2x | Diminishing returns beyond physical cores |
MKLDNN speedup is typically larger for mobile/lightweight models (e.g., PP-OCRv5_mobile_rec) than for server-size models due to better vectorization fit.
PaddleOCR supports multiple hardware accelerators beyond NVIDIA GPUs through Paddle's custom device mechanism:
Implementation:
Enflame GCU requires additional pass configuration for optimal performance:
Key Steps:
import paddle_custom_device.gcu.passes as gcu_passesgcu_passes.setUp()gcu_passes.set_exp_enable_mixed_precision_ops(config)Implementation: tools/infer/utility.py368-384
Sources: tools/infer/utility.py347-384
| Scenario | Recommended Configuration | Expected Performance |
|---|---|---|
| Production GPU (High Accuracy) | use_gpu=True, use_tensorrt=True, precision="fp16" | 2-3x speedup with <0.5% accuracy loss |
| Production GPU (Max Speed) | use_gpu=True, use_tensorrt=True, precision="int8" | 3-4x speedup with 1-2% accuracy loss |
| Production CPU | use_gpu=False, enable_mkldnn=True, cpu_threads=10 | 1.5-2x speedup on mobile models |
| Cross-Platform Deployment | use_onnx=True, onnx_providers=["CUDAExecutionProvider"] | Portable but slightly slower than native |
| Development/Debug | Default settings (FP32, no optimizations) | Baseline performance, easiest debugging |
trt_dynamic_shapes in inference.yml covers your actual input size rangegpu_mem or max_batch_size if encountering OOM errorsSources: tools/infer/utility.py177-436 docs/version3.x/pipeline_usage/OCR.md132-201
Refresh this wiki
This wiki was recently refreshed. Please wait 2 days to refresh again.