High-Performance Inference

Relevant source files

Purpose and Scope

High-Performance Inference (HPI) in PaddleOCR refers to a set of optimization techniques and configurations that significantly accelerate model inference while maintaining accuracy. This includes hardware-accelerated inference backends (TensorRT, MKLDNN), reduced precision computation (FP16, INT8), and optimized model formats (ONNX). This document covers the configuration and usage of these optimization techniques.

For information about basic Python inference workflows, see page 5.1 (Python Inference System). For C++ deployment with Paddle Inference API, see page 5.3 (C++ Inference and Build System). For production service deployment patterns, see page 5.4 (Service Deployment). AMP training configuration (to produce FP16-ready models) is covered in page 4.4 (Training Loop and Optimization).

Inference Mode Overview

PaddleOCR supports two primary inference modes that can be selected based on performance requirements:

Mode	Description	Use Case	Configuration
Standard Mode	FP32 precision with basic optimizations	Accuracy-critical applications, baseline performance	Default configuration
High-Performance Mode	Accelerated backends with reduced precision	Speed-critical applications, production deployment	`use_tensorrt=True`, `precision="fp16"`, `enable_mkldnn=True`

The high-performance mode can achieve 2-4x speedup for GPU inference and 1.5-2x speedup for CPU inference compared to standard mode.

Inference Backend Architecture

Sources: tools/infer/utility.py177-436

TensorRT Acceleration for GPU

TensorRT is NVIDIA's high-performance deep learning inference optimizer and runtime. PaddleOCR integrates TensorRT to achieve significant GPU inference speedup through kernel fusion, precision calibration, and dynamic tensor memory management.

TensorRT Configuration

The TensorRT backend is configured through the create_predictor function with the following key parameters:

Parameter	Type	Default	Description
`use_tensorrt`	bool	False	Enable TensorRT engine
`precision`	str	"fp32"	Precision mode: "fp32", "fp16", "int8"
`min_subgraph_size`	int	15	Minimum operators in TensorRT subgraph
`max_batch_size`	int	10	Maximum batch size for optimization
`gpu_mem`	int	500	GPU memory allocation in MB

TensorRT Engine Creation Flow

Sources: tools/infer/utility.py282-346 tools/infer/utility.py438-514

Dynamic Shape Configuration

For models with variable input dimensions (common in OCR detection and recognition), TensorRT requires dynamic shape configuration. This is specified in the inference.yml file:

The TensorRT conversion function validates and processes these shapes:

Key Implementation:

Dynamic shape validation: tools/infer/utility.py469-481
TensorRT input preparation: tools/infer/utility.py483-507
Model conversion: tools/infer/utility.py509-514

Precision Modes

TensorRT supports multiple precision modes that trade accuracy for speed:

Precision	Relative Speed	Accuracy Impact	Hardware Requirement
FP32	1.0x (baseline)	None	Any NVIDIA GPU
FP16	2-3x	Minimal (<0.5%)	Compute Capability ≥ 7.0
INT8	3-4x	Low (1-2%)	Compute Capability ≥ 7.0 + calibration

Implementation: tools/infer/utility.py265-273

TensorRT Usage Example

To enable TensorRT acceleration for text detection:

Sources: tools/infer/utility.py52-53 tools/infer/utility.py55 tools/infer/utility.py72

CPU Optimization with MKLDNN

MKLDNN (Intel Math Kernel Library for Deep Neural Networks), now called oneDNN, provides optimized primitive operations for CPU inference. It includes vectorized operations, cache optimization, and automatic thread management.

MKLDNN Configuration

Sources: tools/infer/utility.py386-402

MKLDNN Configuration Parameters

Parameter	Type	Default	Description
`enable_mkldnn`	bool	None	Enable MKLDNN acceleration (None = auto-detect)
`cpu_threads`	int	10	Number of CPU threads for inference
`precision`	str	"fp32"	Precision mode: "fp32" or "fp16" (BFloat16)

Cache Management

MKLDNN uses a cache to store optimized operator kernels for different input shapes. The cache capacity is set to 10 to prevent memory leaks when processing variable-sized inputs (common in OCR):

Implementation: tools/infer/utility.py389-391

MKLDNN Usage Example

Sources: tools/infer/utility.py133-134

ONNX Runtime Integration

ONNX Runtime provides cross-platform inference with support for multiple execution providers. PaddleOCR models can be exported to ONNX format for deployment scenarios requiring non-Paddle backends.

ONNX vs Paddle Inference

Aspect	ONNX Runtime	Paddle Inference
Performance	Good (with optimization)	Excellent (native)
Platform Support	Broader (Windows/Linux/Mac)	Linux/Windows focus
Hardware Support	CPU, CUDA, DirectML, CoreML	CPU, CUDA, XPU, NPU, MLU
Model Format	.onnx	.pdmodel/.pdiparams
Optimization	Limited to ONNX ops	Full Paddle optimization

ONNX Inference Backend Selection

Sources: tools/infer/utility.py200-238

ONNX Configuration Parameters

Parameter	Type	Description
`use_onnx`	bool	Enable ONNX Runtime backend
`onnx_providers`	list	Execution providers (e.g., `["CUDAExecutionProvider"]`)
`onnx_sess_options`	list	Session options for ONNX Runtime

Model Export to ONNX

PaddleOCR models must be exported to ONNX format before using ONNX Runtime:

ONNX Inference Execution

The ONNX inference path bypasses Paddle Inference entirely:

Key Components:

Session creation: tools/infer/utility.py210-231
Input handling: tools/infer/utility.py232-237
Inference execution (detection): tools/infer/predict_det.py260-263
Inference execution (recognition): tools/infer/predict_rec.py580-590

Sources: tools/infer/utility.py200-238 tools/infer/predict_det.py260-263

Comprehensive Configuration Parameters

The following table summarizes all high-performance inference configuration parameters available in PaddleOCR:

General Inference Parameters

Parameter	Type	Default	Description	Source
`use_gpu`	bool	True	Enable GPU acceleration	tools/infer/utility.py41
`gpu_id`	int	0	GPU device ID	tools/infer/utility.py57
`gpu_mem`	int	500	GPU memory allocation (MB)	tools/infer/utility.py56
`ir_optim`	bool	True	Enable inference IR optimization	tools/infer/utility.py52

TensorRT Parameters

Parameter	Type	Default	Description	Source
`use_tensorrt`	bool	False	Enable TensorRT engine	tools/infer/utility.py53
`precision`	str	"fp32"	Precision mode (fp32/fp16/int8)	tools/infer/utility.py55
`min_subgraph_size`	int	15	Min operators in TensorRT subgraph	tools/infer/utility.py54
`max_batch_size`	int	10	Maximum batch size for TensorRT	tools/infer/utility.py72

CPU Optimization Parameters

Parameter	Type	Default	Description	Source
`enable_mkldnn`	bool	None	Enable MKLDNN/oneDNN	tools/infer/utility.py133
`cpu_threads`	int	10	Number of CPU threads	tools/infer/utility.py134

ONNX Parameters

Parameter	Type	Default	Description	Source
`use_onnx`	bool	False	Enable ONNX Runtime backend	tools/infer/utility.py157
`onnx_providers`	list	False	ONNX execution providers	tools/infer/utility.py158
`onnx_sess_options`	list	False	ONNX session options	tools/infer/utility.py159

Alternative Hardware Parameters

Parameter	Type	Default	Description	Source
`use_xpu`	bool	False	Enable Kunlunxin XPU	tools/infer/utility.py42
`use_npu`	bool	False	Enable Ascend NPU	tools/infer/utility.py43
`use_mlu`	bool	False	Enable Cambricon MLU	tools/infer/utility.py44
`use_metax_gpu`	bool	False	Enable MetaX GPU	tools/infer/utility.py45
`use_gcu`	bool	False	Enable Enflame GCU	tools/infer/utility.py46-51

Sources: tools/infer/utility.py38-159

Inference Configuration Object

The create_predictor function in utility.py constructs an inference configuration object that encapsulates all optimization settings:

Critical Optimization Passes

Several inference passes are explicitly deleted to ensure compatibility with specific OCR models:

Pass	Reason for Deletion	Affected Models
`conv_transpose_eltwiseadd_bn_fuse_pass`	Causes numerical instability	Most detection models
`matmul_transpose_reshape_fuse_pass`	Incompatible with dynamic shapes	All models
`gpu_cpu_map_matmul_v2_to_matmul_pass`	Breaks SRN architecture	SRN recognition model
`simplify_with_basic_ops_pass`	Incompatible with RE module	Relation extraction models
`fc_fuse_pass`	Not supported for table models	Table recognition models

Implementation: tools/infer/utility.py409-421

Sources: tools/infer/utility.py263-435

Expected Performance Characteristics

The following guidance describes the typical relative impact of each optimization. Actual numbers vary by hardware and model.

GPU Acceleration

Configuration	Typical Speedup vs. FP32 Baseline	Notes
TensorRT FP32	~1.1–1.3x	Graph optimization only
TensorRT FP16	~2–4x	Recognition models benefit most
TensorRT INT8	~3–5x	Requires calibration data

CPU Acceleration

Configuration	Typical Speedup vs. Baseline	Notes
MKLDNN enabled	~1.5–4x	Larger gains on smaller (mobile) models
MKLDNN + BFloat16	~2–5x	Requires Intel CPU with BF16 support
Increased `cpu_threads`	~1.2–2x	Diminishing returns beyond physical cores

MKLDNN speedup is typically larger for mobile/lightweight models (e.g., PP-OCRv5_mobile_rec) than for server-size models due to better vectorization fit.

Hardware-Specific Optimizations

Alternative Accelerators

PaddleOCR supports multiple hardware accelerators beyond NVIDIA GPUs through Paddle's custom device mechanism:

Implementation:

NPU configuration: tools/infer/utility.py347-348
MLU configuration: tools/infer/utility.py349-350
MetaX GPU configuration: tools/infer/utility.py351-358
XPU configuration: tools/infer/utility.py359-360
GCU configuration: tools/infer/utility.py361-384

GCU-Specific Optimizations

Enflame GCU requires additional pass configuration for optimal performance:

Key Steps:

Import GCU passes: import paddle_custom_device.gcu.passes as gcu_passes
Setup passes: gcu_passes.setUp()
Enable mixed precision: gcu_passes.set_exp_enable_mixed_precision_ops(config)
Configure IR: Enable new IR and executor for PIR API, or append legacy IR passes

Implementation: tools/infer/utility.py368-384

Sources: tools/infer/utility.py347-384

Best Practices and Recommendations

Choosing the Right Configuration

Scenario	Recommended Configuration	Expected Performance
Production GPU (High Accuracy)	`use_gpu=True`, `use_tensorrt=True`, `precision="fp16"`	2-3x speedup with <0.5% accuracy loss
Production GPU (Max Speed)	`use_gpu=True`, `use_tensorrt=True`, `precision="int8"`	3-4x speedup with 1-2% accuracy loss
Production CPU	`use_gpu=False`, `enable_mkldnn=True`, `cpu_threads=10`	1.5-2x speedup on mobile models
Cross-Platform Deployment	`use_onnx=True`, `onnx_providers=["CUDAExecutionProvider"]`	Portable but slightly slower than native
Development/Debug	Default settings (FP32, no optimizations)	Baseline performance, easiest debugging

Common Pitfalls

TensorRT Shape Mismatch: Ensure trt_dynamic_shapes in inference.yml covers your actual input size range
Memory Overflow: Reduce gpu_mem or max_batch_size if encountering OOM errors
MKLDNN Cache: The cache is limited to 10 shapes; highly variable inputs may cause cache thrashing
INT8 Calibration: INT8 mode requires calibration data; without it, accuracy degradation may exceed 2%
Pass Compatibility: Don't modify the deleted passes list without understanding model architecture implications

Performance Tuning Workflow

Sources: tools/infer/utility.py177-436 docs/version3.x/pipeline_usage/OCR.md132-201

High-Performance Inference

Relevant source files

Purpose and Scope

Inference Mode Overview

PaddleOCR supports two primary inference modes that can be selected based on performance requirements:

Mode	Description	Use Case	Configuration
Standard Mode	FP32 precision with basic optimizations	Accuracy-critical applications, baseline performance	Default configuration
High-Performance Mode	Accelerated backends with reduced precision	Speed-critical applications, production deployment	`use_tensorrt=True`, `precision="fp16"`, `enable_mkldnn=True`

The high-performance mode can achieve 2-4x speedup for GPU inference and 1.5-2x speedup for CPU inference compared to standard mode.

Inference Backend Architecture

Sources: tools/infer/utility.py177-436

TensorRT Acceleration for GPU

TensorRT Configuration

The TensorRT backend is configured through the create_predictor function with the following key parameters:

Parameter	Type	Default	Description
`use_tensorrt`	bool	False	Enable TensorRT engine
`precision`	str	"fp32"	Precision mode: "fp32", "fp16", "int8"
`min_subgraph_size`	int	15	Minimum operators in TensorRT subgraph
`max_batch_size`	int	10	Maximum batch size for optimization
`gpu_mem`	int	500	GPU memory allocation in MB

TensorRT Engine Creation Flow

Sources: tools/infer/utility.py282-346 tools/infer/utility.py438-514

Dynamic Shape Configuration

For models with variable input dimensions (common in OCR detection and recognition), TensorRT requires dynamic shape configuration. This is specified in the inference.yml file:

The TensorRT conversion function validates and processes these shapes:

Key Implementation:

Dynamic shape validation: tools/infer/utility.py469-481
TensorRT input preparation: tools/infer/utility.py483-507
Model conversion: tools/infer/utility.py509-514

Precision Modes

TensorRT supports multiple precision modes that trade accuracy for speed:

Precision	Relative Speed	Accuracy Impact	Hardware Requirement
FP32	1.0x (baseline)	None	Any NVIDIA GPU
FP16	2-3x	Minimal (<0.5%)	Compute Capability ≥ 7.0
INT8	3-4x	Low (1-2%)	Compute Capability ≥ 7.0 + calibration

Implementation: tools/infer/utility.py265-273

TensorRT Usage Example

To enable TensorRT acceleration for text detection:

Sources: tools/infer/utility.py52-53 tools/infer/utility.py55 tools/infer/utility.py72

CPU Optimization with MKLDNN

MKLDNN Configuration

Sources: tools/infer/utility.py386-402

MKLDNN Configuration Parameters

Parameter	Type	Default	Description
`enable_mkldnn`	bool	None	Enable MKLDNN acceleration (None = auto-detect)
`cpu_threads`	int	10	Number of CPU threads for inference
`precision`	str	"fp32"	Precision mode: "fp32" or "fp16" (BFloat16)

Cache Management

MKLDNN uses a cache to store optimized operator kernels for different input shapes. The cache capacity is set to 10 to prevent memory leaks when processing variable-sized inputs (common in OCR):

Implementation: tools/infer/utility.py389-391

MKLDNN Usage Example

Sources: tools/infer/utility.py133-134

ONNX Runtime Integration

ONNX Runtime provides cross-platform inference with support for multiple execution providers. PaddleOCR models can be exported to ONNX format for deployment scenarios requiring non-Paddle backends.

ONNX vs Paddle Inference

Aspect	ONNX Runtime	Paddle Inference
Performance	Good (with optimization)	Excellent (native)
Platform Support	Broader (Windows/Linux/Mac)	Linux/Windows focus
Hardware Support	CPU, CUDA, DirectML, CoreML	CPU, CUDA, XPU, NPU, MLU
Model Format	.onnx	.pdmodel/.pdiparams
Optimization	Limited to ONNX ops	Full Paddle optimization

ONNX Inference Backend Selection

Sources: tools/infer/utility.py200-238

ONNX Configuration Parameters

Parameter	Type	Description
`use_onnx`	bool	Enable ONNX Runtime backend
`onnx_providers`	list	Execution providers (e.g., `["CUDAExecutionProvider"]`)
`onnx_sess_options`	list	Session options for ONNX Runtime

Model Export to ONNX

PaddleOCR models must be exported to ONNX format before using ONNX Runtime:

ONNX Inference Execution

The ONNX inference path bypasses Paddle Inference entirely:

Key Components:

Session creation: tools/infer/utility.py210-231
Input handling: tools/infer/utility.py232-237
Inference execution (detection): tools/infer/predict_det.py260-263
Inference execution (recognition): tools/infer/predict_rec.py580-590

Sources: tools/infer/utility.py200-238 tools/infer/predict_det.py260-263

Comprehensive Configuration Parameters

The following table summarizes all high-performance inference configuration parameters available in PaddleOCR:

General Inference Parameters

Parameter	Type	Default	Description	Source
`use_gpu`	bool	True	Enable GPU acceleration	tools/infer/utility.py41
`gpu_id`	int	0	GPU device ID	tools/infer/utility.py57
`gpu_mem`	int	500	GPU memory allocation (MB)	tools/infer/utility.py56
`ir_optim`	bool	True	Enable inference IR optimization	tools/infer/utility.py52

TensorRT Parameters

Parameter	Type	Default	Description	Source
`use_tensorrt`	bool	False	Enable TensorRT engine	tools/infer/utility.py53
`precision`	str	"fp32"	Precision mode (fp32/fp16/int8)	tools/infer/utility.py55
`min_subgraph_size`	int	15	Min operators in TensorRT subgraph	tools/infer/utility.py54
`max_batch_size`	int	10	Maximum batch size for TensorRT	tools/infer/utility.py72

CPU Optimization Parameters

Parameter	Type	Default	Description	Source
`enable_mkldnn`	bool	None	Enable MKLDNN/oneDNN	tools/infer/utility.py133
`cpu_threads`	int	10	Number of CPU threads	tools/infer/utility.py134

ONNX Parameters

Parameter	Type	Default	Description	Source
`use_onnx`	bool	False	Enable ONNX Runtime backend	tools/infer/utility.py157
`onnx_providers`	list	False	ONNX execution providers	tools/infer/utility.py158
`onnx_sess_options`	list	False	ONNX session options	tools/infer/utility.py159

Alternative Hardware Parameters

Parameter	Type	Default	Description	Source
`use_xpu`	bool	False	Enable Kunlunxin XPU	tools/infer/utility.py42
`use_npu`	bool	False	Enable Ascend NPU	tools/infer/utility.py43
`use_mlu`	bool	False	Enable Cambricon MLU	tools/infer/utility.py44
`use_metax_gpu`	bool	False	Enable MetaX GPU	tools/infer/utility.py45
`use_gcu`	bool	False	Enable Enflame GCU	tools/infer/utility.py46-51

Sources: tools/infer/utility.py38-159

Inference Configuration Object

The create_predictor function in utility.py constructs an inference configuration object that encapsulates all optimization settings:

Critical Optimization Passes

Several inference passes are explicitly deleted to ensure compatibility with specific OCR models:

Pass	Reason for Deletion	Affected Models
`conv_transpose_eltwiseadd_bn_fuse_pass`	Causes numerical instability	Most detection models
`matmul_transpose_reshape_fuse_pass`	Incompatible with dynamic shapes	All models
`gpu_cpu_map_matmul_v2_to_matmul_pass`	Breaks SRN architecture	SRN recognition model
`simplify_with_basic_ops_pass`	Incompatible with RE module	Relation extraction models
`fc_fuse_pass`	Not supported for table models	Table recognition models

Implementation: tools/infer/utility.py409-421

Sources: tools/infer/utility.py263-435

Expected Performance Characteristics

The following guidance describes the typical relative impact of each optimization. Actual numbers vary by hardware and model.

GPU Acceleration

Configuration	Typical Speedup vs. FP32 Baseline	Notes
TensorRT FP32	~1.1–1.3x	Graph optimization only
TensorRT FP16	~2–4x	Recognition models benefit most
TensorRT INT8	~3–5x	Requires calibration data

CPU Acceleration

Configuration	Typical Speedup vs. Baseline	Notes
MKLDNN enabled	~1.5–4x	Larger gains on smaller (mobile) models
MKLDNN + BFloat16	~2–5x	Requires Intel CPU with BF16 support
Increased `cpu_threads`	~1.2–2x	Diminishing returns beyond physical cores

MKLDNN speedup is typically larger for mobile/lightweight models (e.g., PP-OCRv5_mobile_rec) than for server-size models due to better vectorization fit.

Hardware-Specific Optimizations

Alternative Accelerators

PaddleOCR supports multiple hardware accelerators beyond NVIDIA GPUs through Paddle's custom device mechanism:

Implementation:

NPU configuration: tools/infer/utility.py347-348
MLU configuration: tools/infer/utility.py349-350
MetaX GPU configuration: tools/infer/utility.py351-358
XPU configuration: tools/infer/utility.py359-360
GCU configuration: tools/infer/utility.py361-384

GCU-Specific Optimizations

Enflame GCU requires additional pass configuration for optimal performance:

Key Steps:

Import GCU passes: import paddle_custom_device.gcu.passes as gcu_passes
Setup passes: gcu_passes.setUp()
Enable mixed precision: gcu_passes.set_exp_enable_mixed_precision_ops(config)
Configure IR: Enable new IR and executor for PIR API, or append legacy IR passes

Implementation: tools/infer/utility.py368-384

Sources: tools/infer/utility.py347-384

Best Practices and Recommendations

Choosing the Right Configuration

Scenario	Recommended Configuration	Expected Performance
Production GPU (High Accuracy)	`use_gpu=True`, `use_tensorrt=True`, `precision="fp16"`	2-3x speedup with <0.5% accuracy loss
Production GPU (Max Speed)	`use_gpu=True`, `use_tensorrt=True`, `precision="int8"`	3-4x speedup with 1-2% accuracy loss
Production CPU	`use_gpu=False`, `enable_mkldnn=True`, `cpu_threads=10`	1.5-2x speedup on mobile models
Cross-Platform Deployment	`use_onnx=True`, `onnx_providers=["CUDAExecutionProvider"]`	Portable but slightly slower than native
Development/Debug	Default settings (FP32, no optimizations)	Baseline performance, easiest debugging

Common Pitfalls

TensorRT Shape Mismatch: Ensure trt_dynamic_shapes in inference.yml covers your actual input size range
Memory Overflow: Reduce gpu_mem or max_batch_size if encountering OOM errors
MKLDNN Cache: The cache is limited to 10 shapes; highly variable inputs may cause cache thrashing
INT8 Calibration: INT8 mode requires calibration data; without it, accuracy degradation may exceed 2%
Pass Compatibility: Don't modify the deleted passes list without understanding model architecture implications

Performance Tuning Workflow

Sources: tools/infer/utility.py177-436 docs/version3.x/pipeline_usage/OCR.md132-201

High-Performance Inference

Purpose and Scope

Inference Mode Overview

Inference Backend Architecture

TensorRT Acceleration for GPU

TensorRT Configuration

TensorRT Engine Creation Flow

Dynamic Shape Configuration

Precision Modes

TensorRT Usage Example

CPU Optimization with MKLDNN

MKLDNN Configuration

MKLDNN Configuration Parameters

Cache Management

MKLDNN Usage Example

ONNX Runtime Integration

ONNX vs Paddle Inference

ONNX Inference Backend Selection

ONNX Configuration Parameters

Model Export to ONNX

ONNX Inference Execution

Comprehensive Configuration Parameters

General Inference Parameters

TensorRT Parameters

CPU Optimization Parameters

ONNX Parameters

Alternative Hardware Parameters

Inference Configuration Object

Critical Optimization Passes

Expected Performance Characteristics

GPU Acceleration

CPU Acceleration

Hardware-Specific Optimizations

Alternative Accelerators

GCU-Specific Optimizations

Best Practices and Recommendations

Choosing the Right Configuration

Common Pitfalls

Performance Tuning Workflow

On this page

High-Performance Inference

Purpose and Scope

Inference Mode Overview

Inference Backend Architecture

TensorRT Acceleration for GPU

TensorRT Configuration

TensorRT Engine Creation Flow

Dynamic Shape Configuration

Precision Modes

TensorRT Usage Example

CPU Optimization with MKLDNN

MKLDNN Configuration

MKLDNN Configuration Parameters

Cache Management

MKLDNN Usage Example

ONNX Runtime Integration

ONNX vs Paddle Inference

ONNX Inference Backend Selection

ONNX Configuration Parameters

Model Export to ONNX

ONNX Inference Execution

Comprehensive Configuration Parameters

General Inference Parameters

TensorRT Parameters

CPU Optimization Parameters

ONNX Parameters

Alternative Hardware Parameters

Inference Configuration Object

Critical Optimization Passes

Expected Performance Characteristics

GPU Acceleration

CPU Acceleration

Hardware-Specific Optimizations

Alternative Accelerators

GCU-Specific Optimizations

Best Practices and Recommendations

Choosing the Right Configuration

Common Pitfalls

Performance Tuning Workflow

On this page