Deployment and Inference

Relevant source files

This document describes the deployment and inference capabilities of PaddleOCR, covering how models are loaded, executed, and optimized for production environments. The inference system supports multiple deployment options ranging from simple Python scripts to high-performance C++ executables and containerized services.

Scope: This page focuses on model inference (prediction) after models have been trained and exported. For information about model training, see page 4. For pipeline-level usage patterns, see page 2. For specific hardware platform configurations, see page 7.

Deployment Options at a Glance

PaddleOCR supports several deployment paths, as summarized in deploy/README.md:

Option	Reference Page	Key Entry Point
Python Inference	5.1	`tools/infer/predict_system.py`
High-Performance Inference (TensorRT, ONNX)	5.2	`tools/infer/utility.py:create_predictor`
C++ Inference	5.3	`deploy/cpp_infer/`
Service Deployment (PD Serving)	5.4	`deploy/pdserving/`
Parallel / Multi-Device	5.5	`tools/infer/predict_system.py:main`
Mobile / Edge (Paddle-Lite)	5.6	`deploy/lite/`

Sources: deploy/README.md1-31

5.1 Python Inference System

The Python inference system provides the foundation for running PaddleOCR models through a unified API. The system consists of predictor classes for individual tasks (detection, recognition, classification) and orchestration classes that combine these components into complete pipelines.

Core Inference Classes

Sources: tools/infer/utility.py177-435 tools/infer/predict_det.py36-150 tools/infer/predict_rec.py39-181 tools/infer/predict_system.py48-61

Predictor Initialization Workflow

The create_predictor() function in tools/infer/utility.py177-435 is the central factory for creating inference engines. It handles:

Model Loading: Locates model files (model.pdmodel/inference.json and model.pdiparams/inference.pdiparams)
Backend Selection: Configures Paddle Inference, ONNX Runtime, or TensorRT based on arguments
Device Configuration: Sets up CPU, GPU, NPU, XPU, DCU, or other accelerators
Optimization Passes: Enables memory optimization, IR optimization, and backend-specific passes

Sources: tools/infer/utility.py177-435 tools/infer/predict_det.py145-150

Text Detection Inference

The TextDetector class (tools/infer/predict_det.py36-221) implements text detection with the following workflow:

Step	Method	Description
Preprocessing	`preprocess_op`	Resize image with `DetResizeForTest`, normalize, convert to CHW format
Inference	`predictor.run()`	Execute model on GPU/CPU to generate probability maps
Postprocessing	`postprocess_op`	Extract text boxes from probability maps using `DBPostProcess`
Filtering	`filter_tag_det_res()`	Remove invalid boxes (too small, out of bounds)

Key parameters controlled via args:

det_limit_side_len: Maximum image side length (default: 960)
det_db_thresh: Binarization threshold (default: 0.3)
det_db_box_thresh: Box confidence threshold (default: 0.6)
det_db_unclip_ratio: Text region expansion ratio (default: 1.5)

Sources: tools/infer/predict_det.py36-306 tools/infer/utility.py54-80

Text Recognition Inference

The TextRecognizer class (tools/infer/predict_rec.py39-625) handles text recognition with algorithm-specific preprocessing:

Sources: tools/infer/predict_rec.py205-625 ppocr/postprocess/rec_postprocess.py

Text System Orchestration

The TextSystem class (tools/infer/predict_system.py48-158) orchestrates the complete OCR pipeline. Its __call__ method (tools/infer/predict_system.py76-157) runs the detector, then crops each text region, optionally runs the angle classifier, and finally runs the recognizer. Low-confidence results are filtered by drop_score.

The system applies sorted_boxes() (tools/infer/predict_system.py160-182) to order detected boxes from top-to-bottom, left-to-right before recognition. Multi-process execution is exposed via --use_mp / --total_process_num flags, which spawn subprocess workers with subprocess.Popen (tools/infer/predict_system.py310-325).

Sources: tools/infer/predict_system.py48-158 tools/infer/predict_system.py160-182 tools/infer/predict_system.py310-325 tools/infer/utility.py149-151

5.2 High-Performance Inference and Optimization

High-performance inference (enable_hpi option) leverages hardware-accelerated backends for faster execution with reduced latency and higher throughput.

Backend Configuration Matrix

Backend	Supported Hardware	Precision	Key Configuration
Paddle Inference	CPU, GPU, NPU, XPU, DCU	FP32, FP16	`ir_optim=True`
TensorRT	NVIDIA GPU	FP32, FP16, INT8	`use_tensorrt=True`, `precision="fp16"`
MKL-DNN (oneDNN)	Intel CPU	FP32, BF16	`enable_mkldnn=True`
ONNX Runtime	CPU, GPU	FP32, FP16	`use_onnx=True`

TensorRT Integration

TensorRT acceleration is enabled via tools/infer/utility.py282-346 The system supports two modes:

1. Legacy Mode (.pdmodel format):

Collects dynamic shape information to {mode}_trt_dynamic_shape.txt
Uses config.enable_tensorrt_engine() with workspace and precision settings
Applies shape tuning via config.enable_tuned_tensorrt_dynamic_shape()

2. PIR Mode (.json format):

Reads TensorRT dynamic shape configuration from inference.yml
Converts model to TensorRT format using paddle.tensorrt.export.convert()
Caches converted model in {model_dir}/.cache/trt/ directory

Key configuration in inference.yml:

Sources: tools/infer/utility.py282-346 tools/infer/utility.py438-514

CPU Optimization with MKL-DNN

For CPU inference, MKL-DNN (now called oneDNN) provides significant acceleration:

The cache capacity prevents memory leaks when processing images of varying sizes. Thread count (cpu_threads) controls parallelism, with a default of 10.

Sources: tools/infer/utility.py387-402

Precision Configuration

Inference precision is controlled by the precision argument:

Sources: tools/infer/utility.py265-273 tools/infer/utility.py392-393

ONNX Runtime Backend

ONNX Runtime support is implemented via tools/infer/utility.py200-238:

Models must be converted to ONNX format first using Paddle2ONNX (see deployment documentation).

Sources: tools/infer/utility.py200-238

5.3 C++ Inference and Deployment

C++ deployment provides standalone executables with minimal dependencies for production environments. The implementation is located in deploy/cpp_infer/. See page 5.3 for full details including the CMake build system and C++ API.

C++ Inference Workflow

The C++ inference workflow mirrors the Python implementation but uses native PaddlePaddle C++ APIs:

Configuration: Read model paths and parameters from command-line arguments
Predictor Creation: Initialize paddle::AnalysisConfig and create predictors
Preprocessing: Same image transformations as Python (resize, normalize, CHW conversion)
Inference: Copy data to input tensors, run predictor, read output tensors
Postprocessing: Apply DB postprocessing for detection, CTC decoding for recognition

Key build options (in deploy/cpp_infer/CMakeLists.txt):

CMake Flag	Purpose
`WITH_GPU`	Enable CUDA acceleration
`WITH_MKL`	Use Intel MKL for CPU BLAS
`WITH_STATIC_LIB`	Static linking for standalone executable
`USE_TENSORRT`	Enable TensorRT optimization

Sources: deploy/README.md25-26

5.4 Service Deployment

PaddleOCR provides service-oriented deployment via deploy/pdserving/. Both Python and C++ serving backends are supported, enabling remote HTTP/gRPC inference. See page 5.4 for full setup and usage details.

Serving Architecture Overview

Deployment path: deploy/pdserving/

The PD Serving integration allows PaddleOCR inference to be exposed as a network service. Clients send images over HTTP/gRPC, and the server returns detection bounding boxes, recognition text, and confidence scores.

Sources: deploy/README.md23-27

5.5 Parallel and Multi-Device Inference

PaddleOCR supports parallel inference to increase throughput. See page 5.5 for full details.

Built-in Multi-Process Support (`predict_system.py`)

tools/infer/predict_system.py exposes multi-process execution via CLI flags declared in tools/infer/utility.py:

Flag	Type	Default	Description
`--use_mp`	bool	`False`	Enable multi-process mode
`--total_process_num`	int	`1`	Total number of worker processes
`--process_id`	int	`0`	ID of this process (set automatically)

When --use_mp=True, the main entry point in predict_system.py spawns total_process_num child processes using subprocess.Popen, each receiving a unique --process_id. Each worker then processes a disjoint slice of the input file list via image_file_list[process_id::total_process_num] (tools/infer/predict_system.py187).

Sources: tools/infer/predict_system.py185-188 tools/infer/predict_system.py310-325 tools/infer/utility.py149-151

5.6 Mobile and Edge Deployment

PaddleOCR supports on-device deployment via Paddle Lite for Android and iOS platforms.

Paddle Lite Integration

Paddle Lite is a lightweight inference engine optimized for mobile and embedded devices:

Feature	Mobile Optimization
Model Size	Quantization (INT8) reduces size by 4x
Inference Speed	ARM NEON and GPU acceleration
Memory Usage	Memory reuse and operator fusion
Power Efficiency	Dynamic frequency scaling support

Android Deployment

Android deployment workflow:

Export Model: Convert trained model to inference format
Optimize for Mobile: Use paddle_lite_opt to generate .nb format
Integrate Library: Add Paddle Lite AAR to Android project
Implement JNI: Create native interface for model loading and inference
Build APK: Compile and package application

Model optimization example:

The Android example is available at the repository and demonstrates:

Camera capture integration
Real-time text detection overlay
Recognized text display
Performance metrics (FPS, latency)

Sources: README.md183

iOS Deployment

iOS deployment follows a similar pattern:

Convert model to Paddle Lite format
Add Paddle Lite framework to Xcode project
Implement Swift/Objective-C interface
Integrate with AVFoundation for camera input

Key considerations:

iOS apps require CoreML integration for optimal performance
Metal GPU acceleration available on recent devices
App size constraints may require model compression

Sources: README.md mentions Android support

On-Device Performance Optimization

Performance optimization strategies for edge devices:

Typical mobile model sizes:

PP-OCRv5 mobile_det: ~3MB (quantized)
PP-OCRv5 mobile_rec: ~8MB (quantized)
Total APK size: ~15-20MB including all dependencies

Sources: README.md20-21 README.md183

This comprehensive deployment and inference documentation covers the complete spectrum from Python scripts to production services and mobile applications. For specific hardware configurations, consult 7. For pipeline usage patterns, see 2.

Deployment and Inference

Relevant source files

Deployment Options at a Glance

PaddleOCR supports several deployment paths, as summarized in deploy/README.md:

Option	Reference Page	Key Entry Point
Python Inference	5.1	`tools/infer/predict_system.py`
High-Performance Inference (TensorRT, ONNX)	5.2	`tools/infer/utility.py:create_predictor`
C++ Inference	5.3	`deploy/cpp_infer/`
Service Deployment (PD Serving)	5.4	`deploy/pdserving/`
Parallel / Multi-Device	5.5	`tools/infer/predict_system.py:main`
Mobile / Edge (Paddle-Lite)	5.6	`deploy/lite/`

Sources: deploy/README.md1-31

5.1 Python Inference System

Core Inference Classes

Sources: tools/infer/utility.py177-435 tools/infer/predict_det.py36-150 tools/infer/predict_rec.py39-181 tools/infer/predict_system.py48-61

Predictor Initialization Workflow

The create_predictor() function in tools/infer/utility.py177-435 is the central factory for creating inference engines. It handles:

Model Loading: Locates model files (model.pdmodel/inference.json and model.pdiparams/inference.pdiparams)
Backend Selection: Configures Paddle Inference, ONNX Runtime, or TensorRT based on arguments
Device Configuration: Sets up CPU, GPU, NPU, XPU, DCU, or other accelerators
Optimization Passes: Enables memory optimization, IR optimization, and backend-specific passes

Sources: tools/infer/utility.py177-435 tools/infer/predict_det.py145-150

Text Detection Inference

The TextDetector class (tools/infer/predict_det.py36-221) implements text detection with the following workflow:

Step	Method	Description
Preprocessing	`preprocess_op`	Resize image with `DetResizeForTest`, normalize, convert to CHW format
Inference	`predictor.run()`	Execute model on GPU/CPU to generate probability maps
Postprocessing	`postprocess_op`	Extract text boxes from probability maps using `DBPostProcess`
Filtering	`filter_tag_det_res()`	Remove invalid boxes (too small, out of bounds)

Key parameters controlled via args:

det_limit_side_len: Maximum image side length (default: 960)
det_db_thresh: Binarization threshold (default: 0.3)
det_db_box_thresh: Box confidence threshold (default: 0.6)
det_db_unclip_ratio: Text region expansion ratio (default: 1.5)

Sources: tools/infer/predict_det.py36-306 tools/infer/utility.py54-80

Text Recognition Inference

The TextRecognizer class (tools/infer/predict_rec.py39-625) handles text recognition with algorithm-specific preprocessing:

Sources: tools/infer/predict_rec.py205-625 ppocr/postprocess/rec_postprocess.py

Text System Orchestration

Sources: tools/infer/predict_system.py48-158 tools/infer/predict_system.py160-182 tools/infer/predict_system.py310-325 tools/infer/utility.py149-151

5.2 High-Performance Inference and Optimization

High-performance inference (enable_hpi option) leverages hardware-accelerated backends for faster execution with reduced latency and higher throughput.

Backend Configuration Matrix

Backend	Supported Hardware	Precision	Key Configuration
Paddle Inference	CPU, GPU, NPU, XPU, DCU	FP32, FP16	`ir_optim=True`
TensorRT	NVIDIA GPU	FP32, FP16, INT8	`use_tensorrt=True`, `precision="fp16"`
MKL-DNN (oneDNN)	Intel CPU	FP32, BF16	`enable_mkldnn=True`
ONNX Runtime	CPU, GPU	FP32, FP16	`use_onnx=True`

TensorRT Integration

TensorRT acceleration is enabled via tools/infer/utility.py282-346 The system supports two modes:

1. Legacy Mode (.pdmodel format):

Collects dynamic shape information to {mode}_trt_dynamic_shape.txt
Uses config.enable_tensorrt_engine() with workspace and precision settings
Applies shape tuning via config.enable_tuned_tensorrt_dynamic_shape()

2. PIR Mode (.json format):

Reads TensorRT dynamic shape configuration from inference.yml
Converts model to TensorRT format using paddle.tensorrt.export.convert()
Caches converted model in {model_dir}/.cache/trt/ directory

Key configuration in inference.yml:

Sources: tools/infer/utility.py282-346 tools/infer/utility.py438-514

CPU Optimization with MKL-DNN

For CPU inference, MKL-DNN (now called oneDNN) provides significant acceleration:

The cache capacity prevents memory leaks when processing images of varying sizes. Thread count (cpu_threads) controls parallelism, with a default of 10.

Sources: tools/infer/utility.py387-402

Precision Configuration

Inference precision is controlled by the precision argument:

Sources: tools/infer/utility.py265-273 tools/infer/utility.py392-393

ONNX Runtime Backend

ONNX Runtime support is implemented via tools/infer/utility.py200-238:

Models must be converted to ONNX format first using Paddle2ONNX (see deployment documentation).

Sources: tools/infer/utility.py200-238

5.3 C++ Inference and Deployment

C++ Inference Workflow

The C++ inference workflow mirrors the Python implementation but uses native PaddlePaddle C++ APIs:

Configuration: Read model paths and parameters from command-line arguments
Predictor Creation: Initialize paddle::AnalysisConfig and create predictors
Preprocessing: Same image transformations as Python (resize, normalize, CHW conversion)
Inference: Copy data to input tensors, run predictor, read output tensors
Postprocessing: Apply DB postprocessing for detection, CTC decoding for recognition

Key build options (in deploy/cpp_infer/CMakeLists.txt):

CMake Flag	Purpose
`WITH_GPU`	Enable CUDA acceleration
`WITH_MKL`	Use Intel MKL for CPU BLAS
`WITH_STATIC_LIB`	Static linking for standalone executable
`USE_TENSORRT`	Enable TensorRT optimization

Sources: deploy/README.md25-26

5.4 Service Deployment

Serving Architecture Overview

Deployment path: deploy/pdserving/

Sources: deploy/README.md23-27

5.5 Parallel and Multi-Device Inference

PaddleOCR supports parallel inference to increase throughput. See page 5.5 for full details.

Built-in Multi-Process Support (`predict_system.py`)

tools/infer/predict_system.py exposes multi-process execution via CLI flags declared in tools/infer/utility.py:

Flag	Type	Default	Description
`--use_mp`	bool	`False`	Enable multi-process mode
`--total_process_num`	int	`1`	Total number of worker processes
`--process_id`	int	`0`	ID of this process (set automatically)

Sources: tools/infer/predict_system.py185-188 tools/infer/predict_system.py310-325 tools/infer/utility.py149-151

5.6 Mobile and Edge Deployment

PaddleOCR supports on-device deployment via Paddle Lite for Android and iOS platforms.

Paddle Lite Integration

Paddle Lite is a lightweight inference engine optimized for mobile and embedded devices:

Feature	Mobile Optimization
Model Size	Quantization (INT8) reduces size by 4x
Inference Speed	ARM NEON and GPU acceleration
Memory Usage	Memory reuse and operator fusion
Power Efficiency	Dynamic frequency scaling support

Android Deployment

Android deployment workflow:

Export Model: Convert trained model to inference format
Optimize for Mobile: Use paddle_lite_opt to generate .nb format
Integrate Library: Add Paddle Lite AAR to Android project
Implement JNI: Create native interface for model loading and inference
Build APK: Compile and package application

Model optimization example:

The Android example is available at the repository and demonstrates:

Camera capture integration
Real-time text detection overlay
Recognized text display
Performance metrics (FPS, latency)

Sources: README.md183

iOS Deployment

iOS deployment follows a similar pattern:

Convert model to Paddle Lite format
Add Paddle Lite framework to Xcode project
Implement Swift/Objective-C interface
Integrate with AVFoundation for camera input

Key considerations:

iOS apps require CoreML integration for optimal performance
Metal GPU acceleration available on recent devices
App size constraints may require model compression

Sources: README.md mentions Android support

On-Device Performance Optimization

Performance optimization strategies for edge devices:

Typical mobile model sizes:

PP-OCRv5 mobile_det: ~3MB (quantized)
PP-OCRv5 mobile_rec: ~8MB (quantized)
Total APK size: ~15-20MB including all dependencies

Sources: README.md20-21 README.md183

Deployment and Inference

Deployment Options at a Glance

5.1 Python Inference System

Core Inference Classes

Predictor Initialization Workflow

Text Detection Inference

Text Recognition Inference

Text System Orchestration

5.2 High-Performance Inference and Optimization

Backend Configuration Matrix

TensorRT Integration

CPU Optimization with MKL-DNN

Precision Configuration

ONNX Runtime Backend

5.3 C++ Inference and Deployment

C++ Inference Workflow

5.4 Service Deployment

Serving Architecture Overview

5.5 Parallel and Multi-Device Inference

Built-in Multi-Process Support (predict_system.py)

5.6 Mobile and Edge Deployment

Paddle Lite Integration

Android Deployment

iOS Deployment

On-Device Performance Optimization

On this page

Deployment and Inference

Deployment Options at a Glance

5.1 Python Inference System

Core Inference Classes

Predictor Initialization Workflow

Text Detection Inference

Text Recognition Inference

Text System Orchestration

5.2 High-Performance Inference and Optimization

Backend Configuration Matrix

TensorRT Integration

CPU Optimization with MKL-DNN

Precision Configuration

ONNX Runtime Backend

5.3 C++ Inference and Deployment

C++ Inference Workflow

5.4 Service Deployment

Serving Architecture Overview

5.5 Parallel and Multi-Device Inference

Built-in Multi-Process Support (predict_system.py)

5.6 Mobile and Edge Deployment

Paddle Lite Integration

Android Deployment

iOS Deployment

On-Device Performance Optimization

On this page

Built-in Multi-Process Support (`predict_system.py`)

Built-in Multi-Process Support (`predict_system.py`)