This document describes the deployment and inference capabilities of PaddleOCR, covering how models are loaded, executed, and optimized for production environments. The inference system supports multiple deployment options ranging from simple Python scripts to high-performance C++ executables and containerized services.
Scope: This page focuses on model inference (prediction) after models have been trained and exported. For information about model training, see page 4. For pipeline-level usage patterns, see page 2. For specific hardware platform configurations, see page 7.
PaddleOCR supports several deployment paths, as summarized in deploy/README.md:
| Option | Reference Page | Key Entry Point |
|---|---|---|
| Python Inference | 5.1 | tools/infer/predict_system.py |
| High-Performance Inference (TensorRT, ONNX) | 5.2 | tools/infer/utility.py:create_predictor |
| C++ Inference | 5.3 | deploy/cpp_infer/ |
| Service Deployment (PD Serving) | 5.4 | deploy/pdserving/ |
| Parallel / Multi-Device | 5.5 | tools/infer/predict_system.py:main |
| Mobile / Edge (Paddle-Lite) | 5.6 | deploy/lite/ |
Sources: deploy/README.md1-31
The Python inference system provides the foundation for running PaddleOCR models through a unified API. The system consists of predictor classes for individual tasks (detection, recognition, classification) and orchestration classes that combine these components into complete pipelines.
Sources: tools/infer/utility.py177-435 tools/infer/predict_det.py36-150 tools/infer/predict_rec.py39-181 tools/infer/predict_system.py48-61
The create_predictor() function in tools/infer/utility.py177-435 is the central factory for creating inference engines. It handles:
model.pdmodel/inference.json and model.pdiparams/inference.pdiparams)Sources: tools/infer/utility.py177-435 tools/infer/predict_det.py145-150
The TextDetector class (tools/infer/predict_det.py36-221) implements text detection with the following workflow:
| Step | Method | Description |
|---|---|---|
| Preprocessing | preprocess_op | Resize image with DetResizeForTest, normalize, convert to CHW format |
| Inference | predictor.run() | Execute model on GPU/CPU to generate probability maps |
| Postprocessing | postprocess_op | Extract text boxes from probability maps using DBPostProcess |
| Filtering | filter_tag_det_res() | Remove invalid boxes (too small, out of bounds) |
Key parameters controlled via args:
det_limit_side_len: Maximum image side length (default: 960)det_db_thresh: Binarization threshold (default: 0.3)det_db_box_thresh: Box confidence threshold (default: 0.6)det_db_unclip_ratio: Text region expansion ratio (default: 1.5)Sources: tools/infer/predict_det.py36-306 tools/infer/utility.py54-80
The TextRecognizer class (tools/infer/predict_rec.py39-625) handles text recognition with algorithm-specific preprocessing:
Sources: tools/infer/predict_rec.py205-625 ppocr/postprocess/rec_postprocess.py
The TextSystem class (tools/infer/predict_system.py48-158) orchestrates the complete OCR pipeline. Its __call__ method (tools/infer/predict_system.py76-157) runs the detector, then crops each text region, optionally runs the angle classifier, and finally runs the recognizer. Low-confidence results are filtered by drop_score.
The system applies sorted_boxes() (tools/infer/predict_system.py160-182) to order detected boxes from top-to-bottom, left-to-right before recognition. Multi-process execution is exposed via --use_mp / --total_process_num flags, which spawn subprocess workers with subprocess.Popen (tools/infer/predict_system.py310-325).
Sources: tools/infer/predict_system.py48-158 tools/infer/predict_system.py160-182 tools/infer/predict_system.py310-325 tools/infer/utility.py149-151
High-performance inference (enable_hpi option) leverages hardware-accelerated backends for faster execution with reduced latency and higher throughput.
| Backend | Supported Hardware | Precision | Key Configuration |
|---|---|---|---|
| Paddle Inference | CPU, GPU, NPU, XPU, DCU | FP32, FP16 | ir_optim=True |
| TensorRT | NVIDIA GPU | FP32, FP16, INT8 | use_tensorrt=True, precision="fp16" |
| MKL-DNN (oneDNN) | Intel CPU | FP32, BF16 | enable_mkldnn=True |
| ONNX Runtime | CPU, GPU | FP32, FP16 | use_onnx=True |
TensorRT acceleration is enabled via tools/infer/utility.py282-346 The system supports two modes:
1. Legacy Mode (.pdmodel format):
{mode}_trt_dynamic_shape.txtconfig.enable_tensorrt_engine() with workspace and precision settingsconfig.enable_tuned_tensorrt_dynamic_shape()2. PIR Mode (.json format):
inference.ymlpaddle.tensorrt.export.convert(){model_dir}/.cache/trt/ directoryKey configuration in inference.yml:
Sources: tools/infer/utility.py282-346 tools/infer/utility.py438-514
For CPU inference, MKL-DNN (now called oneDNN) provides significant acceleration:
The cache capacity prevents memory leaks when processing images of varying sizes. Thread count (cpu_threads) controls parallelism, with a default of 10.
Sources: tools/infer/utility.py387-402
Inference precision is controlled by the precision argument:
Sources: tools/infer/utility.py265-273 tools/infer/utility.py392-393
ONNX Runtime support is implemented via tools/infer/utility.py200-238:
Models must be converted to ONNX format first using Paddle2ONNX (see deployment documentation).
Sources: tools/infer/utility.py200-238
C++ deployment provides standalone executables with minimal dependencies for production environments. The implementation is located in deploy/cpp_infer/. See page 5.3 for full details including the CMake build system and C++ API.
The C++ inference workflow mirrors the Python implementation but uses native PaddlePaddle C++ APIs:
paddle::AnalysisConfig and create predictorsKey build options (in deploy/cpp_infer/CMakeLists.txt):
| CMake Flag | Purpose |
|---|---|
WITH_GPU | Enable CUDA acceleration |
WITH_MKL | Use Intel MKL for CPU BLAS |
WITH_STATIC_LIB | Static linking for standalone executable |
USE_TENSORRT | Enable TensorRT optimization |
Sources: deploy/README.md25-26
PaddleOCR provides service-oriented deployment via deploy/pdserving/. Both Python and C++ serving backends are supported, enabling remote HTTP/gRPC inference. See page 5.4 for full setup and usage details.
Deployment path: deploy/pdserving/
The PD Serving integration allows PaddleOCR inference to be exposed as a network service. Clients send images over HTTP/gRPC, and the server returns detection bounding boxes, recognition text, and confidence scores.
Sources: deploy/README.md23-27
PaddleOCR supports parallel inference to increase throughput. See page 5.5 for full details.
predict_system.py)tools/infer/predict_system.py exposes multi-process execution via CLI flags declared in tools/infer/utility.py:
| Flag | Type | Default | Description |
|---|---|---|---|
--use_mp | bool | False | Enable multi-process mode |
--total_process_num | int | 1 | Total number of worker processes |
--process_id | int | 0 | ID of this process (set automatically) |
When --use_mp=True, the main entry point in predict_system.py spawns total_process_num child processes using subprocess.Popen, each receiving a unique --process_id. Each worker then processes a disjoint slice of the input file list via image_file_list[process_id::total_process_num] (tools/infer/predict_system.py187).
Sources: tools/infer/predict_system.py185-188 tools/infer/predict_system.py310-325 tools/infer/utility.py149-151
PaddleOCR supports on-device deployment via Paddle Lite for Android and iOS platforms.
Paddle Lite is a lightweight inference engine optimized for mobile and embedded devices:
| Feature | Mobile Optimization |
|---|---|
| Model Size | Quantization (INT8) reduces size by 4x |
| Inference Speed | ARM NEON and GPU acceleration |
| Memory Usage | Memory reuse and operator fusion |
| Power Efficiency | Dynamic frequency scaling support |
Android deployment workflow:
paddle_lite_opt to generate .nb formatModel optimization example:
The Android example is available at the repository and demonstrates:
Sources: README.md183
iOS deployment follows a similar pattern:
Key considerations:
Sources: README.md mentions Android support
Performance optimization strategies for edge devices:
Typical mobile model sizes:
Sources: README.md20-21 README.md183
This comprehensive deployment and inference documentation covers the complete spectrum from Python scripts to production services and mobile applications. For specific hardware configurations, consult 7. For pipeline usage patterns, see 2.
Refresh this wiki
This wiki was recently refreshed. Please wait 2 days to refresh again.