This document describes parallel inference strategies available in PaddleOCR to improve throughput when processing multiple inputs or large documents. It covers three complementary approaches: (1) built-in multi-device support for automatic parallelization across GPUs/accelerators, (2) internal queue-based asynchronous processing for pipeline stages, and (3) custom multi-process parallelism patterns for advanced use cases.
For general inference interfaces, see Python Inference System. For performance optimization techniques like TensorRT and ONNX, see High-Performance Inference. For service-based deployment patterns, see Service Deployment.
PaddleOCR pipelines support specifying multiple inference devices simultaneously using comma-separated device identifiers. When multiple devices are specified, the pipeline creates one instance of the underlying model on each device and distributes inputs across them for parallel processing.
Supported device format:
device="<device_type>:<id1>,<id2>,<id3>,..."
Examples:
device="gpu:0,1,2,3" - Use 4 NVIDIA GPUsdevice="npu:0,1" - Use 2 Huawei Ascend NPUsdevice="xpu:0,1,2" - Use 3 Kunlunxin XPUsDiagram: Multi-Device Pipeline Distribution
During pipeline initialization, one complete pipeline instance (including all loaded models) is created on each specified device. The dispatcher distributes incoming inputs across instances, and results are aggregated in the order they were submitted.
Sources: docs/version3.x/pipeline_usage/instructions/parallel_inference.en.md5-31
The following table shows multi-device support status for major pipelines:
| Pipeline | Multi-Device Support | Relevant Parameter |
|---|---|---|
PaddleOCR (PP-OCRv5) | ✅ Yes | device |
PPStructureV3 | ✅ Yes | device |
PaddleOCRVL | ✅ Yes | device |
PPChatOCRv4 | ✅ Yes | device |
DocPreprocessor | ✅ Yes | device |
FormulaRecognition | ✅ Yes | device |
SealRecognition | ✅ Yes | device |
TableRecognitionV2 | ✅ Yes | device |
To verify whether a specific pipeline supports multi-device inference, refer to the pipeline's usage documentation.
Sources: docs/version3.x/pipeline_usage/instructions/parallel_inference.en.md1-31
use_queues ParameterSeveral PaddleOCR pipelines (notably PaddleOCRVL) implement internal queue-based asynchronous processing to improve throughput when processing multi-page documents or directories with many files. This is controlled by the use_queues parameter.
Parameter specification:
boolTrue for PaddleOCRVL pipelineUsage example:
Diagram: Queue-Based Asynchronous Pipeline Architecture
The queue-based approach decouples processing stages, allowing them to run concurrently. While the layout detection model processes page N, the VLM can simultaneously process page N-1, and the PDF renderer can prepare page N+1. This pipeline parallelism is particularly efficient for large multi-page documents.
Sources: docs/version3.x/pipeline_usage/PaddleOCR-VL.en.md537-542 docs/version3.x/pipeline_usage/PaddleOCR-VL.md508-513
Recommended scenarios:
Performance characteristics:
Example comparison:
| Scenario | use_queues=False | use_queues=True | Speedup |
|---|---|---|---|
| Single image | 100% (baseline) | ~100% | 1.0x |
| 10-page PDF | 100% (baseline) | ~130-150% | 1.3-1.5x |
| 100-page PDF | 100% (baseline) | ~180-220% | 1.8-2.2x |
| Directory (50 images) | 100% (baseline) | ~150-200% | 1.5-2.0x |
Sources: docs/version3.x/pipeline_usage/PaddleOCR-VL.en.md537-542 docs/version3.x/pipeline_usage/PaddleOCR-VL.md508-513
For advanced use cases requiring fine-grained control over parallelism, users can implement custom multi-process patterns wrapping PaddleOCR pipeline APIs. This approach is useful when the built-in multi-device support doesn't match your specific requirements.
The following example demonstrates a common pattern: a task queue feeding multiple worker processes, each running a pipeline instance on a dedicated device.
Diagram: Multi-Process Task Queue Architecture
This pattern uses Python's multiprocessing.Manager to create a shared queue. The main process enqueues all input file paths, and worker processes compete to dequeue tasks, process them with their pipeline instance, and save results. Each worker runs on a dedicated device to avoid resource contention.
Sources: docs/version3.x/pipeline_usage/instructions/parallel_inference.en.md33-166
The complete implementation is provided in the documentation. Key components:
Pipeline Loading - docs/version3.x/pipeline_usage/instructions/parallel_inference.en.md49-54
Worker Function - docs/version3.x/pipeline_usage/instructions/parallel_inference.en.md56-86
Main Process - docs/version3.x/pipeline_usage/instructions/parallel_inference.en.md89-166
Manager().Queue() for task distributionThe example provides a CLI for easy usage:
Parameters:
--pipeline: PaddleOCR pipeline class name (e.g., DocPreprocessor, PaddleOCRVL)--input_dir: Directory containing input files--device: Comma-separated device list--output_dir: Where to save results--instances_per_device: Number of pipeline instances per device (default: 1)--batch_size: Inference batch size per instance (default: 1)--input_glob_pattern: Pattern to match input files (default: *)Sources: docs/version3.x/pipeline_usage/instructions/parallel_inference.en.md89-166
The example supports running multiple pipeline instances on the same device by adjusting --instances_per_device. This can improve GPU utilization when the model doesn't fully saturate the device.
Device assignment logic:
When to use multiple instances per device:
Trade-offs:
Sources: docs/version3.x/pipeline_usage/instructions/parallel_inference.en.md110-166
| Requirement | Recommended Approach | Rationale |
|---|---|---|
| Simple multi-GPU inference | Built-in multi-device | Easiest to use, automatic load balancing |
| Multi-page PDF processing | use_queues=True | Pipeline parallelism between stages |
| Large directory of images | Built-in multi-device + batch input | Efficient for homogeneous workloads |
| Fine-grained control | Custom multi-process | Maximum flexibility |
| Heterogeneous devices | Custom multi-process | Can assign different pipeline configs per device |
| Service deployment | Built-in multi-device + queues | Combines throughput benefits |
The three approaches can be combined for maximum performance:
This three-level parallelism approach provides:
Sources: docs/version3.x/pipeline_usage/instructions/parallel_inference.en.md1-166 docs/version3.x/pipeline_usage/PaddleOCR-VL.en.md537-542
For built-in multi-device:
For custom multi-process:
batch_size parameter in worker function controls per-worker batchingpredict() callsbatch_size means processing multiple PDF pages togetherRecommended starting points:
Multi-device inference:
Queue-based processing:
Multi-process parallelism:
--instances_per_device carefully to avoid OOMSources: docs/version3.x/pipeline_usage/PaddleOCR-VL.en.md537-542 docs/version3.x/pipeline_usage/instructions/parallel_inference.en.md33-166
For service deployment, device specification can be provided via YAML configuration instead of programmatically:
This configuration is loaded when initializing pipelines in service mode.
Sources: docs/version3.x/pipeline_usage/PaddleOCR-VL.en.md616-641 docs/version3.x/pipeline_usage/PaddleOCR-VL.md615-620
Different accelerators have specific device naming conventions:
| Hardware | Device String Format | Example |
|---|---|---|
| NVIDIA GPU | gpu:<id> | gpu:0,1,2,3 |
| Kunlunxin XPU | xpu:<id> | xpu:0,1 |
| Huawei Ascend NPU | npu:<id> | npu:0,1 |
| Cambricon MLU | mlu:<id> | mlu:0,1 |
| Hygon DCU | dcu:<id> | dcu:0,1 |
| MetaX GPU | metax_gpu:<id> | metax_gpu:0,1 |
| Iluvatar GPU | iluvatar_gpu:<id> | iluvatar_gpu:0,1 |
| CPU | cpu | cpu (no multi-device) |
For hardware-specific setup instructions, see the relevant usage tutorial (e.g., docs/version3.x/pipeline_usage/PaddleOCR-VL-Kunlunxin-XPU.en.md).
Sources: docs/version3.x/pipeline_usage/PaddleOCR-VL.en.md580-595 docs/version3.x/pipeline_usage/PaddleOCR-VL-NVIDIA-Blackwell.en.md1-316 docs/version3.x/pipeline_usage/PaddleOCR-VL-Iluvatar-GPU.en.md1-242 docs/version3.x/pipeline_usage/PaddleOCR-VL-MetaX-GPU.en.md1-242 docs/version3.x/pipeline_usage/PaddleOCR-VL-Huawei-Ascend-NPU.en.md1-242 docs/version3.x/pipeline_usage/PaddleOCR-VL-Apple-Silicon.en.md1-105
Refresh this wiki
This wiki was recently refreshed. Please wait 2 days to refresh again.