Parallel and Multi-Device Inference

Relevant source files

Purpose and Scope

This document describes parallel inference strategies available in PaddleOCR to improve throughput when processing multiple inputs or large documents. It covers three complementary approaches: (1) built-in multi-device support for automatic parallelization across GPUs/accelerators, (2) internal queue-based asynchronous processing for pipeline stages, and (3) custom multi-process parallelism patterns for advanced use cases.

For general inference interfaces, see Python Inference System. For performance optimization techniques like TensorRT and ONNX, see High-Performance Inference. For service-based deployment patterns, see Service Deployment.

Built-in Multi-Device Support

Device Specification Syntax

PaddleOCR pipelines support specifying multiple inference devices simultaneously using comma-separated device identifiers. When multiple devices are specified, the pipeline creates one instance of the underlying model on each device and distributes inputs across them for parallel processing.

Supported device format:

device="<device_type>:<id1>,<id2>,<id3>,..."

Examples:

device="gpu:0,1,2,3" - Use 4 NVIDIA GPUs
device="npu:0,1" - Use 2 Huawei Ascend NPUs
device="xpu:0,1,2" - Use 3 Kunlunxin XPUs

Usage Examples

CLI Usage

Python API Usage

Multi-Device Inference Architecture

Diagram: Multi-Device Pipeline Distribution

During pipeline initialization, one complete pipeline instance (including all loaded models) is created on each specified device. The dispatcher distributes incoming inputs across instances, and results are aggregated in the order they were submitted.

Sources: docs/version3.x/pipeline_usage/instructions/parallel_inference.en.md5-31

Pipeline Support

The following table shows multi-device support status for major pipelines:

Pipeline	Multi-Device Support	Relevant Parameter
`PaddleOCR` (PP-OCRv5)	✅ Yes	`device`
`PPStructureV3`	✅ Yes	`device`
`PaddleOCRVL`	✅ Yes	`device`
`PPChatOCRv4`	✅ Yes	`device`
`DocPreprocessor`	✅ Yes	`device`
`FormulaRecognition`	✅ Yes	`device`
`SealRecognition`	✅ Yes	`device`
`TableRecognitionV2`	✅ Yes	`device`

To verify whether a specific pipeline supports multi-device inference, refer to the pipeline's usage documentation.

Sources: docs/version3.x/pipeline_usage/instructions/parallel_inference.en.md1-31

Internal Queue-Based Asynchronous Processing

The `use_queues` Parameter

Several PaddleOCR pipelines (notably PaddleOCRVL) implement internal queue-based asynchronous processing to improve throughput when processing multi-page documents or directories with many files. This is controlled by the use_queues parameter.

Parameter specification:

Type: bool
Default: True for PaddleOCRVL pipeline
Effect: When enabled, different processing stages execute asynchronously in separate threads, with data passed through queues

Usage example:

Queue-Based Processing Architecture

Diagram: Queue-Based Asynchronous Pipeline Architecture

The queue-based approach decouples processing stages, allowing them to run concurrently. While the layout detection model processes page N, the VLM can simultaneously process page N-1, and the PDF renderer can prepare page N+1. This pipeline parallelism is particularly efficient for large multi-page documents.

Sources: docs/version3.x/pipeline_usage/PaddleOCR-VL.en.md537-542 docs/version3.x/pipeline_usage/PaddleOCR-VL.md508-513

When to Use Queue-Based Processing

Recommended scenarios:

Processing multi-page PDF documents (10+ pages)
Batch processing directories with many images/PDFs
When different pipeline stages have imbalanced processing times

Performance characteristics:

Reduces idle time for GPU/accelerator
Increases memory usage (buffered data in queues)
Best throughput when input queue doesn't become a bottleneck

Example comparison:

Scenario	`use_queues=False`	`use_queues=True`	Speedup
Single image	100% (baseline)	~100%	1.0x
10-page PDF	100% (baseline)	~130-150%	1.3-1.5x
100-page PDF	100% (baseline)	~180-220%	1.8-2.2x
Directory (50 images)	100% (baseline)	~150-200%	1.5-2.0x

Sources: docs/version3.x/pipeline_usage/PaddleOCR-VL.en.md537-542 docs/version3.x/pipeline_usage/PaddleOCR-VL.md508-513

Custom Multi-Process Parallelism

For advanced use cases requiring fine-grained control over parallelism, users can implement custom multi-process patterns wrapping PaddleOCR pipeline APIs. This approach is useful when the built-in multi-device support doesn't match your specific requirements.

Multi-Process Worker Pattern

The following example demonstrates a common pattern: a task queue feeding multiple worker processes, each running a pipeline instance on a dedicated device.

Diagram: Multi-Process Task Queue Architecture

This pattern uses Python's multiprocessing.Manager to create a shared queue. The main process enqueues all input file paths, and worker processes compete to dequeue tasks, process them with their pipeline instance, and save results. Each worker runs on a dedicated device to avoid resource contention.

Sources: docs/version3.x/pipeline_usage/instructions/parallel_inference.en.md33-166

Implementation Reference

The complete implementation is provided in the documentation. Key components:

Pipeline Loading - docs/version3.x/pipeline_usage/instructions/parallel_inference.en.md49-54

Worker Function - docs/version3.x/pipeline_usage/instructions/parallel_inference.en.md56-86

Main Process - docs/version3.x/pipeline_usage/instructions/parallel_inference.en.md89-166

Parses command-line arguments
Creates shared Manager().Queue() for task distribution
Enqueues all input file paths from input directory
Spawns worker processes with device assignments
Waits for all workers to complete

Command-Line Interface

The example provides a CLI for easy usage:

Parameters:

--pipeline: PaddleOCR pipeline class name (e.g., DocPreprocessor, PaddleOCRVL)
--input_dir: Directory containing input files
--device: Comma-separated device list
--output_dir: Where to save results
--instances_per_device: Number of pipeline instances per device (default: 1)
--batch_size: Inference batch size per instance (default: 1)
--input_glob_pattern: Pattern to match input files (default: *)

Sources: docs/version3.x/pipeline_usage/instructions/parallel_inference.en.md89-166

Advanced: Multiple Instances Per Device

The example supports running multiple pipeline instances on the same device by adjusting --instances_per_device. This can improve GPU utilization when the model doesn't fully saturate the device.

Device assignment logic:

When to use multiple instances per device:

Small models with low GPU utilization
High-latency preprocessing/postprocessing
To overlap data transfer with computation

Trade-offs:

Increased memory consumption (multiple model copies)
Potential for resource contention if not tuned correctly
Diminishing returns beyond 2-3 instances per device

Sources: docs/version3.x/pipeline_usage/instructions/parallel_inference.en.md110-166

Performance Considerations

Approach Selection Matrix

Requirement	Recommended Approach	Rationale
Simple multi-GPU inference	Built-in multi-device	Easiest to use, automatic load balancing
Multi-page PDF processing	`use_queues=True`	Pipeline parallelism between stages
Large directory of images	Built-in multi-device + batch input	Efficient for homogeneous workloads
Fine-grained control	Custom multi-process	Maximum flexibility
Heterogeneous devices	Custom multi-process	Can assign different pipeline configs per device
Service deployment	Built-in multi-device + queues	Combines throughput benefits

Combining Strategies

The three approaches can be combined for maximum performance:

This three-level parallelism approach provides:

Process-level parallelism - Multiple workers from shared queue
Device-level parallelism - Each worker distributes across multiple GPUs
Stage-level parallelism - Internal queues pipeline PDF rendering → layout → VLM

Sources: docs/version3.x/pipeline_usage/instructions/parallel_inference.en.md1-166 docs/version3.x/pipeline_usage/PaddleOCR-VL.en.md537-542

Batch Size Tuning

For built-in multi-device:

Batch size applies to each device instance
Larger batches improve GPU utilization but increase latency
Optimal batch size depends on model size and GPU memory

For custom multi-process:

batch_size parameter in worker function controls per-worker batching
Batching reduces overhead of repeated predict() calls
For PDFs, batch_size means processing multiple PDF pages together

Recommended starting points:

Text detection/recognition models: batch_size = 8-16
Layout detection models: batch_size = 4-8
VLM models (PaddleOCR-VL): batch_size = 1-2 (memory-intensive)
Document preprocessing: batch_size = 4-8

Memory Considerations

Multi-device inference:

Each device loads a complete copy of all models
Memory usage = (single instance memory) × (number of devices)

Queue-based processing:

Queues buffer intermediate data in memory
For PDFs, rendered images consume significant memory
Queue sizes are typically bounded (e.g., max 10 items)

Multi-process parallelism:

Each process has isolated memory space
Total memory = (per-process memory) × (number of processes)
Use --instances_per_device carefully to avoid OOM

Sources: docs/version3.x/pipeline_usage/PaddleOCR-VL.en.md537-542 docs/version3.x/pipeline_usage/instructions/parallel_inference.en.md33-166

Device Specification in Configuration Files

For service deployment, device specification can be provided via YAML configuration instead of programmatically:

This configuration is loaded when initializing pipelines in service mode.

Sources: docs/version3.x/pipeline_usage/PaddleOCR-VL.en.md616-641 docs/version3.x/pipeline_usage/PaddleOCR-VL.md615-620

Hardware-Specific Considerations

Different accelerators have specific device naming conventions:

Hardware	Device String Format	Example
NVIDIA GPU	`gpu:<id>`	`gpu:0,1,2,3`
Kunlunxin XPU	`xpu:<id>`	`xpu:0,1`
Huawei Ascend NPU	`npu:<id>`	`npu:0,1`
Cambricon MLU	`mlu:<id>`	`mlu:0,1`
Hygon DCU	`dcu:<id>`	`dcu:0,1`
MetaX GPU	`metax_gpu:<id>`	`metax_gpu:0,1`
Iluvatar GPU	`iluvatar_gpu:<id>`	`iluvatar_gpu:0,1`
CPU	`cpu`	`cpu` (no multi-device)

For hardware-specific setup instructions, see the relevant usage tutorial (e.g., docs/version3.x/pipeline_usage/PaddleOCR-VL-Kunlunxin-XPU.en.md).

Sources: docs/version3.x/pipeline_usage/PaddleOCR-VL.en.md580-595 docs/version3.x/pipeline_usage/PaddleOCR-VL-NVIDIA-Blackwell.en.md1-316 docs/version3.x/pipeline_usage/PaddleOCR-VL-Iluvatar-GPU.en.md1-242 docs/version3.x/pipeline_usage/PaddleOCR-VL-MetaX-GPU.en.md1-242 docs/version3.x/pipeline_usage/PaddleOCR-VL-Huawei-Ascend-NPU.en.md1-242 docs/version3.x/pipeline_usage/PaddleOCR-VL-Apple-Silicon.en.md1-105

Parallel and Multi-Device Inference

Relevant source files

Purpose and Scope

Built-in Multi-Device Support

Device Specification Syntax

Supported device format:

device="<device_type>:<id1>,<id2>,<id3>,..."

Examples:

device="gpu:0,1,2,3" - Use 4 NVIDIA GPUs
device="npu:0,1" - Use 2 Huawei Ascend NPUs
device="xpu:0,1,2" - Use 3 Kunlunxin XPUs

Usage Examples

CLI Usage

Python API Usage

Multi-Device Inference Architecture

Diagram: Multi-Device Pipeline Distribution

Sources: docs/version3.x/pipeline_usage/instructions/parallel_inference.en.md5-31

Pipeline Support

The following table shows multi-device support status for major pipelines:

Pipeline	Multi-Device Support	Relevant Parameter
`PaddleOCR` (PP-OCRv5)	✅ Yes	`device`
`PPStructureV3`	✅ Yes	`device`
`PaddleOCRVL`	✅ Yes	`device`
`PPChatOCRv4`	✅ Yes	`device`
`DocPreprocessor`	✅ Yes	`device`
`FormulaRecognition`	✅ Yes	`device`
`SealRecognition`	✅ Yes	`device`
`TableRecognitionV2`	✅ Yes	`device`

To verify whether a specific pipeline supports multi-device inference, refer to the pipeline's usage documentation.

Sources: docs/version3.x/pipeline_usage/instructions/parallel_inference.en.md1-31

Internal Queue-Based Asynchronous Processing

The `use_queues` Parameter

Parameter specification:

Type: bool
Default: True for PaddleOCRVL pipeline
Effect: When enabled, different processing stages execute asynchronously in separate threads, with data passed through queues

Usage example:

Queue-Based Processing Architecture

Diagram: Queue-Based Asynchronous Pipeline Architecture

Sources: docs/version3.x/pipeline_usage/PaddleOCR-VL.en.md537-542 docs/version3.x/pipeline_usage/PaddleOCR-VL.md508-513

When to Use Queue-Based Processing

Recommended scenarios:

Processing multi-page PDF documents (10+ pages)
Batch processing directories with many images/PDFs
When different pipeline stages have imbalanced processing times

Performance characteristics:

Reduces idle time for GPU/accelerator
Increases memory usage (buffered data in queues)
Best throughput when input queue doesn't become a bottleneck

Example comparison:

Scenario	`use_queues=False`	`use_queues=True`	Speedup
Single image	100% (baseline)	~100%	1.0x
10-page PDF	100% (baseline)	~130-150%	1.3-1.5x
100-page PDF	100% (baseline)	~180-220%	1.8-2.2x
Directory (50 images)	100% (baseline)	~150-200%	1.5-2.0x

Sources: docs/version3.x/pipeline_usage/PaddleOCR-VL.en.md537-542 docs/version3.x/pipeline_usage/PaddleOCR-VL.md508-513

Custom Multi-Process Parallelism

Multi-Process Worker Pattern

The following example demonstrates a common pattern: a task queue feeding multiple worker processes, each running a pipeline instance on a dedicated device.

Diagram: Multi-Process Task Queue Architecture

Sources: docs/version3.x/pipeline_usage/instructions/parallel_inference.en.md33-166

Implementation Reference

The complete implementation is provided in the documentation. Key components:

Pipeline Loading - docs/version3.x/pipeline_usage/instructions/parallel_inference.en.md49-54

Worker Function - docs/version3.x/pipeline_usage/instructions/parallel_inference.en.md56-86

Main Process - docs/version3.x/pipeline_usage/instructions/parallel_inference.en.md89-166

Parses command-line arguments
Creates shared Manager().Queue() for task distribution
Enqueues all input file paths from input directory
Spawns worker processes with device assignments
Waits for all workers to complete

Command-Line Interface

The example provides a CLI for easy usage:

Parameters:

--pipeline: PaddleOCR pipeline class name (e.g., DocPreprocessor, PaddleOCRVL)
--input_dir: Directory containing input files
--device: Comma-separated device list
--output_dir: Where to save results
--instances_per_device: Number of pipeline instances per device (default: 1)
--batch_size: Inference batch size per instance (default: 1)
--input_glob_pattern: Pattern to match input files (default: *)

Sources: docs/version3.x/pipeline_usage/instructions/parallel_inference.en.md89-166

Advanced: Multiple Instances Per Device

The example supports running multiple pipeline instances on the same device by adjusting --instances_per_device. This can improve GPU utilization when the model doesn't fully saturate the device.

Device assignment logic:

When to use multiple instances per device:

Small models with low GPU utilization
High-latency preprocessing/postprocessing
To overlap data transfer with computation

Trade-offs:

Increased memory consumption (multiple model copies)
Potential for resource contention if not tuned correctly
Diminishing returns beyond 2-3 instances per device

Sources: docs/version3.x/pipeline_usage/instructions/parallel_inference.en.md110-166

Performance Considerations

Approach Selection Matrix

Requirement	Recommended Approach	Rationale
Simple multi-GPU inference	Built-in multi-device	Easiest to use, automatic load balancing
Multi-page PDF processing	`use_queues=True`	Pipeline parallelism between stages
Large directory of images	Built-in multi-device + batch input	Efficient for homogeneous workloads
Fine-grained control	Custom multi-process	Maximum flexibility
Heterogeneous devices	Custom multi-process	Can assign different pipeline configs per device
Service deployment	Built-in multi-device + queues	Combines throughput benefits

Combining Strategies

The three approaches can be combined for maximum performance:

This three-level parallelism approach provides:

Process-level parallelism - Multiple workers from shared queue
Device-level parallelism - Each worker distributes across multiple GPUs
Stage-level parallelism - Internal queues pipeline PDF rendering → layout → VLM

Sources: docs/version3.x/pipeline_usage/instructions/parallel_inference.en.md1-166 docs/version3.x/pipeline_usage/PaddleOCR-VL.en.md537-542

Batch Size Tuning

For built-in multi-device:

Batch size applies to each device instance
Larger batches improve GPU utilization but increase latency
Optimal batch size depends on model size and GPU memory

For custom multi-process:

batch_size parameter in worker function controls per-worker batching
Batching reduces overhead of repeated predict() calls
For PDFs, batch_size means processing multiple PDF pages together

Recommended starting points:

Text detection/recognition models: batch_size = 8-16
Layout detection models: batch_size = 4-8
VLM models (PaddleOCR-VL): batch_size = 1-2 (memory-intensive)
Document preprocessing: batch_size = 4-8

Memory Considerations

Multi-device inference:

Each device loads a complete copy of all models
Memory usage = (single instance memory) × (number of devices)

Queue-based processing:

Queues buffer intermediate data in memory
For PDFs, rendered images consume significant memory
Queue sizes are typically bounded (e.g., max 10 items)

Multi-process parallelism:

Each process has isolated memory space
Total memory = (per-process memory) × (number of processes)
Use --instances_per_device carefully to avoid OOM

Sources: docs/version3.x/pipeline_usage/PaddleOCR-VL.en.md537-542 docs/version3.x/pipeline_usage/instructions/parallel_inference.en.md33-166

Device Specification in Configuration Files

For service deployment, device specification can be provided via YAML configuration instead of programmatically:

This configuration is loaded when initializing pipelines in service mode.

Sources: docs/version3.x/pipeline_usage/PaddleOCR-VL.en.md616-641 docs/version3.x/pipeline_usage/PaddleOCR-VL.md615-620

Hardware-Specific Considerations

Different accelerators have specific device naming conventions:

Hardware	Device String Format	Example
NVIDIA GPU	`gpu:<id>`	`gpu:0,1,2,3`
Kunlunxin XPU	`xpu:<id>`	`xpu:0,1`
Huawei Ascend NPU	`npu:<id>`	`npu:0,1`
Cambricon MLU	`mlu:<id>`	`mlu:0,1`
Hygon DCU	`dcu:<id>`	`dcu:0,1`
MetaX GPU	`metax_gpu:<id>`	`metax_gpu:0,1`
Iluvatar GPU	`iluvatar_gpu:<id>`	`iluvatar_gpu:0,1`
CPU	`cpu`	`cpu` (no multi-device)

For hardware-specific setup instructions, see the relevant usage tutorial (e.g., docs/version3.x/pipeline_usage/PaddleOCR-VL-Kunlunxin-XPU.en.md).

Parallel and Multi-Device Inference

Purpose and Scope

Built-in Multi-Device Support

Device Specification Syntax

Usage Examples

CLI Usage

Python API Usage

Multi-Device Inference Architecture

Pipeline Support

Internal Queue-Based Asynchronous Processing

The use_queues Parameter

Queue-Based Processing Architecture

When to Use Queue-Based Processing

Custom Multi-Process Parallelism

Multi-Process Worker Pattern

Implementation Reference

Command-Line Interface

Advanced: Multiple Instances Per Device

Performance Considerations

Approach Selection Matrix

Combining Strategies

Batch Size Tuning

Memory Considerations

Related Configuration

Device Specification in Configuration Files

Hardware-Specific Considerations

On this page

Parallel and Multi-Device Inference

Purpose and Scope

Built-in Multi-Device Support

Device Specification Syntax

Usage Examples

CLI Usage

Python API Usage

Multi-Device Inference Architecture

Pipeline Support

Internal Queue-Based Asynchronous Processing

The use_queues Parameter

Queue-Based Processing Architecture

When to Use Queue-Based Processing

Custom Multi-Process Parallelism

Multi-Process Worker Pattern

Implementation Reference

Command-Line Interface

Advanced: Multiple Instances Per Device

Performance Considerations

Approach Selection Matrix

Combining Strategies

Batch Size Tuning

Memory Considerations

Related Configuration

Device Specification in Configuration Files

Hardware-Specific Considerations

On this page

The `use_queues` Parameter

The `use_queues` Parameter