PaddleOCR-VL Inference and Acceleration

Relevant source files

This page covers the available inference modes, hardware requirements, and acceleration backend options for PaddleOCR-VL. It focuses on the runtime execution layer: how to choose an inference method, what hardware it requires, and how to configure it for production use.

For the overall architecture of the PaddleOCR-VL pipeline — including its models, configuration parameters, and pipeline versions — see 2.2.1. For service deployment (Docker Compose, API endpoints, and client invocation), see 2.2.3.

Inference Method Overview

PaddleOCR-VL decomposes inference into two independent parts:

Layout detection model — always runs via PaddlePaddle.
VLM (Vision-Language Model) — can be served by one of several backends.

The naming convention for inference methods follows the format <layout detection backend> + <VLM backend>. The baseline mode where both components use PaddlePaddle natively is referred to simply as PaddlePaddle.

The six supported inference methods are shown below with their hardware compatibility.

Inference Method × Hardware Support Matrix

Inference Method	NVIDIA GPU	Kunlunxin XPU	Hygon DCU	MetaX GPU	Iluvatar GPU	Huawei Ascend NPU	x64 CPU	Apple Silicon
PaddlePaddle	✅	✅	✅	✅	✅	🚧	✅	✅
PaddlePaddle + vLLM	✅	🚧	✅	🚧	🚧	✅	—	—
PaddlePaddle + SGLang	✅	🚧	🚧	🚧	🚧	🚧	—	—
PaddlePaddle + FastDeploy	✅	✅	🚧	✅	✅	🚧	—	—
PaddlePaddle + MLX-VLM	—	—	—	—	—	—	—	✅
PaddlePaddle + llama.cpp	✅	🚧	🚧	🚧	🚧	🚧	✅	🚧

✅ = supported, 🚧 = not yet supported, — = not applicable

Note: Huawei Ascend NPU does not support the baseline PaddlePaddle inference method; it must use the vLLM backend.

Sources: docs/version3.x/pipeline_usage/PaddleOCR-VL.en.md45-131

NVIDIA GPU Compute Capability Requirements

When using NVIDIA GPUs, the minimum Compute Capability (CC) and CUDA version vary by backend.

Backend	Min. Compute Capability	Min. CUDA Version
PaddlePaddle	CC ≥ 7.0	CUDA ≥ 11.8
vLLM	CC ≥ 8.0	CUDA ≥ 12.6
SGLang	8.0 ≤ CC < 12.0	CUDA ≥ 12.6
FastDeploy	8.0 ≤ CC < 12.0	CUDA ≥ 12.6

Common GPUs with CC ≥ 8.0: RTX 30/40/50 series, A10, A100.
GPUs with CC 7.x (T4, V100) can start vLLM but may encounter timeout or OOM errors — not recommended.
vLLM, SGLang, and FastDeploy do not run natively on Windows; use the provided Docker images.
NVIDIA Blackwell-architecture GPUs (RTX 50 series, CC = 12.0) require CUDA 12.9 and a separate image tag (latest-nvidia-gpu-sm120).

Sources: docs/version3.x/pipeline_usage/PaddleOCR-VL.en.md133-141 docs/version3.x/pipeline_usage/PaddleOCR-VL-NVIDIA-Blackwell.en.md1-19

Inference Architecture

The two-tier separation between the pipeline and the VLM inference service is a core design point. When a production acceleration framework is used, the VLM component runs as a standalone network service (default port 8118) that the pipeline queries over HTTP.

Two-tier Inference Architecture

Sources: docs/version3.x/pipeline_usage/PaddleOCR-VL.en.md129-131

Default Mode: PaddlePaddle Inference

The PaddleOCRVL class from paddleocr uses PaddlePaddle inference for both layout detection and VLM by default. This mode is the simplest to set up but is not optimized for production throughput.

CLI usage:

Python API:

The device parameter accepts: cpu, gpu:N, npu:N, xpu:N, dcu:N, metax_gpu:N, iluvatar_gpu:N.

Sources: docs/version3.x/pipeline_usage/PaddleOCR-VL.en.md236-266 docs/version3.x/pipeline_usage/PaddleOCR-VL.en.md659-688

Inference Acceleration Frameworks

For production use, it is strongly recommended to run the VLM as a dedicated inference service using one of the supported acceleration frameworks. The paddleocr genai_server CLI command starts this service.

Starting the VLM Inference Service

The genai_server subcommand parameters are:

Parameter	Description
`--model_name`	Model identifier (e.g., `PaddleOCR-VL-1.5-0.9B`)
`--model_dir`	Local directory for model weights
`--host`	Server bind address
`--port`	Server listen port
`--backend`	Backend name: `vllm`, `sglang`, or `fastdeploy`
`--backend_config`	Path to a YAML file with backend-specific parameters

Install the relevant backend dependencies using paddleocr install_genai_server_deps:

Important: Because vLLM and SGLang may have dependency conflicts with the PaddlePaddle framework, install them in a separate virtual environment from the one used for PaddleOCR itself.

Sources: docs/version3.x/pipeline_usage/PaddleOCR-VL.en.md1-1000 docs/version3.x/pipeline_usage/PaddleOCR-VL-NVIDIA-Blackwell.en.md127-173

Connecting the Pipeline to an Acceleration Service

Once the VLM inference service is running, tell the PaddleOCRVL pipeline to use it via vl_rec_backend and vl_rec_server_url.

CLI:

Python API:

For MLX-VLM on Apple Silicon, also specify the model identifier:

Sources: docs/version3.x/pipeline_usage/PaddleOCR-VL-Apple-Silicon.en.md60-84 docs/version3.x/pipeline_usage/PaddleOCR-VL.en.md1-1000

Backend-Specific Acceleration Details

vLLM (NVIDIA GPU, Hygon DCU, Huawei Ascend NPU)

vLLM is the recommended backend for production on NVIDIA GPUs. It requires FlashAttention; if CUDA compilation tools are unavailable, prebuilt wheels are available at an external repository.

Start via Docker image (NVIDIA GPU, standard architecture):

For Blackwell-architecture GPUs (RTX 50 series), use tag latest-nvidia-gpu-sm120.

SGLang (NVIDIA GPU)

SGLang is supported on NVIDIA GPUs with CC in range 8.0, 12.0). It is not yet supported on other hardware.\n\n```shell\npaddleocr genai_server --model_name PaddleOCR-VL-1.5-0.9B --backend sglang --port 8118\n```\n\n### FastDeploy (NVIDIA GPU, MetaX GPU, Iluvatar GPU, Kunlunxin XPU)\n\nFastDeploy is the recommended backend for non-NVIDIA accelerators that support it.\n\nStart via Docker image (Iluvatar GPU example) docs/version3.x/pipeline_usage/PaddleOCR-VL-Iluvatar-GPU.en.md76-116 docs/version3.x/pipeline_usage/PaddleOCR-VL-NVIDIA-Blackwell.en.md80-173

Key Inference Parameters

Diagram: PaddleOCRVL constructor inference parameters

Sources: docs/version3.x/pipeline_usage/PaddleOCR-VL.en.md268-644

Notable parameters for inference tuning:

Parameter	Type	Notes
`vl_rec_backend`	`str`	Selects the VLM backend; unset uses native PaddlePaddle
`vl_rec_server_url`	`str`	URL of the external inference service
`vl_rec_max_concurrency`	`int`	Max concurrent requests when using a service backend
`device`	`str`	Compute device; defaults to GPU 0 if available, else CPU
`use_tensorrt`	`bool`	Enables TensorRT subgraph engine in Paddle Inference
`precision`	`str`	`fp32` or `fp16`
`enable_mkldnn`	`bool`	Enables MKL-DNN for CPU inference
`mkldnn_cache_capacity`	`int`	MKL-DNN cache size
`cpu_threads`	`int`	Threads for CPU inference
`enable_hpi`	`bool`	Enables high-performance inference mode
`use_queues`	`bool`	Enables async pipeline with internal queues (default `True`)
`min_pixels` / `max_pixels`	`int`	Controls image resolution sent to VLM

Sources: docs/version3.x/pipeline_usage/PaddleOCR-VL.en.md580-644

`use_queues` and Async Execution

When use_queues=True (the default), data loading (e.g., PDF page rendering), layout detection, and VLM inference execute in separate threads, passing data through queues. This is especially effective when processing multi-page PDFs or directories containing many files.

Disable it only if you need deterministic sequential execution or are debugging:

Sources: docs/version3.x/pipeline_usage/PaddleOCR-VL.en.md537-543

Docker Images by Hardware

Each hardware target has a corresponding set of Docker images. Image tags follow the pattern latest-<hardware-suffix> (online) or latest-<hardware-suffix>-offline (offline bundle).

Hardware	Base Image Tag Suffix	VLM Server Tag Suffix
NVIDIA GPU (standard)	`nvidia-gpu`	`nvidia-gpu` (vLLM/SGLang server)
NVIDIA GPU (Blackwell)	`nvidia-gpu-sm120`	`nvidia-gpu-sm120`
MetaX GPU	`metax-gpu`	`metax-gpu` (FastDeploy server)
Iluvatar GPU	`iluvatar-gpu`	`iluvatar-gpu` (FastDeploy server)
Huawei Ascend NPU	`huawei-npu`	`huawei-npu` (vLLM server)
Apple Silicon	(manual install only)	—

Registry prefix: ccr-2vdh3abv-pub.cnc.bj.baidubce.com/paddlepaddle/

Pipeline image repository: paddleocr-vl
VLM inference service image repository: paddleocr-genai-vllm-server or paddleocr-genai-fastdeploy-server

To pin a specific PaddleOCR version, replace latest with paddleocr<major>.<minor> (e.g., paddleocr3.3).

Sources: docs/version3.x/pipeline_usage/PaddleOCR-VL.en.md169-202 docs/version3.x/pipeline_usage/PaddleOCR-VL-NVIDIA-Blackwell.en.md33-51 docs/version3.x/pipeline_usage/PaddleOCR-VL-Iluvatar-GPU.en.md22-46 docs/version3.x/pipeline_usage/PaddleOCR-VL-Huawei-Ascend-NPU.en.md22-44

Decision Guide: Choosing an Inference Method

Sources: docs/version3.x/pipeline_usage/PaddleOCR-VL.en.md45-131 docs/version3.x/pipeline_usage/PaddleOCR-VL-Huawei-Ascend-NPU.en.md69-72

PaddleOCR-VL Inference and Acceleration

Relevant source files

Inference Method Overview

PaddleOCR-VL decomposes inference into two independent parts:

Layout detection model — always runs via PaddlePaddle.
VLM (Vision-Language Model) — can be served by one of several backends.

The six supported inference methods are shown below with their hardware compatibility.

Inference Method × Hardware Support Matrix

Inference Method	NVIDIA GPU	Kunlunxin XPU	Hygon DCU	MetaX GPU	Iluvatar GPU	Huawei Ascend NPU	x64 CPU	Apple Silicon
PaddlePaddle	✅	✅	✅	✅	✅	🚧	✅	✅
PaddlePaddle + vLLM	✅	🚧	✅	🚧	🚧	✅	—	—
PaddlePaddle + SGLang	✅	🚧	🚧	🚧	🚧	🚧	—	—
PaddlePaddle + FastDeploy	✅	✅	🚧	✅	✅	🚧	—	—
PaddlePaddle + MLX-VLM	—	—	—	—	—	—	—	✅
PaddlePaddle + llama.cpp	✅	🚧	🚧	🚧	🚧	🚧	✅	🚧

✅ = supported, 🚧 = not yet supported, — = not applicable

Note: Huawei Ascend NPU does not support the baseline PaddlePaddle inference method; it must use the vLLM backend.

Sources: docs/version3.x/pipeline_usage/PaddleOCR-VL.en.md45-131

NVIDIA GPU Compute Capability Requirements

When using NVIDIA GPUs, the minimum Compute Capability (CC) and CUDA version vary by backend.

Backend	Min. Compute Capability	Min. CUDA Version
PaddlePaddle	CC ≥ 7.0	CUDA ≥ 11.8
vLLM	CC ≥ 8.0	CUDA ≥ 12.6
SGLang	8.0 ≤ CC < 12.0	CUDA ≥ 12.6
FastDeploy	8.0 ≤ CC < 12.0	CUDA ≥ 12.6

Common GPUs with CC ≥ 8.0: RTX 30/40/50 series, A10, A100.
GPUs with CC 7.x (T4, V100) can start vLLM but may encounter timeout or OOM errors — not recommended.
vLLM, SGLang, and FastDeploy do not run natively on Windows; use the provided Docker images.
NVIDIA Blackwell-architecture GPUs (RTX 50 series, CC = 12.0) require CUDA 12.9 and a separate image tag (latest-nvidia-gpu-sm120).

Sources: docs/version3.x/pipeline_usage/PaddleOCR-VL.en.md133-141 docs/version3.x/pipeline_usage/PaddleOCR-VL-NVIDIA-Blackwell.en.md1-19

Inference Architecture

Two-tier Inference Architecture

Sources: docs/version3.x/pipeline_usage/PaddleOCR-VL.en.md129-131

Default Mode: PaddlePaddle Inference

The PaddleOCRVL class from paddleocr uses PaddlePaddle inference for both layout detection and VLM by default. This mode is the simplest to set up but is not optimized for production throughput.

CLI usage:

Python API:

The device parameter accepts: cpu, gpu:N, npu:N, xpu:N, dcu:N, metax_gpu:N, iluvatar_gpu:N.

Sources: docs/version3.x/pipeline_usage/PaddleOCR-VL.en.md236-266 docs/version3.x/pipeline_usage/PaddleOCR-VL.en.md659-688

Inference Acceleration Frameworks

Starting the VLM Inference Service

The genai_server subcommand parameters are:

Parameter	Description
`--model_name`	Model identifier (e.g., `PaddleOCR-VL-1.5-0.9B`)
`--model_dir`	Local directory for model weights
`--host`	Server bind address
`--port`	Server listen port
`--backend`	Backend name: `vllm`, `sglang`, or `fastdeploy`
`--backend_config`	Path to a YAML file with backend-specific parameters

Install the relevant backend dependencies using paddleocr install_genai_server_deps:

Important: Because vLLM and SGLang may have dependency conflicts with the PaddlePaddle framework, install them in a separate virtual environment from the one used for PaddleOCR itself.

Sources: docs/version3.x/pipeline_usage/PaddleOCR-VL.en.md1-1000 docs/version3.x/pipeline_usage/PaddleOCR-VL-NVIDIA-Blackwell.en.md127-173

Connecting the Pipeline to an Acceleration Service

Once the VLM inference service is running, tell the PaddleOCRVL pipeline to use it via vl_rec_backend and vl_rec_server_url.

CLI:

Python API:

For MLX-VLM on Apple Silicon, also specify the model identifier:

Sources: docs/version3.x/pipeline_usage/PaddleOCR-VL-Apple-Silicon.en.md60-84 docs/version3.x/pipeline_usage/PaddleOCR-VL.en.md1-1000

Backend-Specific Acceleration Details

vLLM (NVIDIA GPU, Hygon DCU, Huawei Ascend NPU)

vLLM is the recommended backend for production on NVIDIA GPUs. It requires FlashAttention; if CUDA compilation tools are unavailable, prebuilt wheels are available at an external repository.

Start via Docker image (NVIDIA GPU, standard architecture):

For Blackwell-architecture GPUs (RTX 50 series), use tag latest-nvidia-gpu-sm120.

SGLang (NVIDIA GPU)

Key Inference Parameters

Diagram: PaddleOCRVL constructor inference parameters

Sources: docs/version3.x/pipeline_usage/PaddleOCR-VL.en.md268-644

Notable parameters for inference tuning:

Parameter	Type	Notes
`vl_rec_backend`	`str`	Selects the VLM backend; unset uses native PaddlePaddle
`vl_rec_server_url`	`str`	URL of the external inference service
`vl_rec_max_concurrency`	`int`	Max concurrent requests when using a service backend
`device`	`str`	Compute device; defaults to GPU 0 if available, else CPU
`use_tensorrt`	`bool`	Enables TensorRT subgraph engine in Paddle Inference
`precision`	`str`	`fp32` or `fp16`
`enable_mkldnn`	`bool`	Enables MKL-DNN for CPU inference
`mkldnn_cache_capacity`	`int`	MKL-DNN cache size
`cpu_threads`	`int`	Threads for CPU inference
`enable_hpi`	`bool`	Enables high-performance inference mode
`use_queues`	`bool`	Enables async pipeline with internal queues (default `True`)
`min_pixels` / `max_pixels`	`int`	Controls image resolution sent to VLM

Sources: docs/version3.x/pipeline_usage/PaddleOCR-VL.en.md580-644

`use_queues` and Async Execution

Disable it only if you need deterministic sequential execution or are debugging:

Sources: docs/version3.x/pipeline_usage/PaddleOCR-VL.en.md537-543

Docker Images by Hardware

Each hardware target has a corresponding set of Docker images. Image tags follow the pattern latest-<hardware-suffix> (online) or latest-<hardware-suffix>-offline (offline bundle).

Hardware	Base Image Tag Suffix	VLM Server Tag Suffix
NVIDIA GPU (standard)	`nvidia-gpu`	`nvidia-gpu` (vLLM/SGLang server)
NVIDIA GPU (Blackwell)	`nvidia-gpu-sm120`	`nvidia-gpu-sm120`
MetaX GPU	`metax-gpu`	`metax-gpu` (FastDeploy server)
Iluvatar GPU	`iluvatar-gpu`	`iluvatar-gpu` (FastDeploy server)
Huawei Ascend NPU	`huawei-npu`	`huawei-npu` (vLLM server)
Apple Silicon	(manual install only)	—

Registry prefix: ccr-2vdh3abv-pub.cnc.bj.baidubce.com/paddlepaddle/

Pipeline image repository: paddleocr-vl
VLM inference service image repository: paddleocr-genai-vllm-server or paddleocr-genai-fastdeploy-server

To pin a specific PaddleOCR version, replace latest with paddleocr<major>.<minor> (e.g., paddleocr3.3).

Decision Guide: Choosing an Inference Method

Sources: docs/version3.x/pipeline_usage/PaddleOCR-VL.en.md45-131 docs/version3.x/pipeline_usage/PaddleOCR-VL-Huawei-Ascend-NPU.en.md69-72

PaddleOCR-VL Inference and Acceleration

Inference Method Overview

NVIDIA GPU Compute Capability Requirements

Inference Architecture

Default Mode: PaddlePaddle Inference

Inference Acceleration Frameworks

Starting the VLM Inference Service

Connecting the Pipeline to an Acceleration Service

Backend-Specific Acceleration Details

vLLM (NVIDIA GPU, Hygon DCU, Huawei Ascend NPU)

SGLang (NVIDIA GPU)

Key Inference Parameters

`use_queues` and Async Execution

Docker Images by Hardware

Decision Guide: Choosing an Inference Method

On this page

PaddleOCR-VL Inference and Acceleration

Inference Method Overview

NVIDIA GPU Compute Capability Requirements

Inference Architecture

Default Mode: PaddlePaddle Inference

Inference Acceleration Frameworks

Starting the VLM Inference Service

Connecting the Pipeline to an Acceleration Service

Backend-Specific Acceleration Details

vLLM (NVIDIA GPU, Hygon DCU, Huawei Ascend NPU)

SGLang (NVIDIA GPU)

Key Inference Parameters

`use_queues` and Async Execution

Docker Images by Hardware

Decision Guide: Choosing an Inference Method

On this page

PaddleOCR-VL Inference and Acceleration

Inference Method Overview

NVIDIA GPU Compute Capability Requirements

Inference Architecture

Default Mode: PaddlePaddle Inference

Inference Acceleration Frameworks

Starting the VLM Inference Service

Connecting the Pipeline to an Acceleration Service

Backend-Specific Acceleration Details

vLLM (NVIDIA GPU, Hygon DCU, Huawei Ascend NPU)

SGLang (NVIDIA GPU)

Key Inference Parameters

use_queues and Async Execution

Docker Images by Hardware

Decision Guide: Choosing an Inference Method

On this page

PaddleOCR-VL Inference and Acceleration

Inference Method Overview

NVIDIA GPU Compute Capability Requirements

Inference Architecture

Default Mode: PaddlePaddle Inference

Inference Acceleration Frameworks

Starting the VLM Inference Service

Connecting the Pipeline to an Acceleration Service

Backend-Specific Acceleration Details

vLLM (NVIDIA GPU, Hygon DCU, Huawei Ascend NPU)

SGLang (NVIDIA GPU)

Key Inference Parameters

use_queues and Async Execution

Docker Images by Hardware

Decision Guide: Choosing an Inference Method

On this page

`use_queues` and Async Execution

`use_queues` and Async Execution