This page covers the available inference modes, hardware requirements, and acceleration backend options for PaddleOCR-VL. It focuses on the runtime execution layer: how to choose an inference method, what hardware it requires, and how to configure it for production use.
For the overall architecture of the PaddleOCR-VL pipeline — including its models, configuration parameters, and pipeline versions — see 2.2.1. For service deployment (Docker Compose, API endpoints, and client invocation), see 2.2.3.
PaddleOCR-VL decomposes inference into two independent parts:
The naming convention for inference methods follows the format <layout detection backend> + <VLM backend>. The baseline mode where both components use PaddlePaddle natively is referred to simply as PaddlePaddle.
The six supported inference methods are shown below with their hardware compatibility.
Inference Method × Hardware Support Matrix
| Inference Method | NVIDIA GPU | Kunlunxin XPU | Hygon DCU | MetaX GPU | Iluvatar GPU | Huawei Ascend NPU | x64 CPU | Apple Silicon |
|---|---|---|---|---|---|---|---|---|
| PaddlePaddle | ✅ | ✅ | ✅ | ✅ | ✅ | 🚧 | ✅ | ✅ |
| PaddlePaddle + vLLM | ✅ | 🚧 | ✅ | 🚧 | 🚧 | ✅ | — | — |
| PaddlePaddle + SGLang | ✅ | 🚧 | 🚧 | 🚧 | 🚧 | 🚧 | — | — |
| PaddlePaddle + FastDeploy | ✅ | ✅ | 🚧 | ✅ | ✅ | 🚧 | — | — |
| PaddlePaddle + MLX-VLM | — | — | — | — | — | — | — | ✅ |
| PaddlePaddle + llama.cpp | ✅ | 🚧 | 🚧 | 🚧 | 🚧 | 🚧 | ✅ | 🚧 |
✅ = supported, 🚧 = not yet supported, — = not applicable
Note: Huawei Ascend NPU does not support the baseline PaddlePaddle inference method; it must use the vLLM backend.
Sources: docs/version3.x/pipeline_usage/PaddleOCR-VL.en.md45-131
When using NVIDIA GPUs, the minimum Compute Capability (CC) and CUDA version vary by backend.
| Backend | Min. Compute Capability | Min. CUDA Version |
|---|---|---|
| PaddlePaddle | CC ≥ 7.0 | CUDA ≥ 11.8 |
| vLLM | CC ≥ 8.0 | CUDA ≥ 12.6 |
| SGLang | 8.0 ≤ CC < 12.0 | CUDA ≥ 12.6 |
| FastDeploy | 8.0 ≤ CC < 12.0 | CUDA ≥ 12.6 |
latest-nvidia-gpu-sm120).Sources: docs/version3.x/pipeline_usage/PaddleOCR-VL.en.md133-141 docs/version3.x/pipeline_usage/PaddleOCR-VL-NVIDIA-Blackwell.en.md1-19
The two-tier separation between the pipeline and the VLM inference service is a core design point. When a production acceleration framework is used, the VLM component runs as a standalone network service (default port 8118) that the pipeline queries over HTTP.
Two-tier Inference Architecture
Sources: docs/version3.x/pipeline_usage/PaddleOCR-VL.en.md129-131
The PaddleOCRVL class from paddleocr uses PaddlePaddle inference for both layout detection and VLM by default. This mode is the simplest to set up but is not optimized for production throughput.
CLI usage:
Python API:
The device parameter accepts: cpu, gpu:N, npu:N, xpu:N, dcu:N, metax_gpu:N, iluvatar_gpu:N.
Sources: docs/version3.x/pipeline_usage/PaddleOCR-VL.en.md236-266 docs/version3.x/pipeline_usage/PaddleOCR-VL.en.md659-688
For production use, it is strongly recommended to run the VLM as a dedicated inference service using one of the supported acceleration frameworks. The paddleocr genai_server CLI command starts this service.
The genai_server subcommand parameters are:
| Parameter | Description |
|---|---|
--model_name | Model identifier (e.g., PaddleOCR-VL-1.5-0.9B) |
--model_dir | Local directory for model weights |
--host | Server bind address |
--port | Server listen port |
--backend | Backend name: vllm, sglang, or fastdeploy |
--backend_config | Path to a YAML file with backend-specific parameters |
Install the relevant backend dependencies using paddleocr install_genai_server_deps:
Important: Because vLLM and SGLang may have dependency conflicts with the PaddlePaddle framework, install them in a separate virtual environment from the one used for PaddleOCR itself.
Sources: docs/version3.x/pipeline_usage/PaddleOCR-VL.en.md1-1000 docs/version3.x/pipeline_usage/PaddleOCR-VL-NVIDIA-Blackwell.en.md127-173
Once the VLM inference service is running, tell the PaddleOCRVL pipeline to use it via vl_rec_backend and vl_rec_server_url.
CLI:
Python API:
For MLX-VLM on Apple Silicon, also specify the model identifier:
Sources: docs/version3.x/pipeline_usage/PaddleOCR-VL-Apple-Silicon.en.md60-84 docs/version3.x/pipeline_usage/PaddleOCR-VL.en.md1-1000
vLLM is the recommended backend for production on NVIDIA GPUs. It requires FlashAttention; if CUDA compilation tools are unavailable, prebuilt wheels are available at an external repository.
Start via Docker image (NVIDIA GPU, standard architecture):
For Blackwell-architecture GPUs (RTX 50 series), use tag latest-nvidia-gpu-sm120.
SGLang is supported on NVIDIA GPUs with CC in range 8.0, 12.0). It is not yet supported on other hardware.\n\n```shell\npaddleocr genai_server --model_name PaddleOCR-VL-1.5-0.9B --backend sglang --port 8118\n```\n\n### FastDeploy (NVIDIA GPU, MetaX GPU, Iluvatar GPU, Kunlunxin XPU)\n\nFastDeploy is the recommended backend for non-NVIDIA accelerators that support it.\n\nStart via Docker image (Iluvatar GPU example) docs/version3.x/pipeline_usage/PaddleOCR-VL-Iluvatar-GPU.en.md76-116 docs/version3.x/pipeline_usage/PaddleOCR-VL-NVIDIA-Blackwell.en.md80-173
Diagram: PaddleOCRVL constructor inference parameters
Sources: docs/version3.x/pipeline_usage/PaddleOCR-VL.en.md268-644
Notable parameters for inference tuning:
| Parameter | Type | Notes |
|---|---|---|
vl_rec_backend | str | Selects the VLM backend; unset uses native PaddlePaddle |
vl_rec_server_url | str | URL of the external inference service |
vl_rec_max_concurrency | int | Max concurrent requests when using a service backend |
device | str | Compute device; defaults to GPU 0 if available, else CPU |
use_tensorrt | bool | Enables TensorRT subgraph engine in Paddle Inference |
precision | str | fp32 or fp16 |
enable_mkldnn | bool | Enables MKL-DNN for CPU inference |
mkldnn_cache_capacity | int | MKL-DNN cache size |
cpu_threads | int | Threads for CPU inference |
enable_hpi | bool | Enables high-performance inference mode |
use_queues | bool | Enables async pipeline with internal queues (default True) |
min_pixels / max_pixels | int | Controls image resolution sent to VLM |
Sources: docs/version3.x/pipeline_usage/PaddleOCR-VL.en.md580-644
use_queues and Async ExecutionWhen use_queues=True (the default), data loading (e.g., PDF page rendering), layout detection, and VLM inference execute in separate threads, passing data through queues. This is especially effective when processing multi-page PDFs or directories containing many files.
Disable it only if you need deterministic sequential execution or are debugging:
Sources: docs/version3.x/pipeline_usage/PaddleOCR-VL.en.md537-543
Each hardware target has a corresponding set of Docker images. Image tags follow the pattern latest-<hardware-suffix> (online) or latest-<hardware-suffix>-offline (offline bundle).
| Hardware | Base Image Tag Suffix | VLM Server Tag Suffix |
|---|---|---|
| NVIDIA GPU (standard) | nvidia-gpu | nvidia-gpu (vLLM/SGLang server) |
| NVIDIA GPU (Blackwell) | nvidia-gpu-sm120 | nvidia-gpu-sm120 |
| MetaX GPU | metax-gpu | metax-gpu (FastDeploy server) |
| Iluvatar GPU | iluvatar-gpu | iluvatar-gpu (FastDeploy server) |
| Huawei Ascend NPU | huawei-npu | huawei-npu (vLLM server) |
| Apple Silicon | (manual install only) | — |
Registry prefix: ccr-2vdh3abv-pub.cnc.bj.baidubce.com/paddlepaddle/
paddleocr-vlpaddleocr-genai-vllm-server or paddleocr-genai-fastdeploy-serverTo pin a specific PaddleOCR version, replace latest with paddleocr<major>.<minor> (e.g., paddleocr3.3).
Sources: docs/version3.x/pipeline_usage/PaddleOCR-VL.en.md169-202 docs/version3.x/pipeline_usage/PaddleOCR-VL-NVIDIA-Blackwell.en.md33-51 docs/version3.x/pipeline_usage/PaddleOCR-VL-Iluvatar-GPU.en.md22-46 docs/version3.x/pipeline_usage/PaddleOCR-VL-Huawei-Ascend-NPU.en.md22-44
Sources: docs/version3.x/pipeline_usage/PaddleOCR-VL.en.md45-131 docs/version3.x/pipeline_usage/PaddleOCR-VL-Huawei-Ascend-NPU.en.md69-72
Refresh this wiki
This wiki was recently refreshed. Please wait 2 days to refresh again.