This page covers the purpose, capabilities, architecture overview, hardware support, and basic usage of PaddleOCR-VL. It is a parent document; implementation details for specific concerns are covered in child pages: architecture and configuration in 2.2.1, inference acceleration in 2.2.2, and service deployment in 2.2.3. For the PP-OCRv5 traditional OCR pipeline (text detection + recognition without a VLM), see 2.1.
PaddleOCR-VL is a document parsing pipeline built around a compact Vision-Language Model (VLM). Its core model, PaddleOCR-VL-0.9B, combines:
The model is under continuous development. As of January 2026, PaddleOCR-VL-1.5 is the current default (pipeline_version="v1.5"), achieving 94.5% accuracy on OmniDocBench v1.5 and adding:
The pipeline wraps the VLM with optional preprocessing stages (document orientation correction, image unwarping) and a layout detection module that segments the page before handing regions to the VLM.
Sources: docs/version3.x/pipeline_usage/PaddleOCR-VL.en.md1-14
Figure 1: PaddleOCR-VL doc_parser pipeline stages
Sources: docs/version3.x/pipeline_usage/PaddleOCR-VL.en.md22-40 docs/version3.x/pipeline_usage/PaddleOCR-VL.en.md236-267
When use_layout_detection=True (default), the layout detection model partitions the page into blocks (text, table, image, formula, etc.), and the VLM processes each block individually. This increases accuracy on complex documents.
When use_layout_detection=False, the full page image is passed directly to the VLM with a task-specific prompt selected by prompt_label.
The use_queues=True setting (default) enables asynchronous threading: PDF rendering, layout detection, and VLM inference run in separate threads connected by queues, improving throughput on large PDFs or batches.
Sources: docs/version3.x/pipeline_usage/PaddleOCR-VL.en.md536-543
The primary class is PaddleOCRVL, imported from the paddleocr package.
For multi-page PDFs, use restructure_pages() to merge cross-page tables, rebuild heading levels, or concatenate all pages:
The CLI entry point is:
Sources: docs/version3.x/pipeline_usage/PaddleOCR-VL.en.md659-706
Figure 2: Key code entities and configuration parameters of PaddleOCRVL
Sources: docs/version3.x/pipeline_usage/PaddleOCR-VL.en.md268-643
PaddleOCR-VL separates inference into two concerns: the layout detection model always uses PaddlePaddle, while the VLM component can use one of six backends. The notation PaddlePaddle + vLLM means layout detection runs in PaddlePaddle and VLM inference runs in vLLM.
| Inference Method | NVIDIA GPU | Kunlunxin XPU | Hygon DCU | MetaX GPU | Iluvatar GPU | Huawei Ascend NPU | x64 CPU | Apple Silicon |
|---|---|---|---|---|---|---|---|---|
| PaddlePaddle | ✅ | ✅ | ✅ | ✅ | ✅ | 🚧 | ✅ | ✅ |
| PaddlePaddle + vLLM | ✅ | 🚧 | ✅ | 🚧 | 🚧 | ✅ | — | — |
| PaddlePaddle + SGLang | ✅ | 🚧 | 🚧 | 🚧 | 🚧 | 🚧 | — | — |
| PaddlePaddle + FastDeploy | ✅ | ✅ | 🚧 | ✅ | ✅ | 🚧 | — | — |
| PaddlePaddle + MLX-VLM | — | — | — | — | — | — | — | ✅ |
| PaddlePaddle + llama.cpp | ✅ | 🚧 | 🚧 | 🚧 | 🚧 | 🚧 | ✅ | 🚧 |
✅ = supported · 🚧 = in progress · — = not applicable
NVIDIA GPU compute capability requirements:
| Backend | Min Compute Capability | Min CUDA |
|---|---|---|
| PaddlePaddle | CC ≥ 7.0 | 11.8 |
| vLLM | CC ≥ 8.0 | 12.6 |
| SGLang | 8.0 ≤ CC < 12.0 | 12.6 |
| FastDeploy | 8.0 ≤ CC < 12.0 | 12.6 |
NVIDIA Blackwell-architecture GPUs (RTX 50xx series, CC 12.x) require the separate latest-nvidia-gpu-sm120 Docker image; see docs/version3.x/pipeline_usage/PaddleOCR-VL-NVIDIA-Blackwell.en.md
vLLM, SGLang, and FastDeploy cannot run natively on Windows — use the provided Docker images.
Sources: docs/version3.x/pipeline_usage/PaddleOCR-VL.en.md41-157
Each hardware platform has a dedicated tutorial that mirrors the general flow but uses platform-specific Docker images and install commands.
| Hardware | Key Accelerator Backend | Docker Image Tag Suffix |
|---|---|---|
| NVIDIA GPU (standard) | vLLM or SGLang | latest-nvidia-gpu |
| NVIDIA Blackwell (RTX 50xx) | vLLM | latest-nvidia-gpu-sm120 |
| Kunlunxin XPU | FastDeploy | latest-kunlunxin-xpu |
| Hygon DCU | vLLM | Platform-specific |
| MetaX GPU | FastDeploy | latest-metax-gpu |
| Iluvatar GPU | FastDeploy | latest-iluvatar-gpu |
| Huawei Ascend NPU | vLLM | latest-huawei-npu |
| Apple Silicon | MLX-VLM | N/A (manual install only) |
Sources: docs/version3.x/pipeline_usage/PaddleOCR-VL.en.md143-157
The pipeline_version parameter selects which model weights are used:
| Value | Model | Notes |
|---|---|---|
"v1.5" (default) | PaddleOCR-VL-1.5-0.9B | Current default; supports quad/poly bounding boxes, seal recognition |
"v1" | PaddleOCR-VL-0.9B | Original release |
Sources: docs/version3.x/pipeline_usage/PaddleOCR-VL.en.md296-303
| Parameter | Type | Default | Purpose |
|---|---|---|---|
pipeline_version | str | "v1.5" | Selects v1 or v1.5 model weights |
use_doc_orientation_classify | bool | False | Enable document rotation correction |
use_doc_unwarping | bool | False | Enable image dewarping |
use_layout_detection | bool | True | Enable layout segmentation before VLM |
use_chart_recognition | bool | False | Enable chart/figure parsing |
use_seal_recognition | bool | False | Enable seal/stamp recognition |
use_queues | bool | True | Enable asynchronous multi-threaded processing |
layout_shape_mode | str | "auto" | Bounding box shape: rect, quad, poly, auto |
vl_rec_backend | str | None | VLM inference backend override |
vl_rec_server_url | str | None | URL for remote VLM inference service |
device | str | None | Target device: cpu, gpu:0, xpu:0, etc. |
prompt_label | str | None | Task prompt when use_layout_detection=False |
merge_layout_blocks | bool | True | Merge cross-column layout boxes |
format_block_content | bool | False | Format block content as Markdown |
For the full parameter table including sampling parameters (temperature, top_p, repetition_penalty, min_pixels, max_pixels) and model path overrides, see 2.2.1.
Sources: docs/version3.x/pipeline_usage/PaddleOCR-VL.en.md268-643
When deployed as a service, PaddleOCR-VL runs as two cooperating containers:
Figure 3: Service deployment containers and their roles
The paddleocr-vlm-server container is started with paddleocr genai_server --model_name PaddleOCR-VL-1.5-0.9B --backend vllm --port 8118. This distinction matters: the VLM inference service handles only the VLM step; the pipeline service (api) orchestrates preprocessing, layout detection, and VLM calls.
For full deployment instructions including Docker Compose configuration and manual deployment, see 2.2.3.
Sources: docs/version3.x/pipeline_usage/PaddleOCR-VL-NVIDIA-Blackwell.en.md183-218 docs/version3.x/pipeline_usage/PaddleOCR-VL.en.md30-39
| Page | Contents |
|---|---|
| 2.2.1 Architecture and Configuration | Detailed parameter tables, pipeline YAML configuration, layout shape modes, block merging, and per-version model details |
| 2.2.2 Inference and Acceleration | Starting genai_server, configuring vLLM/SGLang/FastDeploy/MLX-VLM backends, performance tuning, and install_genai_server_deps CLI command |
| 2.2.3 Service Deployment | Docker Compose deployment, manual deployment, client-side invocation examples, and pipeline configuration adjustment |
Refresh this wiki
This wiki was recently refreshed. Please wait 2 days to refresh again.