PaddleOCR-VL Vision-Language Model

Relevant source files

This page covers the purpose, capabilities, architecture overview, hardware support, and basic usage of PaddleOCR-VL. It is a parent document; implementation details for specific concerns are covered in child pages: architecture and configuration in 2.2.1, inference acceleration in 2.2.2, and service deployment in 2.2.3. For the PP-OCRv5 traditional OCR pipeline (text detection + recognition without a VLM), see 2.1.

What Is PaddleOCR-VL

PaddleOCR-VL is a document parsing pipeline built around a compact Vision-Language Model (VLM). Its core model, PaddleOCR-VL-0.9B, combines:

A NaViT-style dynamic resolution visual encoder — processes document images at variable resolution without padding artifacts.
ERNIE-4.5-0.3B — the language model that generates structured text from visual features.

The model is under continuous development. As of January 2026, PaddleOCR-VL-1.5 is the current default (pipeline_version="v1.5"), achieving 94.5% accuracy on OmniDocBench v1.5 and adding:

Irregular/non-rectangular bounding box localization (quad and poly shapes)
Seal/stamp recognition
Dedicated text detection and recognition capability

The pipeline wraps the VLM with optional preprocessing stages (document orientation correction, image unwarping) and a layout detection module that segments the page before handing regions to the VLM.

Sources: docs/version3.x/pipeline_usage/PaddleOCR-VL.en.md1-14

Pipeline Architecture

Figure 1: PaddleOCR-VL doc_parser pipeline stages

Sources: docs/version3.x/pipeline_usage/PaddleOCR-VL.en.md22-40 docs/version3.x/pipeline_usage/PaddleOCR-VL.en.md236-267

Processing Modes

When use_layout_detection=True (default), the layout detection model partitions the page into blocks (text, table, image, formula, etc.), and the VLM processes each block individually. This increases accuracy on complex documents.

When use_layout_detection=False, the full page image is passed directly to the VLM with a task-specific prompt selected by prompt_label.

The use_queues=True setting (default) enables asynchronous threading: PDF rendering, layout detection, and VLM inference run in separate threads connected by queues, improving throughput on large PDFs or batches.

Sources: docs/version3.x/pipeline_usage/PaddleOCR-VL.en.md536-543

Python API Entry Point

The primary class is PaddleOCRVL, imported from the paddleocr package.

For multi-page PDFs, use restructure_pages() to merge cross-page tables, rebuild heading levels, or concatenate all pages:

The CLI entry point is:

Sources: docs/version3.x/pipeline_usage/PaddleOCR-VL.en.md659-706

Code Entity Map

Figure 2: Key code entities and configuration parameters of PaddleOCRVL

Sources: docs/version3.x/pipeline_usage/PaddleOCR-VL.en.md268-643

Inference Methods and Hardware Support

PaddleOCR-VL separates inference into two concerns: the layout detection model always uses PaddlePaddle, while the VLM component can use one of six backends. The notation PaddlePaddle + vLLM means layout detection runs in PaddlePaddle and VLM inference runs in vLLM.

Inference Method	NVIDIA GPU	Kunlunxin XPU	Hygon DCU	MetaX GPU	Iluvatar GPU	Huawei Ascend NPU	x64 CPU	Apple Silicon
PaddlePaddle	✅	✅	✅	✅	✅	🚧	✅	✅
PaddlePaddle + vLLM	✅	🚧	✅	🚧	🚧	✅	—	—
PaddlePaddle + SGLang	✅	🚧	🚧	🚧	🚧	🚧	—	—
PaddlePaddle + FastDeploy	✅	✅	🚧	✅	✅	🚧	—	—
PaddlePaddle + MLX-VLM	—	—	—	—	—	—	—	✅
PaddlePaddle + llama.cpp	✅	🚧	🚧	🚧	🚧	🚧	✅	🚧

✅ = supported · 🚧 = in progress · — = not applicable

NVIDIA GPU compute capability requirements:

Backend	Min Compute Capability	Min CUDA
PaddlePaddle	CC ≥ 7.0	11.8
vLLM	CC ≥ 8.0	12.6
SGLang	8.0 ≤ CC < 12.0	12.6
FastDeploy	8.0 ≤ CC < 12.0	12.6

NVIDIA Blackwell-architecture GPUs (RTX 50xx series, CC 12.x) require the separate latest-nvidia-gpu-sm120 Docker image; see docs/version3.x/pipeline_usage/PaddleOCR-VL-NVIDIA-Blackwell.en.md

vLLM, SGLang, and FastDeploy cannot run natively on Windows — use the provided Docker images.

Sources: docs/version3.x/pipeline_usage/PaddleOCR-VL.en.md41-157

Hardware-Specific Tutorials

Each hardware platform has a dedicated tutorial that mirrors the general flow but uses platform-specific Docker images and install commands.

Hardware	Key Accelerator Backend	Docker Image Tag Suffix
NVIDIA GPU (standard)	vLLM or SGLang	`latest-nvidia-gpu`
NVIDIA Blackwell (RTX 50xx)	vLLM	`latest-nvidia-gpu-sm120`
Kunlunxin XPU	FastDeploy	`latest-kunlunxin-xpu`
Hygon DCU	vLLM	Platform-specific
MetaX GPU	FastDeploy	`latest-metax-gpu`
Iluvatar GPU	FastDeploy	`latest-iluvatar-gpu`
Huawei Ascend NPU	vLLM	`latest-huawei-npu`
Apple Silicon	MLX-VLM	N/A (manual install only)

Sources: docs/version3.x/pipeline_usage/PaddleOCR-VL.en.md143-157

Pipeline Versions

The pipeline_version parameter selects which model weights are used:

Value	Model	Notes
`"v1.5"` (default)	PaddleOCR-VL-1.5-0.9B	Current default; supports quad/poly bounding boxes, seal recognition
`"v1"`	PaddleOCR-VL-0.9B	Original release

Sources: docs/version3.x/pipeline_usage/PaddleOCR-VL.en.md296-303

Key Configuration Parameters Reference

Parameter	Type	Default	Purpose
`pipeline_version`	`str`	`"v1.5"`	Selects v1 or v1.5 model weights
`use_doc_orientation_classify`	`bool`	`False`	Enable document rotation correction
`use_doc_unwarping`	`bool`	`False`	Enable image dewarping
`use_layout_detection`	`bool`	`True`	Enable layout segmentation before VLM
`use_chart_recognition`	`bool`	`False`	Enable chart/figure parsing
`use_seal_recognition`	`bool`	`False`	Enable seal/stamp recognition
`use_queues`	`bool`	`True`	Enable asynchronous multi-threaded processing
`layout_shape_mode`	`str`	`"auto"`	Bounding box shape: `rect`, `quad`, `poly`, `auto`
`vl_rec_backend`	`str`	`None`	VLM inference backend override
`vl_rec_server_url`	`str`	`None`	URL for remote VLM inference service
`device`	`str`	`None`	Target device: `cpu`, `gpu:0`, `xpu:0`, etc.
`prompt_label`	`str`	`None`	Task prompt when `use_layout_detection=False`
`merge_layout_blocks`	`bool`	`True`	Merge cross-column layout boxes
`format_block_content`	`bool`	`False`	Format block content as Markdown

For the full parameter table including sampling parameters (temperature, top_p, repetition_penalty, min_pixels, max_pixels) and model path overrides, see 2.2.1.

Sources: docs/version3.x/pipeline_usage/PaddleOCR-VL.en.md268-643

Two-Layer Service Architecture

When deployed as a service, PaddleOCR-VL runs as two cooperating containers:

Figure 3: Service deployment containers and their roles

The paddleocr-vlm-server container is started with paddleocr genai_server --model_name PaddleOCR-VL-1.5-0.9B --backend vllm --port 8118. This distinction matters: the VLM inference service handles only the VLM step; the pipeline service (api) orchestrates preprocessing, layout detection, and VLM calls.

For full deployment instructions including Docker Compose configuration and manual deployment, see 2.2.3.

Sources: docs/version3.x/pipeline_usage/PaddleOCR-VL-NVIDIA-Blackwell.en.md183-218 docs/version3.x/pipeline_usage/PaddleOCR-VL.en.md30-39

Subpage Summary

Page	Contents
2.2.1 Architecture and Configuration	Detailed parameter tables, pipeline YAML configuration, layout shape modes, block merging, and per-version model details
2.2.2 Inference and Acceleration	Starting `genai_server`, configuring vLLM/SGLang/FastDeploy/MLX-VLM backends, performance tuning, and `install_genai_server_deps` CLI command
2.2.3 Service Deployment	Docker Compose deployment, manual deployment, client-side invocation examples, and pipeline configuration adjustment

PaddleOCR-VL Vision-Language Model

Relevant source files

What Is PaddleOCR-VL

PaddleOCR-VL is a document parsing pipeline built around a compact Vision-Language Model (VLM). Its core model, PaddleOCR-VL-0.9B, combines:

A NaViT-style dynamic resolution visual encoder — processes document images at variable resolution without padding artifacts.
ERNIE-4.5-0.3B — the language model that generates structured text from visual features.

The model is under continuous development. As of January 2026, PaddleOCR-VL-1.5 is the current default (pipeline_version="v1.5"), achieving 94.5% accuracy on OmniDocBench v1.5 and adding:

Irregular/non-rectangular bounding box localization (quad and poly shapes)
Seal/stamp recognition
Dedicated text detection and recognition capability

Sources: docs/version3.x/pipeline_usage/PaddleOCR-VL.en.md1-14

Pipeline Architecture

Figure 1: PaddleOCR-VL doc_parser pipeline stages

Sources: docs/version3.x/pipeline_usage/PaddleOCR-VL.en.md22-40 docs/version3.x/pipeline_usage/PaddleOCR-VL.en.md236-267

Processing Modes

When use_layout_detection=False, the full page image is passed directly to the VLM with a task-specific prompt selected by prompt_label.

Sources: docs/version3.x/pipeline_usage/PaddleOCR-VL.en.md536-543

Python API Entry Point

The primary class is PaddleOCRVL, imported from the paddleocr package.

For multi-page PDFs, use restructure_pages() to merge cross-page tables, rebuild heading levels, or concatenate all pages:

The CLI entry point is:

Sources: docs/version3.x/pipeline_usage/PaddleOCR-VL.en.md659-706

Code Entity Map

Figure 2: Key code entities and configuration parameters of PaddleOCRVL

Sources: docs/version3.x/pipeline_usage/PaddleOCR-VL.en.md268-643

Inference Methods and Hardware Support

Inference Method	NVIDIA GPU	Kunlunxin XPU	Hygon DCU	MetaX GPU	Iluvatar GPU	Huawei Ascend NPU	x64 CPU	Apple Silicon
PaddlePaddle	✅	✅	✅	✅	✅	🚧	✅	✅
PaddlePaddle + vLLM	✅	🚧	✅	🚧	🚧	✅	—	—
PaddlePaddle + SGLang	✅	🚧	🚧	🚧	🚧	🚧	—	—
PaddlePaddle + FastDeploy	✅	✅	🚧	✅	✅	🚧	—	—
PaddlePaddle + MLX-VLM	—	—	—	—	—	—	—	✅
PaddlePaddle + llama.cpp	✅	🚧	🚧	🚧	🚧	🚧	✅	🚧

✅ = supported · 🚧 = in progress · — = not applicable

NVIDIA GPU compute capability requirements:

Backend	Min Compute Capability	Min CUDA
PaddlePaddle	CC ≥ 7.0	11.8
vLLM	CC ≥ 8.0	12.6
SGLang	8.0 ≤ CC < 12.0	12.6
FastDeploy	8.0 ≤ CC < 12.0	12.6

NVIDIA Blackwell-architecture GPUs (RTX 50xx series, CC 12.x) require the separate latest-nvidia-gpu-sm120 Docker image; see docs/version3.x/pipeline_usage/PaddleOCR-VL-NVIDIA-Blackwell.en.md

vLLM, SGLang, and FastDeploy cannot run natively on Windows — use the provided Docker images.

Sources: docs/version3.x/pipeline_usage/PaddleOCR-VL.en.md41-157

Hardware-Specific Tutorials

Each hardware platform has a dedicated tutorial that mirrors the general flow but uses platform-specific Docker images and install commands.

Hardware	Key Accelerator Backend	Docker Image Tag Suffix
NVIDIA GPU (standard)	vLLM or SGLang	`latest-nvidia-gpu`
NVIDIA Blackwell (RTX 50xx)	vLLM	`latest-nvidia-gpu-sm120`
Kunlunxin XPU	FastDeploy	`latest-kunlunxin-xpu`
Hygon DCU	vLLM	Platform-specific
MetaX GPU	FastDeploy	`latest-metax-gpu`
Iluvatar GPU	FastDeploy	`latest-iluvatar-gpu`
Huawei Ascend NPU	vLLM	`latest-huawei-npu`
Apple Silicon	MLX-VLM	N/A (manual install only)

Sources: docs/version3.x/pipeline_usage/PaddleOCR-VL.en.md143-157

Pipeline Versions

The pipeline_version parameter selects which model weights are used:

Value	Model	Notes
`"v1.5"` (default)	PaddleOCR-VL-1.5-0.9B	Current default; supports quad/poly bounding boxes, seal recognition
`"v1"`	PaddleOCR-VL-0.9B	Original release

Sources: docs/version3.x/pipeline_usage/PaddleOCR-VL.en.md296-303

Key Configuration Parameters Reference

Parameter	Type	Default	Purpose
`pipeline_version`	`str`	`"v1.5"`	Selects v1 or v1.5 model weights
`use_doc_orientation_classify`	`bool`	`False`	Enable document rotation correction
`use_doc_unwarping`	`bool`	`False`	Enable image dewarping
`use_layout_detection`	`bool`	`True`	Enable layout segmentation before VLM
`use_chart_recognition`	`bool`	`False`	Enable chart/figure parsing
`use_seal_recognition`	`bool`	`False`	Enable seal/stamp recognition
`use_queues`	`bool`	`True`	Enable asynchronous multi-threaded processing
`layout_shape_mode`	`str`	`"auto"`	Bounding box shape: `rect`, `quad`, `poly`, `auto`
`vl_rec_backend`	`str`	`None`	VLM inference backend override
`vl_rec_server_url`	`str`	`None`	URL for remote VLM inference service
`device`	`str`	`None`	Target device: `cpu`, `gpu:0`, `xpu:0`, etc.
`prompt_label`	`str`	`None`	Task prompt when `use_layout_detection=False`
`merge_layout_blocks`	`bool`	`True`	Merge cross-column layout boxes
`format_block_content`	`bool`	`False`	Format block content as Markdown

For the full parameter table including sampling parameters (temperature, top_p, repetition_penalty, min_pixels, max_pixels) and model path overrides, see 2.2.1.

Sources: docs/version3.x/pipeline_usage/PaddleOCR-VL.en.md268-643

Two-Layer Service Architecture

When deployed as a service, PaddleOCR-VL runs as two cooperating containers:

Figure 3: Service deployment containers and their roles

For full deployment instructions including Docker Compose configuration and manual deployment, see 2.2.3.

Sources: docs/version3.x/pipeline_usage/PaddleOCR-VL-NVIDIA-Blackwell.en.md183-218 docs/version3.x/pipeline_usage/PaddleOCR-VL.en.md30-39

Subpage Summary

Page	Contents
2.2.1 Architecture and Configuration	Detailed parameter tables, pipeline YAML configuration, layout shape modes, block merging, and per-version model details
2.2.2 Inference and Acceleration	Starting `genai_server`, configuring vLLM/SGLang/FastDeploy/MLX-VLM backends, performance tuning, and `install_genai_server_deps` CLI command
2.2.3 Service Deployment	Docker Compose deployment, manual deployment, client-side invocation examples, and pipeline configuration adjustment

PaddleOCR-VL Vision-Language Model

What Is PaddleOCR-VL

Pipeline Architecture

Processing Modes

Python API Entry Point

Code Entity Map

Inference Methods and Hardware Support

Hardware-Specific Tutorials

Pipeline Versions

Key Configuration Parameters Reference

Two-Layer Service Architecture

Subpage Summary

On this page

PaddleOCR-VL Vision-Language Model

What Is PaddleOCR-VL

Pipeline Architecture

Processing Modes

Python API Entry Point

Code Entity Map

Inference Methods and Hardware Support

Hardware-Specific Tutorials

Pipeline Versions

Key Configuration Parameters Reference

Two-Layer Service Architecture

Subpage Summary

On this page