This document explains the internal architecture of the PaddleOCR-VL pipeline, including its core Vision-Language Model components (NaViT encoder and ERNIE-4.5-0.3B language model), configuration parameters, pipeline initialization process, and inference backend options.
For usage instructions and deployment guides, see 2.2 PaddleOCR-VL Vision-Language Model. For inference acceleration methods, see 2.2.2 PaddleOCR-VL Inference and Acceleration. For service deployment, see 2.2.3 PaddleOCR-VL Service Deployment.
PaddleOCR-VL is an advanced document parsing pipeline specifically designed for multilingual document element recognition. It combines a Vision-Language Model (VLM) with layout analysis to achieve state-of-the-art performance on document understanding tasks.
Key Characteristics:
Sources: docs/version3.x/pipeline_usage/PaddleOCR-VL.en.md1-14
Sources: docs/version3.x/pipeline_usage/PaddleOCR-VL.en.md7-14 docs/version3.x/pipeline_usage/PaddleOCR-VL.en.md280-643
The core of PaddleOCR-VL is a Vision-Language Model consisting of two main components:
min_pixels and max_pixels parametersrepetition_penalty, temperature, and top_pSources: docs/version3.x/pipeline_usage/PaddleOCR-VL.en.md7-9
Layout Detection Configuration:
| Parameter | Description | Type | Default |
|---|---|---|---|
layout_detection_model_name | Name of the layout detection model | str | Default model |
layout_detection_model_dir | Directory path for custom model | str | Official model downloaded |
layout_threshold | Score threshold (0-1) | float | 0.5 |
layout_nms | Enable NMS post-processing | bool | Default value |
layout_unclip_ratio | Box expansion coefficient | float | Default value |
layout_merge_bboxes_mode | Box merging mode: large, small, union | str | Default value |
layout_shape_mode | Geometric representation: rect, quad, poly, auto | str | "auto" |
Sources: docs/version3.x/pipeline_usage/PaddleOCR-VL.en.md305-356 docs/version3.x/pipeline_usage/PaddleOCR-VL.en.md504-535
Preprocessing Module Configuration:
| Parameter | Description | Type | Default |
|---|---|---|---|
use_doc_orientation_classify | Enable document orientation classification | bool | False |
doc_orientation_classify_model_name | Model name | str | Default model |
doc_orientation_classify_model_dir | Model directory | str | Official model |
use_doc_unwarping | Enable document unwarping | bool | False |
doc_unwarping_model_name | Model name | str | Default model |
doc_unwarping_model_dir | Model directory | str | Official model |
Sources: docs/version3.x/pipeline_usage/PaddleOCR-VL.en.md404-450
The PaddleOCRVL class serves as the main entry point for the pipeline:
Key Initialization Parameters:
| Parameter | Description | Type | Default |
|---|---|---|---|
pipeline_version | Pipeline version: "v1" or "v1.5" | str | "v1.5" |
device | Inference device: cpu, gpu:N, xpu:N, npu:N, etc. | str | Auto-detect |
vl_rec_backend | VLM inference backend | str | "paddlepaddle" |
vl_rec_model_name | VLM model name | str | Default model |
vl_rec_model_dir | VLM model directory | str | Official model |
use_layout_detection | Enable layout detection | bool | True |
use_queues | Enable internal queues for async processing | bool | True |
Sources: docs/version3.x/pipeline_usage/PaddleOCR-VL.en.md661-688 docs/version3.x/pipeline_usage/PaddleOCR-VL.en.md725-754
Sources: docs/version3.x/pipeline_usage/PaddleOCR-VL.en.md9-14 docs/version3.x/pipeline_usage/PaddleOCR-VL.en.md296-303
PaddleOCR-VL supports multiple inference backends for the VLM component, allowing optimization for different hardware platforms and performance requirements.
| Backend | NVIDIA GPU | Kunlunxin XPU | Hygon DCU | MetaX GPU | Iluvatar GPU | Ascend NPU | x64 CPU | Apple Silicon |
|---|---|---|---|---|---|---|---|---|
| PaddlePaddle | ✅ (CC≥7.0, CUDA≥11.8) | ✅ | ✅ | ✅ | ✅ | 🚧 | ✅ | ✅ |
| vLLM | ✅ (CC≥8.0, CUDA≥12.6) | 🚧 | ✅ | 🚧 | 🚧 | ✅ | - | - |
| SGLang | ✅ (8.0≤CC<12.0, CUDA≥12.6) | 🚧 | 🚧 | 🚧 | 🚧 | 🚧 | - | - |
| FastDeploy | ✅ (8.0≤CC<12.0, CUDA≥12.6) | ✅ | 🚧 | ✅ | ✅ | 🚧 | - | - |
| MLX-VLM | - | - | - | - | - | - | - | ✅ |
| llama.cpp | ✅ | 🚧 | 🚧 | 🚧 | 🚧 | 🚧 | ✅ | 🚧 |
Legend: ✅ Supported | 🚧 Under development | - Not supported
Sources: docs/version3.x/pipeline_usage/PaddleOCR-VL.en.md45-142
PaddlePaddle Backend (Default):
vLLM/SGLang/FastDeploy Backend (Server Mode):
MLX-VLM Backend (Apple Silicon):
Backend Configuration Parameters:
| Parameter | Description | Type | Example |
|---|---|---|---|
vl_rec_backend | Backend name | str | "paddlepaddle", "vllm", "sglang", "fastdeploy", "mlx-vlm-server" |
vl_rec_server_url | Server URL (for server-based backends) | str | "http://localhost:8118" |
vl_rec_max_concurrency | Max concurrent requests | int | 10 |
vl_rec_api_model_name | Model name for server | str | "PaddleOCR-VL-1.5-0.9B" |
vl_rec_api_key | API key (if required) | str | API key string |
Sources: docs/version3.x/pipeline_usage/PaddleOCR-VL.en.md373-402 docs/version3.x/pipeline_usage/PaddleOCR-VL-Apple-Silicon.en.md62-84
VLM Model Configuration:
| Parameter | Description | Type | Purpose |
|---|---|---|---|
vl_rec_model_name | Name of the VLM model | str | Specifies which pre-trained model to use |
vl_rec_model_dir | Path to model directory | str | Custom model path (overrides model_name) |
Sampling Parameters:
| Parameter | Description | Type | Range | Purpose |
|---|---|---|---|---|
repetition_penalty | Repetition penalty factor | float | >0 | Reduces repetitive text generation |
temperature | Sampling temperature | float | >0 | Controls randomness (lower = more deterministic) |
top_p | Nucleus sampling parameter | float | 0-1 | Cumulative probability threshold |
Image Processing Parameters:
| Parameter | Description | Type | Purpose |
|---|---|---|---|
min_pixels | Minimum pixel count for image preprocessing | int | Lower bound for image resizing |
max_pixels | Maximum pixel count for image preprocessing | int | Upper bound for image resizing |
Sources: docs/version3.x/pipeline_usage/PaddleOCR-VL.en.md358-378 docs/version3.x/pipeline_usage/PaddleOCR-VL.en.md544-578
Module Toggle Parameters:
| Parameter | Description | Type | Default | Impact |
|---|---|---|---|---|
use_layout_detection | Enable layout detection module | bool | True | Processes regions separately vs. full image |
use_doc_orientation_classify | Enable orientation classification | bool | False | Corrects document rotation before processing |
use_doc_unwarping | Enable document unwarping | bool | False | Corrects curved/warped documents |
use_chart_recognition | Enable chart recognition | bool | False | Extracts chart content |
use_seal_recognition | Enable seal recognition (v1.5) | bool | False | Recognizes seal/stamp elements |
use_ocr_for_image_block | OCR text within image blocks | bool | False | Extracts text from images |
Processing Behavior Parameters:
| Parameter | Description | Type | Default | Purpose |
|---|---|---|---|---|
use_queues | Enable internal async queues | bool | True | Improves efficiency for multi-page PDFs |
format_block_content | Format block content as Markdown | bool | False | Controls output formatting |
merge_layout_blocks | Merge cross-column layout blocks | bool | True | Handles multi-column layouts |
markdown_ignore_labels | Layout labels to ignore in Markdown | list[str] | ['number','footnote','header',...] | Filters certain elements from output |
Layout Shape Mode:
| Value | Description | Use Case |
|---|---|---|
"rect" | Axis-aligned bounding boxes (x1, y1, x2, y2) | Standard horizontal layouts |
"quad" | Arbitrary quadrilateral (4 vertices) | Skewed or perspective-distorted regions |
"poly" | Closed contour (multiple points) | Irregular/curved layout elements |
"auto" | Automatic selection | System chooses based on complexity |
Sources: docs/version3.x/pipeline_usage/PaddleOCR-VL.en.md430-580 docs/version3.x/pipeline_usage/PaddleOCR-VL.en.md504-542
Device Specification Format:
| Device Type | Format | Example |
|---|---|---|
| CPU | "cpu" | device="cpu" |
| NVIDIA GPU | "gpu:N" | device="gpu:0" |
| Kunlunxin XPU | "xpu:N" | device="xpu:0" |
| Huawei Ascend NPU | "npu:N" | device="npu:0" |
| Cambricon MLU | "mlu:N" | device="mlu:0" |
| Hygon DCU | "dcu:N" | device="dcu:0" |
| MetaX GPU | "metax_gpu:N" | device="metax_gpu:0" |
| Iluvatar GPU | "iluvatar_gpu:N" | device="iluvatar_gpu:0" |
Sources: docs/version3.x/pipeline_usage/PaddleOCR-VL.en.md552-595
Inference Performance Configuration:
| Parameter | Description | Type | Default | Hardware Support |
|---|---|---|---|---|
enable_hpi | Enable high-performance inference | bool | False | GPU |
use_tensorrt | Enable TensorRT sub-graph engine | bool | False | NVIDIA GPU (TensorRT 8.6.1.6+) |
precision | Computational precision | str | "fp32" | "fp32", "fp16", "int8" |
enable_mkldnn | Enable MKL-DNN acceleration | bool | False | CPU (x64) |
mkldnn_cache_capacity | MKL-DNN cache capacity | int | Default | CPU |
cpu_threads | Number of CPU threads | int | Default | CPU |
TensorRT Configuration Notes:
Sources: docs/version3.x/pipeline_usage/PaddleOCR-VL.en.md597-635
PaddleOCR-VL uses a hierarchical configuration system that can be overridden via YAML files:
Configuration File Parameter:
| Parameter | Description | Type | Purpose |
|---|---|---|---|
paddlex_config | Path to PaddleX pipeline configuration file | str | Allows advanced configuration overrides |
The paddlex_config parameter allows advanced users to pass a full PaddleX pipeline YAML file to override any pipeline setting not exposed directly through the PaddleOCRVL constructor arguments.
Sources: docs/version3.x/pipeline_usage/PaddleOCR-VL.en.md637-642
When layout detection is disabled, the VLM processes the entire image with a configurable prompt:
Prompt Configuration:
| Parameter | Description | Type | Default | Condition |
|---|---|---|---|---|
prompt_label | Prompt type for VL model | str | Default | Only effective when use_layout_detection=False |
Sources: docs/version3.x/pipeline_usage/PaddleOCR-VL.en.md544-548
Here's a comprehensive example showing all major configuration parameters:
Sources: docs/version3.x/pipeline_usage/PaddleOCR-VL.en.md725-754 docs/version3.x/pipeline_usage/PaddleOCR-VL.en.md280-643
PaddleOCR-VL provides a flexible, configurable pipeline for multilingual document understanding:
The configuration system allows users to balance accuracy, speed, and resource consumption based on their specific requirements.
Sources: All citations above
Refresh this wiki
This wiki was recently refreshed. Please wait 2 days to refresh again.