PP-ChatOCRv4 (also referred to as PP-ChatOCRv4-doc) is an intelligent document analysis pipeline that combines Large Language Models (LLMs), Multimodal LLMs (MLLMs), and OCR technology to extract structured information from complex documents. It addresses challenges such as layout analysis, rare characters, multi-page PDFs, tables, formulas, and seal text recognition by integrating ERNIE Bot (ERNIE 4.5) with document parsing capabilities. The pipeline enables question-answering interactions with documents.
Installation note: PP-ChatOCRv4 requires the
ieoptional dependency. Install with:See
pyproject.tomlfor the full dependency set:ie = ["paddlex[ie]>=3.4.0,<3.5.0"].
For basic text recognition without LLM integration, see page 2.1 (PP-OCRv5 Universal Text Recognition). For document parsing without intelligent extraction, see page 2.3 (PP-StructureV3 Document Parsing). For multilingual document parsing using vision-language models, see page 2.2 (PaddleOCR-VL Vision-Language Model).
PP-ChatOCRv4 operates as a three-layer architecture that processes documents through parsing, analysis, and intelligent extraction stages.
Sources: docs/version3.x/pipeline_usage/PP-ChatOCRv4.md1-25 docs/version3.x/pipeline_usage/PP-ChatOCRv4.en.md1-24 pyproject.toml57-61
Module-to-code mapping for PP-ChatOCRv4:
Sources: docs/version3.x/pipeline_usage/PP-ChatOCRv4.md15-26 docs/version3.x/pipeline_usage/PP-ChatOCRv4.en.md13-24 pyproject.toml58-61
The document parsing layer leverages PP-StructureV3 capabilities to convert raw documents into structured representations. This layer consists of nine configurable modules that can be enabled or disabled based on requirements.
The layout detection module identifies document structure using PP-DocLayout_plus-L, which recognizes 20 common layout categories including document titles, paragraph titles, text blocks, tables, formulas, images, seals, and charts.
| Model | mAP(0.5) | GPU Time (ms) | CPU Time (ms) | Size (MB) | Categories |
|---|---|---|---|---|---|
| PP-DocLayout_plus-L | 83.2% | 53.03 / 17.23 | 634.62 / 378.32 | 126.01 | 20 layout types |
Sources: docs/version3.x/pipeline_usage/PP-ChatOCRv4.md86-112 docs/version3.x/pipeline_usage/PP-ChatOCRv4.en.md87-111
The OCR subsystem comprises text detection and recognition components:
Text Detection: Uses PP-OCRv5_server_det (83.8% Hmean, 89.55ms GPU inference)
Text Recognition: Uses PP-OCRv5_server_rec (86.38% accuracy, 8.46ms GPU inference)
These models support five text types: Simplified Chinese, Traditional Chinese, English, Japanese, and Pinyin in a single model.
Sources: docs/version3.x/pipeline_usage/PP-ChatOCRv4.md424-458 docs/version3.x/pipeline_usage/PP-ChatOCRv4.en.md426-460
Table Structure Recognition: SLANeXt_wired and SLANeXt_wireless models handle wired and wireless tables separately (69.65% accuracy, 85.92ms GPU inference).
Formula Recognition: Multiple models supported including LatexOCR (76.9% BLEU) and UniMERNet_base (74.7% CDM, 78.7% ExpRate).
Seal Text Detection: PP-OCRv4_server_seal_det model specialized for curved seal text (98.0% detection Hmean).
Sources: docs/version3.x/pipeline_usage/PP-ChatOCRv4.md608-664 docs/version3.x/pipeline_usage/PP-ChatOCRv4.en.md608-664
The LLM integration layer connects document parsing results with ERNIE 4.5 to enable intelligent information extraction and question-answering.
Sources: docs/version3.x/pipeline_usage/PP-ChatOCRv4.md866-975 docs/version3.x/pipeline_usage/PP-ChatOCRv4.en.md866-975
The extraction layer processes LLM responses to provide precise answers to user queries about document content.
Sources: docs/version3.x/pipeline_usage/PP-ChatOCRv4.md866-1047 docs/version3.x/pipeline_usage/PP-ChatOCRv4.en.md866-1047
The pipeline executes modules in a specific sequence to optimize processing efficiency:
Sources: docs/version3.x/pipeline_usage/PP-ChatOCRv4.md15-26
PP-ChatOCRv4 uses a hierarchical configuration system inherited from PaddleX. Key options:
| Parameter | Type | Default | Description |
|---|---|---|---|
layout_model_name | str | "PP-DocLayout_plus-L" | Layout detection model |
text_detection_model_name | str | "PP-OCRv5_server_det" | Text detection model |
text_recognition_model_name | str | "PP-OCRv5_server_rec" | Text recognition model |
table_structure_model_name | str | "SLANeXt_wired" | Table structure model |
formula_recognition_model_name | str | "LatexOCR" | Formula recognition model |
use_doc_orientation_classify | bool | False | Enable document orientation |
use_doc_unwarping | bool | False | Enable document unwarping |
use_textline_orientation | bool | False | Enable textline orientation |
use_seal_text_detection | bool | False | Enable seal detection |
use_table_recognition | bool | True | Enable table recognition |
use_formula_recognition | bool | True | Enable formula recognition |
llm_name | str | "ernie-4.5-8k-preview" | LLM model selection |
api_type | str | "qianfan" | API service type |
Sources: docs/version3.x/pipeline_usage/PP-ChatOCRv4.md866-975 docs/version3.x/pipeline_usage/PP-ChatOCRv4.en.md866-975
Configuration flow through the PaddleX/PaddleOCR layers:
Sources: docs/version3.x/pipeline_usage/PP-ChatOCRv4.md803-865 docs/version3.x/paddleocr_and_paddlex.en.md1-40
The PP-ChatOCRv4 pipeline is accessed through the PaddleX pipeline system. The PaddleX registration name is PP-ChatOCRv4. The pipeline wrapper class is located in paddleocr/_pipelines/pp_chatocrv4.py.
Key initialization parameters and an example invocation pattern:
| Parameter | Type | Default | Description |
|---|---|---|---|
layout_model_name | str | "PP-DocLayout_plus-L" | Layout detection model |
text_detection_model_name | str | "PP-OCRv5_server_det" | Text detection model |
text_recognition_model_name | str | "PP-OCRv5_server_rec" | Text recognition model |
use_table_recognition | bool | True | Enable table recognition |
use_formula_recognition | bool | True | Enable formula recognition |
use_seal_text_detection | bool | False | Enable seal detection |
llm_name | str | "ernie-4.5-8k-preview" | LLM model name |
api_type | str | "qianfan" | API service type |
api_key | str | — | Qianfan API key |
secret_key | str | — | Qianfan secret key |
Refer to docs/version3.x/pipeline_usage/PP-ChatOCRv4.en.md for full parameter lists and usage examples.
Sources: docs/version3.x/pipeline_usage/PP-ChatOCRv4.md976-1047 docs/version3.x/pipeline_usage/PP-ChatOCRv4.en.md976-1047
The paddleocr CLI entry point (defined in pyproject.toml) exposes the pipeline as pp_chatocrv4:
Sources: docs/version3.x/pipeline_usage/PP-ChatOCRv4.md803-865 pyproject.toml54-55
| Method | Signature | Description |
|---|---|---|
predict() | predict(input, query, vector_path=None, ...) | Main prediction method combining OCR + LLM |
save_vector() | save_vector(vector_path) | Persist document embeddings to disk |
load_vector() | load_vector(vector_path) | Load previously persisted embeddings |
save_visual_info_list() | save_visual_info_list(path) | Save visual bounding box / layout information |
load_visual_info_list() | load_visual_info_list(path) | Load previously saved visual information |
Sources: docs/version3.x/pipeline_usage/PP-ChatOCRv4.en.md866-1047
PP-ChatOCRv4 implements a vector database system for efficient document retrieval and context management when processing queries.
Sources: docs/version3.x/pipeline_usage/PP-ChatOCRv4.md866-1047
The vector database enables persistent storage of document embeddings for repeated queries without reprocessing:
First-time processing:
vector_path using save_vector()Subsequent queries:
load_vector(vector_path)Sources: docs/version3.x/pipeline_usage/PP-ChatOCRv4.md976-1047
PP-ChatOCRv4 connects to ERNIE Bot services through the Qianfan API:
| Parameter | Required | Description |
|---|---|---|
api_type | Yes | API service type ("qianfan") |
api_key | Yes | Qianfan API key |
secret_key | Yes | Qianfan secret key |
llm_name | No | Model name (default: "ernie-4.5-8k-preview") |
llm_params | No | Additional LLM parameters |
The authentication mechanism uses the API key and secret key to obtain access tokens for making LLM API calls.
Sources: docs/version3.x/pipeline_usage/PP-ChatOCRv4.md866-975
PP-ChatOCRv4 supports multiple ERNIE model variants:
ernie-4.5-8k-preview: Default model with 8K context windowernie-4.5-turbo-8k: Faster variant for lower latencyernie-4.5-128k-preview: Extended context for long documentsernie-3.5: Previous generation modelSources: docs/version3.x/pipeline_usage/PP-ChatOCRv4.md866-975
The pipeline constructs prompts by combining:
This structured prompt enables ERNIE 4.5 to understand document structure and provide accurate, contextually relevant answers.
Sources: docs/version3.x/pipeline_usage/PP-ChatOCRv4.md1-25
Sources: docs/version3.x/pipeline_usage/PP-ChatOCRv4.md15-26 docs/version3.x/pipeline_usage/PP-ChatOCRv4.en.md13-24
The table below shows inference times for core models in PP-ChatOCRv4:
| Module | Model | GPU Time (ms) | CPU Time (ms) | Accuracy |
|---|---|---|---|---|
| Layout Detection | PP-DocLayout_plus-L | 53.03 / 17.23 | 634.62 / 378.32 | 83.2% mAP |
| Text Detection | PP-OCRv5_server_det | 89.55 / 70.19 | 383.15 / 383.15 | 83.8% Hmean |
| Text Recognition | PP-OCRv5_server_rec | 8.46 / 2.36 | 31.21 / 31.21 | 86.38% Avg |
| Table Structure | SLANeXt_wired | 85.92 / 85.92 | - / 501.66 | 69.65% |
| Formula | LatexOCR | 358.34 / 358.34 | - / 1620.76 | 76.9% BLEU |
Times shown as [Standard Mode / High-Performance Mode]
Sources: docs/version3.x/pipeline_usage/PP-ChatOCRv4.md86-664
Key improvements in PP-ChatOCRv4 over prior versions:
PP-DocLayout_plus-L (20 layout categories vs. fewer in older models)PP-OCRv5_server_det / PP-OCRv5_server_rec (PP-OCRv5 is ~13% better than PP-OCRv4 on varied scenes)SLANeXt_wired / SLANeXt_wireless (separate wired and wireless models)save_vector / load_vector) for efficient document chunking and retrievalSources: docs/version3.x/pipeline_usage/PP-ChatOCRv4.md1-9 docs/version3.x/pipeline_usage/PP-ChatOCRv4.en.md1-9
PP-ChatOCRv4 can be deployed on various hardware configurations:
Minimum (CPU-only):
Recommended (GPU):
Sources: docs/version3.x/pipeline_usage/PP-ChatOCRv4.md1048-1227
Choose models based on deployment constraints:
High Accuracy: Use server models (PP-OCRv5_server, PP-DocLayout_plus-L) Fast Inference: Use mobile models (PP-OCRv5_mobile, PP-DocLayout-S) Memory Constrained: Disable optional modules (formula, seal detection)
Sources: docs/version3.x/pipeline_usage/PP-ChatOCRv4.md86-664
PP-ChatOCRv4 supports multiple deployment modes:
enable_hpi=True and TensorRT/ONNX backendsThe pipeline integrates with PaddleX for production deployment; refer to docs/version3.x/paddleocr_and_paddlex.en.md for the PaddleOCR–PaddleX relationship and pipeline registration names.
Sources: docs/version3.x/pipeline_usage/PP-ChatOCRv4.md1-9 docs/version3.x/paddleocr_and_paddlex.en.md1-40
| Aspect | PP-ChatOCRv4 | PP-StructureV3 |
|---|---|---|
| Primary Goal | Intelligent Q&A + Extraction | Document Parsing |
| LLM Integration | Yes (ERNIE 4.5) | No |
| Vector Database | Yes | No |
| Query Support | Yes | No |
| Output Format | Answers + JSON/Markdown | JSON/Markdown only |
| Use Case | Information extraction from docs | Document structure conversion |
Sources: docs/version3.x/pipeline_usage/PP-ChatOCRv4.md1-9 docs/version3.x/pipeline_usage/PP-StructureV3.md1-9
| Aspect | PP-ChatOCRv4 | PaddleOCR-VL |
|---|---|---|
| Architecture | Pipeline-based | Vision-Language Model |
| Language Support | Via translation | Native 109+ languages |
| Customization | Modular components | End-to-end VLM |
| Training | Per-module fine-tuning | Unified VLM training |
| LLM Dependency | External (ERNIE) | Integrated (ERNIE-4.5-0.3B) |
Sources: docs/version3.x/pipeline_usage/PP-ChatOCRv4.md1-9 README.md61-66
Refresh this wiki
This wiki was recently refreshed. Please wait 2 days to refresh again.