This document describes RAGFlow's vision processing subsystem, which extracts structured information from document images using deep learning models. The system performs three primary tasks: text detection and recognition (OCR), layout classification, and table structure analysis. These capabilities enable the document parser to understand document structure and extract text accurately from PDFs and images.
This page focuses on the vision models and their inference pipeline. For document parsing workflows that use these vision components, see Document Parsing Strategies. For the complete chunking pipeline, see Chunking Methods.
The vision processing system operates in three sequential stages during PDF parsing:
Diagram 1: Vision Processing Pipeline in PDF Parsing
The three stages are invoked sequentially by the RAGFlowPdfParser class methods: __images__() for OCR, _layouts_rec() for layout recognition, and _table_transformer_job() for table structure analysis.
Sources: deepdoc/parser/pdf_parser.py55-106 deepdoc/parser/pdf_parser.py548-580 deepdoc/parser/pdf_parser.py650-657
The OCR system in deepdoc/vision/ocr.py implements a two-stage pipeline: detection identifies text regions, and recognition converts pixel regions to text strings.
Diagram 2: OCR Two-Stage Pipeline
Sources: deepdoc/vision/ocr.py542-586 deepdoc/vision/ocr.py139-177 deepdoc/vision/ocr.py369-414 deepdoc/vision/ocr.py420-540 deepdoc/vision/ocr.py590-644 deepdoc/vision/ocr.py714-757
Models are loaded via the load_model() function at deepdoc/vision/ocr.py71-137 which configures ONNX Runtime for CPU or CUDA execution:
| Configuration | Environment Variable | Default Value | Description |
|---|---|---|---|
| Inference Device | torch.cuda.is_available() | Auto-detect | Use CUDA if available, else CPU |
| GPU Memory Limit | OCR_GPU_MEM_LIMIT_MB | 2048 MB | Maximum GPU memory per model |
| Arena Strategy | OCR_ARENA_EXTEND_STRATEGY | kNextPowerOfTwo | GPU memory allocation strategy |
| Intra-op Threads | OCR_INTRA_OP_NUM_THREADS | 2 | Threads for single operation |
| Inter-op Threads | OCR_INTER_OP_NUM_THREADS | 2 | Threads across operations |
| Arena Shrinkage | OCR_GPUMEM_ARENA_SHRINKAGE | 0 (disabled) | Release GPU memory after inference |
Table 1: OCR Model Configuration Parameters
The load_model() function implements a global model cache using the loaded_models dictionary to avoid reloading models:
The function creates an ort.SessionOptions object with CPU memory arena disabled and sequential execution mode, then instantiates an ort.InferenceSession with either CUDAExecutionProvider or CPUExecutionProvider based on GPU availability.
Sources: deepdoc/vision/ocr.py71-137
For documents with many pages, RAGFlow supports parallel OCR across multiple GPUs using asyncio.Semaphore for device coordination:
Diagram 3: Multi-GPU OCR Coordination
The settings.PARALLEL_DEVICES environment variable controls the number of GPU devices. Each device gets a dedicated TextDetector and TextRecognizer instance. The PDF parser's parallel_limiter list contains one semaphore per device to ensure exclusive access during inference.
Sources: deepdoc/parser/pdf_parser.py70-73 deepdoc/vision/ocr.py562-585 deepdoc/vision/ocr.py669-688 deepdoc/vision/ocr.py702-712
The LayoutRecognizer class classifies document regions into 11 layout types using a DETR-based (DEtection TRansformer) object detection model:
Sources: deepdoc/vision/layout_recognizer.py34-46
Diagram 4: Layout Recognition and Box Tagging Pipeline
Sources: deepdoc/vision/layout_recognizer.py63-158 deepdoc/vision/recognizer.py55-62 deepdoc/vision/recognizer.py135-176 deepdoc/vision/recognizer.py267-281 deepdoc/vision/recognizer.py415-437
For each OCR box, the system determines its layout type by finding overlapping layout regions:
find_overlapped_with_threshold() with 40% overlap requirementlayout_type = ""Sources: deepdoc/vision/layout_recognizer.py99-146
RAGFlow supports specialized layout models for different document types via the model_speciess attribute:
| Document Type | Model Domain | Parser Class |
|---|---|---|
| General | "layout" | RAGFlowPdfParser |
| Academic Papers | "layout.paper" | Pdf in rag/app/paper.py |
| Legal Documents | "layout.laws" | Pdf in rag/app/laws.py |
| Manual/Structured | "layout.manual" | Pdf in rag/app/manual.py |
The domain is set via the model_speciess attribute before calling super().__init__() in the parser constructor.
Sources: deepdoc/parser/pdf_parser.py78-82 rag/app/paper.py32-34 rag/app/laws.py97-99 rag/app/manual.py35-36
The TableStructureRecognizer detects six types of table components within table regions:
Sources: deepdoc/vision/table_structure_recognizer.py31-38
Diagram 5: Table Structure Recognition with Auto-Rotation
Sources: deepdoc/parser/pdf_parser.py291-437 deepdoc/parser/pdf_parser.py200-289 deepdoc/parser/pdf_parser.py438-583 deepdoc/vision/table_structure_recognizer.py54-111
RAGFlow includes automatic table rotation correction (introduced in recent versions). When enabled (auto_rotate=True), the system:
combined_score = avg_confidence * (1 + 0.1 * min(regions, 50) / 50)_map_rotated_point()This ensures tables printed in landscape orientation or upside-down are correctly processed.
Sources: deepdoc/parser/pdf_parser.py200-289 deepdoc/parser/pdf_parser.py438-583
After TSR inference, OCR boxes within table regions are tagged with row, column, and header information using helper functions from the Recognizer class:
These tags enable the table construction logic in TableStructureRecognizer.construct_table() to correctly assemble HTML tables with proper row/column structure.
Sources: deepdoc/parser/pdf_parser.py407-437 deepdoc/vision/recognizer.py253-264 deepdoc/vision/recognizer.py267-281 deepdoc/vision/table_structure_recognizer.py152-393
The Recognizer class in deepdoc/vision/recognizer.py provides the foundation for all vision models:
Diagram 6: Recognizer Base Class Architecture
Sources: deepdoc/vision/recognizer.py31-53 deepdoc/vision/recognizer.py55-62 deepdoc/vision/recognizer.py114-132 deepdoc/vision/recognizer.py135-176 deepdoc/vision/recognizer.py178-215 deepdoc/vision/recognizer.py253-264 deepdoc/vision/recognizer.py267-281 deepdoc/vision/recognizer.py283-312 deepdoc/vision/recognizer.py314-407 deepdoc/vision/recognizer.py415-437
The vision system uses a configurable operator pipeline for image preprocessing. Operators are dynamically instantiated from configuration dictionaries:
Common operators include: NormalizeImage, ResizeImage, ToCHWImage, Pad, DetResizeForTest.
Sources: deepdoc/vision/ocr.py49-69
The recognizers process images in batches for efficiency:
| Model | Default Batch Size | Configuration |
|---|---|---|
| Layout Recognition | 16 | batch_size parameter in __call__() |
| Text Recognition | 16 | self.rec_batch_num in TextRecognizer |
| Table Structure | Variable | All table crops from one page |
For batch processing, images are padded to the same dimensions before stacking into a single numpy array for ONNX inference.
Sources: deepdoc/vision/layout_recognizer.py63 deepdoc/vision/ocr.py142 deepdoc/vision/recognizer.py204-225
The RAGFlowPdfParser orchestrates the vision pipeline through three key methods:
Diagram 7: PDF Parser Vision Integration
Sources: deepdoc/parser/pdf_parser.py56-106 deepdoc/parser/pdf_parser.py585-649 deepdoc/parser/pdf_parser.py650-657 deepdoc/parser/pdf_parser.py291-437 deepdoc/parser/pdf_parser.py742-778 deepdoc/parser/pdf_parser.py999-1169
Different chunking methods in RAGFlow utilize vision processing differently:
| Chunking Method | Uses OCR | Uses Layout | Uses Table Structure | Parser Class |
|---|---|---|---|---|
| Naive (General) | ✓ | ✓ | ✓ | rag.app.naive.Pdf |
| Book | ✓ | ✓ | ✓ | rag.app.book.Pdf |
| Paper | ✓ | ✓ (domain-specific) | ✓ | rag.app.paper.Pdf |
| Manual | ✓ | ✓ (domain-specific) | ✓ | rag.app.manual.Pdf |
| Laws | ✓ | ✓ (domain-specific) | ✗ | rag.app.laws.Pdf |
| Q&A | ✓ | ✓ | ✓ | rag.app.qa.Pdf |
| Presentation | ✓ | ✓ | ✓ | rag.app.presentation.Pdf |
| One (entire file) | ✓ | ✓ | ✓ | rag.app.one.Pdf |
Table 2: Vision Processing Usage by Chunking Method
Sources: rag/app/naive.py544-581 rag/app/book.py32-58 rag/app/paper.py31-141 rag/app/manual.py33-68 rag/app/laws.py96-118 rag/app/qa.py79-189 rag/app/presentation.py34-114 rag/app/one.py30-56
After vision processing completes, each text box has these attributes:
These boxes flow into the text merging phase (_text_merge()) where they are concatenated according to document structure logic.
Sources: deepdoc/parser/pdf_parser.py595-649 deepdoc/parser/pdf_parser.py650-657 deepdoc/parser/pdf_parser.py407-437
RAGFlow supports Huawei Ascend NPUs as an alternative to ONNX/CUDA:
The AscendLayoutRecognizer class provides the same interface but uses Ascend CANN runtime for inference.
Sources: deepdoc/parser/pdf_parser.py74-88
Layout recognition can be delegated to a remote TensorRT-DLA server:
Sources: deepdoc/vision/layout_recognizer.py58-72
Models are loaded once and cached globally in the loaded_models dictionary. The cache key includes the device ID to support multi-GPU configurations:
Sources: deepdoc/vision/ocr.py71-80
GPU memory is controlled through ONNX Runtime provider options:
OCR_GPU_MEM_LIMIT_MB caps GPU memory per modelkNextPowerOfTwo (default) or kSameAsRequested controls allocationOCR_GPUMEM_ARENA_SHRINKAGE=1 to release memory after inferenceSources: deepdoc/vision/ocr.py106-126
To prevent CPU oversubscription in multi-worker deployments:
Lower thread counts prevent thrashing when multiple document workers run concurrently.
Sources: deepdoc/vision/ocr.py99-101
Vision models are stored in rag/res/deepdoc/:
rag/res/deepdoc/
├── det.onnx # Text detection model
├── rec.onnx # Text recognition model
├── ocr.res # Character dictionary (6623 chars)
├── layout.onnx # Layout recognition (general)
├── layout.paper.onnx # Layout recognition (papers)
├── layout.laws.onnx # Layout recognition (legal docs)
├── layout.manual.onnx # Layout recognition (manuals)
├── tsr.onnx # Table structure recognition
└── updown_concat_xgb.model # Text merge classifier (XGBoost)
Sources: deepdoc/vision/ocr.py71-83 deepdoc/vision/layout_recognizer.py48-54 deepdoc/vision/table_structure_recognizer.py41-52 deepdoc/parser/pdf_parser.py101-105
If models are missing, RAGFlow automatically downloads them from HuggingFace:
Sources: deepdoc/vision/layout_recognizer.py48-54 deepdoc/vision/table_structure_recognizer.py41-52
RAGFlow's vision processing system provides a comprehensive solution for extracting structured information from document images:
The modular design allows each component to be used independently or as part of the full document processing pipeline, supporting various document types and use cases.
Refresh this wiki