Vision Processing: OCR and Layout Recognition

Relevant source files

Purpose and Scope

This document describes RAGFlow's vision processing subsystem, which extracts structured information from document images using deep learning models. The system performs three primary tasks: text detection and recognition (OCR), layout classification, and table structure analysis. These capabilities enable the document parser to understand document structure and extract text accurately from PDFs and images.

This page focuses on the vision models and their inference pipeline. For document parsing workflows that use these vision components, see Document Parsing Strategies. For the complete chunking pipeline, see Chunking Methods.

Vision Processing Pipeline Overview

The vision processing system operates in three sequential stages during PDF parsing:

Diagram 1: Vision Processing Pipeline in PDF Parsing

The three stages are invoked sequentially by the RAGFlowPdfParser class methods: __images__() for OCR, _layouts_rec() for layout recognition, and _table_transformer_job() for table structure analysis.

Sources: deepdoc/parser/pdf_parser.py55-106 deepdoc/parser/pdf_parser.py548-580 deepdoc/parser/pdf_parser.py650-657

OCR System Architecture

Text Detection and Recognition

The OCR system in deepdoc/vision/ocr.py implements a two-stage pipeline: detection identifies text regions, and recognition converts pixel regions to text strings.

Diagram 2: OCR Two-Stage Pipeline

Sources: deepdoc/vision/ocr.py542-586 deepdoc/vision/ocr.py139-177 deepdoc/vision/ocr.py369-414 deepdoc/vision/ocr.py420-540 deepdoc/vision/ocr.py590-644 deepdoc/vision/ocr.py714-757

ONNX Model Loading and Inference

Models are loaded via the load_model() function at deepdoc/vision/ocr.py71-137 which configures ONNX Runtime for CPU or CUDA execution:

Configuration	Environment Variable	Default Value	Description
Inference Device	`torch.cuda.is_available()`	Auto-detect	Use CUDA if available, else CPU
GPU Memory Limit	`OCR_GPU_MEM_LIMIT_MB`	2048 MB	Maximum GPU memory per model
Arena Strategy	`OCR_ARENA_EXTEND_STRATEGY`	`kNextPowerOfTwo`	GPU memory allocation strategy
Intra-op Threads	`OCR_INTRA_OP_NUM_THREADS`	2	Threads for single operation
Inter-op Threads	`OCR_INTER_OP_NUM_THREADS`	2	Threads across operations
Arena Shrinkage	`OCR_GPUMEM_ARENA_SHRINKAGE`	0 (disabled)	Release GPU memory after inference

Table 1: OCR Model Configuration Parameters

The load_model() function implements a global model cache using the loaded_models dictionary to avoid reloading models:

The function creates an ort.SessionOptions object with CPU memory arena disabled and sequential execution mode, then instantiates an ort.InferenceSession with either CUDAExecutionProvider or CPUExecutionProvider based on GPU availability.

Sources: deepdoc/vision/ocr.py71-137

Multi-GPU Parallel Processing

For documents with many pages, RAGFlow supports parallel OCR across multiple GPUs using asyncio.Semaphore for device coordination:

Diagram 3: Multi-GPU OCR Coordination

The settings.PARALLEL_DEVICES environment variable controls the number of GPU devices. Each device gets a dedicated TextDetector and TextRecognizer instance. The PDF parser's parallel_limiter list contains one semaphore per device to ensure exclusive access during inference.

Sources: deepdoc/parser/pdf_parser.py70-73 deepdoc/vision/ocr.py562-585 deepdoc/vision/ocr.py669-688 deepdoc/vision/ocr.py702-712

Layout Recognition System

Layout Classification Models

The LayoutRecognizer class classifies document regions into 11 layout types using a DETR-based (DEtection TRansformer) object detection model:

Sources: deepdoc/vision/layout_recognizer.py34-46

Layout Recognition Pipeline

Diagram 4: Layout Recognition and Box Tagging Pipeline

Sources: deepdoc/vision/layout_recognizer.py63-158 deepdoc/vision/recognizer.py55-62 deepdoc/vision/recognizer.py135-176 deepdoc/vision/recognizer.py267-281 deepdoc/vision/recognizer.py415-437

Layout Type Tagging Logic

For each OCR box, the system determines its layout type by finding overlapping layout regions:

Overlap Detection: Use find_overlapped_with_threshold() with 40% overlap requirement
Priority Order: Footer → Header → Reference → Figure Caption → Table Caption → Title → Table → Text → Figure → Equation
Garbage Handling: Headers/footers repeated on multiple pages are removed (unless they appear in atypical positions)
Unmatched Boxes: Boxes without layout matches get layout_type = ""

Sources: deepdoc/vision/layout_recognizer.py99-146

Domain-Specific Layout Models

RAGFlow supports specialized layout models for different document types via the model_speciess attribute:

Document Type	Model Domain	Parser Class
General	`"layout"`	`RAGFlowPdfParser`
Academic Papers	`"layout.paper"`	`Pdf` in `rag/app/paper.py`
Legal Documents	`"layout.laws"`	`Pdf` in `rag/app/laws.py`
Manual/Structured	`"layout.manual"`	`Pdf` in `rag/app/manual.py`

The domain is set via the model_speciess attribute before calling super().__init__() in the parser constructor.

Sources: deepdoc/parser/pdf_parser.py78-82 rag/app/paper.py32-34 rag/app/laws.py97-99 rag/app/manual.py35-36

Table Structure Recognition

Table Components Detection

The TableStructureRecognizer detects six types of table components within table regions:

Sources: deepdoc/vision/table_structure_recognizer.py31-38

Table Structure Recognition Pipeline

Diagram 5: Table Structure Recognition with Auto-Rotation

Sources: deepdoc/parser/pdf_parser.py291-437 deepdoc/parser/pdf_parser.py200-289 deepdoc/parser/pdf_parser.py438-583 deepdoc/vision/table_structure_recognizer.py54-111

Auto-Rotation for Misaligned Tables

RAGFlow includes automatic table rotation correction (introduced in recent versions). When enabled (auto_rotate=True), the system:

Evaluates 4 Orientations: Tests 0°, 90°, 180°, 270° rotations
Scores Each Rotation: Performs quick OCR sampling to calculate combined_score = avg_confidence * (1 + 0.1 * min(regions, 50) / 50)
Applies Threshold Logic: Only rotates if non-0° angle exceeds 0° by >0.2 AND 0° score <0.8
Re-OCRs Rotated Image: Performs fresh OCR on the rotated table image
Coordinate Transformation: Maps rotated coordinates back to original page space using _map_rotated_point()

This ensures tables printed in landscape orientation or upside-down are correctly processed.

Sources: deepdoc/parser/pdf_parser.py200-289 deepdoc/parser/pdf_parser.py438-583

Tagging OCR Boxes with Table Structure

After TSR inference, OCR boxes within table regions are tagged with row, column, and header information using helper functions from the Recognizer class:

These tags enable the table construction logic in TableStructureRecognizer.construct_table() to correctly assemble HTML tables with proper row/column structure.

Sources: deepdoc/parser/pdf_parser.py407-437 deepdoc/vision/recognizer.py253-264 deepdoc/vision/recognizer.py267-281 deepdoc/vision/table_structure_recognizer.py152-393

ONNX Model Architecture and Inference

Base Recognizer Class

The Recognizer class in deepdoc/vision/recognizer.py provides the foundation for all vision models:

Diagram 6: Recognizer Base Class Architecture

Sources: deepdoc/vision/recognizer.py31-53 deepdoc/vision/recognizer.py55-62 deepdoc/vision/recognizer.py114-132 deepdoc/vision/recognizer.py135-176 deepdoc/vision/recognizer.py178-215 deepdoc/vision/recognizer.py253-264 deepdoc/vision/recognizer.py267-281 deepdoc/vision/recognizer.py283-312 deepdoc/vision/recognizer.py314-407 deepdoc/vision/recognizer.py415-437

Preprocessing Operators

The vision system uses a configurable operator pipeline for image preprocessing. Operators are dynamically instantiated from configuration dictionaries:

Common operators include: NormalizeImage, ResizeImage, ToCHWImage, Pad, DetResizeForTest.

Sources: deepdoc/vision/ocr.py49-69

Batch Processing Strategy

The recognizers process images in batches for efficiency:

Model	Default Batch Size	Configuration
Layout Recognition	16	`batch_size` parameter in `__call__()`
Text Recognition	16	`self.rec_batch_num` in `TextRecognizer`
Table Structure	Variable	All table crops from one page

For batch processing, images are padded to the same dimensions before stacking into a single numpy array for ONNX inference.

Sources: deepdoc/vision/layout_recognizer.py63 deepdoc/vision/ocr.py142 deepdoc/vision/recognizer.py204-225

Integration with Document Parsing

PDF Parser Integration Points

The RAGFlowPdfParser orchestrates the vision pipeline through three key methods:

Diagram 7: PDF Parser Vision Integration

Sources: deepdoc/parser/pdf_parser.py56-106 deepdoc/parser/pdf_parser.py585-649 deepdoc/parser/pdf_parser.py650-657 deepdoc/parser/pdf_parser.py291-437 deepdoc/parser/pdf_parser.py742-778 deepdoc/parser/pdf_parser.py999-1169

Chunking Strategy Selection

Different chunking methods in RAGFlow utilize vision processing differently:

Chunking Method	Uses OCR	Uses Layout	Uses Table Structure	Parser Class
Naive (General)	✓	✓	✓	`rag.app.naive.Pdf`
Book	✓	✓	✓	`rag.app.book.Pdf`
Paper	✓	✓ (domain-specific)	✓	`rag.app.paper.Pdf`
Manual	✓	✓ (domain-specific)	✓	`rag.app.manual.Pdf`
Laws	✓	✓ (domain-specific)	✗	`rag.app.laws.Pdf`
Q&A	✓	✓	✓	`rag.app.qa.Pdf`
Presentation	✓	✓	✓	`rag.app.presentation.Pdf`
One (entire file)	✓	✓	✓	`rag.app.one.Pdf`

Table 2: Vision Processing Usage by Chunking Method

Sources: rag/app/naive.py544-581 rag/app/book.py32-58 rag/app/paper.py31-141 rag/app/manual.py33-68 rag/app/laws.py96-118 rag/app/qa.py79-189 rag/app/presentation.py34-114 rag/app/one.py30-56

Box Attributes and Data Flow

After vision processing completes, each text box has these attributes:

These boxes flow into the text merging phase (_text_merge()) where they are concatenated according to document structure logic.

Sources: deepdoc/parser/pdf_parser.py595-649 deepdoc/parser/pdf_parser.py650-657 deepdoc/parser/pdf_parser.py407-437

Alternative Vision Backends

Ascend NPU Support

RAGFlow supports Huawei Ascend NPUs as an alternative to ONNX/CUDA:

The AscendLayoutRecognizer class provides the same interface but uses Ascend CANN runtime for inference.

Sources: deepdoc/parser/pdf_parser.py74-88

TensorRT-DLA Support

Layout recognition can be delegated to a remote TensorRT-DLA server:

Sources: deepdoc/vision/layout_recognizer.py58-72

Performance Optimization

Model Caching

Models are loaded once and cached globally in the loaded_models dictionary. The cache key includes the device ID to support multi-GPU configurations:

Sources: deepdoc/vision/ocr.py71-80

Memory Management

GPU memory is controlled through ONNX Runtime provider options:

Memory Limit: OCR_GPU_MEM_LIMIT_MB caps GPU memory per model
Arena Strategy: kNextPowerOfTwo (default) or kSameAsRequested controls allocation
Arena Shrinkage: Enable OCR_GPUMEM_ARENA_SHRINKAGE=1 to release memory after inference

Sources: deepdoc/vision/ocr.py106-126

Thread Configuration

To prevent CPU oversubscription in multi-worker deployments:

Lower thread counts prevent thrashing when multiple document workers run concurrently.

Sources: deepdoc/vision/ocr.py99-101

Model Files and Storage

Model Directory Structure

Vision models are stored in rag/res/deepdoc/:

rag/res/deepdoc/
├── det.onnx                    # Text detection model
├── rec.onnx                    # Text recognition model
├── ocr.res                     # Character dictionary (6623 chars)
├── layout.onnx                 # Layout recognition (general)
├── layout.paper.onnx           # Layout recognition (papers)
├── layout.laws.onnx            # Layout recognition (legal docs)
├── layout.manual.onnx          # Layout recognition (manuals)
├── tsr.onnx                    # Table structure recognition
└── updown_concat_xgb.model     # Text merge classifier (XGBoost)

Sources: deepdoc/vision/ocr.py71-83 deepdoc/vision/layout_recognizer.py48-54 deepdoc/vision/table_structure_recognizer.py41-52 deepdoc/parser/pdf_parser.py101-105

Automatic Model Download

If models are missing, RAGFlow automatically downloads them from HuggingFace:

Sources: deepdoc/vision/layout_recognizer.py48-54 deepdoc/vision/table_structure_recognizer.py41-52

Summary

RAGFlow's vision processing system provides a comprehensive solution for extracting structured information from document images:

OCR System: Two-stage detection + recognition pipeline with multi-GPU support
Layout Recognition: 11-class classification with domain-specific models
Table Structure Recognition: Row/column/header detection with auto-rotation correction
ONNX Architecture: Unified inference backend with CPU/CUDA/Ascend NPU support
Integration: Seamless integration with PDF parser and chunking strategies

The modular design allows each component to be used independently or as part of the full document processing pipeline, supporting various document types and use cases.

Vision Processing: OCR and Layout Recognition

Relevant source files

Purpose and Scope

Vision Processing Pipeline Overview

The vision processing system operates in three sequential stages during PDF parsing:

Diagram 1: Vision Processing Pipeline in PDF Parsing

Sources: deepdoc/parser/pdf_parser.py55-106 deepdoc/parser/pdf_parser.py548-580 deepdoc/parser/pdf_parser.py650-657

OCR System Architecture

Text Detection and Recognition

The OCR system in deepdoc/vision/ocr.py implements a two-stage pipeline: detection identifies text regions, and recognition converts pixel regions to text strings.

Diagram 2: OCR Two-Stage Pipeline

Sources: deepdoc/vision/ocr.py542-586 deepdoc/vision/ocr.py139-177 deepdoc/vision/ocr.py369-414 deepdoc/vision/ocr.py420-540 deepdoc/vision/ocr.py590-644 deepdoc/vision/ocr.py714-757

ONNX Model Loading and Inference

Models are loaded via the load_model() function at deepdoc/vision/ocr.py71-137 which configures ONNX Runtime for CPU or CUDA execution:

Configuration	Environment Variable	Default Value	Description
Inference Device	`torch.cuda.is_available()`	Auto-detect	Use CUDA if available, else CPU
GPU Memory Limit	`OCR_GPU_MEM_LIMIT_MB`	2048 MB	Maximum GPU memory per model
Arena Strategy	`OCR_ARENA_EXTEND_STRATEGY`	`kNextPowerOfTwo`	GPU memory allocation strategy
Intra-op Threads	`OCR_INTRA_OP_NUM_THREADS`	2	Threads for single operation
Inter-op Threads	`OCR_INTER_OP_NUM_THREADS`	2	Threads across operations
Arena Shrinkage	`OCR_GPUMEM_ARENA_SHRINKAGE`	0 (disabled)	Release GPU memory after inference

Table 1: OCR Model Configuration Parameters

The load_model() function implements a global model cache using the loaded_models dictionary to avoid reloading models:

Sources: deepdoc/vision/ocr.py71-137

Multi-GPU Parallel Processing

For documents with many pages, RAGFlow supports parallel OCR across multiple GPUs using asyncio.Semaphore for device coordination:

Diagram 3: Multi-GPU OCR Coordination

Sources: deepdoc/parser/pdf_parser.py70-73 deepdoc/vision/ocr.py562-585 deepdoc/vision/ocr.py669-688 deepdoc/vision/ocr.py702-712

Layout Recognition System

Layout Classification Models

The LayoutRecognizer class classifies document regions into 11 layout types using a DETR-based (DEtection TRansformer) object detection model:

Sources: deepdoc/vision/layout_recognizer.py34-46

Layout Recognition Pipeline

Diagram 4: Layout Recognition and Box Tagging Pipeline

Sources: deepdoc/vision/layout_recognizer.py63-158 deepdoc/vision/recognizer.py55-62 deepdoc/vision/recognizer.py135-176 deepdoc/vision/recognizer.py267-281 deepdoc/vision/recognizer.py415-437

Layout Type Tagging Logic

For each OCR box, the system determines its layout type by finding overlapping layout regions:

Overlap Detection: Use find_overlapped_with_threshold() with 40% overlap requirement
Priority Order: Footer → Header → Reference → Figure Caption → Table Caption → Title → Table → Text → Figure → Equation
Garbage Handling: Headers/footers repeated on multiple pages are removed (unless they appear in atypical positions)
Unmatched Boxes: Boxes without layout matches get layout_type = ""

Sources: deepdoc/vision/layout_recognizer.py99-146

Domain-Specific Layout Models

RAGFlow supports specialized layout models for different document types via the model_speciess attribute:

Document Type	Model Domain	Parser Class
General	`"layout"`	`RAGFlowPdfParser`
Academic Papers	`"layout.paper"`	`Pdf` in `rag/app/paper.py`
Legal Documents	`"layout.laws"`	`Pdf` in `rag/app/laws.py`
Manual/Structured	`"layout.manual"`	`Pdf` in `rag/app/manual.py`

The domain is set via the model_speciess attribute before calling super().__init__() in the parser constructor.

Sources: deepdoc/parser/pdf_parser.py78-82 rag/app/paper.py32-34 rag/app/laws.py97-99 rag/app/manual.py35-36

Table Structure Recognition

Table Components Detection

The TableStructureRecognizer detects six types of table components within table regions:

Sources: deepdoc/vision/table_structure_recognizer.py31-38

Table Structure Recognition Pipeline

Diagram 5: Table Structure Recognition with Auto-Rotation

Sources: deepdoc/parser/pdf_parser.py291-437 deepdoc/parser/pdf_parser.py200-289 deepdoc/parser/pdf_parser.py438-583 deepdoc/vision/table_structure_recognizer.py54-111

Auto-Rotation for Misaligned Tables

RAGFlow includes automatic table rotation correction (introduced in recent versions). When enabled (auto_rotate=True), the system:

Evaluates 4 Orientations: Tests 0°, 90°, 180°, 270° rotations
Scores Each Rotation: Performs quick OCR sampling to calculate combined_score = avg_confidence * (1 + 0.1 * min(regions, 50) / 50)
Applies Threshold Logic: Only rotates if non-0° angle exceeds 0° by >0.2 AND 0° score <0.8
Re-OCRs Rotated Image: Performs fresh OCR on the rotated table image
Coordinate Transformation: Maps rotated coordinates back to original page space using _map_rotated_point()

This ensures tables printed in landscape orientation or upside-down are correctly processed.

Sources: deepdoc/parser/pdf_parser.py200-289 deepdoc/parser/pdf_parser.py438-583

Tagging OCR Boxes with Table Structure

After TSR inference, OCR boxes within table regions are tagged with row, column, and header information using helper functions from the Recognizer class:

These tags enable the table construction logic in TableStructureRecognizer.construct_table() to correctly assemble HTML tables with proper row/column structure.

Sources: deepdoc/parser/pdf_parser.py407-437 deepdoc/vision/recognizer.py253-264 deepdoc/vision/recognizer.py267-281 deepdoc/vision/table_structure_recognizer.py152-393

ONNX Model Architecture and Inference

Base Recognizer Class

The Recognizer class in deepdoc/vision/recognizer.py provides the foundation for all vision models:

Diagram 6: Recognizer Base Class Architecture

Preprocessing Operators

The vision system uses a configurable operator pipeline for image preprocessing. Operators are dynamically instantiated from configuration dictionaries:

Common operators include: NormalizeImage, ResizeImage, ToCHWImage, Pad, DetResizeForTest.

Sources: deepdoc/vision/ocr.py49-69

Batch Processing Strategy

The recognizers process images in batches for efficiency:

Model	Default Batch Size	Configuration
Layout Recognition	16	`batch_size` parameter in `__call__()`
Text Recognition	16	`self.rec_batch_num` in `TextRecognizer`
Table Structure	Variable	All table crops from one page

For batch processing, images are padded to the same dimensions before stacking into a single numpy array for ONNX inference.

Sources: deepdoc/vision/layout_recognizer.py63 deepdoc/vision/ocr.py142 deepdoc/vision/recognizer.py204-225

Integration with Document Parsing

PDF Parser Integration Points

The RAGFlowPdfParser orchestrates the vision pipeline through three key methods:

Diagram 7: PDF Parser Vision Integration

Chunking Strategy Selection

Different chunking methods in RAGFlow utilize vision processing differently:

Chunking Method	Uses OCR	Uses Layout	Uses Table Structure	Parser Class
Naive (General)	✓	✓	✓	`rag.app.naive.Pdf`
Book	✓	✓	✓	`rag.app.book.Pdf`
Paper	✓	✓ (domain-specific)	✓	`rag.app.paper.Pdf`
Manual	✓	✓ (domain-specific)	✓	`rag.app.manual.Pdf`
Laws	✓	✓ (domain-specific)	✗	`rag.app.laws.Pdf`
Q&A	✓	✓	✓	`rag.app.qa.Pdf`
Presentation	✓	✓	✓	`rag.app.presentation.Pdf`
One (entire file)	✓	✓	✓	`rag.app.one.Pdf`

Table 2: Vision Processing Usage by Chunking Method

Sources: rag/app/naive.py544-581 rag/app/book.py32-58 rag/app/paper.py31-141 rag/app/manual.py33-68 rag/app/laws.py96-118 rag/app/qa.py79-189 rag/app/presentation.py34-114 rag/app/one.py30-56

Box Attributes and Data Flow

After vision processing completes, each text box has these attributes:

These boxes flow into the text merging phase (_text_merge()) where they are concatenated according to document structure logic.

Sources: deepdoc/parser/pdf_parser.py595-649 deepdoc/parser/pdf_parser.py650-657 deepdoc/parser/pdf_parser.py407-437

Alternative Vision Backends

Ascend NPU Support

RAGFlow supports Huawei Ascend NPUs as an alternative to ONNX/CUDA:

The AscendLayoutRecognizer class provides the same interface but uses Ascend CANN runtime for inference.

Sources: deepdoc/parser/pdf_parser.py74-88

TensorRT-DLA Support

Layout recognition can be delegated to a remote TensorRT-DLA server:

Sources: deepdoc/vision/layout_recognizer.py58-72

Performance Optimization

Model Caching

Models are loaded once and cached globally in the loaded_models dictionary. The cache key includes the device ID to support multi-GPU configurations:

Sources: deepdoc/vision/ocr.py71-80

Memory Management

GPU memory is controlled through ONNX Runtime provider options:

Memory Limit: OCR_GPU_MEM_LIMIT_MB caps GPU memory per model
Arena Strategy: kNextPowerOfTwo (default) or kSameAsRequested controls allocation
Arena Shrinkage: Enable OCR_GPUMEM_ARENA_SHRINKAGE=1 to release memory after inference

Sources: deepdoc/vision/ocr.py106-126

Thread Configuration

To prevent CPU oversubscription in multi-worker deployments:

Lower thread counts prevent thrashing when multiple document workers run concurrently.

Sources: deepdoc/vision/ocr.py99-101

Model Files and Storage

Model Directory Structure

Vision models are stored in rag/res/deepdoc/:

rag/res/deepdoc/
├── det.onnx                    # Text detection model
├── rec.onnx                    # Text recognition model
├── ocr.res                     # Character dictionary (6623 chars)
├── layout.onnx                 # Layout recognition (general)
├── layout.paper.onnx           # Layout recognition (papers)
├── layout.laws.onnx            # Layout recognition (legal docs)
├── layout.manual.onnx          # Layout recognition (manuals)
├── tsr.onnx                    # Table structure recognition
└── updown_concat_xgb.model     # Text merge classifier (XGBoost)

Sources: deepdoc/vision/ocr.py71-83 deepdoc/vision/layout_recognizer.py48-54 deepdoc/vision/table_structure_recognizer.py41-52 deepdoc/parser/pdf_parser.py101-105

Automatic Model Download

If models are missing, RAGFlow automatically downloads them from HuggingFace:

Sources: deepdoc/vision/layout_recognizer.py48-54 deepdoc/vision/table_structure_recognizer.py41-52

Summary

RAGFlow's vision processing system provides a comprehensive solution for extracting structured information from document images:

OCR System: Two-stage detection + recognition pipeline with multi-GPU support
Layout Recognition: 11-class classification with domain-specific models
Table Structure Recognition: Row/column/header detection with auto-rotation correction
ONNX Architecture: Unified inference backend with CPU/CUDA/Ascend NPU support
Integration: Seamless integration with PDF parser and chunking strategies

The modular design allows each component to be used independently or as part of the full document processing pipeline, supporting various document types and use cases.

Vision Processing: OCR and Layout Recognition

Purpose and Scope

Vision Processing Pipeline Overview

OCR System Architecture

Text Detection and Recognition

ONNX Model Loading and Inference

Multi-GPU Parallel Processing

Layout Recognition System

Layout Classification Models

Layout Recognition Pipeline

Layout Type Tagging Logic

Domain-Specific Layout Models

Table Structure Recognition

Table Components Detection

Table Structure Recognition Pipeline

Auto-Rotation for Misaligned Tables

Tagging OCR Boxes with Table Structure

ONNX Model Architecture and Inference

Base Recognizer Class

Preprocessing Operators

Batch Processing Strategy

Integration with Document Parsing

PDF Parser Integration Points

Chunking Strategy Selection

Box Attributes and Data Flow

Alternative Vision Backends

Ascend NPU Support

TensorRT-DLA Support

Performance Optimization

Model Caching

Memory Management

Thread Configuration

Model Files and Storage

Model Directory Structure

Automatic Model Download

Summary

On this page

Vision Processing: OCR and Layout Recognition

Purpose and Scope

Vision Processing Pipeline Overview

OCR System Architecture

Text Detection and Recognition

ONNX Model Loading and Inference

Multi-GPU Parallel Processing

Layout Recognition System

Layout Classification Models

Layout Recognition Pipeline

Layout Type Tagging Logic

Domain-Specific Layout Models

Table Structure Recognition

Table Components Detection

Table Structure Recognition Pipeline

Auto-Rotation for Misaligned Tables

Tagging OCR Boxes with Table Structure

ONNX Model Architecture and Inference

Base Recognizer Class

Preprocessing Operators

Batch Processing Strategy

Integration with Document Parsing

PDF Parser Integration Points

Chunking Strategy Selection

Box Attributes and Data Flow

Alternative Vision Backends

Ascend NPU Support

TensorRT-DLA Support

Performance Optimization

Model Caching

Memory Management

Thread Configuration

Model Files and Storage

Model Directory Structure

Automatic Model Download

Summary

On this page