Document Parsing Strategies

Relevant source files

Purpose and Scope

This document explains RAGFlow's parser strategy selection system, which controls how documents are converted from raw formats (PDF, DOCX, etc.) into structured text and metadata. The system supports multiple parsing engines optimized for different document types and quality levels. This page focuses on parser selection and the different parsing strategies available. For information about what happens after parsing (chunking into segments), see Chunking Methods. For the vision processing components (OCR, layout recognition), see Vision Processing: OCR and Layout Recognition.

Sources: rag/app/naive.py1-241 deepdoc/parser/pdf_parser.py1-100

Parser Selection Architecture

RAGFlow's parser selection mechanism is controlled by the layout_recognize configuration parameter, which determines which parsing engine processes a document. The selection happens in the chunk function and is normalized through normalize_layout_recognizer before being dispatched to the appropriate parser.

Parser Selection Flow

Sources: rag/app/naive.py222-229 rag/app/naive.py826-850 common/parser_config_utils.py

PARSERS Dictionary Structure

The parser registry is implemented as a simple dictionary mapping strategy names to parser functions:

Key	Function	Description
`"deepdoc"`	`by_deepdoc`	Vision-based parsing using DeepDoc models (OCR + Layout)
`"mineru"`	`by_mineru`	LLM-based OCR parsing via MinerU framework
`"docling"`	`by_docling`	IBM Docling parser (external library)
`"tcadp"`	`by_tcadp`	Tencent Cloud ADP parser (cloud API)
`"paddleocr"`	`by_paddleocr`	PaddleOCR-based parsing
`"plaintext"`	`by_plaintext`	Simple text extraction without vision models

All parser functions share a common signature:

Sources: rag/app/naive.py222-229 rag/app/naive.py58-220

Parser Strategy Implementations

DeepDoc Parser (Vision-Based)

DeepDoc is the default high-quality parser that uses ONNX-based vision models for OCR, layout recognition, and table structure detection.

Key Components:

Pdf class rag/app/naive.py544-581: Inherits from PdfParser deepdoc/parser/pdf_parser.py55-106
OCR stage deepdoc/parser/pdf_parser.py585-649: Text detection and recognition using self.ocr
Layout recognition deepdoc/parser/pdf_parser.py650-657: Classifies regions as text, title, table, figure, etc.
Table analysis deepdoc/parser/pdf_parser.py291-437: Identifies table structure (rows, columns, cells)
Text merging deepdoc/parser/pdf_parser.py742-786: Assembles recognized boxes into coherent text blocks

Sources: rag/app/naive.py58-71 deepdoc/parser/pdf_parser.py55-657 deepdoc/vision/ocr.py139-327 deepdoc/vision/layout_recognizer.py33-157

MinerU Parser (LLM-Based OCR)

MinerU leverages large language models for OCR, providing better accuracy for complex layouts at the cost of requiring LLM API access.

Configuration Flow:

Tenant lookup rag/app/naive.py87-96: Queries TenantLLMService for configured MinerU models
LLM initialization rag/app/naive.py103-104: Creates LLMBundle with LLMType.OCR
Parsing rag/app/naive.py105-113: Calls parse_pdf() method on the LLM model
Parse methods rag/app/naive.py109: Supports "raw" or other parsing methods

Key Parameters:

mineru_llm_name: Name of the MinerU LLM model to use
parse_method: Parsing strategy (typically "raw")
tenant_id: Required for accessing tenant-specific LLM configurations

Sources: rag/app/naive.py73-120 api/db/services/tenant_llm_service.py

Docling Parser (IBM)

Docling is an external library-based parser that requires installation verification.

Environment Variables:

MINERU_OUTPUT_DIR: Directory for intermediate outputs (defaults to "")
MINERU_DELETE_OUTPUT: Whether to delete outputs after processing (defaults to 1)

Sources: rag/app/naive.py122-139 deepdoc/parser/docling_parser.py

TCADP Parser (Tencent Cloud)

TCADP uses Tencent Cloud's Document Analysis Platform API for parsing.

Configuration:

Requires Tencent Cloud API credentials
TCADP_OUTPUT_DIR: Output directory for intermediate files
file_type: Can be "PDF", "XLSX", or "CSV"

Sources: rag/app/naive.py141-150 deepdoc/parser/tcadp_parser.py

PaddleOCR Parser

PaddleOCR provides an alternative OCR engine, similar to MinerU but using PaddlePaddle framework.

Sources: rag/app/naive.py152-200 api/db/services/tenant_llm_service.py

PlainText Parser (Fallback)

The plaintext parser provides simple text extraction without vision models. It can optionally use a vision model for layout recognition.

Parser Selection Logic rag/app/naive.py202-219:

If layout_recognizer is empty or "Plain Text": Use PlainParser (pure text extraction)
Otherwise: Use VisionParser with the specified vision model from LLMBundle

Sources: rag/app/naive.py202-220 deepdoc/parser/pdf_parser.py

Configuration and Runtime Selection

Configuration Parameter: `layout_recognize`

The layout_recognize parameter is the primary control for parser selection. It accepts:

Value	Behavior
`"DeepDOC"`	Uses DeepDoc vision-based parser (default)
`"MinerU"`	Uses MinerU LLM-based OCR
`"Docling"`	Uses IBM Docling parser
`"TCADP Parser"`	Uses Tencent Cloud ADP
`"PaddleOCR"`	Uses PaddleOCR framework
`"Plain Text"`	Simple text extraction without vision
`True`	Boolean compatibility: equivalent to `"DeepDOC"`
`False`	Boolean compatibility: equivalent to `"Plain Text"`
Vision model name	Uses `PlainText` parser with vision model for layout

Sources: rag/app/naive.py826-850 common/parser_config_utils.py

Normalization Process

The normalization step common/parser_config_utils.py:

Converts boolean True/False to "DeepDOC"/"Plain Text"
Parses string format "ParserName/ModelName" into separate components
Returns tuple (layout_recognizer, parser_model_name)

Sources: common/parser_config_utils.py rag/app/naive.py827

Integration with Chunk Function

The parser selection integrates into the main chunking pipeline:

Key Code Locations:

File type detection: rag/app/naive.py826-896
Parser invocation: rag/app/naive.py839-850
Output validation: rag/app/naive.py852-853

Sources: rag/app/naive.py737-896

Parser Output Format

All parsers return a consistent tuple format: (sections, tables, pdf_parser)

Sections Format

The position tag format is: @@{page}\t{left}\t{right}\t{top}\t{bottom}##

Tables Format

Sources: rag/app/naive.py58-71 deepdoc/parser/pdf_parser.py568-581

Post-Parser Processing

After parser selection and execution, several post-processing steps may occur:

Token Number Override

For certain parsers that return pre-chunked content, the token number is set to 0 to prevent re-chunking:

This occurs at rag/app/naive.py858-859

Vision Figure Enhancement

Tables and figures can be enhanced with vision model descriptions:

Sources: rag/app/naive.py64-70 deepdoc/parser/figure_parser.py

Error Handling and Fallbacks

Each parser implements its own error handling strategy:

Parser	Error Behavior
MinerU	Returns `(None, None, None)` if tenant_id missing or LLM unavailable
Docling	Returns `(None, None, parser)` if library not installed
TCADP	Returns `(None, None, parser)` if API credentials invalid
PaddleOCR	Returns `(None, None, None)` if tenant configuration missing
PlainText	Always succeeds (fallback parser)

The chunk() function checks for None results and returns empty list:

Sources: rag/app/naive.py87-119 rag/app/naive.py126-128 rag/app/naive.py144-146 rag/app/naive.py852-853

Summary Table: Parser Comparison

Parser	Type	Requires LLM	External Dependency	Best For
DeepDoc	Vision-based	No	ONNX models	General documents, high quality
MinerU	LLM-based OCR	Yes	LLM API access	Complex layouts, low-quality scans
Docling	Library-based	No	Docling library	IBM ecosystem integration
TCADP	Cloud API	No	Tencent Cloud API	Cloud-based processing
PaddleOCR	Vision-based	Yes (tenant)	PaddlePaddle	Alternative to DeepDoc
PlainText	Text extraction	Optional	None	Simple documents, fallback

Sources: rag/app/naive.py222-229 rag/app/naive.py58-220

Document Parsing Strategies

Relevant source files

Purpose and Scope

Sources: rag/app/naive.py1-241 deepdoc/parser/pdf_parser.py1-100

Parser Selection Architecture

Parser Selection Flow

Sources: rag/app/naive.py222-229 rag/app/naive.py826-850 common/parser_config_utils.py

PARSERS Dictionary Structure

The parser registry is implemented as a simple dictionary mapping strategy names to parser functions:

Key	Function	Description
`"deepdoc"`	`by_deepdoc`	Vision-based parsing using DeepDoc models (OCR + Layout)
`"mineru"`	`by_mineru`	LLM-based OCR parsing via MinerU framework
`"docling"`	`by_docling`	IBM Docling parser (external library)
`"tcadp"`	`by_tcadp`	Tencent Cloud ADP parser (cloud API)
`"paddleocr"`	`by_paddleocr`	PaddleOCR-based parsing
`"plaintext"`	`by_plaintext`	Simple text extraction without vision models

All parser functions share a common signature:

Sources: rag/app/naive.py222-229 rag/app/naive.py58-220

Parser Strategy Implementations

DeepDoc Parser (Vision-Based)

DeepDoc is the default high-quality parser that uses ONNX-based vision models for OCR, layout recognition, and table structure detection.

Key Components:

Pdf class rag/app/naive.py544-581: Inherits from PdfParser deepdoc/parser/pdf_parser.py55-106
OCR stage deepdoc/parser/pdf_parser.py585-649: Text detection and recognition using self.ocr
Layout recognition deepdoc/parser/pdf_parser.py650-657: Classifies regions as text, title, table, figure, etc.
Table analysis deepdoc/parser/pdf_parser.py291-437: Identifies table structure (rows, columns, cells)
Text merging deepdoc/parser/pdf_parser.py742-786: Assembles recognized boxes into coherent text blocks

Sources: rag/app/naive.py58-71 deepdoc/parser/pdf_parser.py55-657 deepdoc/vision/ocr.py139-327 deepdoc/vision/layout_recognizer.py33-157

MinerU Parser (LLM-Based OCR)

MinerU leverages large language models for OCR, providing better accuracy for complex layouts at the cost of requiring LLM API access.

Configuration Flow:

Tenant lookup rag/app/naive.py87-96: Queries TenantLLMService for configured MinerU models
LLM initialization rag/app/naive.py103-104: Creates LLMBundle with LLMType.OCR
Parsing rag/app/naive.py105-113: Calls parse_pdf() method on the LLM model
Parse methods rag/app/naive.py109: Supports "raw" or other parsing methods

Key Parameters:

mineru_llm_name: Name of the MinerU LLM model to use
parse_method: Parsing strategy (typically "raw")
tenant_id: Required for accessing tenant-specific LLM configurations

Sources: rag/app/naive.py73-120 api/db/services/tenant_llm_service.py

Docling Parser (IBM)

Docling is an external library-based parser that requires installation verification.

Environment Variables:

MINERU_OUTPUT_DIR: Directory for intermediate outputs (defaults to "")
MINERU_DELETE_OUTPUT: Whether to delete outputs after processing (defaults to 1)

Sources: rag/app/naive.py122-139 deepdoc/parser/docling_parser.py

TCADP Parser (Tencent Cloud)

TCADP uses Tencent Cloud's Document Analysis Platform API for parsing.

Configuration:

Requires Tencent Cloud API credentials
TCADP_OUTPUT_DIR: Output directory for intermediate files
file_type: Can be "PDF", "XLSX", or "CSV"

Sources: rag/app/naive.py141-150 deepdoc/parser/tcadp_parser.py

PaddleOCR Parser

PaddleOCR provides an alternative OCR engine, similar to MinerU but using PaddlePaddle framework.

Sources: rag/app/naive.py152-200 api/db/services/tenant_llm_service.py

PlainText Parser (Fallback)

The plaintext parser provides simple text extraction without vision models. It can optionally use a vision model for layout recognition.

Parser Selection Logic rag/app/naive.py202-219:

If layout_recognizer is empty or "Plain Text": Use PlainParser (pure text extraction)
Otherwise: Use VisionParser with the specified vision model from LLMBundle

Sources: rag/app/naive.py202-220 deepdoc/parser/pdf_parser.py

Configuration and Runtime Selection

Configuration Parameter: `layout_recognize`

The layout_recognize parameter is the primary control for parser selection. It accepts:

Value	Behavior
`"DeepDOC"`	Uses DeepDoc vision-based parser (default)
`"MinerU"`	Uses MinerU LLM-based OCR
`"Docling"`	Uses IBM Docling parser
`"TCADP Parser"`	Uses Tencent Cloud ADP
`"PaddleOCR"`	Uses PaddleOCR framework
`"Plain Text"`	Simple text extraction without vision
`True`	Boolean compatibility: equivalent to `"DeepDOC"`
`False`	Boolean compatibility: equivalent to `"Plain Text"`
Vision model name	Uses `PlainText` parser with vision model for layout

Sources: rag/app/naive.py826-850 common/parser_config_utils.py

Normalization Process

The normalization step common/parser_config_utils.py:

Converts boolean True/False to "DeepDOC"/"Plain Text"
Parses string format "ParserName/ModelName" into separate components
Returns tuple (layout_recognizer, parser_model_name)

Sources: common/parser_config_utils.py rag/app/naive.py827

Integration with Chunk Function

The parser selection integrates into the main chunking pipeline:

Key Code Locations:

File type detection: rag/app/naive.py826-896
Parser invocation: rag/app/naive.py839-850
Output validation: rag/app/naive.py852-853

Sources: rag/app/naive.py737-896

Parser Output Format

All parsers return a consistent tuple format: (sections, tables, pdf_parser)

Sections Format

The position tag format is: @@{page}\t{left}\t{right}\t{top}\t{bottom}##

Tables Format

Sources: rag/app/naive.py58-71 deepdoc/parser/pdf_parser.py568-581

Post-Parser Processing

After parser selection and execution, several post-processing steps may occur:

Token Number Override

For certain parsers that return pre-chunked content, the token number is set to 0 to prevent re-chunking:

This occurs at rag/app/naive.py858-859

Vision Figure Enhancement

Tables and figures can be enhanced with vision model descriptions:

Sources: rag/app/naive.py64-70 deepdoc/parser/figure_parser.py

Error Handling and Fallbacks

Each parser implements its own error handling strategy:

Parser	Error Behavior
MinerU	Returns `(None, None, None)` if tenant_id missing or LLM unavailable
Docling	Returns `(None, None, parser)` if library not installed
TCADP	Returns `(None, None, parser)` if API credentials invalid
PaddleOCR	Returns `(None, None, None)` if tenant configuration missing
PlainText	Always succeeds (fallback parser)

The chunk() function checks for None results and returns empty list:

Sources: rag/app/naive.py87-119 rag/app/naive.py126-128 rag/app/naive.py144-146 rag/app/naive.py852-853

Summary Table: Parser Comparison

Parser	Type	Requires LLM	External Dependency	Best For
DeepDoc	Vision-based	No	ONNX models	General documents, high quality
MinerU	LLM-based OCR	Yes	LLM API access	Complex layouts, low-quality scans
Docling	Library-based	No	Docling library	IBM ecosystem integration
TCADP	Cloud API	No	Tencent Cloud API	Cloud-based processing
PaddleOCR	Vision-based	Yes (tenant)	PaddlePaddle	Alternative to DeepDoc
PlainText	Text extraction	Optional	None	Simple documents, fallback

Sources: rag/app/naive.py222-229 rag/app/naive.py58-220

Document Parsing Strategies

Purpose and Scope

Parser Selection Architecture

Parser Selection Flow

PARSERS Dictionary Structure

Parser Strategy Implementations

DeepDoc Parser (Vision-Based)

MinerU Parser (LLM-Based OCR)

Docling Parser (IBM)

TCADP Parser (Tencent Cloud)

PaddleOCR Parser

PlainText Parser (Fallback)

Configuration and Runtime Selection

Configuration Parameter: layout_recognize

Normalization Process

Integration with Chunk Function

Parser Output Format

Sections Format

Tables Format

Post-Parser Processing

Token Number Override

Vision Figure Enhancement

Error Handling and Fallbacks

Summary Table: Parser Comparison

On this page

Document Parsing Strategies

Purpose and Scope

Parser Selection Architecture

Parser Selection Flow

PARSERS Dictionary Structure

Parser Strategy Implementations

DeepDoc Parser (Vision-Based)

MinerU Parser (LLM-Based OCR)

Docling Parser (IBM)

TCADP Parser (Tencent Cloud)

PaddleOCR Parser

PlainText Parser (Fallback)

Configuration and Runtime Selection

Configuration Parameter: layout_recognize

Normalization Process

Integration with Chunk Function

Parser Output Format

Sections Format

Tables Format

Post-Parser Processing

Token Number Override

Vision Figure Enhancement

Error Handling and Fallbacks

Summary Table: Parser Comparison

On this page

Configuration Parameter: `layout_recognize`

Configuration Parameter: `layout_recognize`