This document explains RAGFlow's parser strategy selection system, which controls how documents are converted from raw formats (PDF, DOCX, etc.) into structured text and metadata. The system supports multiple parsing engines optimized for different document types and quality levels. This page focuses on parser selection and the different parsing strategies available. For information about what happens after parsing (chunking into segments), see Chunking Methods. For the vision processing components (OCR, layout recognition), see Vision Processing: OCR and Layout Recognition.
Sources: rag/app/naive.py1-241 deepdoc/parser/pdf_parser.py1-100
RAGFlow's parser selection mechanism is controlled by the layout_recognize configuration parameter, which determines which parsing engine processes a document. The selection happens in the chunk function and is normalized through normalize_layout_recognizer before being dispatched to the appropriate parser.
Sources: rag/app/naive.py222-229 rag/app/naive.py826-850 common/parser_config_utils.py
The parser registry is implemented as a simple dictionary mapping strategy names to parser functions:
| Key | Function | Description |
|---|---|---|
"deepdoc" | by_deepdoc | Vision-based parsing using DeepDoc models (OCR + Layout) |
"mineru" | by_mineru | LLM-based OCR parsing via MinerU framework |
"docling" | by_docling | IBM Docling parser (external library) |
"tcadp" | by_tcadp | Tencent Cloud ADP parser (cloud API) |
"paddleocr" | by_paddleocr | PaddleOCR-based parsing |
"plaintext" | by_plaintext | Simple text extraction without vision models |
All parser functions share a common signature:
Sources: rag/app/naive.py222-229 rag/app/naive.py58-220
DeepDoc is the default high-quality parser that uses ONNX-based vision models for OCR, layout recognition, and table structure detection.
Key Components:
Pdf class rag/app/naive.py544-581: Inherits from PdfParser deepdoc/parser/pdf_parser.py55-106self.ocrSources: rag/app/naive.py58-71 deepdoc/parser/pdf_parser.py55-657 deepdoc/vision/ocr.py139-327 deepdoc/vision/layout_recognizer.py33-157
MinerU leverages large language models for OCR, providing better accuracy for complex layouts at the cost of requiring LLM API access.
Configuration Flow:
TenantLLMService for configured MinerU modelsLLMBundle with LLMType.OCRparse_pdf() method on the LLM model"raw" or other parsing methodsKey Parameters:
mineru_llm_name: Name of the MinerU LLM model to useparse_method: Parsing strategy (typically "raw")tenant_id: Required for accessing tenant-specific LLM configurationsSources: rag/app/naive.py73-120 api/db/services/tenant_llm_service.py
Docling is an external library-based parser that requires installation verification.
Environment Variables:
MINERU_OUTPUT_DIR: Directory for intermediate outputs (defaults to "")MINERU_DELETE_OUTPUT: Whether to delete outputs after processing (defaults to 1)Sources: rag/app/naive.py122-139 deepdoc/parser/docling_parser.py
TCADP uses Tencent Cloud's Document Analysis Platform API for parsing.
Configuration:
TCADP_OUTPUT_DIR: Output directory for intermediate filesfile_type: Can be "PDF", "XLSX", or "CSV"Sources: rag/app/naive.py141-150 deepdoc/parser/tcadp_parser.py
PaddleOCR provides an alternative OCR engine, similar to MinerU but using PaddlePaddle framework.
Sources: rag/app/naive.py152-200 api/db/services/tenant_llm_service.py
The plaintext parser provides simple text extraction without vision models. It can optionally use a vision model for layout recognition.
Parser Selection Logic rag/app/naive.py202-219:
layout_recognizer is empty or "Plain Text": Use PlainParser (pure text extraction)VisionParser with the specified vision model from LLMBundleSources: rag/app/naive.py202-220 deepdoc/parser/pdf_parser.py
layout_recognizeThe layout_recognize parameter is the primary control for parser selection. It accepts:
| Value | Behavior |
|---|---|
"DeepDOC" | Uses DeepDoc vision-based parser (default) |
"MinerU" | Uses MinerU LLM-based OCR |
"Docling" | Uses IBM Docling parser |
"TCADP Parser" | Uses Tencent Cloud ADP |
"PaddleOCR" | Uses PaddleOCR framework |
"Plain Text" | Simple text extraction without vision |
True | Boolean compatibility: equivalent to "DeepDOC" |
False | Boolean compatibility: equivalent to "Plain Text" |
| Vision model name | Uses PlainText parser with vision model for layout |
Sources: rag/app/naive.py826-850 common/parser_config_utils.py
The normalization step common/parser_config_utils.py:
True/False to "DeepDOC"/"Plain Text""ParserName/ModelName" into separate components(layout_recognizer, parser_model_name)Sources: common/parser_config_utils.py rag/app/naive.py827
The parser selection integrates into the main chunking pipeline:
Key Code Locations:
Sources: rag/app/naive.py737-896
All parsers return a consistent tuple format: (sections, tables, pdf_parser)
The position tag format is: @@{page}\t{left}\t{right}\t{top}\t{bottom}##
Sources: rag/app/naive.py58-71 deepdoc/parser/pdf_parser.py568-581
After parser selection and execution, several post-processing steps may occur:
For certain parsers that return pre-chunked content, the token number is set to 0 to prevent re-chunking:
This occurs at rag/app/naive.py858-859
Tables and figures can be enhanced with vision model descriptions:
Sources: rag/app/naive.py64-70 deepdoc/parser/figure_parser.py
Each parser implements its own error handling strategy:
| Parser | Error Behavior |
|---|---|
| MinerU | Returns (None, None, None) if tenant_id missing or LLM unavailable |
| Docling | Returns (None, None, parser) if library not installed |
| TCADP | Returns (None, None, parser) if API credentials invalid |
| PaddleOCR | Returns (None, None, None) if tenant configuration missing |
| PlainText | Always succeeds (fallback parser) |
The chunk() function checks for None results and returns empty list:
Sources: rag/app/naive.py87-119 rag/app/naive.py126-128 rag/app/naive.py144-146 rag/app/naive.py852-853
| Parser | Type | Requires LLM | External Dependency | Best For |
|---|---|---|---|---|
| DeepDoc | Vision-based | No | ONNX models | General documents, high quality |
| MinerU | LLM-based OCR | Yes | LLM API access | Complex layouts, low-quality scans |
| Docling | Library-based | No | Docling library | IBM ecosystem integration |
| TCADP | Cloud API | No | Tencent Cloud API | Cloud-based processing |
| PaddleOCR | Vision-based | Yes (tenant) | PaddlePaddle | Alternative to DeepDoc |
| PlainText | Text extraction | Optional | None | Simple documents, fallback |
Refresh this wiki