Hybrid Backend

Relevant source files

Purpose and Scope

The Hybrid Backend is MinerU's default and recommended processing backend that intelligently combines Vision-Language Model (VLM) capabilities with specialized computer vision models to achieve high accuracy across multiple languages. It adaptively chooses between two processing modes based on document characteristics.

This page covers the hybrid backend's architecture, decision logic, processing modes, and integration points. For information about:

Pure VLM processing without specialized models, see VLM Backend
Traditional multi-model pipeline processing, see Pipeline Backend
Overall backend selection and orchestration, see Core Orchestration

Sources: mineru/backend/hybrid/hybrid_analyze.py1-527

Architecture Overview

Processing Flow

Figure 1: Hybrid Backend Decision Tree and Processing Flow

Sources: mineru/backend/hybrid/hybrid_analyze.py384-454 mineru/backend/hybrid/hybrid_analyze.py456-526

Decision Logic

The `_should_enable_vlm_ocr` Function

The core decision function determines whether to use VLM-only processing or hybrid mode:

Figure 2: VLM-OCR Decision Logic

The function evaluates three primary conditions:

Condition	Requirement	Rationale
`ocr_enable`	Must be `True`	Document is image-based, needs OCR
`language`	Must be `"ch"` or `"en"`	VLM OCR currently optimized for Chinese/English
`inline_formula_enable`	Must be `True`	Formula processing requested

Environment Variable Overrides:

MINERU_FORCE_VLM_OCR_ENABLE=1: Forces VLM-OCR mode regardless of conditions
MINERU_HYBRID_FORCE_PIPELINE_ENABLE=1: Forces hybrid mode regardless of conditions

Sources: mineru/backend/hybrid/hybrid_analyze.py369-381

VLM-OCR Mode

When VLM-OCR Mode is Active

VLM-OCR mode is triggered when all conditions are met (or forced via environment variable). In this mode, the VLM performs complete end-to-end processing:

Figure 3: VLM-OCR Mode Data Flow

In this mode:

VLM extracts layout, text, and formulas in a single pass
No specialized formula models (MFD/MFR) are invoked
No PaddleOCR detection is performed
inline_formula_list and ocr_res_list are set to empty lists
hybrid_pipeline_model is set to None

Sources: mineru/backend/hybrid/hybrid_analyze.py416-420 mineru/backend/hybrid/hybrid_analyze.py488-492

Hybrid Mode

Overview

Hybrid mode leverages VLM for layout detection while delegating specialized tasks to dedicated models:

Figure 4: Hybrid Mode Processing Pipeline

Sources: mineru/backend/hybrid/hybrid_analyze.py422-435 mineru/backend/hybrid/hybrid_analyze.py494-507

Image Masking

The mask_image_regions function prevents formula detection models from incorrectly identifying formulas within images, tables, or existing equation blocks:

Sources: mineru/backend/hybrid/hybrid_analyze.py169-189

Formula Detection and Recognition

Formula processing uses the MFD (Mathematical Formula Detection) and MFR (Mathematical Formula Recognition) models:

Figure 5: Formula Processing Flow

The MFD results are converted to a format compatible with OCR processing:

Sources: mineru/backend/hybrid/hybrid_analyze.py226-250

OCR Detection

The ocr_det function handles text detection with two modes:

Non-Batch Mode

Processes each text block individually:

Figure 6: Non-Batch OCR Detection Flow

Sources: mineru/backend/hybrid/hybrid_analyze.py52-86

Batch Mode

Groups images by resolution for efficient batch processing:

Figure 7: Batch OCR Detection Flow

Key optimization: Images are grouped by resolution and padded to common dimensions to maximize GPU utilization.

Sources: mineru/backend/hybrid/hybrid_analyze.py87-167

OCR Recognition

If _ocr_enable is True, cropped text regions undergo recognition:

Low-confidence results (score < OcrConfidence.min_confidence) are marked with category_id = 16.

Sources: mineru/backend/hybrid/hybrid_analyze.py263-300

Coordinate Normalization

The _normalize_bbox function converts polygon coordinates to normalized bounding boxes:

Sources: mineru/backend/hybrid/hybrid_analyze.py191-200 mineru/backend/hybrid/hybrid_analyze.py303-321

Batch Ratio Configuration

Dynamic Batch Sizing

The get_batch_ratio function determines batch sizes based on GPU memory:

GPU Memory (GB)	Batch Ratio	Formula Batch	OCR Det Batch
≥ 32 GB	16	256	256
≥ 16 GB	8	128	128
≥ 12 GB	4	64	64
≥ 8 GB	2	32	32
< 8 GB	1	16	16

Formula batch size: batch_ratio * MFR_BASE_BATCH_SIZE (where MFR_BASE_BATCH_SIZE = 16)
OCR detection batch size: batch_ratio * OCR_DET_BASE_BATCH_SIZE (where OCR_DET_BASE_BATCH_SIZE = 16)

Environment Variable Override

When set, this value is used directly instead of auto-detection. Recommended for client-server deployments to account for multiple concurrent clients.

Client-Server Deployment Guidelines:

Single Client VRAM	Recommended MINERU_HYBRID_BATCH_RATIO
≤ 2.5 GB	1
≤ 3 GB	2
≤ 4.5 GB	4
≤ 6 GB	8

Note: Reserve one client's worth of VRAM as overall redundancy when deploying multiple concurrent clients.

Sources: mineru/backend/hybrid/hybrid_analyze.py323-366

Model Singleton Management

HybridModelSingleton

The HybridModelSingleton class manages hybrid-specific models with lazy initialization:

Figure 8: HybridModelSingleton Architecture

The singleton ensures models are loaded only once per configuration, reducing memory overhead and initialization time.

Sources: Referenced in mineru/backend/hybrid/hybrid_analyze.py220-224

Middle JSON Generation

The `result_to_middle_json` Function

This function orchestrates the transformation of model outputs into the standardized middle.json format:

Figure 9: Middle JSON Generation Pipeline

Post-Processing Steps

Post-OCR Recognition: For remaining text spans with np_img field, performs OCR recognition
Cross-Page Table Merging: Detects and merges tables split across pages
LLM-Aided Title Optimization: Assigns hierarchical title levels using LLM (if configured)

Sources: mineru/backend/hybrid/hybrid_model_output_to_middle_json.py135-212

Entry Points

Synchronous: `doc_analyze`

Figure 10: Synchronous doc_analyze Flow

Sources: mineru/backend/hybrid/hybrid_analyze.py384-454

Asynchronous: `aio_doc_analyze`

Mirrors doc_analyze but uses async operations:

await predictor.aio_batch_two_step_extract() instead of predictor.batch_two_step_extract()
Synchronous formula/OCR processing (no async implementations available)

Sources: mineru/backend/hybrid/hybrid_analyze.py456-526

Orchestration Layer Integration

Backend Selection in `do_parse`

The orchestration layer in common.py handles backend routing:

Figure 11: Orchestration Layer Backend Selection

Key differences from VLM backend:

MINERU_VLM_FORMULA_ENABLE is always set to "true" (formulas handled by specialized models)
MINERU_VLM_TABLE_ENABLE is user-configurable via table_enable parameter
inline_formula_enable parameter controls inline formula processing in hybrid mode

Sources: mineru/cli/common.py465-483 mineru/cli/common.py538-555

Configuration

Environment Variables

Variable	Values	Default	Effect
`MINERU_FORCE_VLM_OCR_ENABLE`	`"1"`, `"true"`, `"yes"`	Not set	Forces VLM-OCR mode
`MINERU_HYBRID_FORCE_PIPELINE_ENABLE`	`"1"`, `"true"`, `"yes"`	Not set	Forces hybrid mode
`MINERU_HYBRID_BATCH_RATIO`	Integer	Auto-detect	Overrides batch ratio calculation
`MINERU_VLM_TABLE_ENABLE`	`"true"`, `"false"`	`"true"`	Enables table recognition
`MINERU_VLM_FORMULA_ENABLE`	`"true"`, `"false"`	`"true"`	Always true in hybrid backend

Function Parameters

parse_method values:

"auto": Classify document automatically (text-based vs image-based)
"txt": Assume text-based PDF, disable OCR
"ocr": Force OCR processing

language values: See Multi-Language Support for complete list (e.g., "ch", "en", "latin", "arabic")

Sources: mineru/backend/hybrid/hybrid_analyze.py384-395

Performance Characteristics

VLM-OCR Mode

Advantages:

Single-pass processing (faster for supported languages)
No coordinate alignment issues between models
Simpler pipeline with fewer model invocations

Limitations:

Chinese/English only
Requires GPU for VLM inference
Less extensible (no specialized formula models)

Hybrid Mode

Advantages:

Supports all OCR languages (15+ language families)
Specialized models for formulas (higher accuracy)
Batch processing optimizations for GPU efficiency
Extensible architecture

Overhead:

Multiple model invocations (VLM + MFD + MFR + OCR)
Coordinate conversion between model spaces
Additional memory for masked images

Benchmark Comparison

From system overview:

Pipeline Backend: 82+ accuracy, CPU-friendly
VLM Backend: 90+ accuracy, GPU-required, Chinese/English only
Hybrid Backend (Default): 90+ accuracy, GPU-required, multi-language

Sources: Referenced in Diagram 2 of system overview

Error Handling

Common Issues

Async/Sync Mismatch:
- vllm-async-engine cannot be used in do_parse (sync mode)
- vllm-engine cannot be used in aio_do_parse (async mode)
- Use auto-engine for automatic selection
Model Loading Failures:
- HybridModelSingleton initializes models lazily
- Check MINERU_MODEL_SOURCE environment variable
- Verify model paths in configuration
GPU Memory Exhaustion:
- Adjust MINERU_HYBRID_BATCH_RATIO to lower value
- Monitor GPU memory with get_vram(device)
- Consider splitting large documents

Sources: mineru/cli/common.py468-470 mineru/cli/common.py541-542

Integration Examples

CLI Usage

Sources: mineru/cli/client.py60-69

FastAPI Usage

The API automatically:

Maps backend="hybrid-auto-engine" to hybrid processing
Creates output directory: {output_dir}/{pdf_name}/hybrid_{parse_method}/
Returns middle.json, Markdown, and optional debug outputs

Sources: mineru/cli/fast_api.py154-162 mineru/cli/fast_api.py286-289

Gradio UI Integration

The Gradio interface exposes hybrid backend selection:

Dynamic UI Updates:

VLM backends hide OCR language selector
Hybrid backends show OCR language selector
Formula label changes based on backend:
- VLM: "Enable display formula recognition"
- Hybrid: "Enable inline formula recognition"

Sources: mineru/cli/gradio_app.py411-415 mineru/cli/gradio_app.py326-344

Summary

The Hybrid Backend achieves high accuracy by:

Intelligent Decision Logic: Automatically chooses between VLM-only and hybrid processing based on document language and characteristics
Specialized Model Integration: Leverages MFD/MFR for formula processing and PaddleOCR for multi-language text extraction
Batch Processing Optimizations: Groups operations by resolution and scales batch sizes dynamically based on GPU memory
Unified Output Format: Generates standardized middle.json compatible with both VLM and pipeline backends

Default Choice: The hybrid backend is MinerU's recommended option for production use, balancing accuracy, language support, and performance.

Sources: mineru/backend/hybrid/hybrid_analyze.py1-527 mineru/backend/hybrid/hybrid_model_output_to_middle_json.py1-212

Hybrid Backend

Relevant source files

Purpose and Scope

This page covers the hybrid backend's architecture, decision logic, processing modes, and integration points. For information about:

Pure VLM processing without specialized models, see VLM Backend
Traditional multi-model pipeline processing, see Pipeline Backend
Overall backend selection and orchestration, see Core Orchestration

Sources: mineru/backend/hybrid/hybrid_analyze.py1-527

Architecture Overview

Processing Flow

Figure 1: Hybrid Backend Decision Tree and Processing Flow

Sources: mineru/backend/hybrid/hybrid_analyze.py384-454 mineru/backend/hybrid/hybrid_analyze.py456-526

Decision Logic

The `_should_enable_vlm_ocr` Function

The core decision function determines whether to use VLM-only processing or hybrid mode:

Figure 2: VLM-OCR Decision Logic

The function evaluates three primary conditions:

Condition	Requirement	Rationale
`ocr_enable`	Must be `True`	Document is image-based, needs OCR
`language`	Must be `"ch"` or `"en"`	VLM OCR currently optimized for Chinese/English
`inline_formula_enable`	Must be `True`	Formula processing requested

Environment Variable Overrides:

MINERU_FORCE_VLM_OCR_ENABLE=1: Forces VLM-OCR mode regardless of conditions
MINERU_HYBRID_FORCE_PIPELINE_ENABLE=1: Forces hybrid mode regardless of conditions

Sources: mineru/backend/hybrid/hybrid_analyze.py369-381

VLM-OCR Mode

When VLM-OCR Mode is Active

VLM-OCR mode is triggered when all conditions are met (or forced via environment variable). In this mode, the VLM performs complete end-to-end processing:

Figure 3: VLM-OCR Mode Data Flow

In this mode:

VLM extracts layout, text, and formulas in a single pass
No specialized formula models (MFD/MFR) are invoked
No PaddleOCR detection is performed
inline_formula_list and ocr_res_list are set to empty lists
hybrid_pipeline_model is set to None

Sources: mineru/backend/hybrid/hybrid_analyze.py416-420 mineru/backend/hybrid/hybrid_analyze.py488-492

Hybrid Mode

Overview

Hybrid mode leverages VLM for layout detection while delegating specialized tasks to dedicated models:

Figure 4: Hybrid Mode Processing Pipeline

Sources: mineru/backend/hybrid/hybrid_analyze.py422-435 mineru/backend/hybrid/hybrid_analyze.py494-507

Image Masking

The mask_image_regions function prevents formula detection models from incorrectly identifying formulas within images, tables, or existing equation blocks:

Sources: mineru/backend/hybrid/hybrid_analyze.py169-189

Formula Detection and Recognition

Formula processing uses the MFD (Mathematical Formula Detection) and MFR (Mathematical Formula Recognition) models:

Figure 5: Formula Processing Flow

The MFD results are converted to a format compatible with OCR processing:

Sources: mineru/backend/hybrid/hybrid_analyze.py226-250

OCR Detection

The ocr_det function handles text detection with two modes:

Non-Batch Mode

Processes each text block individually:

Figure 6: Non-Batch OCR Detection Flow

Sources: mineru/backend/hybrid/hybrid_analyze.py52-86

Batch Mode

Groups images by resolution for efficient batch processing:

Figure 7: Batch OCR Detection Flow

Key optimization: Images are grouped by resolution and padded to common dimensions to maximize GPU utilization.

Sources: mineru/backend/hybrid/hybrid_analyze.py87-167

OCR Recognition

If _ocr_enable is True, cropped text regions undergo recognition:

Low-confidence results (score < OcrConfidence.min_confidence) are marked with category_id = 16.

Sources: mineru/backend/hybrid/hybrid_analyze.py263-300

Coordinate Normalization

The _normalize_bbox function converts polygon coordinates to normalized bounding boxes:

Sources: mineru/backend/hybrid/hybrid_analyze.py191-200 mineru/backend/hybrid/hybrid_analyze.py303-321

Batch Ratio Configuration

Dynamic Batch Sizing

The get_batch_ratio function determines batch sizes based on GPU memory:

GPU Memory (GB)	Batch Ratio	Formula Batch	OCR Det Batch
≥ 32 GB	16	256	256
≥ 16 GB	8	128	128
≥ 12 GB	4	64	64
≥ 8 GB	2	32	32
< 8 GB	1	16	16

Environment Variable Override

When set, this value is used directly instead of auto-detection. Recommended for client-server deployments to account for multiple concurrent clients.

Client-Server Deployment Guidelines:

Single Client VRAM	Recommended MINERU_HYBRID_BATCH_RATIO
≤ 2.5 GB	1
≤ 3 GB	2
≤ 4.5 GB	4
≤ 6 GB	8

Note: Reserve one client's worth of VRAM as overall redundancy when deploying multiple concurrent clients.

Sources: mineru/backend/hybrid/hybrid_analyze.py323-366

Model Singleton Management

HybridModelSingleton

The HybridModelSingleton class manages hybrid-specific models with lazy initialization:

Figure 8: HybridModelSingleton Architecture

The singleton ensures models are loaded only once per configuration, reducing memory overhead and initialization time.

Sources: Referenced in mineru/backend/hybrid/hybrid_analyze.py220-224

Middle JSON Generation

The `result_to_middle_json` Function

This function orchestrates the transformation of model outputs into the standardized middle.json format:

Figure 9: Middle JSON Generation Pipeline

Post-Processing Steps

Post-OCR Recognition: For remaining text spans with np_img field, performs OCR recognition
Cross-Page Table Merging: Detects and merges tables split across pages
LLM-Aided Title Optimization: Assigns hierarchical title levels using LLM (if configured)

Sources: mineru/backend/hybrid/hybrid_model_output_to_middle_json.py135-212

Entry Points

Synchronous: `doc_analyze`

Figure 10: Synchronous doc_analyze Flow

Sources: mineru/backend/hybrid/hybrid_analyze.py384-454

Asynchronous: `aio_doc_analyze`

Mirrors doc_analyze but uses async operations:

await predictor.aio_batch_two_step_extract() instead of predictor.batch_two_step_extract()
Synchronous formula/OCR processing (no async implementations available)

Sources: mineru/backend/hybrid/hybrid_analyze.py456-526

Orchestration Layer Integration

Backend Selection in `do_parse`

The orchestration layer in common.py handles backend routing:

Figure 11: Orchestration Layer Backend Selection

Key differences from VLM backend:

MINERU_VLM_FORMULA_ENABLE is always set to "true" (formulas handled by specialized models)
MINERU_VLM_TABLE_ENABLE is user-configurable via table_enable parameter
inline_formula_enable parameter controls inline formula processing in hybrid mode

Sources: mineru/cli/common.py465-483 mineru/cli/common.py538-555

Configuration

Environment Variables

Variable	Values	Default	Effect
`MINERU_FORCE_VLM_OCR_ENABLE`	`"1"`, `"true"`, `"yes"`	Not set	Forces VLM-OCR mode
`MINERU_HYBRID_FORCE_PIPELINE_ENABLE`	`"1"`, `"true"`, `"yes"`	Not set	Forces hybrid mode
`MINERU_HYBRID_BATCH_RATIO`	Integer	Auto-detect	Overrides batch ratio calculation
`MINERU_VLM_TABLE_ENABLE`	`"true"`, `"false"`	`"true"`	Enables table recognition
`MINERU_VLM_FORMULA_ENABLE`	`"true"`, `"false"`	`"true"`	Always true in hybrid backend

Function Parameters

parse_method values:

"auto": Classify document automatically (text-based vs image-based)
"txt": Assume text-based PDF, disable OCR
"ocr": Force OCR processing

language values: See Multi-Language Support for complete list (e.g., "ch", "en", "latin", "arabic")

Sources: mineru/backend/hybrid/hybrid_analyze.py384-395

Performance Characteristics

VLM-OCR Mode

Advantages:

Single-pass processing (faster for supported languages)
No coordinate alignment issues between models
Simpler pipeline with fewer model invocations

Limitations:

Chinese/English only
Requires GPU for VLM inference
Less extensible (no specialized formula models)

Hybrid Mode

Advantages:

Supports all OCR languages (15+ language families)
Specialized models for formulas (higher accuracy)
Batch processing optimizations for GPU efficiency
Extensible architecture

Overhead:

Multiple model invocations (VLM + MFD + MFR + OCR)
Coordinate conversion between model spaces
Additional memory for masked images

Benchmark Comparison

From system overview:

Pipeline Backend: 82+ accuracy, CPU-friendly
VLM Backend: 90+ accuracy, GPU-required, Chinese/English only
Hybrid Backend (Default): 90+ accuracy, GPU-required, multi-language

Sources: Referenced in Diagram 2 of system overview

Error Handling

Common Issues

Async/Sync Mismatch:
- vllm-async-engine cannot be used in do_parse (sync mode)
- vllm-engine cannot be used in aio_do_parse (async mode)
- Use auto-engine for automatic selection
Model Loading Failures:
- HybridModelSingleton initializes models lazily
- Check MINERU_MODEL_SOURCE environment variable
- Verify model paths in configuration
GPU Memory Exhaustion:
- Adjust MINERU_HYBRID_BATCH_RATIO to lower value
- Monitor GPU memory with get_vram(device)
- Consider splitting large documents

Sources: mineru/cli/common.py468-470 mineru/cli/common.py541-542

Integration Examples

CLI Usage

Sources: mineru/cli/client.py60-69

FastAPI Usage

The API automatically:

Maps backend="hybrid-auto-engine" to hybrid processing
Creates output directory: {output_dir}/{pdf_name}/hybrid_{parse_method}/
Returns middle.json, Markdown, and optional debug outputs

Sources: mineru/cli/fast_api.py154-162 mineru/cli/fast_api.py286-289

Gradio UI Integration

The Gradio interface exposes hybrid backend selection:

Dynamic UI Updates:

VLM backends hide OCR language selector
Hybrid backends show OCR language selector
Formula label changes based on backend:
- VLM: "Enable display formula recognition"
- Hybrid: "Enable inline formula recognition"

Sources: mineru/cli/gradio_app.py411-415 mineru/cli/gradio_app.py326-344

Summary

The Hybrid Backend achieves high accuracy by:

Intelligent Decision Logic: Automatically chooses between VLM-only and hybrid processing based on document language and characteristics
Specialized Model Integration: Leverages MFD/MFR for formula processing and PaddleOCR for multi-language text extraction
Batch Processing Optimizations: Groups operations by resolution and scales batch sizes dynamically based on GPU memory
Unified Output Format: Generates standardized middle.json compatible with both VLM and pipeline backends

Default Choice: The hybrid backend is MinerU's recommended option for production use, balancing accuracy, language support, and performance.

Sources: mineru/backend/hybrid/hybrid_analyze.py1-527 mineru/backend/hybrid/hybrid_model_output_to_middle_json.py1-212

Hybrid Backend

Purpose and Scope

Architecture Overview

Processing Flow

Decision Logic

The _should_enable_vlm_ocr Function

VLM-OCR Mode

When VLM-OCR Mode is Active

Hybrid Mode

Overview

Image Masking

Formula Detection and Recognition

OCR Detection

Non-Batch Mode

Batch Mode

OCR Recognition

Coordinate Normalization

Batch Ratio Configuration

Dynamic Batch Sizing

Environment Variable Override

Model Singleton Management

HybridModelSingleton

Middle JSON Generation

The result_to_middle_json Function

Post-Processing Steps

Entry Points

Synchronous: doc_analyze

Asynchronous: aio_doc_analyze

Orchestration Layer Integration

Backend Selection in do_parse

Configuration

Environment Variables

Function Parameters

Performance Characteristics

VLM-OCR Mode

Hybrid Mode

Benchmark Comparison

Error Handling

Common Issues

Integration Examples

CLI Usage

FastAPI Usage

Gradio UI Integration

Summary

On this page

Hybrid Backend

Purpose and Scope

Architecture Overview

Processing Flow

Decision Logic

The _should_enable_vlm_ocr Function

VLM-OCR Mode

When VLM-OCR Mode is Active

Hybrid Mode

Overview

Image Masking

Formula Detection and Recognition

OCR Detection

Non-Batch Mode

Batch Mode

OCR Recognition

Coordinate Normalization

Batch Ratio Configuration

Dynamic Batch Sizing

Environment Variable Override

Model Singleton Management

HybridModelSingleton

Middle JSON Generation

The result_to_middle_json Function

Post-Processing Steps

Entry Points

Synchronous: doc_analyze

Asynchronous: aio_doc_analyze

Orchestration Layer Integration

Backend Selection in do_parse

Configuration

Environment Variables

Function Parameters

Performance Characteristics

VLM-OCR Mode

The `_should_enable_vlm_ocr` Function

The `result_to_middle_json` Function

Synchronous: `doc_analyze`

Asynchronous: `aio_doc_analyze`

Backend Selection in `do_parse`

The `_should_enable_vlm_ocr` Function

The `result_to_middle_json` Function

Synchronous: `doc_analyze`

Asynchronous: `aio_doc_analyze`

Backend Selection in `do_parse`