This document covers specialized features and utilities in MinerU that enhance document processing capabilities beyond basic parsing. These features include multi-language OCR support, span refinement, cross-page table merging, LLM-aided optimizations, PDF-to-image conversion, and configuration management.
For core document parsing workflows, see System Architecture. For backend-specific processing details, see Pipeline Backend, VLM Backend, and Hybrid Backend.
MinerU supports 109 languages through PaddleOCR integration and automatic language detection using fasttext models. Language configuration affects OCR accuracy and text processing behavior.
The system organizes languages into predefined groups for OCR model selection:
Primary Language Sets:
| Language Code | Supported Languages |
|---|---|
ch | Chinese, English, Chinese Traditional |
ch_lite | Chinese, English, Chinese Traditional, Japanese |
ch_server | Chinese, English, Chinese Traditional, Japanese |
en | English |
korean | Korean, English |
japan | Chinese, English, Chinese Traditional, Japanese |
chinese_cht | Chinese, English, Chinese Traditional, Japanese |
ta | Tamil, English |
te | Telugu, English |
ka | Kannada |
el | Greek, English |
th | Thai, English |
Extended Language Sets:
| Language Code | Region | Languages |
|---|---|---|
latin | European/Latin America | French, German, Italian, Spanish, Portuguese, Dutch, Norwegian, Polish, Swedish, Turkish, Romanian, Finnish, etc. (40+ languages) |
arabic | Middle East/Central Asia | Arabic, Persian, Uyghur, Urdu, Pashto, Kurdish, Sindhi, Balochi, English |
east_slavic | Eastern Europe | Russian, Belarusian, Ukrainian, English |
cyrillic | Cyrillic Script | Russian, Bulgarian, Mongolian, Kazakh, Kyrgyz, Tajik, Macedonian, Serbian (Cyrillic), etc. (30+ languages) |
devanagari | South Asia | Hindi, Marathi, Nepali, Sanskrit, English |
Sources: mineru/cli/gradio_app.py149-169 mineru/cli/client.py76-84 mineru/cli/fast_api.py134-152
Sources: mineru/backend/vlm/vlm_middle_json_mkcontent.py25-91 mineru/backend/pipeline/pipeline_middle_json_mkcontent.py106-179 mineru/utils/language.py
The system handles full-width and half-width character conversion:
Implementation:
full_to_half_exclude_marks(): Converts full-width letters (FF21-FF3A, FF41-FF5A) and numbers (FF10-FF19) onlyfull_to_half(): Converts all full-width characters (FF01-FF5E) including punctuationSources: mineru/utils/char_utils.py18-55 mineru/utils/span_pre_proc.py110-114
CJK Languages (Chinese, Japanese, Korean):
Western Languages:
Sources: mineru/backend/vlm/vlm_middle_json_mkcontent.py58-90 mineru/backend/pipeline/pipeline_middle_json_mkcontent.py143-175
Span processing involves filtering, overlap removal, confidence-based selection, and text extraction from PDF layers.
Sources: mineru/backend/pipeline/model_json_to_middle_json.py128-168 mineru/utils/span_pre_proc.py18-108
Strategy 1: Confidence-Based (IOU > 0.9)
Strategy 2: Size-Based (Ratio > 0.65)
Sources: mineru/utils/span_pre_proc.py60-107
The OcrConfidence class defines minimum confidence thresholds:
| Confidence Level | Value | Usage |
|---|---|---|
min_confidence | 0.3-0.5 | Minimum acceptable OCR score |
| High confidence | > 0.8 | Reliable text extraction |
| Low confidence | < min | Marked as category_id=16 (low score text) |
Special Cases:
Sources: mineru/backend/hybrid/hybrid_analyze.py283-299 mineru/utils/ocr_utils.py
Implementation Details:
pdftext library for PDF text layer extractionSources: mineru/utils/span_pre_proc.py124-180 mineru/utils/pdf_text_tool.py9-40
MinerU detects and merges tables split across pages using header matching, column structure analysis, and caption markers.
Sources: mineru/utils/table_merge.py287-354 mineru/backend/utils.py
End-of-Caption Markers:
"(续)", "(续表)", "(续上表)", "(continued)", "(cont.)", "(cont'd)", "(…continued)", "续表"
Inline Caption Markers:
"(continued)"
These markers trigger relaxed merging rules, allowing up to 1 footnote in the previous table.
Sources: mineru/utils/table_merge.py12-26
Implementation:
calculate_table_total_columns(): Handles colspan and rowspan by building occupation matrixcalculate_row_effective_columns(): Returns per-row effective columns considering rowspancalculate_row_columns(): Sums colspan values (ignores rowspan)calculate_visual_columns(): Counts actual td/th cells (ignores both colspan and rowspan)Sources: mineru/utils/table_merge.py28-168
Strict vs Visual Matching:
Sources: mineru/utils/table_merge.py170-284
colspan Adjustment Logic:
Sources: mineru/utils/table_merge.py471-567 mineru/utils/table_merge.py419-469
MinerU optionally uses LLM models to optimize title hierarchy levels in parsed documents.
Sources: mineru/backend/pipeline/model_json_to_middle_json.py237-247 mineru/backend/hybrid/hybrid_model_output_to_middle_json.py22-106
For VLM Backend (no line information):
get_crop_img()ocr_model.ocr(title_img, rec=False))avg_height = mean([box[2][1] - box[0][1] for box in ocr_det_res])title_block['line_avg_height'] = round(avg_height/scale)For Pipeline Backend (has line information):
title_block['lines']avg_height = sum(line['bbox'][3] - line['bbox'][1] for line in lines) / len(lines)title_block['line_avg_height'] = round(avg_height)Fallback: If no lines, use block height: bbox[3] - bbox[1]
Sources: mineru/backend/hybrid/hybrid_model_output_to_middle_json.py75-105
The get_title_level() function extracts the title level from a block:
Levels are used in Markdown generation:
# Title## Title# * N + TitleSources: mineru/backend/vlm/vlm_middle_json_mkcontent.py106-107 mineru/backend/pipeline/pipeline_middle_json_mkcontent.py21-22
MinerU converts PDF pages to images using pypdfium2 with multiprocessing support and timeout controls.
Sources: mineru/utils/pdf_image_tools.py mineru/cli/common.py54-91
Error Handling:
Sources: mineru/cli/common.py54-82 mineru/utils/pdf_page_id.py
Currently, all backends use PIL image format (ImageType.PIL). The BASE64 option exists for potential future use cases.
Sources: mineru/utils/enum_class.py115-117 mineru/backend/pipeline/pipeline_analyze.py103 mineru/backend/vlm/vlm_analyze.py mineru/backend/hybrid/hybrid_analyze.py402
MinerU supports configuration through mineru.json files and environment variables with precedence rules.
Sources: mineru/utils/cli_parser.py4-38 mineru/cli/client.py154-182
| Variable | Purpose | Values | Default |
|---|---|---|---|
MINERU_DEVICE_MODE | Device selection | cpu, cuda, cuda:0, npu, npu:0, mps | Auto-detect |
MINERU_VIRTUAL_VRAM_SIZE | GPU memory limit (MB) | Integer | Auto-detect |
MINERU_MODEL_SOURCE | Model repository | huggingface, modelscope, local | huggingface |
MINERU_LOG_LEVEL | Logging verbosity | DEBUG, INFO, WARNING, ERROR | INFO |
MINERU_MIN_BATCH_INFERENCE_SIZE | Batch size threshold | Integer | 384 |
MINERU_HYBRID_BATCH_RATIO | Hybrid backend batch ratio | Integer | Auto-calculated |
MINERU_VLM_FORMULA_ENABLE | VLM formula recognition | true, false | true |
MINERU_VLM_TABLE_ENABLE | VLM table recognition | true, false | true |
MINERU_FORCE_VLM_OCR_ENABLE | Force VLM OCR mode | 0, 1, true, false | false |
MINERU_HYBRID_FORCE_PIPELINE_ENABLE | Force pipeline OCR | 0, 1, true, false | false |
MINERU_API_MAX_CONCURRENT_REQUESTS | FastAPI concurrency limit | Integer | 0 (unlimited) |
MINERU_API_ENABLE_FASTAPI_DOCS | Enable API docs | 0, 1, true, false | true |
MINERU_DONOT_CLEAN_MEM | Disable memory cleanup | Any value | Not set |
MINERU_LMDEPLOY_DEVICE | LMDeploy device override | maca, corex, etc. | Not set |
TOKENIZERS_PARALLELISM | Tokenizer parallel mode | true, false | false |
PYTORCH_ENABLE_MPS_FALLBACK | MPS fallback | 0, 1 | 1 |
NO_ALBUMENTATIONS_UPDATE | Disable albumentations check | 0, 1 | 1 |
Sources: mineru/cli/common.py22-30 mineru/backend/pipeline/pipeline_analyze.py81 mineru/backend/hybrid/hybrid_analyze.py25-26 mineru/backend/hybrid/hybrid_analyze.py340-347 mineru/backend/hybrid/hybrid_analyze.py369-381 mineru/cli/fast_api.py52-75
Configuration Readers:
get_device(): Returns device mode (cpu/cuda/npu/mps)get_vram(): Returns available VRAM in GBget_formula_enable(): Returns formula processing flagget_table_enable(): Returns table processing flagget_llm_aided_config(): Returns LLM optimization settingsget_latex_delimiter_config(): Returns LaTeX delimiter configurationSources: mineru/utils/config_reader.py mineru/backend/vlm/vlm_middle_json_mkcontent.py10-22
Recommended Settings for Client-Server Architecture:
| Client VRAM | MINERU_HYBRID_BATCH_RATIO |
|---|---|
| ≤ 6 GB | 8 |
| ≤ 4.5 GB | 4 |
| ≤ 3 GB | 2 |
| ≤ 2.5 GB | 1 |
Note: Values consider redundancy for multi-client deployments. Reserve one client's VRAM worth as overall redundancy.
Sources: mineru/backend/hybrid/hybrid_analyze.py323-366
Output control flags in do_parse() and aio_do_parse():
| Flag | Default | Output File | Purpose |
|---|---|---|---|
f_draw_layout_bbox | True | *_layout.pdf | Colored bounding boxes for layout blocks |
f_draw_span_bbox | True | *_span.pdf | Bounding boxes for text spans |
f_dump_md | True | *.md | Markdown output |
f_dump_middle_json | True | *_middle.json | Intermediate JSON format |
f_dump_model_output | True | *_model.json | Raw model predictions |
f_dump_orig_pdf | True | *_origin.pdf | Original PDF copy |
f_dump_content_list | True | *_content_list.json | Flat content structure |
f_make_md_mode | MakeMode.MM_MD | N/A | Markdown generation mode |
MakeMode Options:
MM_MD: Multimodal Markdown (includes images/tables)NLP_MD: NLP-focused Markdown (text-only)CONTENT_LIST: Flat JSON listCONTENT_LIST_V2: Enhanced JSON structureSources: mineru/cli/common.py94-169 mineru/utils/enum_class.py86-90 mineru/utils/draw_bbox.py120-258
Refresh this wiki