This document provides a comprehensive overview of the MinerU system, a high-accuracy document parsing tool that converts PDFs and images into machine-readable formats such as Markdown and JSON. MinerU was developed during the pre-training process of the InternLM language model to solve symbol conversion issues in scientific literature.
This overview covers:
For installation instructions, see Installation and Setup. For detailed usage patterns, see Command Line Interface and Python API Integration. For backend-specific details, see Pipeline Backend, VLM Backend, and Hybrid Backend.
Sources: README.md89-96 README_zh-CN.md87-93 docs/zh/index.md42-46
MinerU is a tool that converts unstructured documents (primarily PDFs) into structured, machine-readable formats. It extracts text, tables, formulas, and images while preserving semantic coherence and document structure. The system processes documents through specialized backends that balance accuracy and computational requirements.
The project is distributed as the mineru Python package and supports Python 3.10-3.13 across Windows, Linux, and macOS platforms. The core parsing functionality operates in CPU-only environments, with optional GPU/NPU acceleration for higher-accuracy backends.
Key Design Goals:
Sources: README.md89-96 pyproject.toml6-18 README.md142-212
MinerU provides comprehensive document parsing capabilities organized into the following functional areas:
| Capability | Description | Technical Implementation |
|---|---|---|
| Layout Analysis | Detects and classifies document regions (text, images, tables, formulas) | DocLayout-YOLO model for region detection |
| OCR Processing | Extracts text from scanned documents in 109 languages | PaddleOCR detection and recognition models |
| Formula Recognition | Converts mathematical expressions to LaTeX | MFD (detection) + MFR/UniMERNet (recognition) |
| Table Extraction | Recognizes table structure and converts to HTML | UNet + RapidTable for wired/wireless tables |
| Reading Order | Determines semantic flow across complex layouts | LayoutLMv3 model or XY-cut algorithm |
| Semantic Cleanup | Removes headers, footers, page numbers, footnotes | Rule-based post-processing |
magika file type detectionMinerU generates multiple output formats to serve different downstream applications:
auto/*.md - Multimodal Markdown with embedded images (default)auto_nlp/*.md - NLP-focused Markdown with text-only contentmiddle.json - Rich intermediate format with full metadata and hierarchical structurecontent_list.json - Flat list of content blocks sorted by reading orderlayout.pdf - Visual overlay of detected layout regionsspans.pdf - Character-level span visualizationmodel.json - Raw model outputs for debuggingSources: README.md97-110 README_zh-CN.md99-111 docs/zh/index.md50-63
The following diagram illustrates the high-level architecture of MinerU, showing how user interfaces route requests through processing backends to generate structured output:
User Layer: Four entry points expose MinerU functionality via CLI (mineru), web UI (mineru-gradio), REST API (mineru-api), and OpenAI-compatible server (mineru-openai-server).
Orchestration Layer: The do_parse and aio_do_parse functions in mineru/cli/common.py route requests to appropriate backends after PDF preparation using pypdfium2.
Processing Backends: Three backend implementations offer different accuracy/performance tradeoffs:
pipeline - CPU-compatible, uses specialized models sequentiallyvlm-* - Vision-Language Model backends requiring GPU/NPUhybrid-* - Combines VLM with pipeline componentsModel Inference: The ModelSingleton pattern caches model instances and manages loading from HuggingFace, ModelScope, or local storage.
Hardware Abstraction: Device detection functions select optimal hardware (CUDA, NPU, MPS, or CPU) and configure inference engines accordingly.
Output Generation: All backends converge to middle.json format, which feeds into Markdown and JSON generators.
Sources: Diagram 1 from provided diagrams, README.md142-212 pyproject.toml111-118
MinerU provides three processing backends that balance accuracy, hardware requirements, and flexibility. The backend selection occurs in mineru/cli/common.py via the --backend or -b CLI parameter.
| Backend | Accuracy Score¹ | Min VRAM | CPU Support | Best For |
|---|---|---|---|---|
| pipeline | 82+ | 6GB | ✅ Yes | CPU environments, batch processing |
| vlm-auto-engine | 90+ | 8GB | ❌ No | High accuracy, GPU-accelerated |
| hybrid-auto-engine | 90+ | 10GB | ❌ No | Best accuracy with multi-language OCR |
| vlm-http-client | 90+ | 3GB | ✅ Yes² | Remote inference servers |
| hybrid-http-client | 90+ | 3GB | ✅ Yes² | Remote inference with OCR fallback |
¹ OmniDocBench (v1.5) End-to-End Evaluation Overall score
² Uses remote GPU, local device only needs CPU
pipeline)Sequential processing through specialized models:
vlm-*)End-to-end Vision-Language Model processing:
hybrid-*)Combines VLM with pipeline capabilities:
Sources: README.md142-212 README_zh-CN.md141-212 Diagram 2 from provided diagrams
MinerU exposes four primary user interfaces, each implemented as a console script entry point defined in pyproject.toml111-118:
mineru)The primary CLI tool for document parsing:
Entry Point: mineru.cli.client:main
Basic Usage:
Key Parameters:
-p, --path: Input file or directory path-o, --output: Output directory-b, --backend: Backend selection (pipeline, vlm-, hybrid-)--source: Model source (huggingface, modelscope, local)See Command Line Interface for complete documentation.
mineru-gradio)Web-based interface for interactive document parsing:
Entry Point: mineru.cli.gradio_app:main
Deployment:
Features:
Implementation: mineru/cli/gradio_app.py
See Gradio Web UI for detailed documentation.
mineru-api)RESTful API server for programmatic access:
Entry Point: mineru.cli.fast_api:main
Deployment:
API Endpoints:
POST /parse - Upload and parse documentGET /status/{job_id} - Check parsing statusGET /result/{job_id} - Retrieve parsed resultsImplementation: mineru/cli/fast_api.py
See FastAPI Service for API documentation.
mineru-openai-server)vLLM/LMDeploy-based inference server for lightweight HTTP clients:
Entry Points:
mineru.cli.vlm_server:vllm_server → mineru-vllm-servermineru.cli.vlm_server:lmdeploy_server → mineru-lmdeploy-servermineru.cli.vlm_server:openai_server → mineru-openai-serverDeployment:
Client Usage:
This architecture enables GPU resource consolidation on servers while distributing lightweight CPU-only clients to edge nodes.
See OpenAI-Compatible Server and HTTP Client for details.
Sources: pyproject.toml111-118 README.md246-256 docs/zh/usage/index.md1-42
MinerU abstracts hardware acceleration through device detection and configuration functions, enabling deployment across diverse hardware platforms.
| Platform | Requirements | Backends | Inference Engines |
|---|---|---|---|
| NVIDIA GPU | Compute Capability 7.0+ (Volta+), CUDA 12.8+ | All | transformers, vLLM, LMDeploy |
| Apple Silicon | macOS 14.0+, Metal Performance Shaders | All | transformers, MLX |
| Ascend NPU | CANN 8.0+, Ascend 910B | vlm, hybrid | vLLM, LMDeploy |
| 11 Other NPUs | Vendor-specific drivers | vlm, hybrid | vLLM or LMDeploy |
| CPU Only | 16GB+ RAM | pipeline only | transformers (CPU) |
Hardware detection and configuration occurs through utility functions:
get_device() - Returns optimal device (cuda/npu/mps/cpu)get_vram() - Returns available VRAM in MBThese functions inform:
MinerU supports 11 Chinese domestic NPU platforms through vendor-specific Docker images and device configurations:
Full Support (Official + Vendor):
Partial/Community Support:
Each platform provides Docker deployment with platform-specific:
/dev/card*)DEVICE, VRAM)See Hardware Acceleration for detailed configuration.
Sources: README.md179-212 README_zh-CN.md48-61 docs/zh/usage/index.md11-24
MinerU implements a comprehensive model management system with caching, lazy loading, and multi-source support.
MinerU can load models from three sources, controlled via --source CLI parameter or MINERU_MODEL_SOURCE environment variable:
mineru-models-download toolConfiguration priority (highest to lowest):
MINERU_MODEL_SOURCE--sourcemineru.json configuration filehuggingfaceMinerU manages five categories of models:
| Category | Models | Backend Usage |
|---|---|---|
| Layout Detection | DocLayout-YOLO | Pipeline, Hybrid |
| OCR | PaddleOCR (detection + recognition) | Pipeline, Hybrid |
| Formula Recognition | MFD, MFR, UniMERNet | Pipeline, Hybrid |
| Table Recognition | UNet, RapidTable | Pipeline |
| VLM | Qwen2-VL (2B/7B) | VLM, Hybrid |
Model instances are cached using the Singleton pattern to minimize memory usage and initialization overhead:
Key Features:
Implementation Pattern:
See Model Management and Singleton Pattern for implementation details.
The mineru-models-download CLI provides interactive model acquisition:
Entry Point: mineru.cli.models_download:download_models
Usage:
Behavior:
~/.cache/mineru/ by default~/mineru.jsonConfiguration File Structure (mineru.json):
See Model Download and Configuration for complete documentation.
Sources: docs/zh/usage/model_source.md docs/en/usage/model_source.md pyproject.toml116 Diagram 5 from provided diagrams
All processing backends converge to a standardized middle.json intermediate format, which then feeds into multiple output generators. This architecture decouples parsing from rendering.
The middle.json file contains hierarchical document structure with rich metadata:
Top-Level Schema:
Block Types (BlockType enum):
text - Paragraph texttitle - Heading with hierarchy levelimage - Image with optional captiontable - Table with HTML representationequation - Display equation (LaTeX)inline_equation - Inline formula (LaTeX)list_item - List itemSee middle.json Intermediate Format for complete schema.
Two Markdown variants are generated via union_make functions:
auto/*.md)Default output for multimodal applications:
Use Cases: RAG systems, multimodal LLM training, document QA
auto_nlp/*.md)Text-only variant for NLP tasks:
[Image: caption text]Use Cases: Text summarization, keyword extraction, translation
Generator Implementation: mineru/backend/*/mkcontent.py modules contain union_make functions
See Markdown Generation for rendering details.
Flat list of content blocks sorted by reading order:
Schema:
Use Cases: Sequential document processing, pipeline feeding, custom rendering
See content_list.json Structure for format specification.
Three PDF visualizations aid debugging and quality inspection:
Generation Control: Enabled via CLI flags or configuration
See Debug and Visualization Outputs for examples.
Sources: README.md107-110 README_zh-CN.md107-111 Diagram 2 and Diagram 6 from provided diagrams
MinerU supports three primary deployment patterns: standalone installation, Docker containerization, and client-server architecture.
Local installation via pip/uv for direct Python usage:
Installation:
Optional Dependencies (pyproject.toml46-103):
[vlm] - transformers + torch for VLM backends[vllm] - vLLM inference engine (Linux only)[lmdeploy] - LMDeploy inference engine (Windows/Linux)[mlx] - MLX inference engine (macOS only)[pipeline] - Pipeline backend models[api] - FastAPI server dependencies[gradio] - Gradio web UI dependencies[core] - All of the above except inference engines[all] - Complete installation with platform-specific enginesSee Installation and Setup for detailed instructions.
Pre-built Docker images support rapid deployment and hardware compatibility:
Image Variants:
registry.hub.docker.com/opendatalab/mineruDocker Compose Profiles:
Platform-Specific Images:
Acceleration card-specific Dockerfiles in docker/ provide:
/dev/card*, /dev/nvidia*)See Docker Deployment and Platform-Specific Docker Images.
Distributed deployment separates GPU resources (server) from processing requests (clients):
Server Setup:
Lightweight Client:
Architecture Benefits:
Deployment Pattern:
See Client-Server Architecture for configuration details.
Sources: README.md217-242 pyproject.toml46-103 docs/zh/quick_start/index.md107-134 Diagram 3 from provided diagrams
The following table summarizes hardware and software requirements across backends:
| Backend | OS Support | CPU Support | Min VRAM | Min RAM | Disk Space | Python |
|---|---|---|---|---|---|---|
| pipeline | Linux/Windows/macOS | ✅ Yes | 6GB (GPU) | 16GB | 20GB | 3.10-3.13³ |
| hybrid-auto-engine | Linux/Windows/macOS | ❌ No | 10GB | 16GB | 20GB | 3.10-3.13³ |
| vlm-auto-engine | Linux/Windows/macOS | ❌ No | 8GB | 16GB | 20GB | 3.10-3.13³ |
| hybrid-http-client | Linux/Windows/macOS | ✅ Yes | 3GB (GPU)¹ | 8GB | 2GB | 3.10-3.13³ |
| vlm-http-client | Linux/Windows/macOS | ✅ Yes | 3GB (GPU)¹ | 8GB | 2GB | 3.10-3.13³ |
¹ VRAM required on remote server only
² Linux distributions from 2019 or later
³ Windows limited to Python 3.10-3.12 due to ray dependency
GPU Requirements:
Accuracy Metrics (OmniDocBench v1.5):
Sources: README.md142-212 README_zh-CN.md141-212 docs/zh/quick_start/index.md29-100
MinerU implements automated release pipelines via GitHub Actions workflows defined in .github/workflows/:
python-package.yml)Triggered by Git tags matching *released pattern:
Stages:
Authentication:
RELEASE_TOKEN GitHub secret for repository accessPYPI_TOKEN GitHub secret for PyPI publicationcli-test.yml)Runs on every push to master/dev branches:
mkdocs.yml)Automatically deploys documentation to GitHub Pages:
gh-pages branchSee CI/CD Pipeline and Workflows and Version Management and Release Process for implementation details.
Sources: Diagram 4 from provided diagrams, pyproject.toml120-121
Complete example demonstrating basic usage:
Output Files:
output/auto/document.md - Multimodal Markdownoutput/auto_nlp/document.md - Text-only Markdownoutput/document/middle.json - Intermediate formatoutput/document/content_list.json - Flat content listoutput/document/images/ - Extracted imagesFor detailed usage patterns, see Quick Start Examples.
Sources: README.md246-256 docs/zh/quick_start/index.md136-151
Sources: Table of contents from wiki page structure
Refresh this wiki