Overview of MinerU

Relevant source files

Purpose and Scope

This document provides a comprehensive overview of the MinerU system, a high-accuracy document parsing tool that converts PDFs and images into machine-readable formats such as Markdown and JSON. MinerU was developed during the pre-training process of the InternLM language model to solve symbol conversion issues in scientific literature.

This overview covers:

Core document parsing capabilities and supported formats
System architecture and processing backends
User interfaces and deployment options
Hardware acceleration support and model management

For installation instructions, see Installation and Setup. For detailed usage patterns, see Command Line Interface and Python API Integration. For backend-specific details, see Pipeline Backend, VLM Backend, and Hybrid Backend.

Sources: README.md89-96 README_zh-CN.md87-93 docs/zh/index.md42-46

Project Introduction

MinerU is a tool that converts unstructured documents (primarily PDFs) into structured, machine-readable formats. It extracts text, tables, formulas, and images while preserving semantic coherence and document structure. The system processes documents through specialized backends that balance accuracy and computational requirements.

The project is distributed as the mineru Python package and supports Python 3.10-3.13 across Windows, Linux, and macOS platforms. The core parsing functionality operates in CPU-only environments, with optional GPU/NPU acceleration for higher-accuracy backends.

Key Design Goals:

High-accuracy extraction of mathematical formulas, tables, and complex layouts
Preservation of reading order and document hierarchy
Support for 109 languages via OCR
Flexible deployment patterns (standalone, client-server, containerized)
Hardware abstraction for diverse acceleration platforms

Sources: README.md89-96 pyproject.toml6-18 README.md142-212

Core Document Parsing Capabilities

MinerU provides comprehensive document parsing capabilities organized into the following functional areas:

Capability	Description	Technical Implementation
Layout Analysis	Detects and classifies document regions (text, images, tables, formulas)	DocLayout-YOLO model for region detection
OCR Processing	Extracts text from scanned documents in 109 languages	PaddleOCR detection and recognition models
Formula Recognition	Converts mathematical expressions to LaTeX	MFD (detection) + MFR/UniMERNet (recognition)
Table Extraction	Recognizes table structure and converts to HTML	UNet + RapidTable for wired/wireless tables
Reading Order	Determines semantic flow across complex layouts	LayoutLMv3 model or XY-cut algorithm
Semantic Cleanup	Removes headers, footers, page numbers, footnotes	Rule-based post-processing

Supported Input Formats

PDF Files: Both text-based and scanned PDFs
Images: PNG, JPEG, and other common image formats
Automatic Detection: Identifies scanned/garbled PDFs and enables OCR automatically using magika file type detection

Output Format Options

MinerU generates multiple output formats to serve different downstream applications:

Markdown Files:
- auto/*.md - Multimodal Markdown with embedded images (default)
- auto_nlp/*.md - NLP-focused Markdown with text-only content
JSON Formats:
- middle.json - Rich intermediate format with full metadata and hierarchical structure
- content_list.json - Flat list of content blocks sorted by reading order
Debug Visualizations:
- layout.pdf - Visual overlay of detected layout regions
- spans.pdf - Character-level span visualization
- model.json - Raw model outputs for debugging

Sources: README.md97-110 README_zh-CN.md99-111 docs/zh/index.md50-63

System Architecture Overview

The following diagram illustrates the high-level architecture of MinerU, showing how user interfaces route requests through processing backends to generate structured output:

Architecture Layers

User Layer: Four entry points expose MinerU functionality via CLI (mineru), web UI (mineru-gradio), REST API (mineru-api), and OpenAI-compatible server (mineru-openai-server).
Orchestration Layer: The do_parse and aio_do_parse functions in mineru/cli/common.py route requests to appropriate backends after PDF preparation using pypdfium2.
Processing Backends: Three backend implementations offer different accuracy/performance tradeoffs:
- pipeline - CPU-compatible, uses specialized models sequentially
- vlm-* - Vision-Language Model backends requiring GPU/NPU
- hybrid-* - Combines VLM with pipeline components
Model Inference: The ModelSingleton pattern caches model instances and manages loading from HuggingFace, ModelScope, or local storage.
Hardware Abstraction: Device detection functions select optimal hardware (CUDA, NPU, MPS, or CPU) and configure inference engines accordingly.
Output Generation: All backends converge to middle.json format, which feeds into Markdown and JSON generators.

Sources: Diagram 1 from provided diagrams, README.md142-212 pyproject.toml111-118

Processing Backend Comparison

MinerU provides three processing backends that balance accuracy, hardware requirements, and flexibility. The backend selection occurs in mineru/cli/common.py via the --backend or -b CLI parameter.

Backend Selection Matrix

Backend	Accuracy Score¹	Min VRAM	CPU Support	Best For
pipeline	82+	6GB	✅ Yes	CPU environments, batch processing
vlm-auto-engine	90+	8GB	❌ No	High accuracy, GPU-accelerated
hybrid-auto-engine	90+	10GB	❌ No	Best accuracy with multi-language OCR
vlm-http-client	90+	3GB	✅ Yes²	Remote inference servers
hybrid-http-client	90+	3GB	✅ Yes²	Remote inference with OCR fallback

¹ OmniDocBench (v1.5) End-to-End Evaluation Overall score
² Uses remote GPU, local device only needs CPU

Backend Implementations

Pipeline Backend (`pipeline`)

Sequential processing through specialized models:

Layout detection → Formula detection → OCR → Table recognition
Implementation: mineru/backend/pipeline/batch_analyze.py
Supports CPU-only operation
VRAM: 6GB minimum for GPU acceleration
Uses batch processing with resolution grouping for efficiency

VLM Backend (`vlm-*`)

End-to-end Vision-Language Model processing:

Two-step extraction: structure detection then content recognition
Implementation: mineru/backend/vlm/vlm_analyze.py
Auto-engine selection: transformers, vLLM, LMDeploy, or MLX
VRAM: 8GB minimum (Qwen2-VL model)
Highest accuracy for complex layouts

Hybrid Backend (`hybrid-*`)

Combines VLM with pipeline capabilities:

VLM for structure detection
Falls back to pipeline OCR for 109-language support
Extracts text directly from text-based PDFs
Independent inline formula recognition control
Implementation: mineru/backend/hybrid/
VRAM: 10GB minimum
Default backend since v2.7.0

Backend-Specific Processing Flow

Sources: README.md142-212 README_zh-CN.md141-212 Diagram 2 from provided diagrams

User Interfaces and Entry Points

MinerU exposes four primary user interfaces, each implemented as a console script entry point defined in pyproject.toml111-118:

1. Command Line Interface (`mineru`)

The primary CLI tool for document parsing:

Entry Point: mineru.cli.client:main

Basic Usage:

Key Parameters:

-p, --path: Input file or directory path
-o, --output: Output directory
-b, --backend: Backend selection (pipeline, vlm-, hybrid-)
--source: Model source (huggingface, modelscope, local)

See Command Line Interface for complete documentation.

2. Gradio Web UI (`mineru-gradio`)

Web-based interface for interactive document parsing:

Entry Point: mineru.cli.gradio_app:main

Deployment:

Features:

Drag-and-drop PDF upload
Real-time parsing progress
Inline Markdown preview with LaTeX rendering
Download parsed results
i18n support (English/Chinese)

Implementation: mineru/cli/gradio_app.py

See Gradio Web UI for detailed documentation.

3. FastAPI REST Service (`mineru-api`)

RESTful API server for programmatic access:

Entry Point: mineru.cli.fast_api:main

Deployment:

API Endpoints:

POST /parse - Upload and parse document
GET /status/{job_id} - Check parsing status
GET /result/{job_id} - Retrieve parsed results

Implementation: mineru/cli/fast_api.py

See FastAPI Service for API documentation.

4. OpenAI-Compatible Server (`mineru-openai-server`)

vLLM/LMDeploy-based inference server for lightweight HTTP clients:

Entry Points:

mineru.cli.vlm_server:vllm_server → mineru-vllm-server
mineru.cli.vlm_server:lmdeploy_server → mineru-lmdeploy-server
mineru.cli.vlm_server:openai_server → mineru-openai-server

Deployment:

Client Usage:

This architecture enables GPU resource consolidation on servers while distributing lightweight CPU-only clients to edge nodes.

See OpenAI-Compatible Server and HTTP Client for details.

Entry Point Architecture

Sources: pyproject.toml111-118 README.md246-256 docs/zh/usage/index.md1-42

Hardware Acceleration Support

MinerU abstracts hardware acceleration through device detection and configuration functions, enabling deployment across diverse hardware platforms.

Supported Hardware Platforms

Platform	Requirements	Backends	Inference Engines
NVIDIA GPU	Compute Capability 7.0+ (Volta+), CUDA 12.8+	All	transformers, vLLM, LMDeploy
Apple Silicon	macOS 14.0+, Metal Performance Shaders	All	transformers, MLX
Ascend NPU	CANN 8.0+, Ascend 910B	vlm, hybrid	vLLM, LMDeploy
11 Other NPUs	Vendor-specific drivers	vlm, hybrid	vLLM or LMDeploy
CPU Only	16GB+ RAM	pipeline only	transformers (CPU)

Device Abstraction Layer

Hardware detection and configuration occurs through utility functions:

get_device() - Returns optimal device (cuda/npu/mps/cpu)
get_vram() - Returns available VRAM in MB
Implementation: Hardware abstraction modules

These functions inform:

Backend selection (CPU vs. GPU/NPU)
Engine selection (transformers vs. vLLM vs. LMDeploy vs. MLX)
Model quantization decisions
Batch size configuration

Domestic NPU Support

MinerU supports 11 Chinese domestic NPU platforms through vendor-specific Docker images and device configurations:

Full Support (Official + Vendor):

Ascend (Huawei) - docs/zh/usage/acceleration_cards/Ascend.md
METAX (Muxin) - docs/zh/usage/acceleration_cards/METAX.md
T-Head (Alibaba) - docs/zh/usage/acceleration_cards/THead.md
Hygon - docs/zh/usage/acceleration_cards/Hygon.md
Iluvatar CoreX - docs/zh/usage/acceleration_cards/IluvatarCorex.md

Partial/Community Support:

Cambricon (vLLM v0 issues) - docs/zh/usage/acceleration_cards/Cambricon.md
MooreThreads (vLLM v0 issues) - docs/zh/usage/acceleration_cards/MooreThreads.md
Enflame (vLLM only) - docs/zh/usage/acceleration_cards/Enflame.md
Kunlunxin (vLLM only) - docs/zh/usage/acceleration_cards/Kunlunxin.md
Tecorigin (community) - docs/zh/usage/acceleration_cards/Tecorigin.md
Biren (community) - docs/zh/usage/acceleration_cards/Biren.md

Each platform provides Docker deployment with platform-specific:

Device mappings (/dev/card*)
Environment variables (DEVICE, VRAM)
Compilation configurations for vLLM/LMDeploy

See Hardware Acceleration for detailed configuration.

Sources: README.md179-212 README_zh-CN.md48-61 docs/zh/usage/index.md11-24

Model Management System

MinerU implements a comprehensive model management system with caching, lazy loading, and multi-source support.

Model Sources

MinerU can load models from three sources, controlled via --source CLI parameter or MINERU_MODEL_SOURCE environment variable:

HuggingFace (Default): Global CDN, optimal for international users
ModelScope: China mirror, seamless SDK compatibility for users behind Great Firewall
Local: Models downloaded via mineru-models-download tool

Configuration priority (highest to lowest):

Environment variable MINERU_MODEL_SOURCE
CLI parameter --source
mineru.json configuration file
Default: huggingface

Model Categories

MinerU manages five categories of models:

Category	Models	Backend Usage
Layout Detection	DocLayout-YOLO	Pipeline, Hybrid
OCR	PaddleOCR (detection + recognition)	Pipeline, Hybrid
Formula Recognition	MFD, MFR, UniMERNet	Pipeline, Hybrid
Table Recognition	UNet, RapidTable	Pipeline
VLM	Qwen2-VL (2B/7B)	VLM, Hybrid

ModelSingleton Pattern

Model instances are cached using the Singleton pattern to minimize memory usage and initialization overhead:

Key Features:

Lazy Loading: Models instantiated only when first accessed
In-Memory Cache: Singleton ensures one instance per model type per process
Thread-Safe: AtomModelSingleton provides thread-safe initialization
Automatic Cleanup: Models released when process terminates

Implementation Pattern:

See Model Management and Singleton Pattern for implementation details.

Model Download Tool

The mineru-models-download CLI provides interactive model acquisition:

Entry Point: mineru.cli.models_download:download_models

Usage:

Behavior:

Prompts user to select models from each category
Downloads from configured source (HuggingFace or ModelScope)
Saves to ~/.cache/mineru/ by default
Writes model paths to ~/mineru.json

Configuration File Structure (mineru.json):

See Model Download and Configuration for complete documentation.

Sources: docs/zh/usage/model_source.md docs/en/usage/model_source.md pyproject.toml116 Diagram 5 from provided diagrams

Output Formats and Data Transformation

All processing backends converge to a standardized middle.json intermediate format, which then feeds into multiple output generators. This architecture decouples parsing from rendering.

Middle JSON Structure

The middle.json file contains hierarchical document structure with rich metadata:

Top-Level Schema:

Block Types (BlockType enum):

text - Paragraph text
title - Heading with hierarchy level
image - Image with optional caption
table - Table with HTML representation
equation - Display equation (LaTeX)
inline_equation - Inline formula (LaTeX)
list_item - List item

See middle.json Intermediate Format for complete schema.

Markdown Generation

Two Markdown variants are generated via union_make functions:

1. Multimodal Markdown (`auto/*.md`)

Default output for multimodal applications:

Embeds images as ![](image_path.png)
Displays tables as HTML blocks
Renders formulas as inline/display LaTeX
Preserves document hierarchy with heading levels

Use Cases: RAG systems, multimodal LLM training, document QA

2. NLP Markdown (`auto_nlp/*.md`)

Text-only variant for NLP tasks:

Replaces images with [Image: caption text]
Converts tables to plain text representation
Simplifies formula notation
Maintains paragraph structure

Use Cases: Text summarization, keyword extraction, translation

Generator Implementation: mineru/backend/*/mkcontent.py modules contain union_make functions

See Markdown Generation for rendering details.

Content List JSON

Flat list of content blocks sorted by reading order:

Schema:

Use Cases: Sequential document processing, pipeline feeding, custom rendering

See content_list.json Structure for format specification.

Debug Visualizations

Three PDF visualizations aid debugging and quality inspection:

layout.pdf - Overlays detected layout regions with bounding boxes and type labels
spans.pdf - Character-level span visualization showing OCR detection results
model.json - Raw model outputs before post-processing

Generation Control: Enabled via CLI flags or configuration

See Debug and Visualization Outputs for examples.

Sources: README.md107-110 README_zh-CN.md107-111 Diagram 2 and Diagram 6 from provided diagrams

Deployment Options

MinerU supports three primary deployment patterns: standalone installation, Docker containerization, and client-server architecture.

Standalone Installation

Local installation via pip/uv for direct Python usage:

Installation:

Optional Dependencies (pyproject.toml46-103):

[vlm] - transformers + torch for VLM backends
[vllm] - vLLM inference engine (Linux only)
[lmdeploy] - LMDeploy inference engine (Windows/Linux)
[mlx] - MLX inference engine (macOS only)
[pipeline] - Pipeline backend models
[api] - FastAPI server dependencies
[gradio] - Gradio web UI dependencies
[core] - All of the above except inference engines
[all] - Complete installation with platform-specific engines

See Installation and Setup for detailed instructions.

Docker Deployment

Pre-built Docker images support rapid deployment and hardware compatibility:

Image Variants:

Global: registry.hub.docker.com/opendatalab/mineru
China: Optimized with Aliyun mirrors and ModelScope

Docker Compose Profiles:

Platform-Specific Images:

Acceleration card-specific Dockerfiles in docker/ provide:

Device mappings (/dev/card*, /dev/nvidia*)
Vendor-specific environment variables
Pre-configured inference engines
Example: docker/china/Dockerfile for base China deployment

See Docker Deployment and Platform-Specific Docker Images.

Client-Server Architecture

Distributed deployment separates GPU resources (server) from processing requests (clients):

Server Setup:

Lightweight Client:

Architecture Benefits:

GPU resource consolidation on servers
Lightweight clients (2GB disk, 8GB RAM)
Horizontal scaling of servers
Mixed backend deployment (pipeline locally, VLM remotely)

Deployment Pattern:

See Client-Server Architecture for configuration details.

Sources: README.md217-242 pyproject.toml46-103 docs/zh/quick_start/index.md107-134 Diagram 3 from provided diagrams

Installation Requirements Summary

The following table summarizes hardware and software requirements across backends:

Backend	OS Support	CPU Support	Min VRAM	Min RAM	Disk Space	Python
pipeline	Linux/Windows/macOS	✅ Yes	6GB (GPU)	16GB	20GB	3.10-3.13³
hybrid-auto-engine	Linux/Windows/macOS	❌ No	10GB	16GB	20GB	3.10-3.13³
vlm-auto-engine	Linux/Windows/macOS	❌ No	8GB	16GB	20GB	3.10-3.13³
hybrid-http-client	Linux/Windows/macOS	✅ Yes	3GB (GPU)¹	8GB	2GB	3.10-3.13³
vlm-http-client	Linux/Windows/macOS	✅ Yes	3GB (GPU)¹	8GB	2GB	3.10-3.13³

¹ VRAM required on remote server only
² Linux distributions from 2019 or later
³ Windows limited to Python 3.10-3.12 due to ray dependency

GPU Requirements:

NVIDIA: Compute Capability 7.0+ (Volta or later), CUDA 12.8+
Apple Silicon: macOS 14.0+, Metal Performance Shaders
Domestic NPUs: Vendor-specific drivers and CANN/similar frameworks

Accuracy Metrics (OmniDocBench v1.5):

Pipeline backend: 82+ End-to-End Overall score
VLM/Hybrid backends: 90+ End-to-End Overall score

Sources: README.md142-212 README_zh-CN.md141-212 docs/zh/quick_start/index.md29-100

CI/CD and Release Automation

MinerU implements automated release pipelines via GitHub Actions workflows defined in .github/workflows/:

Release Workflow (`python-package.yml`)

Triggered by Git tags matching *released pattern:

Stages:

update-version: Extracts version from Git tag, updates mineru/version.py
check-install: Validates installation across Python 3.10-3.13
build: Creates wheel distribution, stores as artifact
release: Publishes to PyPI and GitHub Releases

Authentication:

RELEASE_TOKEN GitHub secret for repository access
PYPI_TOKEN GitHub secret for PyPI publication

Continuous Testing (`cli-test.yml`)

Runs on every push to master/dev branches:

Executes pytest test suite
Generates coverage reports via coverage.py
Enforces code quality standards

Documentation Deployment (`mkdocs.yml`)

Automatically deploys documentation to GitHub Pages:

Builds MkDocs site with Material theme
Deploys to gh-pages branch
Includes i18n support (English/Chinese)

See CI/CD Pipeline and Workflows and Version Management and Release Process for implementation details.

Sources: Diagram 4 from provided diagrams, pyproject.toml120-121

Quick Start Example

Complete example demonstrating basic usage:

Output Files:

output/auto/document.md - Multimodal Markdown
output/auto_nlp/document.md - Text-only Markdown
output/document/middle.json - Intermediate format
output/document/content_list.json - Flat content list
output/document/images/ - Extracted images

For detailed usage patterns, see Quick Start Examples.

Sources: README.md246-256 docs/zh/quick_start/index.md136-151

Installation: Installation and Setup, Model Download and Configuration
Usage: Command Line Interface, Python API Integration, FastAPI Service
Backends: Pipeline Backend, VLM Backend, Hybrid Backend
Hardware: Hardware Acceleration, NVIDIA GPU Support, Domestic NPU Support
Output: Markdown Generation, middle.json Format, Debug Visualizations
Deployment: Docker Deployment, Client-Server Architecture
Development: CI/CD Pipeline, Testing Framework

Sources: Table of contents from wiki page structure

Overview of MinerU

Relevant source files

Purpose and Scope

This overview covers:

Core document parsing capabilities and supported formats
System architecture and processing backends
User interfaces and deployment options
Hardware acceleration support and model management

Sources: README.md89-96 README_zh-CN.md87-93 docs/zh/index.md42-46

Project Introduction

Key Design Goals:

High-accuracy extraction of mathematical formulas, tables, and complex layouts
Preservation of reading order and document hierarchy
Support for 109 languages via OCR
Flexible deployment patterns (standalone, client-server, containerized)
Hardware abstraction for diverse acceleration platforms

Sources: README.md89-96 pyproject.toml6-18 README.md142-212

Core Document Parsing Capabilities

MinerU provides comprehensive document parsing capabilities organized into the following functional areas:

Capability	Description	Technical Implementation
Layout Analysis	Detects and classifies document regions (text, images, tables, formulas)	DocLayout-YOLO model for region detection
OCR Processing	Extracts text from scanned documents in 109 languages	PaddleOCR detection and recognition models
Formula Recognition	Converts mathematical expressions to LaTeX	MFD (detection) + MFR/UniMERNet (recognition)
Table Extraction	Recognizes table structure and converts to HTML	UNet + RapidTable for wired/wireless tables
Reading Order	Determines semantic flow across complex layouts	LayoutLMv3 model or XY-cut algorithm
Semantic Cleanup	Removes headers, footers, page numbers, footnotes	Rule-based post-processing

Supported Input Formats

PDF Files: Both text-based and scanned PDFs
Images: PNG, JPEG, and other common image formats
Automatic Detection: Identifies scanned/garbled PDFs and enables OCR automatically using magika file type detection

Output Format Options

MinerU generates multiple output formats to serve different downstream applications:

Markdown Files:
- auto/*.md - Multimodal Markdown with embedded images (default)
- auto_nlp/*.md - NLP-focused Markdown with text-only content
JSON Formats:
- middle.json - Rich intermediate format with full metadata and hierarchical structure
- content_list.json - Flat list of content blocks sorted by reading order
Debug Visualizations:
- layout.pdf - Visual overlay of detected layout regions
- spans.pdf - Character-level span visualization
- model.json - Raw model outputs for debugging

Sources: README.md97-110 README_zh-CN.md99-111 docs/zh/index.md50-63

System Architecture Overview

The following diagram illustrates the high-level architecture of MinerU, showing how user interfaces route requests through processing backends to generate structured output:

Architecture Layers

User Layer: Four entry points expose MinerU functionality via CLI (mineru), web UI (mineru-gradio), REST API (mineru-api), and OpenAI-compatible server (mineru-openai-server).
Orchestration Layer: The do_parse and aio_do_parse functions in mineru/cli/common.py route requests to appropriate backends after PDF preparation using pypdfium2.
Processing Backends: Three backend implementations offer different accuracy/performance tradeoffs:
- pipeline - CPU-compatible, uses specialized models sequentially
- vlm-* - Vision-Language Model backends requiring GPU/NPU
- hybrid-* - Combines VLM with pipeline components
Model Inference: The ModelSingleton pattern caches model instances and manages loading from HuggingFace, ModelScope, or local storage.
Hardware Abstraction: Device detection functions select optimal hardware (CUDA, NPU, MPS, or CPU) and configure inference engines accordingly.
Output Generation: All backends converge to middle.json format, which feeds into Markdown and JSON generators.

Sources: Diagram 1 from provided diagrams, README.md142-212 pyproject.toml111-118

Processing Backend Comparison

Backend Selection Matrix

Backend	Accuracy Score¹	Min VRAM	CPU Support	Best For
pipeline	82+	6GB	✅ Yes	CPU environments, batch processing
vlm-auto-engine	90+	8GB	❌ No	High accuracy, GPU-accelerated
hybrid-auto-engine	90+	10GB	❌ No	Best accuracy with multi-language OCR
vlm-http-client	90+	3GB	✅ Yes²	Remote inference servers
hybrid-http-client	90+	3GB	✅ Yes²	Remote inference with OCR fallback

¹ OmniDocBench (v1.5) End-to-End Evaluation Overall score
² Uses remote GPU, local device only needs CPU

Backend Implementations

Pipeline Backend (`pipeline`)

Sequential processing through specialized models:

Layout detection → Formula detection → OCR → Table recognition
Implementation: mineru/backend/pipeline/batch_analyze.py
Supports CPU-only operation
VRAM: 6GB minimum for GPU acceleration
Uses batch processing with resolution grouping for efficiency

VLM Backend (`vlm-*`)

End-to-end Vision-Language Model processing:

Two-step extraction: structure detection then content recognition
Implementation: mineru/backend/vlm/vlm_analyze.py
Auto-engine selection: transformers, vLLM, LMDeploy, or MLX
VRAM: 8GB minimum (Qwen2-VL model)
Highest accuracy for complex layouts

Hybrid Backend (`hybrid-*`)

Combines VLM with pipeline capabilities:

VLM for structure detection
Falls back to pipeline OCR for 109-language support
Extracts text directly from text-based PDFs
Independent inline formula recognition control
Implementation: mineru/backend/hybrid/
VRAM: 10GB minimum
Default backend since v2.7.0

Backend-Specific Processing Flow

Sources: README.md142-212 README_zh-CN.md141-212 Diagram 2 from provided diagrams

User Interfaces and Entry Points

MinerU exposes four primary user interfaces, each implemented as a console script entry point defined in pyproject.toml111-118:

1. Command Line Interface (`mineru`)

The primary CLI tool for document parsing:

Entry Point: mineru.cli.client:main

Basic Usage:

Key Parameters:

-p, --path: Input file or directory path
-o, --output: Output directory
-b, --backend: Backend selection (pipeline, vlm-, hybrid-)
--source: Model source (huggingface, modelscope, local)

See Command Line Interface for complete documentation.

2. Gradio Web UI (`mineru-gradio`)

Web-based interface for interactive document parsing:

Entry Point: mineru.cli.gradio_app:main

Deployment:

Features:

Drag-and-drop PDF upload
Real-time parsing progress
Inline Markdown preview with LaTeX rendering
Download parsed results
i18n support (English/Chinese)

Implementation: mineru/cli/gradio_app.py

See Gradio Web UI for detailed documentation.

3. FastAPI REST Service (`mineru-api`)

RESTful API server for programmatic access:

Entry Point: mineru.cli.fast_api:main

Deployment:

API Endpoints:

POST /parse - Upload and parse document
GET /status/{job_id} - Check parsing status
GET /result/{job_id} - Retrieve parsed results

Implementation: mineru/cli/fast_api.py

See FastAPI Service for API documentation.

4. OpenAI-Compatible Server (`mineru-openai-server`)

vLLM/LMDeploy-based inference server for lightweight HTTP clients:

Entry Points:

mineru.cli.vlm_server:vllm_server → mineru-vllm-server
mineru.cli.vlm_server:lmdeploy_server → mineru-lmdeploy-server
mineru.cli.vlm_server:openai_server → mineru-openai-server

Deployment:

Client Usage:

This architecture enables GPU resource consolidation on servers while distributing lightweight CPU-only clients to edge nodes.

See OpenAI-Compatible Server and HTTP Client for details.

Entry Point Architecture

Sources: pyproject.toml111-118 README.md246-256 docs/zh/usage/index.md1-42

Hardware Acceleration Support

MinerU abstracts hardware acceleration through device detection and configuration functions, enabling deployment across diverse hardware platforms.

Supported Hardware Platforms

Platform	Requirements	Backends	Inference Engines
NVIDIA GPU	Compute Capability 7.0+ (Volta+), CUDA 12.8+	All	transformers, vLLM, LMDeploy
Apple Silicon	macOS 14.0+, Metal Performance Shaders	All	transformers, MLX
Ascend NPU	CANN 8.0+, Ascend 910B	vlm, hybrid	vLLM, LMDeploy
11 Other NPUs	Vendor-specific drivers	vlm, hybrid	vLLM or LMDeploy
CPU Only	16GB+ RAM	pipeline only	transformers (CPU)

Device Abstraction Layer

Hardware detection and configuration occurs through utility functions:

get_device() - Returns optimal device (cuda/npu/mps/cpu)
get_vram() - Returns available VRAM in MB
Implementation: Hardware abstraction modules

These functions inform:

Backend selection (CPU vs. GPU/NPU)
Engine selection (transformers vs. vLLM vs. LMDeploy vs. MLX)
Model quantization decisions
Batch size configuration

Domestic NPU Support

MinerU supports 11 Chinese domestic NPU platforms through vendor-specific Docker images and device configurations:

Full Support (Official + Vendor):

Ascend (Huawei) - docs/zh/usage/acceleration_cards/Ascend.md
METAX (Muxin) - docs/zh/usage/acceleration_cards/METAX.md
T-Head (Alibaba) - docs/zh/usage/acceleration_cards/THead.md
Hygon - docs/zh/usage/acceleration_cards/Hygon.md
Iluvatar CoreX - docs/zh/usage/acceleration_cards/IluvatarCorex.md

Partial/Community Support:

Cambricon (vLLM v0 issues) - docs/zh/usage/acceleration_cards/Cambricon.md
MooreThreads (vLLM v0 issues) - docs/zh/usage/acceleration_cards/MooreThreads.md
Enflame (vLLM only) - docs/zh/usage/acceleration_cards/Enflame.md
Kunlunxin (vLLM only) - docs/zh/usage/acceleration_cards/Kunlunxin.md
Tecorigin (community) - docs/zh/usage/acceleration_cards/Tecorigin.md
Biren (community) - docs/zh/usage/acceleration_cards/Biren.md

Each platform provides Docker deployment with platform-specific:

Device mappings (/dev/card*)
Environment variables (DEVICE, VRAM)
Compilation configurations for vLLM/LMDeploy

See Hardware Acceleration for detailed configuration.

Sources: README.md179-212 README_zh-CN.md48-61 docs/zh/usage/index.md11-24

Model Management System

MinerU implements a comprehensive model management system with caching, lazy loading, and multi-source support.

Model Sources

MinerU can load models from three sources, controlled via --source CLI parameter or MINERU_MODEL_SOURCE environment variable:

HuggingFace (Default): Global CDN, optimal for international users
ModelScope: China mirror, seamless SDK compatibility for users behind Great Firewall
Local: Models downloaded via mineru-models-download tool

Configuration priority (highest to lowest):

Environment variable MINERU_MODEL_SOURCE
CLI parameter --source
mineru.json configuration file
Default: huggingface

Model Categories

MinerU manages five categories of models:

Category	Models	Backend Usage
Layout Detection	DocLayout-YOLO	Pipeline, Hybrid
OCR	PaddleOCR (detection + recognition)	Pipeline, Hybrid
Formula Recognition	MFD, MFR, UniMERNet	Pipeline, Hybrid
Table Recognition	UNet, RapidTable	Pipeline
VLM	Qwen2-VL (2B/7B)	VLM, Hybrid

ModelSingleton Pattern

Model instances are cached using the Singleton pattern to minimize memory usage and initialization overhead:

Key Features:

Lazy Loading: Models instantiated only when first accessed
In-Memory Cache: Singleton ensures one instance per model type per process
Thread-Safe: AtomModelSingleton provides thread-safe initialization
Automatic Cleanup: Models released when process terminates

Implementation Pattern:

See Model Management and Singleton Pattern for implementation details.

Model Download Tool

The mineru-models-download CLI provides interactive model acquisition:

Entry Point: mineru.cli.models_download:download_models

Usage:

Behavior:

Prompts user to select models from each category
Downloads from configured source (HuggingFace or ModelScope)
Saves to ~/.cache/mineru/ by default
Writes model paths to ~/mineru.json

Configuration File Structure (mineru.json):

See Model Download and Configuration for complete documentation.

Sources: docs/zh/usage/model_source.md docs/en/usage/model_source.md pyproject.toml116 Diagram 5 from provided diagrams

Output Formats and Data Transformation

All processing backends converge to a standardized middle.json intermediate format, which then feeds into multiple output generators. This architecture decouples parsing from rendering.

Middle JSON Structure

The middle.json file contains hierarchical document structure with rich metadata:

Top-Level Schema:

Block Types (BlockType enum):

text - Paragraph text
title - Heading with hierarchy level
image - Image with optional caption
table - Table with HTML representation
equation - Display equation (LaTeX)
inline_equation - Inline formula (LaTeX)
list_item - List item

See middle.json Intermediate Format for complete schema.

Markdown Generation

Two Markdown variants are generated via union_make functions:

1. Multimodal Markdown (`auto/*.md`)

Default output for multimodal applications:

Embeds images as ![](image_path.png)
Displays tables as HTML blocks
Renders formulas as inline/display LaTeX
Preserves document hierarchy with heading levels

Use Cases: RAG systems, multimodal LLM training, document QA

2. NLP Markdown (`auto_nlp/*.md`)

Text-only variant for NLP tasks:

Replaces images with [Image: caption text]
Converts tables to plain text representation
Simplifies formula notation
Maintains paragraph structure

Use Cases: Text summarization, keyword extraction, translation

Generator Implementation: mineru/backend/*/mkcontent.py modules contain union_make functions

See Markdown Generation for rendering details.

Content List JSON

Flat list of content blocks sorted by reading order:

Schema:

Use Cases: Sequential document processing, pipeline feeding, custom rendering

See content_list.json Structure for format specification.

Debug Visualizations

Three PDF visualizations aid debugging and quality inspection:

layout.pdf - Overlays detected layout regions with bounding boxes and type labels
spans.pdf - Character-level span visualization showing OCR detection results
model.json - Raw model outputs before post-processing

Generation Control: Enabled via CLI flags or configuration

See Debug and Visualization Outputs for examples.

Sources: README.md107-110 README_zh-CN.md107-111 Diagram 2 and Diagram 6 from provided diagrams

Deployment Options

MinerU supports three primary deployment patterns: standalone installation, Docker containerization, and client-server architecture.

Standalone Installation

Local installation via pip/uv for direct Python usage:

Installation:

Optional Dependencies (pyproject.toml46-103):

[vlm] - transformers + torch for VLM backends
[vllm] - vLLM inference engine (Linux only)
[lmdeploy] - LMDeploy inference engine (Windows/Linux)
[mlx] - MLX inference engine (macOS only)
[pipeline] - Pipeline backend models
[api] - FastAPI server dependencies
[gradio] - Gradio web UI dependencies
[core] - All of the above except inference engines
[all] - Complete installation with platform-specific engines

See Installation and Setup for detailed instructions.

Docker Deployment

Pre-built Docker images support rapid deployment and hardware compatibility:

Image Variants:

Global: registry.hub.docker.com/opendatalab/mineru
China: Optimized with Aliyun mirrors and ModelScope

Docker Compose Profiles:

Platform-Specific Images:

Acceleration card-specific Dockerfiles in docker/ provide:

Device mappings (/dev/card*, /dev/nvidia*)
Vendor-specific environment variables
Pre-configured inference engines
Example: docker/china/Dockerfile for base China deployment

See Docker Deployment and Platform-Specific Docker Images.

Client-Server Architecture

Distributed deployment separates GPU resources (server) from processing requests (clients):

Server Setup:

Lightweight Client:

Architecture Benefits:

GPU resource consolidation on servers
Lightweight clients (2GB disk, 8GB RAM)
Horizontal scaling of servers
Mixed backend deployment (pipeline locally, VLM remotely)

Deployment Pattern:

See Client-Server Architecture for configuration details.

Sources: README.md217-242 pyproject.toml46-103 docs/zh/quick_start/index.md107-134 Diagram 3 from provided diagrams

Installation Requirements Summary

The following table summarizes hardware and software requirements across backends:

Backend	OS Support	CPU Support	Min VRAM	Min RAM	Disk Space	Python
pipeline	Linux/Windows/macOS	✅ Yes	6GB (GPU)	16GB	20GB	3.10-3.13³
hybrid-auto-engine	Linux/Windows/macOS	❌ No	10GB	16GB	20GB	3.10-3.13³
vlm-auto-engine	Linux/Windows/macOS	❌ No	8GB	16GB	20GB	3.10-3.13³
hybrid-http-client	Linux/Windows/macOS	✅ Yes	3GB (GPU)¹	8GB	2GB	3.10-3.13³
vlm-http-client	Linux/Windows/macOS	✅ Yes	3GB (GPU)¹	8GB	2GB	3.10-3.13³

¹ VRAM required on remote server only
² Linux distributions from 2019 or later
³ Windows limited to Python 3.10-3.12 due to ray dependency

GPU Requirements:

NVIDIA: Compute Capability 7.0+ (Volta or later), CUDA 12.8+
Apple Silicon: macOS 14.0+, Metal Performance Shaders
Domestic NPUs: Vendor-specific drivers and CANN/similar frameworks

Accuracy Metrics (OmniDocBench v1.5):

Pipeline backend: 82+ End-to-End Overall score
VLM/Hybrid backends: 90+ End-to-End Overall score

Sources: README.md142-212 README_zh-CN.md141-212 docs/zh/quick_start/index.md29-100

CI/CD and Release Automation

MinerU implements automated release pipelines via GitHub Actions workflows defined in .github/workflows/:

Release Workflow (`python-package.yml`)

Triggered by Git tags matching *released pattern:

Stages:

update-version: Extracts version from Git tag, updates mineru/version.py
check-install: Validates installation across Python 3.10-3.13
build: Creates wheel distribution, stores as artifact
release: Publishes to PyPI and GitHub Releases

Authentication:

RELEASE_TOKEN GitHub secret for repository access
PYPI_TOKEN GitHub secret for PyPI publication

Continuous Testing (`cli-test.yml`)

Runs on every push to master/dev branches:

Executes pytest test suite
Generates coverage reports via coverage.py
Enforces code quality standards

Documentation Deployment (`mkdocs.yml`)

Automatically deploys documentation to GitHub Pages:

Builds MkDocs site with Material theme
Deploys to gh-pages branch
Includes i18n support (English/Chinese)

See CI/CD Pipeline and Workflows and Version Management and Release Process for implementation details.

Sources: Diagram 4 from provided diagrams, pyproject.toml120-121

Quick Start Example

Complete example demonstrating basic usage:

Output Files:

output/auto/document.md - Multimodal Markdown
output/auto_nlp/document.md - Text-only Markdown
output/document/middle.json - Intermediate format
output/document/content_list.json - Flat content list
output/document/images/ - Extracted images

For detailed usage patterns, see Quick Start Examples.

Sources: README.md246-256 docs/zh/quick_start/index.md136-151

Installation: Installation and Setup, Model Download and Configuration
Usage: Command Line Interface, Python API Integration, FastAPI Service
Backends: Pipeline Backend, VLM Backend, Hybrid Backend
Hardware: Hardware Acceleration, NVIDIA GPU Support, Domestic NPU Support
Output: Markdown Generation, middle.json Format, Debug Visualizations
Deployment: Docker Deployment, Client-Server Architecture
Development: CI/CD Pipeline, Testing Framework

Sources: Table of contents from wiki page structure

Overview of MinerU

Purpose and Scope

Project Introduction

Core Document Parsing Capabilities

Supported Input Formats

Output Format Options

System Architecture Overview

Architecture Layers

Processing Backend Comparison

Backend Selection Matrix

Backend Implementations

Pipeline Backend (pipeline)

VLM Backend (vlm-*)

Hybrid Backend (hybrid-*)

Backend-Specific Processing Flow

User Interfaces and Entry Points

1. Command Line Interface (mineru)

2. Gradio Web UI (mineru-gradio)

3. FastAPI REST Service (mineru-api)

4. OpenAI-Compatible Server (mineru-openai-server)

Entry Point Architecture

Hardware Acceleration Support

Supported Hardware Platforms

Device Abstraction Layer

Domestic NPU Support

Model Management System

Model Sources

Model Categories

ModelSingleton Pattern

Model Download Tool

Output Formats and Data Transformation

Middle JSON Structure

Markdown Generation

1. Multimodal Markdown (auto/*.md)

2. NLP Markdown (auto_nlp/*.md)

Content List JSON

Debug Visualizations

Deployment Options

Standalone Installation

Docker Deployment

Client-Server Architecture

Installation Requirements Summary

CI/CD and Release Automation

Release Workflow (python-package.yml)

Continuous Testing (cli-test.yml)

Documentation Deployment (mkdocs.yml)

Quick Start Example

Related Documentation

On this page

Overview of MinerU

Purpose and Scope

Project Introduction

Core Document Parsing Capabilities

Supported Input Formats

Output Format Options

System Architecture Overview

Architecture Layers

Processing Backend Comparison

Backend Selection Matrix

Backend Implementations

Pipeline Backend (pipeline)

VLM Backend (vlm-*)

Hybrid Backend (hybrid-*)

Backend-Specific Processing Flow

User Interfaces and Entry Points

1. Command Line Interface (mineru)

2. Gradio Web UI (mineru-gradio)

3. FastAPI REST Service (mineru-api)

4. OpenAI-Compatible Server (mineru-openai-server)

Entry Point Architecture

Hardware Acceleration Support

Supported Hardware Platforms

Device Abstraction Layer

Domestic NPU Support

Model Management System

Model Sources

Model Categories

ModelSingleton Pattern

Model Download Tool

Output Formats and Data Transformation

Pipeline Backend (`pipeline`)

VLM Backend (`vlm-*`)

Hybrid Backend (`hybrid-*`)

1. Command Line Interface (`mineru`)

2. Gradio Web UI (`mineru-gradio`)

3. FastAPI REST Service (`mineru-api`)

4. OpenAI-Compatible Server (`mineru-openai-server`)

1. Multimodal Markdown (`auto/*.md`)

2. NLP Markdown (`auto_nlp/*.md`)

Release Workflow (`python-package.yml`)

Continuous Testing (`cli-test.yml`)

Documentation Deployment (`mkdocs.yml`)

Pipeline Backend (`pipeline`)

VLM Backend (`vlm-*`)

Hybrid Backend (`hybrid-*`)

1. Command Line Interface (`mineru`)

2. Gradio Web UI (`mineru-gradio`)

3. FastAPI REST Service (`mineru-api`)

4. OpenAI-Compatible Server (`mineru-openai-server`)

1. Multimodal Markdown (`auto/*.md`)

2. NLP Markdown (`auto_nlp/*.md`)

Release Workflow (`python-package.yml`)

Continuous Testing (`cli-test.yml`)

Documentation Deployment (`mkdocs.yml`)