The VLM Backend is a high-accuracy document parsing backend that leverages Vision-Language Models (VLMs) for end-to-end document understanding. Unlike the traditional Pipeline Backend which uses multiple specialized models in sequence, the VLM Backend employs a unified vision-language model (primarily Qwen2-VL) to directly extract document structure and content from page images.
This page documents the VLM Backend's architecture, model management, inference engine selection, and processing flow. For information about the traditional multi-model approach, see Pipeline Backend. For the hybrid approach that combines VLM with pipeline components, see Hybrid Backend. For overall system orchestration, see Core Orchestration.
The VLM Backend follows a streamlined architecture where document pages are converted to images and processed through a vision-language model in a two-step extraction process. The backend supports multiple inference engines and hardware accelerators, with automatic engine selection based on available resources.
Sources: mineru/backend/vlm/vlm_analyze.py1-273
The VLM Backend implements a singleton pattern for efficient model caching. The ModelSingleton class ensures that models are loaded only once per unique configuration, avoiding redundant initialization costs.
Sources: mineru/backend/vlm/vlm_analyze.py23-219
The ModelSingleton.get_model() method accepts the following parameters:
| Parameter | Type | Purpose |
|---|---|---|
backend | str | Inference engine selection: "transformers", "vllm-engine", "vllm-async-engine", "lmdeploy-engine", "mlx-engine", "http-client" |
model_path | str | None | Local model directory (auto-downloaded if None) |
server_url | str | None | Remote server URL for http-client backend |
**kwargs | dict | Engine-specific configuration parameters |
Special kwargs extracted before engine initialization:
batch_size: Batch processing size for transformers backend (default: auto-calculated based on VRAM)max_concurrency: Maximum concurrent requests for http-client backend (default: 100)http_timeout: Request timeout for http-client backend (default: 600s)server_headers: Custom HTTP headers for http-client backendmax_retries: Retry attempts for failed HTTP requests (default: 3)retry_backoff_factor: Exponential backoff multiplier (default: 0.5)Sources: mineru/backend/vlm/vlm_analyze.py32-56
The VLM Backend supports six inference engines, each optimized for different hardware and use cases:
Standard HuggingFace transformers implementation with manual batch processing.
Batch size determination:
set_default_batch_size() based on VRAM:
Sources: mineru/backend/vlm/vlm_analyze.py59-84 mineru/backend/vlm/utils.py92-108
MinerU supports two vLLM modes:
Default kwargs:
gpu_memory_utilization: Auto-set (0.7 for vLLM >= 0.11.0 with VRAM <= 8GB, else 0.5)model: model_pathlogits_processors: [MinerULogitsProcessor] if compute capability >= 8.0 and vLLM >= 0.10.1Sources: mineru/backend/vlm/vlm_analyze.py98-122 mineru/backend/vlm/utils.py82-89 mineru/backend/vlm/utils.py11-56
Notable differences:
compilation_config must be a CompilationConfig object (auto-converted from dict/JSON string)Sources: mineru/backend/vlm/vlm_analyze.py123-154
Optimized for diverse hardware including CUDA, Ascend NPUs, and other domestic accelerators.
Configuration:
cache_max_entry_count: Default 0.5MINERU_LMDEPLOY_DEVICE env var or lmdeploy_device kwargMINERU_LMDEPLOY_BACKEND env var or lmdeploy_backend kwargSources: mineru/backend/vlm/vlm_analyze.py156-201 mineru/backend/vlm/utils.py59-79
Exclusive to Apple Silicon (macOS 13.5+):
Sources: mineru/backend/vlm/vlm_analyze.py85-93
Lightweight client for remote inference:
Sources: mineru/backend/vlm/vlm_analyze.py57-58 mineru/backend/vlm/vlm_analyze.py202-216
The VLM Backend automatically adapts configuration parameters based on detected hardware. The mod_kwargs_by_device_type() function modifies vLLM kwargs for specific accelerators:
| Device Type | Environment Variable | Configuration Applied |
|---|---|---|
corex (IluvatarCorex) | MINERU_VLLM_DEVICE=corex | compilation_config: FULL_DECODE_ONLY mode |
kxpu (Kunlunxin) | MINERU_VLLM_DEVICE=kxpu | compilation_config: splitting_ops list, block_size=128, dtype="float16", distributed_executor_backend="mp" |
| Default (CUDA/others) | Not set | No modifications |
Configuration injection logic:
server mode (vLLM OpenAI server): Injects command-line argumentssync_engine / async_engine modes: Injects kwargs_get_device_config()Sources: mineru/backend/vlm/utils.py111-233
The VLM Backend conditionally enables MinerULogitsProcessor for improved token generation:
Enabling conditions:
VLLM_USE_V1 != "0"Sources: mineru/backend/vlm/utils.py11-56
The VLM Backend provides both synchronous and asynchronous entry points for document analysis:
Sources: mineru/backend/vlm/vlm_analyze.py222-246
Identical flow to doc_analyze() but uses async/await for I/O-bound operations:
Sources: mineru/backend/vlm/vlm_analyze.py249-272
Image loading is performed by load_images_from_pdf() with the following characteristics:
Output structure:
Performance logging:
Sources: mineru/backend/vlm/vlm_analyze.py234-243 mineru/backend/vlm/vlm_analyze.py261-270
The batch_two_step_extract() method (provided by MinerUClient from mineru_vl_utils) performs VLM inference in two passes:
This two-step approach improves accuracy by separating layout understanding from content recognition.
Sources: mineru/backend/vlm/vlm_analyze.py241 mineru/backend/vlm/vlm_analyze.py268
After VLM inference, results are transformed into the standardized middle JSON format by result_to_middle_json(). This function is defined in mineru/backend/vlm/model_output_to_middle_json.py and performs:
image_writer)The resulting middle JSON contains:
pdf_info: Document metadata (page count, dimensions)para_blocks: Ordered list of content blocksdiscarded_blocks: Filtered or invalid blocksSources: mineru/backend/vlm/vlm_analyze.py10 mineru/backend/vlm/vlm_analyze.py245 mineru/backend/vlm/vlm_analyze.py271
The VLM Backend inherits block sorting functionality from the shared block sorting module. For detailed information on reading order detection, see Post-Processing and Block Sorting.
After middle JSON generation, blocks are sorted using:
Sorting strategy:
Sources: mineru/utils/block_sort.py15-37 mineru/utils/block_sort.py57-134
The block sorting module uses its own ModelSingleton class (distinct from VLM's ModelSingleton) to cache the LayoutLMv3 reading order model:
Model initialization:
auto_download_and_get_model_root_path()get_device()Sources: mineru/utils/block_sort.py234-246 mineru/utils/block_sort.py179-231
| Variable | Purpose | Default |
|---|---|---|
MINERU_LMDEPLOY_DEVICE | LMDeploy device type (cuda/ascend/maca/camb) | "cuda" |
MINERU_LMDEPLOY_BACKEND | LMDeploy backend (pytorch/turbomind) | Auto-selected |
MINERU_VLLM_DEVICE | vLLM device type for special configs | Not set |
MINERU_DEVICE_MODE | Override device detection | Auto-detected |
MINERU_VIRTUAL_VRAM_SIZE | Override VRAM detection (GB) | Auto-detected |
VLLM_USE_V1 | Enable vLLM v1 features | "1" |
OMP_NUM_THREADS | OpenMP thread count | "1" (vLLM/LMDeploy) |
TM_LOG_LEVEL | LMDeploy TurboMind log level | "ERROR" |
Sources: mineru/backend/vlm/vlm_analyze.py95-96 mineru/backend/vlm/vlm_analyze.py164-196 mineru/backend/vlm/utils.py35-39 mineru/backend/vlm/utils.py182 mineru/utils/config_reader.py76-107 mineru/utils/model_utils.py450-486
| Use Case | Recommended Backend | Rationale |
|---|---|---|
| Single document, GPU available | transformers | Simple, no extra dependencies |
| Batch processing, NVIDIA GPU | vllm-engine | High throughput, GPU memory efficient |
| Async/concurrent requests | vllm-async-engine | Better concurrency handling |
| Ascend/METAX/T-Head NPUs | lmdeploy-engine | Optimized for domestic accelerators |
| Apple Silicon (M1/M2/M3) | mlx-engine | Native Metal acceleration |
| Client-server deployment | http-client | No local model, lightweight client |
Sources: Based on analysis from mineru/backend/vlm/vlm_analyze.py59-201
Sources: Inferred from mineru/backend/vlm/vlm_analyze.py222-272
VRAM optimization strategies:
gpu_memory_utilization kwargcache_max_entry_count=0.5 by defaultSources: mineru/backend/vlm/utils.py82-108 mineru/backend/vlm/vlm_analyze.py161-162
Factors affecting processing speed:
aio_doc_analyze() enables concurrent document processingLogged metrics:
{time}s, {images/s} images/s{time}s, {pages/s} page/sSources: mineru/backend/vlm/vlm_analyze.py237-243 mineru/backend/vlm/vlm_analyze.py264-270
Acceleration card compatibility:
For detailed hardware setup instructions, see Hardware Acceleration.
Sources: docs/zh/usage/acceleration_cards/Tecorigin.md49-115
Refresh this wiki