VLM Backend

Relevant source files

Purpose and Scope

The VLM Backend is a high-accuracy document parsing backend that leverages Vision-Language Models (VLMs) for end-to-end document understanding. Unlike the traditional Pipeline Backend which uses multiple specialized models in sequence, the VLM Backend employs a unified vision-language model (primarily Qwen2-VL) to directly extract document structure and content from page images.

This page documents the VLM Backend's architecture, model management, inference engine selection, and processing flow. For information about the traditional multi-model approach, see Pipeline Backend. For the hybrid approach that combines VLM with pipeline components, see Hybrid Backend. For overall system orchestration, see Core Orchestration.

Architecture Overview

The VLM Backend follows a streamlined architecture where document pages are converted to images and processed through a vision-language model in a two-step extraction process. The backend supports multiple inference engines and hardware accelerators, with automatic engine selection based on available resources.

Sources: mineru/backend/vlm/vlm_analyze.py1-273

Model Management with ModelSingleton

The VLM Backend implements a singleton pattern for efficient model caching. The ModelSingleton class ensures that models are loaded only once per unique configuration, avoiding redundant initialization costs.

ModelSingleton Class Structure

Sources: mineru/backend/vlm/vlm_analyze.py23-219

Key Implementation Details

The ModelSingleton.get_model() method accepts the following parameters:

Parameter	Type	Purpose
`backend`	str	Inference engine selection: "transformers", "vllm-engine", "vllm-async-engine", "lmdeploy-engine", "mlx-engine", "http-client"
`model_path`	str \| None	Local model directory (auto-downloaded if None)
`server_url`	str \| None	Remote server URL for http-client backend
`**kwargs`	dict	Engine-specific configuration parameters

Special kwargs extracted before engine initialization:

batch_size: Batch processing size for transformers backend (default: auto-calculated based on VRAM)
max_concurrency: Maximum concurrent requests for http-client backend (default: 100)
http_timeout: Request timeout for http-client backend (default: 600s)
server_headers: Custom HTTP headers for http-client backend
max_retries: Retry attempts for failed HTTP requests (default: 3)
retry_backoff_factor: Exponential backoff multiplier (default: 0.5)

Sources: mineru/backend/vlm/vlm_analyze.py32-56

Inference Engine Selection and Configuration

The VLM Backend supports six inference engines, each optimized for different hardware and use cases:

Transformers Backend

Standard HuggingFace transformers implementation with manual batch processing.

Batch size determination:

Auto-calculated via set_default_batch_size() based on VRAM:
- VRAM >= 16 GB: batch_size = 8
- VRAM >= 8 GB: batch_size = 4
- VRAM < 8 GB: batch_size = 1

Sources: mineru/backend/vlm/vlm_analyze.py59-84 mineru/backend/vlm/utils.py92-108

vLLM Engines

MinerU supports two vLLM modes:

vllm-engine (Synchronous)

Default kwargs:

gpu_memory_utilization: Auto-set (0.7 for vLLM >= 0.11.0 with VRAM <= 8GB, else 0.5)
model: model_path
logits_processors: [MinerULogitsProcessor] if compute capability >= 8.0 and vLLM >= 0.10.1

Sources: mineru/backend/vlm/vlm_analyze.py98-122 mineru/backend/vlm/utils.py82-89 mineru/backend/vlm/utils.py11-56

vllm-async-engine (AsyncLLM v1)

Notable differences:

compilation_config must be a CompilationConfig object (auto-converted from dict/JSON string)
Supports v1 engine features for improved throughput

Sources: mineru/backend/vlm/vlm_analyze.py123-154

LMDeploy Engine

Optimized for diverse hardware including CUDA, Ascend NPUs, and other domestic accelerators.

Configuration:

cache_max_entry_count: Default 0.5
Device type: Set via MINERU_LMDEPLOY_DEVICE env var or lmdeploy_device kwarg
Backend: Set via MINERU_LMDEPLOY_BACKEND env var or lmdeploy_backend kwarg

Sources: mineru/backend/vlm/vlm_analyze.py156-201 mineru/backend/vlm/utils.py59-79

MLX Engine

Exclusive to Apple Silicon (macOS 13.5+):

Sources: mineru/backend/vlm/vlm_analyze.py85-93

HTTP Client

Lightweight client for remote inference:

Sources: mineru/backend/vlm/vlm_analyze.py57-58 mineru/backend/vlm/vlm_analyze.py202-216

Hardware-Specific Configuration

The VLM Backend automatically adapts configuration parameters based on detected hardware. The mod_kwargs_by_device_type() function modifies vLLM kwargs for specific accelerators:

Supported Device Types

Device Type	Environment Variable	Configuration Applied
`corex` (IluvatarCorex)	`MINERU_VLLM_DEVICE=corex`	`compilation_config`: FULL_DECODE_ONLY mode
`kxpu` (Kunlunxin)	`MINERU_VLLM_DEVICE=kxpu`	`compilation_config`: splitting_ops list, `block_size=128`, `dtype="float16"`, `distributed_executor_backend="mp"`
Default (CUDA/others)	Not set	No modifications

Configuration injection logic:

For server mode (vLLM OpenAI server): Injects command-line arguments
For sync_engine / async_engine modes: Injects kwargs
Device-specific configs defined in _get_device_config()

Sources: mineru/backend/vlm/utils.py111-233

Custom Logits Processors

The VLM Backend conditionally enables MinerULogitsProcessor for improved token generation:

Enabling conditions:

CUDA/NPU/GCU/MUSA/MLU/SDAA device available
VLLM_USE_V1 != "0"
vLLM version >= 0.10.1
Compute capability >= 8.0 (or vLLM >= 0.10.2 for older hardware)

Sources: mineru/backend/vlm/utils.py11-56

Document Processing Flow

The VLM Backend provides both synchronous and asynchronous entry points for document analysis:

Synchronous Processing: doc_analyze()

Sources: mineru/backend/vlm/vlm_analyze.py222-246

Asynchronous Processing: aio_doc_analyze()

Identical flow to doc_analyze() but uses async/await for I/O-bound operations:

Sources: mineru/backend/vlm/vlm_analyze.py249-272

Image Loading Details

Image loading is performed by load_images_from_pdf() with the following characteristics:

Output structure:

Performance logging:

Load time and speed (images/s) logged at DEBUG level
Inference time and speed (pages/s) logged at DEBUG level

Sources: mineru/backend/vlm/vlm_analyze.py234-243 mineru/backend/vlm/vlm_analyze.py261-270

Two-Step Extraction

The batch_two_step_extract() method (provided by MinerUClient from mineru_vl_utils) performs VLM inference in two passes:

Structure extraction: Identify document layout and block types
Content extraction: Extract text, tables, and formulas from identified regions

This two-step approach improves accuracy by separating layout understanding from content recognition.

Sources: mineru/backend/vlm/vlm_analyze.py241 mineru/backend/vlm/vlm_analyze.py268

Post-Processing: Result to Middle JSON

After VLM inference, results are transformed into the standardized middle JSON format by result_to_middle_json(). This function is defined in mineru/backend/vlm/model_output_to_middle_json.py and performs:

Block extraction from VLM output
Image region saving (via image_writer)
Coordinate normalization
Structure hierarchization

The resulting middle JSON contains:

pdf_info: Document metadata (page count, dimensions)
para_blocks: Ordered list of content blocks
discarded_blocks: Filtered or invalid blocks
Metadata fields

Sources: mineru/backend/vlm/vlm_analyze.py10 mineru/backend/vlm/vlm_analyze.py245 mineru/backend/vlm/vlm_analyze.py271

Block Sorting and Reading Order

The VLM Backend inherits block sorting functionality from the shared block sorting module. For detailed information on reading order detection, see Post-Processing and Block Sorting.

Integration with sort_blocks_by_bbox()

After middle JSON generation, blocks are sorted using:

Sorting strategy:

Primary: LayoutLMv3 model predicts reading order for up to 512 lines
Fallback: XY-cut algorithm for documents with >200 lines or when model unavailable

Sources: mineru/utils/block_sort.py15-37 mineru/utils/block_sort.py57-134

ModelSingleton for LayoutLMv3

The block sorting module uses its own ModelSingleton class (distinct from VLM's ModelSingleton) to cache the LayoutLMv3 reading order model:

Model initialization:

Model path: Auto-downloaded via auto_download_and_get_model_root_path()
Device placement: Determined by get_device()
Precision: BFloat16 if device compute capability >= 8.0, else Float32

Sources: mineru/utils/block_sort.py234-246 mineru/utils/block_sort.py179-231

Configuration Reference

Environment Variables

Variable	Purpose	Default
`MINERU_LMDEPLOY_DEVICE`	LMDeploy device type (cuda/ascend/maca/camb)	"cuda"
`MINERU_LMDEPLOY_BACKEND`	LMDeploy backend (pytorch/turbomind)	Auto-selected
`MINERU_VLLM_DEVICE`	vLLM device type for special configs	Not set
`MINERU_DEVICE_MODE`	Override device detection	Auto-detected
`MINERU_VIRTUAL_VRAM_SIZE`	Override VRAM detection (GB)	Auto-detected
`VLLM_USE_V1`	Enable vLLM v1 features	"1"
`OMP_NUM_THREADS`	OpenMP thread count	"1" (vLLM/LMDeploy)
`TM_LOG_LEVEL`	LMDeploy TurboMind log level	"ERROR"

Sources: mineru/backend/vlm/vlm_analyze.py95-96 mineru/backend/vlm/vlm_analyze.py164-196 mineru/backend/vlm/utils.py35-39 mineru/backend/vlm/utils.py182 mineru/utils/config_reader.py76-107 mineru/utils/model_utils.py450-486

Backend Selection Guide

Use Case	Recommended Backend	Rationale
Single document, GPU available	`transformers`	Simple, no extra dependencies
Batch processing, NVIDIA GPU	`vllm-engine`	High throughput, GPU memory efficient
Async/concurrent requests	`vllm-async-engine`	Better concurrency handling
Ascend/METAX/T-Head NPUs	`lmdeploy-engine`	Optimized for domestic accelerators
Apple Silicon (M1/M2/M3)	`mlx-engine`	Native Metal acceleration
Client-server deployment	`http-client`	No local model, lightweight client

Sources: Based on analysis from mineru/backend/vlm/vlm_analyze.py59-201

Usage Examples

Direct Python API

Async Processing

Sources: Inferred from mineru/backend/vlm/vlm_analyze.py222-272

Performance Considerations

Memory Management

VRAM optimization strategies:

Auto-batch sizing: Smaller batches for low-VRAM GPUs (< 8 GB)
GPU memory utilization: Configurable via gpu_memory_utilization kwarg
Model caching: SingletonPattern prevents duplicate model loading
Cache eviction: LMDeploy uses cache_max_entry_count=0.5 by default

Sources: mineru/backend/vlm/utils.py82-108 mineru/backend/vlm/vlm_analyze.py161-162

Throughput Optimization

Factors affecting processing speed:

Inference engine: vLLM/LMDeploy > transformers for batch processing
Batch size: Larger batches improve GPU utilization but increase memory usage
Async processing: aio_doc_analyze() enables concurrent document processing
HTTP client: Offloads inference to remote servers for horizontal scaling

Logged metrics:

Image loading: {time}s, {images/s} images/s
Inference: {time}s, {pages/s} page/s

Sources: mineru/backend/vlm/vlm_analyze.py237-243 mineru/backend/vlm/vlm_analyze.py264-270

Hardware Acceleration Support

Acceleration card compatibility:

Full support: Ascend NPU, METAX GPU, T-Head PPU, Hygon DCU, IluvatarCorex
Good support: Enflame GCU, Kunlunxin XPU, Tecorigin T100
Partial support: Cambricon MLU, MooreThreads (vLLM v0 limitations)

For detailed hardware setup instructions, see Hardware Acceleration.

Sources: docs/zh/usage/acceleration_cards/Tecorigin.md49-115

VLM Backend

Relevant source files

Purpose and Scope

Architecture Overview

Sources: mineru/backend/vlm/vlm_analyze.py1-273

Model Management with ModelSingleton

ModelSingleton Class Structure

Sources: mineru/backend/vlm/vlm_analyze.py23-219

Key Implementation Details

The ModelSingleton.get_model() method accepts the following parameters:

Parameter	Type	Purpose
`backend`	str	Inference engine selection: "transformers", "vllm-engine", "vllm-async-engine", "lmdeploy-engine", "mlx-engine", "http-client"
`model_path`	str \| None	Local model directory (auto-downloaded if None)
`server_url`	str \| None	Remote server URL for http-client backend
`**kwargs`	dict	Engine-specific configuration parameters

Special kwargs extracted before engine initialization:

batch_size: Batch processing size for transformers backend (default: auto-calculated based on VRAM)
max_concurrency: Maximum concurrent requests for http-client backend (default: 100)
http_timeout: Request timeout for http-client backend (default: 600s)
server_headers: Custom HTTP headers for http-client backend
max_retries: Retry attempts for failed HTTP requests (default: 3)
retry_backoff_factor: Exponential backoff multiplier (default: 0.5)

Sources: mineru/backend/vlm/vlm_analyze.py32-56

Inference Engine Selection and Configuration

The VLM Backend supports six inference engines, each optimized for different hardware and use cases:

Transformers Backend

Standard HuggingFace transformers implementation with manual batch processing.

Batch size determination:

Auto-calculated via set_default_batch_size() based on VRAM:
- VRAM >= 16 GB: batch_size = 8
- VRAM >= 8 GB: batch_size = 4
- VRAM < 8 GB: batch_size = 1

Sources: mineru/backend/vlm/vlm_analyze.py59-84 mineru/backend/vlm/utils.py92-108

vLLM Engines

MinerU supports two vLLM modes:

vllm-engine (Synchronous)

Default kwargs:

gpu_memory_utilization: Auto-set (0.7 for vLLM >= 0.11.0 with VRAM <= 8GB, else 0.5)
model: model_path
logits_processors: [MinerULogitsProcessor] if compute capability >= 8.0 and vLLM >= 0.10.1

Sources: mineru/backend/vlm/vlm_analyze.py98-122 mineru/backend/vlm/utils.py82-89 mineru/backend/vlm/utils.py11-56

vllm-async-engine (AsyncLLM v1)

Notable differences:

compilation_config must be a CompilationConfig object (auto-converted from dict/JSON string)
Supports v1 engine features for improved throughput

Sources: mineru/backend/vlm/vlm_analyze.py123-154

LMDeploy Engine

Optimized for diverse hardware including CUDA, Ascend NPUs, and other domestic accelerators.

Configuration:

cache_max_entry_count: Default 0.5
Device type: Set via MINERU_LMDEPLOY_DEVICE env var or lmdeploy_device kwarg
Backend: Set via MINERU_LMDEPLOY_BACKEND env var or lmdeploy_backend kwarg

Sources: mineru/backend/vlm/vlm_analyze.py156-201 mineru/backend/vlm/utils.py59-79

MLX Engine

Exclusive to Apple Silicon (macOS 13.5+):

Sources: mineru/backend/vlm/vlm_analyze.py85-93

HTTP Client

Lightweight client for remote inference:

Sources: mineru/backend/vlm/vlm_analyze.py57-58 mineru/backend/vlm/vlm_analyze.py202-216

Hardware-Specific Configuration

The VLM Backend automatically adapts configuration parameters based on detected hardware. The mod_kwargs_by_device_type() function modifies vLLM kwargs for specific accelerators:

Supported Device Types

Device Type	Environment Variable	Configuration Applied
`corex` (IluvatarCorex)	`MINERU_VLLM_DEVICE=corex`	`compilation_config`: FULL_DECODE_ONLY mode
`kxpu` (Kunlunxin)	`MINERU_VLLM_DEVICE=kxpu`	`compilation_config`: splitting_ops list, `block_size=128`, `dtype="float16"`, `distributed_executor_backend="mp"`
Default (CUDA/others)	Not set	No modifications

Configuration injection logic:

For server mode (vLLM OpenAI server): Injects command-line arguments
For sync_engine / async_engine modes: Injects kwargs
Device-specific configs defined in _get_device_config()

Sources: mineru/backend/vlm/utils.py111-233

Custom Logits Processors

The VLM Backend conditionally enables MinerULogitsProcessor for improved token generation:

Enabling conditions:

CUDA/NPU/GCU/MUSA/MLU/SDAA device available
VLLM_USE_V1 != "0"
vLLM version >= 0.10.1
Compute capability >= 8.0 (or vLLM >= 0.10.2 for older hardware)

Sources: mineru/backend/vlm/utils.py11-56

Document Processing Flow

The VLM Backend provides both synchronous and asynchronous entry points for document analysis:

Synchronous Processing: doc_analyze()

Sources: mineru/backend/vlm/vlm_analyze.py222-246

Asynchronous Processing: aio_doc_analyze()

Identical flow to doc_analyze() but uses async/await for I/O-bound operations:

Sources: mineru/backend/vlm/vlm_analyze.py249-272

Image Loading Details

Image loading is performed by load_images_from_pdf() with the following characteristics:

Output structure:

Performance logging:

Load time and speed (images/s) logged at DEBUG level
Inference time and speed (pages/s) logged at DEBUG level

Sources: mineru/backend/vlm/vlm_analyze.py234-243 mineru/backend/vlm/vlm_analyze.py261-270

Two-Step Extraction

The batch_two_step_extract() method (provided by MinerUClient from mineru_vl_utils) performs VLM inference in two passes:

Structure extraction: Identify document layout and block types
Content extraction: Extract text, tables, and formulas from identified regions

This two-step approach improves accuracy by separating layout understanding from content recognition.

Sources: mineru/backend/vlm/vlm_analyze.py241 mineru/backend/vlm/vlm_analyze.py268

Post-Processing: Result to Middle JSON

Block extraction from VLM output
Image region saving (via image_writer)
Coordinate normalization
Structure hierarchization

The resulting middle JSON contains:

pdf_info: Document metadata (page count, dimensions)
para_blocks: Ordered list of content blocks
discarded_blocks: Filtered or invalid blocks
Metadata fields

Sources: mineru/backend/vlm/vlm_analyze.py10 mineru/backend/vlm/vlm_analyze.py245 mineru/backend/vlm/vlm_analyze.py271

Block Sorting and Reading Order

The VLM Backend inherits block sorting functionality from the shared block sorting module. For detailed information on reading order detection, see Post-Processing and Block Sorting.

Integration with sort_blocks_by_bbox()

After middle JSON generation, blocks are sorted using:

Sorting strategy:

Primary: LayoutLMv3 model predicts reading order for up to 512 lines
Fallback: XY-cut algorithm for documents with >200 lines or when model unavailable

Sources: mineru/utils/block_sort.py15-37 mineru/utils/block_sort.py57-134

ModelSingleton for LayoutLMv3

The block sorting module uses its own ModelSingleton class (distinct from VLM's ModelSingleton) to cache the LayoutLMv3 reading order model:

Model initialization:

Model path: Auto-downloaded via auto_download_and_get_model_root_path()
Device placement: Determined by get_device()
Precision: BFloat16 if device compute capability >= 8.0, else Float32

Sources: mineru/utils/block_sort.py234-246 mineru/utils/block_sort.py179-231

Configuration Reference

Environment Variables

Variable	Purpose	Default
`MINERU_LMDEPLOY_DEVICE`	LMDeploy device type (cuda/ascend/maca/camb)	"cuda"
`MINERU_LMDEPLOY_BACKEND`	LMDeploy backend (pytorch/turbomind)	Auto-selected
`MINERU_VLLM_DEVICE`	vLLM device type for special configs	Not set
`MINERU_DEVICE_MODE`	Override device detection	Auto-detected
`MINERU_VIRTUAL_VRAM_SIZE`	Override VRAM detection (GB)	Auto-detected
`VLLM_USE_V1`	Enable vLLM v1 features	"1"
`OMP_NUM_THREADS`	OpenMP thread count	"1" (vLLM/LMDeploy)
`TM_LOG_LEVEL`	LMDeploy TurboMind log level	"ERROR"

Backend Selection Guide

Use Case	Recommended Backend	Rationale
Single document, GPU available	`transformers`	Simple, no extra dependencies
Batch processing, NVIDIA GPU	`vllm-engine`	High throughput, GPU memory efficient
Async/concurrent requests	`vllm-async-engine`	Better concurrency handling
Ascend/METAX/T-Head NPUs	`lmdeploy-engine`	Optimized for domestic accelerators
Apple Silicon (M1/M2/M3)	`mlx-engine`	Native Metal acceleration
Client-server deployment	`http-client`	No local model, lightweight client

Sources: Based on analysis from mineru/backend/vlm/vlm_analyze.py59-201

Usage Examples

Direct Python API

Async Processing

Sources: Inferred from mineru/backend/vlm/vlm_analyze.py222-272

Performance Considerations

Memory Management

VRAM optimization strategies:

Auto-batch sizing: Smaller batches for low-VRAM GPUs (< 8 GB)
GPU memory utilization: Configurable via gpu_memory_utilization kwarg
Model caching: SingletonPattern prevents duplicate model loading
Cache eviction: LMDeploy uses cache_max_entry_count=0.5 by default

Sources: mineru/backend/vlm/utils.py82-108 mineru/backend/vlm/vlm_analyze.py161-162

Throughput Optimization

Factors affecting processing speed:

Inference engine: vLLM/LMDeploy > transformers for batch processing
Batch size: Larger batches improve GPU utilization but increase memory usage
Async processing: aio_doc_analyze() enables concurrent document processing
HTTP client: Offloads inference to remote servers for horizontal scaling

Logged metrics:

Image loading: {time}s, {images/s} images/s
Inference: {time}s, {pages/s} page/s

Sources: mineru/backend/vlm/vlm_analyze.py237-243 mineru/backend/vlm/vlm_analyze.py264-270

Hardware Acceleration Support

Acceleration card compatibility:

Full support: Ascend NPU, METAX GPU, T-Head PPU, Hygon DCU, IluvatarCorex
Good support: Enflame GCU, Kunlunxin XPU, Tecorigin T100
Partial support: Cambricon MLU, MooreThreads (vLLM v0 limitations)

For detailed hardware setup instructions, see Hardware Acceleration.

Sources: docs/zh/usage/acceleration_cards/Tecorigin.md49-115

VLM Backend

Purpose and Scope

Architecture Overview

Model Management with ModelSingleton

ModelSingleton Class Structure

Key Implementation Details

Inference Engine Selection and Configuration

Transformers Backend

vLLM Engines

vllm-engine (Synchronous)

vllm-async-engine (AsyncLLM v1)

LMDeploy Engine

MLX Engine

HTTP Client

Hardware-Specific Configuration

Supported Device Types

Custom Logits Processors

Document Processing Flow

Synchronous Processing: doc_analyze()

Asynchronous Processing: aio_doc_analyze()

Image Loading Details

Two-Step Extraction

Post-Processing: Result to Middle JSON

Block Sorting and Reading Order

Integration with sort_blocks_by_bbox()

ModelSingleton for LayoutLMv3

Configuration Reference

Environment Variables

Backend Selection Guide

Usage Examples

Direct Python API

Async Processing

Performance Considerations

Memory Management

Throughput Optimization

Hardware Acceleration Support

On this page

VLM Backend

Purpose and Scope

Architecture Overview

Model Management with ModelSingleton

ModelSingleton Class Structure

Key Implementation Details

Inference Engine Selection and Configuration

Transformers Backend

vLLM Engines

vllm-engine (Synchronous)

vllm-async-engine (AsyncLLM v1)

LMDeploy Engine

MLX Engine

HTTP Client

Hardware-Specific Configuration

Supported Device Types

Custom Logits Processors

Document Processing Flow

Synchronous Processing: doc_analyze()

Asynchronous Processing: aio_doc_analyze()

Image Loading Details

Two-Step Extraction

Post-Processing: Result to Middle JSON

Block Sorting and Reading Order

Integration with sort_blocks_by_bbox()

ModelSingleton for LayoutLMv3

Configuration Reference

Environment Variables

Backend Selection Guide

Usage Examples

Direct Python API

Async Processing

Performance Considerations

Memory Management

Throughput Optimization

Hardware Acceleration Support

On this page