OpenAI-Compatible Server and HTTP Client

Relevant source files

Purpose and Scope

This document describes MinerU's OpenAI-compatible server infrastructure and HTTP client backends, which enable distributed document parsing by separating GPU-intensive VLM inference from lightweight client operations. The server (mineru-openai-server) exposes VLM models through an OpenAI-compatible API, while client backends (vlm-http-client, hybrid-http-client) connect to remote inference servers over HTTP.

For general backend selection and routing logic, see Core Orchestration. For local VLM inference without networking, see VLM Backend. For hybrid processing details, see Hybrid Backend.

Architecture Overview

The OpenAI-compatible server and HTTP client system provides a client-server architecture for document parsing where:

Server side: Runs GPU-intensive VLM models using vLLM or LMDeploy inference engines
Client side: Performs lightweight PDF pre-processing and output generation without requiring GPUs
Communication: Uses HTTP with OpenAI-compatible API format for interoperability

Sources: mineru/cli/common.py240-305 mineru/cli/client.py60-70 pyproject.toml113-115 docker/compose.yaml2-29

OpenAI-Compatible Server

Server Entry Points

MinerU provides three server command entry points defined in pyproject.toml113-115:

Command	Function	Engine	Description
`mineru-openai-server`	`mineru.cli.vlm_server:openai_server`	Auto-selected	Automatically chooses between vLLM and LMDeploy
`mineru-vllm-server`	`mineru.cli.vlm_server:vllm_server`	vLLM	Explicitly uses vLLM engine
`mineru-lmdeploy-server`	`mineru.cli.vlm_server:lmdeploy_server`	LMDeploy	Explicitly uses LMDeploy engine

The openai_server function is the recommended entry point as it auto-detects the best available engine based on the environment.

Starting the Server

Basic usage:

With vLLM-specific parameters:

All vLLM and LMDeploy parameters can be passed through to the underlying engine. See Advanced CLI Parameters for engine-specific parameter tuning.

Sources: docs/zh/quick_start/docker_deployment.md56-66 docker/compose.yaml10-16

Server Configuration

Environment variables:

MINERU_MODEL_SOURCE: Set to local, huggingface, or modelscope to control model loading source
MINERU_DEVICE_MODE: Override automatic device detection (e.g., cuda, cuda:0, npu)

Sources: docker/compose.yaml9-16 mineru/cli/client.py169-181

Docker Deployment

The Docker images based on vllm/vllm-openai include the server pre-configured:

Start with Docker Compose:

Sources: docker/compose.yaml2-29 docs/zh/quick_start/docker_deployment.md56-66

HTTP Client Backends

Backend Selection in Code

The backend routing logic in mineru/cli/common.py414-484 determines whether to use local or remote inference:

In _process_vlm and _process_hybrid functions mineru/cli/common.py283-358:

Sources: mineru/cli/common.py414-484 mineru/cli/common.py267-305

vlm-http-client Backend

The vlm-http-client backend connects to a remote VLM inference server for high-accuracy parsing of Chinese and English documents.

Usage from CLI:

Usage from Python:

Client requirements:

CPU-only environment (no GPU needed)
Minimum 8GB RAM
Network connectivity to server

Sources: mineru/cli/client.py59-94 README.md162 docs/zh/quick_start/docker_deployment.md62-66

hybrid-http-client Backend

The hybrid-http-client backend combines remote VLM inference with local lightweight processing:

Remote: VLM model inference for structure detection
Local: OCR detection/recognition, formula recognition (MFD/MFR)

This allows for high accuracy while requiring less GPU memory on the server side (3GB VRAM vs 8GB for pure VLM).

Usage from CLI:

Client requirements:

CPU with moderate compute (for OCR and formula recognition)
Minimum 16GB RAM recommended
Network connectivity to server

The hybrid backend decision logic is in mineru/backend/hybrid/hybrid_analyze.py which determines whether to use VLM OCR or fall back to PaddleOCR for multi-language support.

Sources: mineru/cli/common.py308-358 mineru/cli/client.py68 README.md161

Client-Server Communication Flow

Key steps:

Client preparation: PDF bytes are prepared by _prepare_pdf_bytes() which handles page range extraction using convert_pdf_bytes_to_bytes_by_pypdfium2() mineru/cli/common.py54-82
Backend routing: The backend string is checked for "client" suffix to determine if server_url should be used mineru/cli/common.py286-287
Model initialization: ModelSingleton.get_model() creates an HTTP client when backend is "http-client" mineru/backend/vlm/vlm_analyze.py
Batch inference: Images are sent to the server in batches, with the server handling VLM inference
Response processing: Server responses are parsed into MinerU's internal format and converted to middle JSON

Sources: mineru/cli/common.py267-305 mineru/cli/common.py54-82 mineru/backend/hybrid/hybrid_analyze.py1-50

Deployment Patterns

Pattern 1: Single Server, Multiple Clients

Advantages:

Centralized GPU resources
Cost-effective scaling (many CPU-only clients)
Easier model version management

Configuration:

Server uses --data-parallel-size for multi-GPU parallelism docker/compose.yaml15
Clients specify server_url pointing to the central server

Sources: docs/zh/quick_start/docker_deployment.md56-84 docker/compose.yaml10-29

Pattern 2: Docker Compose Multi-Service

The docker/compose.yaml1-73 file defines three service profiles:

openai-server: GPU-accelerated inference server
api: FastAPI service for programmatic access
gradio: Web UI for interactive document parsing

Services can be configured to either use local VLM inference or connect to the openai-server as clients.

Sources: docker/compose.yaml1-73 docs/zh/quick_start/docker_deployment.md42-84

Pattern 3: Hybrid Client with Local GPU

For clients with limited GPU resources (e.g., 6-10GB VRAM), the hybrid-http-client backend offloads VLM inference to a server while performing OCR and formula recognition locally:

This pattern is optimal when:

Client has moderate GPU resources (6-10GB VRAM)
Multi-language OCR is required (109 languages)
Network latency is acceptable for structure detection
Cost savings from server GPU sharing are significant

Sources: mineru/backend/hybrid/hybrid_analyze.py1-50 README.md161

Configuration and Advanced Usage

Server URL Configuration Methods

Priority order (highest to lowest):

Command-line argument: -u / --url
Function parameter: server_url=...
No default (must be specified for http-client backends)

Example in Python API:

Sources: mineru/cli/client.py85-94 mineru/cli/common.py486-558

Gradio UI with HTTP Client

The Gradio interface mineru/cli/gradio_app.py199-257 includes an --enable-http-client flag to show HTTP client backend options:

UI elements dynamically show/hide based on backend selection mineru/cli/gradio_app.py356-369:

When vlm-http-client or hybrid-http-client is selected, a "Server URL" input field appears
The URL field uses the label i18n("server_url") with info text about OpenAI-compatible servers

Sources: mineru/cli/gradio_app.py199-257 mineru/cli/gradio_app.py356-369

FastAPI with HTTP Client

The FastAPI server mineru/cli/fast_api.py125-199 accepts backend and server_url form parameters:

Example API call:

Sources: mineru/cli/fast_api.py125-199 mineru/cli/fast_api.py173-176

Performance Tuning

Server-side optimization:

For multi-GPU servers, use data parallelism to increase throughput:

For single GPU with limited VRAM, reduce KV cache size:

Client-side optimization:

For hybrid-http-client, adjust batch sizes for local processing:

Note that inference engine parameters (like --gpu-memory-utilization) are only effective on the server side, not on HTTP clients.

Sources: docker/compose.yaml15-16 docs/zh/usage/advanced_cli_parameters.md1-35

Environment Variables

Key environment variables for server and client:

Variable	Scope	Purpose	Default
`MINERU_MODEL_SOURCE`	Server	Model download source (huggingface/modelscope/local)	huggingface
`MINERU_DEVICE_MODE`	Server	Force device type (cuda/cuda:0/npu)	auto-detected
`MINERU_LOG_LEVEL`	Both	Logging verbosity (DEBUG/INFO/WARNING)	INFO
`MINERU_VLM_FORMULA_ENABLE`	Server	Enable formula recognition in VLM	True
`MINERU_VLM_TABLE_ENABLE`	Server	Enable table recognition in VLM	True

These are set in the backend processing functions mineru/cli/common.py456-457 mineru/cli/common.py475-476 before calling the VLM/hybrid analyze functions.

Sources: mineru/cli/common.py456-476 docker/compose.yaml9-10 mineru/cli/client.py169-181

OpenAI-Compatible Server and HTTP Client

Relevant source files

Purpose and Scope

For general backend selection and routing logic, see Core Orchestration. For local VLM inference without networking, see VLM Backend. For hybrid processing details, see Hybrid Backend.

Architecture Overview

The OpenAI-compatible server and HTTP client system provides a client-server architecture for document parsing where:

Server side: Runs GPU-intensive VLM models using vLLM or LMDeploy inference engines
Client side: Performs lightweight PDF pre-processing and output generation without requiring GPUs
Communication: Uses HTTP with OpenAI-compatible API format for interoperability

Sources: mineru/cli/common.py240-305 mineru/cli/client.py60-70 pyproject.toml113-115 docker/compose.yaml2-29

OpenAI-Compatible Server

Server Entry Points

MinerU provides three server command entry points defined in pyproject.toml113-115:

Command	Function	Engine	Description
`mineru-openai-server`	`mineru.cli.vlm_server:openai_server`	Auto-selected	Automatically chooses between vLLM and LMDeploy
`mineru-vllm-server`	`mineru.cli.vlm_server:vllm_server`	vLLM	Explicitly uses vLLM engine
`mineru-lmdeploy-server`	`mineru.cli.vlm_server:lmdeploy_server`	LMDeploy	Explicitly uses LMDeploy engine

The openai_server function is the recommended entry point as it auto-detects the best available engine based on the environment.

Starting the Server

Basic usage:

With vLLM-specific parameters:

All vLLM and LMDeploy parameters can be passed through to the underlying engine. See Advanced CLI Parameters for engine-specific parameter tuning.

Sources: docs/zh/quick_start/docker_deployment.md56-66 docker/compose.yaml10-16

Server Configuration

Environment variables:

MINERU_MODEL_SOURCE: Set to local, huggingface, or modelscope to control model loading source
MINERU_DEVICE_MODE: Override automatic device detection (e.g., cuda, cuda:0, npu)

Sources: docker/compose.yaml9-16 mineru/cli/client.py169-181

Docker Deployment

The Docker images based on vllm/vllm-openai include the server pre-configured:

Start with Docker Compose:

Sources: docker/compose.yaml2-29 docs/zh/quick_start/docker_deployment.md56-66

HTTP Client Backends

Backend Selection in Code

The backend routing logic in mineru/cli/common.py414-484 determines whether to use local or remote inference:

In _process_vlm and _process_hybrid functions mineru/cli/common.py283-358:

Sources: mineru/cli/common.py414-484 mineru/cli/common.py267-305

vlm-http-client Backend

The vlm-http-client backend connects to a remote VLM inference server for high-accuracy parsing of Chinese and English documents.

Usage from CLI:

Usage from Python:

Client requirements:

CPU-only environment (no GPU needed)
Minimum 8GB RAM
Network connectivity to server

Sources: mineru/cli/client.py59-94 README.md162 docs/zh/quick_start/docker_deployment.md62-66

hybrid-http-client Backend

The hybrid-http-client backend combines remote VLM inference with local lightweight processing:

Remote: VLM model inference for structure detection
Local: OCR detection/recognition, formula recognition (MFD/MFR)

This allows for high accuracy while requiring less GPU memory on the server side (3GB VRAM vs 8GB for pure VLM).

Usage from CLI:

Client requirements:

CPU with moderate compute (for OCR and formula recognition)
Minimum 16GB RAM recommended
Network connectivity to server

The hybrid backend decision logic is in mineru/backend/hybrid/hybrid_analyze.py which determines whether to use VLM OCR or fall back to PaddleOCR for multi-language support.

Sources: mineru/cli/common.py308-358 mineru/cli/client.py68 README.md161

Client-Server Communication Flow

Key steps:

Client preparation: PDF bytes are prepared by _prepare_pdf_bytes() which handles page range extraction using convert_pdf_bytes_to_bytes_by_pypdfium2() mineru/cli/common.py54-82
Backend routing: The backend string is checked for "client" suffix to determine if server_url should be used mineru/cli/common.py286-287
Model initialization: ModelSingleton.get_model() creates an HTTP client when backend is "http-client" mineru/backend/vlm/vlm_analyze.py
Batch inference: Images are sent to the server in batches, with the server handling VLM inference
Response processing: Server responses are parsed into MinerU's internal format and converted to middle JSON

Sources: mineru/cli/common.py267-305 mineru/cli/common.py54-82 mineru/backend/hybrid/hybrid_analyze.py1-50

Deployment Patterns

Pattern 1: Single Server, Multiple Clients

Advantages:

Centralized GPU resources
Cost-effective scaling (many CPU-only clients)
Easier model version management

Configuration:

Server uses --data-parallel-size for multi-GPU parallelism docker/compose.yaml15
Clients specify server_url pointing to the central server

Sources: docs/zh/quick_start/docker_deployment.md56-84 docker/compose.yaml10-29

Pattern 2: Docker Compose Multi-Service

The docker/compose.yaml1-73 file defines three service profiles:

openai-server: GPU-accelerated inference server
api: FastAPI service for programmatic access
gradio: Web UI for interactive document parsing

Services can be configured to either use local VLM inference or connect to the openai-server as clients.

Sources: docker/compose.yaml1-73 docs/zh/quick_start/docker_deployment.md42-84

Pattern 3: Hybrid Client with Local GPU

For clients with limited GPU resources (e.g., 6-10GB VRAM), the hybrid-http-client backend offloads VLM inference to a server while performing OCR and formula recognition locally:

This pattern is optimal when:

Client has moderate GPU resources (6-10GB VRAM)
Multi-language OCR is required (109 languages)
Network latency is acceptable for structure detection
Cost savings from server GPU sharing are significant

Sources: mineru/backend/hybrid/hybrid_analyze.py1-50 README.md161

Configuration and Advanced Usage

Server URL Configuration Methods

Priority order (highest to lowest):

Command-line argument: -u / --url
Function parameter: server_url=...
No default (must be specified for http-client backends)

Example in Python API:

Sources: mineru/cli/client.py85-94 mineru/cli/common.py486-558

Gradio UI with HTTP Client

The Gradio interface mineru/cli/gradio_app.py199-257 includes an --enable-http-client flag to show HTTP client backend options:

UI elements dynamically show/hide based on backend selection mineru/cli/gradio_app.py356-369:

When vlm-http-client or hybrid-http-client is selected, a "Server URL" input field appears
The URL field uses the label i18n("server_url") with info text about OpenAI-compatible servers

Sources: mineru/cli/gradio_app.py199-257 mineru/cli/gradio_app.py356-369

FastAPI with HTTP Client

The FastAPI server mineru/cli/fast_api.py125-199 accepts backend and server_url form parameters:

Example API call:

Sources: mineru/cli/fast_api.py125-199 mineru/cli/fast_api.py173-176

Performance Tuning

Server-side optimization:

For multi-GPU servers, use data parallelism to increase throughput:

For single GPU with limited VRAM, reduce KV cache size:

Client-side optimization:

For hybrid-http-client, adjust batch sizes for local processing:

Note that inference engine parameters (like --gpu-memory-utilization) are only effective on the server side, not on HTTP clients.

Sources: docker/compose.yaml15-16 docs/zh/usage/advanced_cli_parameters.md1-35

Environment Variables

Key environment variables for server and client:

Variable	Scope	Purpose	Default
`MINERU_MODEL_SOURCE`	Server	Model download source (huggingface/modelscope/local)	huggingface
`MINERU_DEVICE_MODE`	Server	Force device type (cuda/cuda:0/npu)	auto-detected
`MINERU_LOG_LEVEL`	Both	Logging verbosity (DEBUG/INFO/WARNING)	INFO
`MINERU_VLM_FORMULA_ENABLE`	Server	Enable formula recognition in VLM	True
`MINERU_VLM_TABLE_ENABLE`	Server	Enable table recognition in VLM	True

These are set in the backend processing functions mineru/cli/common.py456-457 mineru/cli/common.py475-476 before calling the VLM/hybrid analyze functions.

Sources: mineru/cli/common.py456-476 docker/compose.yaml9-10 mineru/cli/client.py169-181

OpenAI-Compatible Server and HTTP Client

Purpose and Scope

Architecture Overview

OpenAI-Compatible Server

Server Entry Points

Starting the Server

Server Configuration

Docker Deployment

HTTP Client Backends

Backend Selection in Code

vlm-http-client Backend

hybrid-http-client Backend

Client-Server Communication Flow

Deployment Patterns

Pattern 1: Single Server, Multiple Clients

Pattern 2: Docker Compose Multi-Service

Pattern 3: Hybrid Client with Local GPU

Configuration and Advanced Usage

Server URL Configuration Methods

Gradio UI with HTTP Client

FastAPI with HTTP Client

Performance Tuning

Environment Variables

On this page

OpenAI-Compatible Server and HTTP Client

Purpose and Scope

Architecture Overview

OpenAI-Compatible Server

Server Entry Points

Starting the Server

Server Configuration

Docker Deployment

HTTP Client Backends

Backend Selection in Code

vlm-http-client Backend

hybrid-http-client Backend

Client-Server Communication Flow

Deployment Patterns

Pattern 1: Single Server, Multiple Clients

Pattern 2: Docker Compose Multi-Service

Pattern 3: Hybrid Client with Local GPU

Configuration and Advanced Usage

Server URL Configuration Methods

Gradio UI with HTTP Client

FastAPI with HTTP Client

Performance Tuning

Environment Variables

On this page