This page documents how to deploy PaddleOCR-VL as a network-accessible API service, covering the two deployment methods (Docker Compose and manual), service architecture, client invocation patterns, and pipeline configuration tuning.
For background on PaddleOCR-VL's internal architecture and configuration parameters, see 2.2.1. For setting up inference acceleration frameworks (vLLM, SGLang, FastDeploy) that act as the VLM backend for the service, see 2.2.2. For general service deployment patterns shared across PaddleOCR pipelines, see 5.4.
PaddleOCR-VL service deployment uses a two-tier architecture. The outer layer is the pipeline service (the PaddleOCR-VL API), which handles request routing, layout detection, preprocessing, and result aggregation. The inner layer is the VLM inference service, which performs the heavy Vision-Language Model inference and is typically backed by an acceleration framework.
Two-Tier Service Architecture
The PaddleOCRVL class in paddleocr/_pipelines/paddleocr_vl.py37-96 connects the two tiers via the vl_rec_server_url parameter. The pipeline sends image crops to the VLM service using the backend specified by vl_rec_backend.
Sources: docs/version3.x/pipeline_usage/PaddleOCR-VL-NVIDIA-Blackwell.en.md183-217 paddleocr/_pipelines/paddleocr_vl.py27-96
Not all hardware variants support Docker Compose deployment. Manual deployment is always available where the hardware supports at least one inference method.
| Hardware | Docker Compose | Manual Deployment | VLM Backend |
|---|---|---|---|
| NVIDIA GPU (standard) | ✅ | ✅ | vLLM, SGLang |
| NVIDIA Blackwell (sm120) | ✅ | ✅ | vLLM, SGLang |
| MetaX GPU | ✅ | ✅ | FastDeploy |
| Iluvatar GPU | ✅ | ✅ | FastDeploy |
| Hygon DCU | ✅ | ✅ | vLLM |
| Huawei Ascend NPU | ✅ | ✅ | vLLM |
| Apple Silicon | ❌ | ✅ | MLX-VLM |
| x64 CPU | ❌ | ✅ | llama.cpp |
Sources: docs/version3.x/pipeline_usage/PaddleOCR-VL-Apple-Silicon.en.md90-92 docs/version3.x/pipeline_usage/PaddleOCR-VL-NVIDIA-Blackwell.en.md183-193 docs/version3.x/pipeline_usage/PaddleOCR-VL-Iluvatar-GPU.en.md1-20
Docker Compose is the recommended deployment method for hardware that supports it. It starts two containers automatically — one for the VLM inference service and one for the pipeline API — with no manual dependency installation required after pulling the images.
Docker Compose Startup Flow
Download compose.yaml and .env from the appropriate accelerator directory under deploy/paddleocr_vl_docker/accelerators/. For example, for standard NVIDIA GPU use nvidia-gpu/, for Blackwell use nvidia-gpu-sm120/.
In the directory containing both files, run:
The server will listen on port 8080 by default. A successful start produces:
paddleocr-vl-api | INFO: Application startup complete.
paddleocr-vl-api | INFO: Uvicorn running on http://0.0.0.0:8080 (Press CTRL+C to quit)
.env File)| Variable | Description |
|---|---|
API_IMAGE_TAG_SUFFIX | Image tag suffix for the pipeline service container |
VLM_BACKEND | VLM inference backend (vllm, sglang, fastdeploy) |
VLM_IMAGE_TAG_SUFFIX | Image tag suffix for the VLM inference service container |
compose.yamlChange the API service port — edit paddleocr-vl-api.ports:
Specify which GPU to use — edit the device_ids in deploy.resources.reservations.devices for both containers:
The same change must be applied to the paddleocr-vlm-server block.
Supply a custom VLM backend config — create a YAML config file (see 2.2.2 for VLM backend parameters), then mount it and reference it:
To deploy in an environment without internet access:
compose.yaml.docker save <image> -o <file>.tar.tar files to the offline machine.docker load -i <file>.tardocker compose up normally.Sources: docs/version3.x/pipeline_usage/PaddleOCR-VL-NVIDIA-Blackwell.en.md193-297 docs/version3.x/pipeline_usage/PaddleOCR-VL.en.md32-35
Manual deployment requires a running environment (either the official Docker image from section 1.1, or a manually installed environment from section 1.2 of the usage tutorial). It also requires a VLM inference service already running (see 2.2.2).
The pipeline service is started using paddleocr serving:
This starts a FastAPI-based HTTP server that exposes the PaddleOCR-VL pipeline as a REST API. The vl_rec_server_url points at an already-running VLM inference service.
Connection Between Pipeline and VLM Service
The PaddleOCRVL.__init__ method in paddleocr/_pipelines/paddleocr_vl.py38-88 accepts all the service connection parameters, which are forwarded to the underlying PaddleX pipeline.
vl_rec_backend Values| Backend Identifier | Description |
|---|---|
native | Local PaddlePaddle inference (no separate service) |
vllm-server | Remote vLLM inference service |
sglang-server | Remote SGLang inference service |
fastdeploy-server | Remote FastDeploy inference service |
mlx-vlm-server | Remote MLX-VLM inference service (Apple Silicon) |
llama-cpp-server | Remote llama.cpp inference service |
These values are defined in _SUPPORTED_VL_BACKENDS at paddleocr/_pipelines/paddleocr_vl.py27-34
Sources: paddleocr/_pipelines/paddleocr_vl.py27-96 docs/version3.x/pipeline_usage/PaddleOCR-VL-Apple-Silicon.en.md90-96
Once the service is running on port 8080, it can be called via HTTP from any client. The service accepts image files, PDF files, and URLs.
The service response is a JSON object matching the structured output of the PaddleOCRVL.predict() call. Key fields include:
| Field | Description |
|---|---|
input_path | Path or URL of the input |
page_index | Page number (for PDFs) |
model_settings | Active pipeline flags |
layout_det_res | Layout detection bounding boxes |
markdown | Extracted content in Markdown format |
Sources: docs/version3.x/pipeline_usage/PaddleOCR-VL.en.md647-656 docs/version3.x/pipeline_usage/PaddleOCR-VL-NVIDIA-Blackwell.en.md303-305
The pipeline service supports runtime configuration via a PaddleX YAML configuration file. This controls model paths, preprocessing toggles, and other pipeline-level settings.
| Parameter | Default | Purpose |
|---|---|---|
use_doc_orientation_classify | False | Enable document orientation detection |
use_doc_unwarping | False | Enable geometric correction |
use_layout_detection | True | Enable layout region detection |
use_chart_recognition | False | Enable chart parsing |
use_seal_recognition | False | Enable stamp/seal recognition |
use_ocr_for_image_block | False | Run OCR on image regions |
layout_threshold | 0.5 | Confidence threshold for layout boxes |
merge_layout_blocks | True | Merge cross-column detection boxes |
The exported YAML can be passed to the service at startup using the --paddlex_config parameter.
Mount the config file and supply it as an environment variable in compose.yaml:
Sources: docs/version3.x/pipeline_usage/PaddleOCR-VL.en.md268-644 docs/version3.x/pipeline_usage/PaddleOCR-VL-NVIDIA-Blackwell.en.md293-297 paddleocr/_pipelines/paddleocr_vl.py38-88
The PaddleOCR-VL service can act as a backend for the MCP (Model Context Protocol) server, enabling LLM agents to call document parsing via the self-hosted mode.
The PipelineHandler base class in mcp_server/paddleocr_mcp/pipelines.py130-244 handles both local and service-based execution. When ppocr_source is set to self_hosted, the MCP server forwards requests to the PaddleOCR-VL HTTP service via server_url.
MCP configuration for self-hosted PaddleOCR-VL:
For MCP server setup and additional operating modes, see 3.4.
Sources: docs/version3.x/deployment/mcp_server.en.md298-326 mcp_server/paddleocr_mcp/pipelines.py130-180 mcp_server/paddleocr_mcp/__main__.py34-46
Each hardware variant has its own Docker image tag. The table below lists the relevant image names and compose file paths.
| Hardware | Pipeline Image Tag | VLM Image Tag | Compose Path |
|---|---|---|---|
| NVIDIA GPU (standard) | latest-nvidia-gpu | latest-nvidia-gpu (vllm/sglang) | deploy/paddleocr_vl_docker/accelerators/nvidia-gpu/ |
| NVIDIA Blackwell (sm120) | latest-nvidia-gpu-sm120 | latest-nvidia-gpu-sm120 | deploy/paddleocr_vl_docker/accelerators/nvidia-gpu-sm120/ |
| MetaX GPU | latest-metax-gpu | — (fastdeploy) | deploy/paddleocr_vl_docker/accelerators/metax-gpu/ |
| Iluvatar GPU | latest-iluvatar-gpu | — (fastdeploy) | deploy/paddleocr_vl_docker/accelerators/iluvatar-gpu/ |
| Huawei Ascend NPU | latest-huawei-ascend-npu | — (vllm) | hardware-specific compose |
| Apple Silicon | Not available | mlx-vlm-server (manual) | manual only |
To use a specific PaddleOCR version instead of latest, substitute the tag prefix: paddleocr<major>.<minor>, e.g., paddleocr3.3.
Sources: docs/version3.x/pipeline_usage/PaddleOCR-VL.en.md143-157 docs/version3.x/pipeline_usage/PaddleOCR-VL-NVIDIA-Blackwell.en.md40-52 docs/version3.x/pipeline_usage/PaddleOCR-VL-Iluvatar-GPU.en.md33-47 docs/version3.x/pipeline_usage/PaddleOCR-VL-MetaX-GPU.en.md20-43 docs/version3.x/pipeline_usage/PaddleOCR-VL-Apple-Silicon.en.md90-92
Refresh this wiki
This wiki was recently refreshed. Please wait 2 days to refresh again.