This page documents how PaddleOCR enables NVIDIA GPU acceleration and TensorRT optimization during inference. It covers the runtime inference path (Python and C++), the relevant configuration arguments, and the two distinct TensorRT integration strategies used by the codebase.
For general high-performance inference options including ONNX Runtime, see High-Performance Inference. For alternative hardware accelerators (XPU, NPU, DCU), see Alternative Accelerators. For CPU-only optimization, see CPU Optimization. For C++ deployment specifics, see C++ Inference and Build System.
PaddleOCR GPU and TensorRT support is implemented inside the create_predictor function in tools/infer/utility.py. Every inference component — TextDetector, TextRecognizer, TextClassifier — calls create_predictor to obtain a Paddle Inference predictor, so GPU and TensorRT settings apply uniformly across all pipeline components.
GPU acceleration is enabled by calling config.enable_use_gpu() on the Paddle Inference Config object. TensorRT is an optional layer on top of GPU inference that replaces supported subgraphs with TRT engines at runtime.
All GPU and TensorRT knobs are expressed as command-line arguments declared in init_args() in tools/infer/utility.py.
| Argument | Type | Default | Purpose |
|---|---|---|---|
--use_gpu | bool | True | Enable NVIDIA GPU execution |
--gpu_id | int | 0 | Which GPU device to use |
--gpu_mem | int | 500 | Initial GPU memory allocation (MB) |
--use_tensorrt | bool | False | Enable TensorRT engine optimization |
--precision | str | "fp32" | Precision: fp32, fp16, or int8 |
--min_subgraph_size | int | 15 | Minimum op count for a TRT subgraph |
--max_batch_size | int | 10 | Maximum batch size for TRT engines |
Sources: tools/infer/utility.py38-57
The function get_infer_gpuid() in tools/infer/utility.py determines which physical GPU is used. It reads the CUDA_VISIBLE_DEVICES environment variable (or HIP_VISIBLE_DEVICES for ROCm builds) and returns the first listed GPU ID.
If nvidia-smi cannot detect a GPU (e.g., on Jetson devices without nvidia-smi), a warning is emitted but execution continues.
Diagram: create_predictor execution path for GPU/TRT
Sources: tools/infer/utility.py240-435
When --use_gpu=True and --use_tensorrt=False, create_predictor calls:
No subgraph compilation occurs. The Paddle Inference runtime executes operators natively on CUDA.
Additional memory and IR optimizations are always applied regardless of device:
config.enable_memory_optim() — enables memory reuseconfig.disable_glog_info() — suppresses verbose logsconfig.switch_ir_optim(True) — enables IR graph optimization passesconv_transpose_eltwiseadd_bn_fuse_pass, matmul_transpose_reshape_fuse_pass)Sources: tools/infer/utility.py275-281 tools/infer/utility.py409-421
There are two distinct TensorRT code paths depending on whether the model uses the legacy .pdmodel format or the newer PIR .json format.
.pdmodel FormatCalled when the model directory contains inference.pdmodel (not .json).
config.enable_tensorrt_engine(
workspace_size = 1 << 30, # 1 GB workspace
precision_mode = precision, # Float32 / Half / Int8
max_batch_size = args.max_batch_size,
min_subgraph_size = args.min_subgraph_size,
use_calib_mode = False,
)
Dynamic shape handling:
The shape range file is named {mode}_trt_dynamic_shape.txt (e.g., det_trt_dynamic_shape.txt) and lives in the model directory. If the file does not exist:
config.collect_shape_range_info(trt_shape_f) is called, causing the first inference run to profile input shapes.config.enable_tuned_tensorrt_dynamic_shape(trt_shape_f, True) loads the profiled ranges..json FormatCalled when the model directory contains inference.json.
Dynamic shape configuration is not collected at runtime. Instead it must be declared in inference.yml under:
If trt_dynamic_shapes is absent, a RuntimeError is raised.
The converted TRT model is cached to {model_dir}/.cache/trt/ and reused on subsequent runs.
Sources: tools/infer/utility.py282-323
_convert_trt FunctionThe _convert_trt function performs the actual PIR-format TRT model conversion.
Diagram: _convert_trt internals
Sources: tools/infer/utility.py438-514
Key implementation notes:
predictor.get_input_handle(name).type() and converting via _pd_dtype_to_np_dtype().dynamic_shape_input_data is provided for a given input, actual data arrays are used for min/opt/max; otherwise, np.ones(shape) is used.paddle.tensorrt.export.Input and paddle.tensorrt.export.TensorRTConfig.Sources: tools/infer/utility.py438-531
Precision is resolved from args.precision into a Paddle Inference PrecisionType enum:
--precision value | Effective PrecisionType | Condition |
|---|---|---|
"fp32" | Float32 | Default |
"fp16" | Half | Only applies when --use_tensorrt=True |
"int8" | Int8 | Applied regardless of TensorRT |
Note:
fp16precision without TensorRT has no effect — theif args.precision == "fp16" and args.use_tensorrtcondition meansHalfis only selected when both flags are active.
When --use_onnx=True, create_predictor bypasses Paddle Inference entirely and creates an ONNX Runtime session. If --use_gpu=True, the CUDAExecutionProvider is used:
Custom providers can also be specified via --onnx_providers.
Each inference component independently calls create_predictor. The mode argument determines which model directory argument is used:
Diagram: Component → create_predictor → mode routing
Sources: tools/infer/utility.py177-196 tools/infer/predict_det.py143-151 tools/infer/predict_rec.py176-182 tools/infer/predict_cls.py58-65
The returned predictor, input_tensor, and output_tensors are stored as instance attributes and used in each component's inference loop via input_tensor.copy_from_cpu(), predictor.run(), and output_tensor.copy_to_cpu().
The C++ inference build at deploy/cpp_infer/CMakeLists.txt supports GPU via the WITH_GPU CMake option.
deploy/cpp_infer/CMakeLists.txt97-107
When WITH_GPU=ON:
CUDA_LIB (path to CUDA library directory) is requiredCUDNN_LIB (path to cuDNN library directory) is required on Linux-DWITH_GPU preprocessor definition is addeddeploy/cpp_infer/CMakeLists.txt210-220
The build script template at deploy/cpp_infer/tools/build.sh shows the variables to set:
deploy/cpp_infer/tools/build.sh1-22
When --benchmark=True, each inference component (TextDetector, TextRecognizer) creates an AutoLogger instance that records preprocess_time, inference_time, and postprocess_time using the auto_log library.
The GPU ID passed to AutoLogger is obtained from get_infer_gpuid(). A --warmup flag causes a random dummy input to be run through the detector/system before timed measurements begin.
Sources: tools/infer/predict_det.py162-180 tools/infer/predict_rec.py184-202 tools/infer/predict_system.py202-205
| Issue | Cause | Resolution |
|---|---|---|
RuntimeError: trt_dynamic_shapes must be defined | PIR model used with TRT but inference.yml missing config | Add trt_dynamic_shapes to inference.yml |
| Shape file not found on first run | Normal — first run profiles shapes | Run once without TRT to generate shape file, or pre-provide shape file |
fp16 has no effect without TRT | Precision mode only activates for TensorRT paths | Also set --use_tensorrt=True |
| GPU not found warning | nvidia-smi unavailable (Jetson, etc.) | Safe to ignore on Jetson; GPU is still used if --use_gpu=True |
| TRT cache not invalidated on model update | .cache/trt/ is checked by file existence only | Manually delete .cache/trt/ directory when the model changes |
Sources: tools/infer/utility.py277-344
Refresh this wiki
This wiki was recently refreshed. Please wait 2 days to refresh again.