NVIDIA GPU and TensorRT

Relevant source files

This page documents how PaddleOCR enables NVIDIA GPU acceleration and TensorRT optimization during inference. It covers the runtime inference path (Python and C++), the relevant configuration arguments, and the two distinct TensorRT integration strategies used by the codebase.

For general high-performance inference options including ONNX Runtime, see High-Performance Inference. For alternative hardware accelerators (XPU, NPU, DCU), see Alternative Accelerators. For CPU-only optimization, see CPU Optimization. For C++ deployment specifics, see C++ Inference and Build System.

Overview

PaddleOCR GPU and TensorRT support is implemented inside the create_predictor function in tools/infer/utility.py. Every inference component — TextDetector, TextRecognizer, TextClassifier — calls create_predictor to obtain a Paddle Inference predictor, so GPU and TensorRT settings apply uniformly across all pipeline components.

GPU acceleration is enabled by calling config.enable_use_gpu() on the Paddle Inference Config object. TensorRT is an optional layer on top of GPU inference that replaces supported subgraphs with TRT engines at runtime.

Runtime Configuration Arguments

All GPU and TensorRT knobs are expressed as command-line arguments declared in init_args() in tools/infer/utility.py.

Argument	Type	Default	Purpose
`--use_gpu`	bool	`True`	Enable NVIDIA GPU execution
`--gpu_id`	int	`0`	Which GPU device to use
`--gpu_mem`	int	`500`	Initial GPU memory allocation (MB)
`--use_tensorrt`	bool	`False`	Enable TensorRT engine optimization
`--precision`	str	`"fp32"`	Precision: `fp32`, `fp16`, or `int8`
`--min_subgraph_size`	int	`15`	Minimum op count for a TRT subgraph
`--max_batch_size`	int	`10`	Maximum batch size for TRT engines

Sources: tools/infer/utility.py38-57

GPU Device Selection

The function get_infer_gpuid() in tools/infer/utility.py determines which physical GPU is used. It reads the CUDA_VISIBLE_DEVICES environment variable (or HIP_VISIBLE_DEVICES for ROCm builds) and returns the first listed GPU ID.

tools/infer/utility.py561-578

If nvidia-smi cannot detect a GPU (e.g., on Jetson devices without nvidia-smi), a warning is emitted but execution continues.

Predictor Creation Flow

Diagram: create_predictor execution path for GPU/TRT

Sources: tools/infer/utility.py240-435

GPU-Only Inference (No TensorRT)

When --use_gpu=True and --use_tensorrt=False, create_predictor calls:

No subgraph compilation occurs. The Paddle Inference runtime executes operators natively on CUDA.

Additional memory and IR optimizations are always applied regardless of device:

config.enable_memory_optim() — enables memory reuse
config.disable_glog_info() — suppresses verbose logs
config.switch_ir_optim(True) — enables IR graph optimization passes
Several passes are explicitly deleted for correctness (e.g., conv_transpose_eltwiseadd_bn_fuse_pass, matmul_transpose_reshape_fuse_pass)

Sources: tools/infer/utility.py275-281 tools/infer/utility.py409-421

TensorRT Integration

There are two distinct TensorRT code paths depending on whether the model uses the legacy .pdmodel format or the newer PIR .json format.

Path 1: Legacy `.pdmodel` Format

Called when the model directory contains inference.pdmodel (not .json).

tools/infer/utility.py324-344

config.enable_tensorrt_engine(
    workspace_size = 1 << 30,      # 1 GB workspace
    precision_mode = precision,     # Float32 / Half / Int8
    max_batch_size = args.max_batch_size,
    min_subgraph_size = args.min_subgraph_size,
    use_calib_mode = False,
)

Dynamic shape handling:

The shape range file is named {mode}_trt_dynamic_shape.txt (e.g., det_trt_dynamic_shape.txt) and lives in the model directory. If the file does not exist:

config.collect_shape_range_info(trt_shape_f) is called, causing the first inference run to profile input shapes.
On subsequent runs, config.enable_tuned_tensorrt_dynamic_shape(trt_shape_f, True) loads the profiled ranges.

Path 2: PIR `.json` Format

Called when the model directory contains inference.json.

tools/infer/utility.py282-323

Dynamic shape configuration is not collected at runtime. Instead it must be declared in inference.yml under:

If trt_dynamic_shapes is absent, a RuntimeError is raised.

The converted TRT model is cached to {model_dir}/.cache/trt/ and reused on subsequent runs.

Sources: tools/infer/utility.py282-323

`_convert_trt` Function

The _convert_trt function performs the actual PIR-format TRT model conversion.

Diagram: _convert_trt internals

Sources: tools/infer/utility.py438-514

Key implementation notes:

Data type per input is inferred by calling predictor.get_input_handle(name).type() and converting via _pd_dtype_to_np_dtype().
If dynamic_shape_input_data is provided for a given input, actual data arrays are used for min/opt/max; otherwise, np.ones(shape) is used.
The conversion uses paddle.tensorrt.export.Input and paddle.tensorrt.export.TensorRTConfig.

Sources: tools/infer/utility.py438-531

Precision Modes

Precision is resolved from args.precision into a Paddle Inference PrecisionType enum:

tools/infer/utility.py265-273

`--precision` value	Effective `PrecisionType`	Condition
`"fp32"`	`Float32`	Default
`"fp16"`	`Half`	Only applies when `--use_tensorrt=True`
`"int8"`	`Int8`	Applied regardless of TensorRT

Note: fp16 precision without TensorRT has no effect — the if args.precision == "fp16" and args.use_tensorrt condition means Half is only selected when both flags are active.

ONNX Runtime with CUDA

When --use_onnx=True, create_predictor bypasses Paddle Inference entirely and creates an ONNX Runtime session. If --use_gpu=True, the CUDAExecutionProvider is used:

tools/infer/utility.py200-238

Custom providers can also be specified via --onnx_providers.

Per-Component Predictor Usage

Each inference component independently calls create_predictor. The mode argument determines which model directory argument is used:

Diagram: Component → create_predictor → mode routing

Sources: tools/infer/utility.py177-196 tools/infer/predict_det.py143-151 tools/infer/predict_rec.py176-182 tools/infer/predict_cls.py58-65

The returned predictor, input_tensor, and output_tensors are stored as instance attributes and used in each component's inference loop via input_tensor.copy_from_cpu(), predictor.run(), and output_tensor.copy_to_cpu().

C++ Inference: GPU and CUDA Configuration

The C++ inference build at deploy/cpp_infer/CMakeLists.txt supports GPU via the WITH_GPU CMake option.

deploy/cpp_infer/CMakeLists.txt97-107

When WITH_GPU=ON:

CUDA_LIB (path to CUDA library directory) is required
CUDNN_LIB (path to cuDNN library directory) is required on Linux
The -DWITH_GPU preprocessor definition is added
CUDA and cuDNN shared libraries are linked

deploy/cpp_infer/CMakeLists.txt210-220

The build script template at deploy/cpp_infer/tools/build.sh shows the variables to set:

deploy/cpp_infer/tools/build.sh1-22

Benchmarking Support

When --benchmark=True, each inference component (TextDetector, TextRecognizer) creates an AutoLogger instance that records preprocess_time, inference_time, and postprocess_time using the auto_log library.

The GPU ID passed to AutoLogger is obtained from get_infer_gpuid(). A --warmup flag causes a random dummy input to be run through the detector/system before timed measurements begin.

Sources: tools/infer/predict_det.py162-180 tools/infer/predict_rec.py184-202 tools/infer/predict_system.py202-205

Common Issues and Constraints

Issue	Cause	Resolution
`RuntimeError: trt_dynamic_shapes must be defined`	PIR model used with TRT but `inference.yml` missing config	Add `trt_dynamic_shapes` to `inference.yml`
Shape file not found on first run	Normal — first run profiles shapes	Run once without TRT to generate shape file, or pre-provide shape file
`fp16` has no effect without TRT	Precision mode only activates for TensorRT paths	Also set `--use_tensorrt=True`
GPU not found warning	`nvidia-smi` unavailable (Jetson, etc.)	Safe to ignore on Jetson; GPU is still used if `--use_gpu=True`
TRT cache not invalidated on model update	`.cache/trt/` is checked by file existence only	Manually delete `.cache/trt/` directory when the model changes

Sources: tools/infer/utility.py277-344

NVIDIA GPU and TensorRT

Relevant source files

Overview

Runtime Configuration Arguments

All GPU and TensorRT knobs are expressed as command-line arguments declared in init_args() in tools/infer/utility.py.

Argument	Type	Default	Purpose
`--use_gpu`	bool	`True`	Enable NVIDIA GPU execution
`--gpu_id`	int	`0`	Which GPU device to use
`--gpu_mem`	int	`500`	Initial GPU memory allocation (MB)
`--use_tensorrt`	bool	`False`	Enable TensorRT engine optimization
`--precision`	str	`"fp32"`	Precision: `fp32`, `fp16`, or `int8`
`--min_subgraph_size`	int	`15`	Minimum op count for a TRT subgraph
`--max_batch_size`	int	`10`	Maximum batch size for TRT engines

Sources: tools/infer/utility.py38-57

GPU Device Selection

tools/infer/utility.py561-578

If nvidia-smi cannot detect a GPU (e.g., on Jetson devices without nvidia-smi), a warning is emitted but execution continues.

Predictor Creation Flow

Diagram: create_predictor execution path for GPU/TRT

Sources: tools/infer/utility.py240-435

GPU-Only Inference (No TensorRT)

When --use_gpu=True and --use_tensorrt=False, create_predictor calls:

No subgraph compilation occurs. The Paddle Inference runtime executes operators natively on CUDA.

Additional memory and IR optimizations are always applied regardless of device:

config.enable_memory_optim() — enables memory reuse
config.disable_glog_info() — suppresses verbose logs
config.switch_ir_optim(True) — enables IR graph optimization passes
Several passes are explicitly deleted for correctness (e.g., conv_transpose_eltwiseadd_bn_fuse_pass, matmul_transpose_reshape_fuse_pass)

Sources: tools/infer/utility.py275-281 tools/infer/utility.py409-421

TensorRT Integration

There are two distinct TensorRT code paths depending on whether the model uses the legacy .pdmodel format or the newer PIR .json format.

Path 1: Legacy `.pdmodel` Format

Called when the model directory contains inference.pdmodel (not .json).

tools/infer/utility.py324-344

config.enable_tensorrt_engine(
    workspace_size = 1 << 30,      # 1 GB workspace
    precision_mode = precision,     # Float32 / Half / Int8
    max_batch_size = args.max_batch_size,
    min_subgraph_size = args.min_subgraph_size,
    use_calib_mode = False,
)

Dynamic shape handling:

The shape range file is named {mode}_trt_dynamic_shape.txt (e.g., det_trt_dynamic_shape.txt) and lives in the model directory. If the file does not exist:

config.collect_shape_range_info(trt_shape_f) is called, causing the first inference run to profile input shapes.
On subsequent runs, config.enable_tuned_tensorrt_dynamic_shape(trt_shape_f, True) loads the profiled ranges.

Path 2: PIR `.json` Format

Called when the model directory contains inference.json.

tools/infer/utility.py282-323

Dynamic shape configuration is not collected at runtime. Instead it must be declared in inference.yml under:

If trt_dynamic_shapes is absent, a RuntimeError is raised.

The converted TRT model is cached to {model_dir}/.cache/trt/ and reused on subsequent runs.

Sources: tools/infer/utility.py282-323

`_convert_trt` Function

The _convert_trt function performs the actual PIR-format TRT model conversion.

Diagram: _convert_trt internals

Sources: tools/infer/utility.py438-514

Key implementation notes:

Data type per input is inferred by calling predictor.get_input_handle(name).type() and converting via _pd_dtype_to_np_dtype().
If dynamic_shape_input_data is provided for a given input, actual data arrays are used for min/opt/max; otherwise, np.ones(shape) is used.
The conversion uses paddle.tensorrt.export.Input and paddle.tensorrt.export.TensorRTConfig.

Sources: tools/infer/utility.py438-531

Precision Modes

Precision is resolved from args.precision into a Paddle Inference PrecisionType enum:

tools/infer/utility.py265-273

`--precision` value	Effective `PrecisionType`	Condition
`"fp32"`	`Float32`	Default
`"fp16"`	`Half`	Only applies when `--use_tensorrt=True`
`"int8"`	`Int8`	Applied regardless of TensorRT

Note: fp16 precision without TensorRT has no effect — the if args.precision == "fp16" and args.use_tensorrt condition means Half is only selected when both flags are active.

ONNX Runtime with CUDA

When --use_onnx=True, create_predictor bypasses Paddle Inference entirely and creates an ONNX Runtime session. If --use_gpu=True, the CUDAExecutionProvider is used:

tools/infer/utility.py200-238

Custom providers can also be specified via --onnx_providers.

Per-Component Predictor Usage

Each inference component independently calls create_predictor. The mode argument determines which model directory argument is used:

Diagram: Component → create_predictor → mode routing

Sources: tools/infer/utility.py177-196 tools/infer/predict_det.py143-151 tools/infer/predict_rec.py176-182 tools/infer/predict_cls.py58-65

C++ Inference: GPU and CUDA Configuration

The C++ inference build at deploy/cpp_infer/CMakeLists.txt supports GPU via the WITH_GPU CMake option.

deploy/cpp_infer/CMakeLists.txt97-107

When WITH_GPU=ON:

CUDA_LIB (path to CUDA library directory) is required
CUDNN_LIB (path to cuDNN library directory) is required on Linux
The -DWITH_GPU preprocessor definition is added
CUDA and cuDNN shared libraries are linked

deploy/cpp_infer/CMakeLists.txt210-220

The build script template at deploy/cpp_infer/tools/build.sh shows the variables to set:

deploy/cpp_infer/tools/build.sh1-22

Benchmarking Support

The GPU ID passed to AutoLogger is obtained from get_infer_gpuid(). A --warmup flag causes a random dummy input to be run through the detector/system before timed measurements begin.

Sources: tools/infer/predict_det.py162-180 tools/infer/predict_rec.py184-202 tools/infer/predict_system.py202-205

Common Issues and Constraints

Issue	Cause	Resolution
`RuntimeError: trt_dynamic_shapes must be defined`	PIR model used with TRT but `inference.yml` missing config	Add `trt_dynamic_shapes` to `inference.yml`
Shape file not found on first run	Normal — first run profiles shapes	Run once without TRT to generate shape file, or pre-provide shape file
`fp16` has no effect without TRT	Precision mode only activates for TensorRT paths	Also set `--use_tensorrt=True`
GPU not found warning	`nvidia-smi` unavailable (Jetson, etc.)	Safe to ignore on Jetson; GPU is still used if `--use_gpu=True`
TRT cache not invalidated on model update	`.cache/trt/` is checked by file existence only	Manually delete `.cache/trt/` directory when the model changes

Sources: tools/infer/utility.py277-344

NVIDIA GPU and TensorRT

Overview

Runtime Configuration Arguments

GPU Device Selection

Predictor Creation Flow

GPU-Only Inference (No TensorRT)

TensorRT Integration

Path 1: Legacy `.pdmodel` Format

Path 2: PIR `.json` Format

`_convert_trt` Function

Precision Modes

ONNX Runtime with CUDA

Per-Component Predictor Usage

C++ Inference: GPU and CUDA Configuration

Benchmarking Support

Common Issues and Constraints

On this page

NVIDIA GPU and TensorRT

Overview

Runtime Configuration Arguments

GPU Device Selection

Predictor Creation Flow

GPU-Only Inference (No TensorRT)

TensorRT Integration

Path 1: Legacy `.pdmodel` Format

Path 2: PIR `.json` Format

`_convert_trt` Function

Precision Modes

ONNX Runtime with CUDA

Per-Component Predictor Usage

C++ Inference: GPU and CUDA Configuration

Benchmarking Support

Common Issues and Constraints

On this page

NVIDIA GPU and TensorRT

Overview

Runtime Configuration Arguments

GPU Device Selection

Predictor Creation Flow

GPU-Only Inference (No TensorRT)

TensorRT Integration

Path 1: Legacy .pdmodel Format

Path 2: PIR .json Format

_convert_trt Function

Precision Modes

ONNX Runtime with CUDA

Per-Component Predictor Usage

C++ Inference: GPU and CUDA Configuration

Benchmarking Support

Common Issues and Constraints

On this page

NVIDIA GPU and TensorRT

Overview

Runtime Configuration Arguments

GPU Device Selection

Predictor Creation Flow

GPU-Only Inference (No TensorRT)

TensorRT Integration

Path 1: Legacy .pdmodel Format

Path 2: PIR .json Format

_convert_trt Function

Precision Modes

ONNX Runtime with CUDA

Per-Component Predictor Usage

C++ Inference: GPU and CUDA Configuration

Benchmarking Support

Common Issues and Constraints

On this page

Path 1: Legacy `.pdmodel` Format

Path 2: PIR `.json` Format

`_convert_trt` Function

Path 1: Legacy `.pdmodel` Format

Path 2: PIR `.json` Format

`_convert_trt` Function