Model configuration files are YAML files that define all aspects of model training, evaluation, and inference in PaddleOCR. They provide a declarative way to specify model architecture, training hyperparameters, data processing pipelines, and evaluation metrics without modifying code.
For information about inference configuration and deployment settings, see page 5.2. For training orchestration that uses these configurations, see page 4.1. For how model architectures are constructed from these configs, see page 4.2. For data processing pipelines driven by these configs, see page 4.3.
Configuration files follow a hierarchical structure with eight top-level sections:
Diagram: YAML Config File — Top-Level Sections and Subsections
Sources: configs/table/table_mv3.yml1-130 configs/rec/rec_icdar15_train.yml1-99
The Global section controls training execution, device settings, logging, checkpoint management, and task-specific settings shared across all pipeline stages.
| Parameter | Type | Description | Example |
|---|---|---|---|
use_gpu | bool | Enable GPU training | true |
epoch_num | int | Total training epochs | 72, 400 |
log_smooth_window | int | Smoothing window for logged metrics | 20 |
print_batch_step | int | Log interval in iterations | 5, 10 |
save_model_dir | str | Checkpoint output directory | ./output/rec/ic15/ |
save_epoch_step | int | Checkpoint frequency in epochs | 3, 400 |
eval_batch_step | list[int] | Evaluation schedule [start_iter, interval] | [0, 2000] |
cal_metric_during_train | bool | Compute metrics during training | True |
pretrained_model | str | Path to pretrained weights (optional) | |
checkpoints | str | Path to resume training from checkpoint | |
save_inference_dir | str | Directory for exported inference model | ./ |
use_visualdl | bool | Enable VisualDL logging | False |
infer_img | str | Sample image path for inference mode | doc/imgs_words_en/word_10.png |
character_dict_path | str | Path to character dictionary file | ppocr/utils/en_dict.txt |
max_text_length | int | Maximum sequence length for recognition | 25, 500 |
infer_mode | bool | Run in inference-only mode | False |
use_space_char | bool | Include space character in recognition | False |
save_res_path | str | Path to write inference results | ./output/rec/predicts_ic15.txt |
amp_custom_black_list | list | Ops excluded from AMP | ['matmul_v2'] |
Example Global Configuration:
The &max_text_length YAML anchor syntax (as used in table_mv3.yml) creates a reusable value that can be referenced elsewhere in the same file using *max_text_length, reducing duplication when multiple sections need the same parameter.
Sources: configs/rec/rec_icdar15_train.yml1-21 configs/table/table_mv3.yml1-23
The Architecture section defines the neural network structure using a modular backbone-head pattern:
Diagram: Model Architecture Components
Architecture Configuration Structure:
| Field | Description | Example Values |
|---|---|---|
model_type | Task category | det, rec, table, cls |
algorithm | Specific algorithm | DB, SVTR_LCNet, TableAttn |
Backbone.name | Backbone network | MobileNetV3, ResNet, PPLCNet |
Backbone.* | Backbone-specific params | scale, model_name, etc. |
Head.name | Head network | DBHead, CTCHead, TableAttentionHead |
Head.out_channels | Output dimension (for recognition) | Character dictionary size |
Head.* | Head-specific params | hidden_size, max_text_length, etc. |
Example Architecture Configuration:
The out_channels parameter for recognition heads is automatically calculated from the character dictionary during training initialization.
Sources: configs/table/table_mv3.yml36-49 tools/train.py73-138 ppocr/modeling/backbones/__init__.py18-152 ppocr/modeling/heads/__init__.py18-152
The Loss section specifies the loss function and its hyperparameters:
| Loss Type | Use Case | Key Parameters |
|---|---|---|
DBLoss | Text detection (DB algorithm) | alpha, beta, ohem_ratio |
CTCLoss | Text recognition (CTC decoder) | - |
AttentionLoss | Text recognition (attention decoder) | - |
TableAttentionLoss | Table structure recognition | structure_weight, loc_weight |
MultiLoss | Multi-head models | loss_config_list |
CombinedLoss | Multiple loss combination | loss_config_list |
Example Loss Configuration:
For multi-loss scenarios:
Sources: configs/table/table_mv3.yml50-53 ppocr/losses/__init__.py77-127
The Optimizer section configures the optimization algorithm and learning rate schedule:
Optimizer Configuration:
| Optimizer Type | Common Parameters |
|---|---|
Adam | beta1, beta2 |
SGD | momentum |
AdamW | beta1, beta2, weight_decay |
Learning Rate Schedules:
| Schedule Type | Parameters | Behavior |
|---|---|---|
Cosine | learning_rate, warmup_epoch | Cosine annealing with warmup |
Linear | learning_rate, epochs, end_lr | Linear decay |
Piecewise | learning_rate, decay_epochs, gamma | Step-wise decay at specified epochs |
MultiStepDecay | learning_rate, milestones, gamma | Decay at milestone epochs |
Sources: configs/table/table_mv3.yml25-34 tools/train.py156-162
The Train and Eval sections define data loading and preprocessing pipelines. Both follow the same structure: a dataset block (class name, data paths, transforms list) and a loader block (batch size, workers, shuffle).
Data pipeline flow through build_dataloader():
Sources: configs/rec/rec_icdar15_train.yml59-99 configs/table/table_mv3.yml63-130 ppocr/data/imaug/__init__.py
Common Dataset Classes:
| Class | Format | Typical Use |
|---|---|---|
SimpleDataSet | Text file with image path + label per line | Detection, recognition (flat files) |
LMDBDataSet | LMDB binary database | Recognition (large-scale datasets) |
PubTabDataSet | JSONL annotations | Table structure recognition |
Common Transform Operations:
| Transform | Purpose | Key Parameters |
|---|---|---|
DecodeImage | Load image from bytes/file | img_mode, channel_first |
DetLabelEncode | Encode detection polygon labels | — |
CTCLabelEncode | Encode recognition labels (CTC) | Uses character_dict_path from Global |
TableLabelEncode | Encode table structure labels | max_text_length, learn_empty_box, loc_reg_num |
RecResizeImg | Resize image for recognition | image_shape: [C, H, W] |
ResizeTableImage | Resize table image to max side | max_len |
PaddingTableImage | Pad to fixed size | size: [H, W] |
NormalizeImage | Normalize pixel values | scale, mean, std, order |
ToCHWImage | Transpose HWC → CHW | — |
KeepKeys | Select fields returned by dataloader | keep_keys list |
Loader Parameters:
| Parameter | Train | Eval | Description |
|---|---|---|---|
shuffle | True | False | Shuffle dataset order |
batch_size_per_card | e.g. 256 | e.g. 256 | Batch size per GPU |
drop_last | True | False | Drop incomplete final batch |
num_workers | e.g. 8 | e.g. 4 | Parallel data loading workers |
use_shared_memory | False | False | Use shared memory for workers |
Recognition Train section example (SimpleDataSet, CTC pipeline):
The Eval section uses identical structure but omits data augmentation transforms and sets shuffle: False and drop_last: False.
Sources: configs/rec/rec_icdar15_train.yml59-99 configs/rec/rec_mv3_none_bilstm_ctc.yml59-96 configs/table/table_mv3.yml63-130 ppocr/data/imaug/__init__.py ppocr/data/imaug/label_ops.py
The PostProcess section configures output decoding, converting raw model predictions into structured results:
PostProcess Classes by Task:
| Task | PostProcess Class | Purpose |
|---|---|---|
| Detection | DBPostProcess | Decode binary segmentation maps to text boxes |
| Recognition (CTC) | CTCLabelDecode | Decode CTC predictions to text |
| Recognition (Attention) | AttnLabelDecode | Decode attention predictions to text |
| Table | TableLabelDecode | Decode table structure and cell locations |
| Classification | ClsPostProcess | Decode classification logits |
The character_dict_path from the Global section is automatically passed to recognition post-processors.
Sources: configs/table/table_mv3.yml55-56 ppocr/postprocess/__init__.py66-117
The Metric section defines evaluation metrics:
Metric Types:
| Task | Metric Class | Main Indicator | Description |
|---|---|---|---|
| Detection | DetMetric | hmean | F-score (harmonic mean of precision/recall) |
| Recognition | RecMetric | acc | Character-level accuracy |
| Table | TableMetric | acc | Structure accuracy (optional bbox IoU) |
| Classification | ClsMetric | acc | Classification accuracy |
The main_indicator field specifies which metric to use for model selection during training.
Sources: configs/table/table_mv3.yml58-61 tools/program.py259-260
The configuration system supports YAML file loading and command-line overrides via dot-notation keys.
Configuration loading pipeline:
Sources: tools/program.py tools/train.py
Loading Process:
YAML file is specified with the -c flag and loaded by load_config() into a Python dict.
Command-line overrides are specified with repeated -o flags using dot-notation for nested keys:
merge_config() applies the overrides by traversing the config hierarchy with dot-separated keys and overwriting values.
Command-line Override Syntax:
| Override Pattern | Effect |
|---|---|
-o Global.use_gpu=false | Disable GPU |
-o Global.epoch_num=500 | Set epoch count |
-o Architecture.Backbone.scale=0.5 | Nested update |
-o Train.loader.batch_size_per_card=32 | Batch size override |
-o Global.pretrained_model=./output/best | Load pretrained weights |
merge_config() handles both top-level keys and dot-notation nested keys by traversing the config dictionary hierarchy.
Sources: tools/program.py tools/train.py
During training initialization (tools/train.py), the configuration dict is programmatically modified based on model type and task requirements before the model is built. This is especially important for recognition models where out_channels must match the character dictionary size.
Runtime config adjustment flow:
Sources: tools/train.py74-138 tools/eval.py46-81
Automatic Adjustments for Recognition Models:
Character Dictionary Processing:
character_dict_path from Global is read by the PostProcess classout_channels = len(character) (total including special tokens)Algorithm-Specific Token Counting:
CTCLabelDecode): char_num includes the CTC blank tokenchar_num - 2 for the base prediction, adds ignore_indexchar_num - 3 (blank + SOS + EOS excluded from head output)Multi-Head Models:
out_channels_list dict mapping each decoder name to its char_num{"CTCLabelDecode": 6625, "SARLabelDecode": 6627}config["Architecture"]["Head"]["out_channels_list"]Sources: tools/train.py74-138 tools/eval.py46-81
rec_icdar15_train.yml)A minimal CRNN recognition config with CTC decoder, MobileNetV3 backbone, and flat file dataset:
Sources: configs/rec/rec_icdar15_train.yml1-99
table_mv3.yml)A table structure recognition config using YAML anchors to share parameters across sections:
Sources: configs/table/table_mv3.yml1-130
1. YAML Anchors and References:
&anchor_name to define reusable values*anchor_name to avoid duplication2. Command-line Overrides for Experimentation:
3. Validation Configuration:
Eval transforms minimal (no augmentation)loader.shuffle=False for reproducible validationcal_metric_during_train to monitor convergence4. Character Dictionary Management:
character_dict_path in Global section for consistencyout_channels is auto-calculated5. AMP Configuration:
amp_custom_black_list to avoid numerical instability['matmul_v2', 'elementwise_add'] for tablesamp_level: "O2" for maximum speed with careful black-listingSources: configs/table/table_mv3.yml1-107 tools/train.py171-211
Refresh this wiki
This wiki was recently refreshed. Please wait 2 days to refresh again.