This page documents the data processing and augmentation pipeline used during model training in PaddleOCR. The system transforms raw image-label pairs into training-ready tensor formats through a configurable sequence of operations including label encoding, image augmentation, resizing, and normalization. This pipeline is essential for preparing data for training detection, recognition, classification, and other OCR tasks.
For information about the training orchestration that uses these data pipelines, see Training Infrastructure and Orchestration. For details on model architectures that consume this processed data, see Model Architecture Components. For post-training evaluation and export, see Model Evaluation and Export.
The data processing system follows a pipeline architecture where raw data passes through a sequence of transform operators. Each operator modifies the data dictionary and passes it to the next stage. If any operator returns None, the entire sample is discarded.
Diagram: High-Level Data Processing Flow
Sources: ppocr/data/imaug/__init__.py68-96 ppocr/data/imaug/label_ops.py1-1327 ppocr/data/imaug/rec_img_aug.py1-900
PaddleOCR provides multiple dataset classes for different data formats and task types. All datasets inherit from paddle.io.Dataset and integrate with the transform pipeline.
Diagram: Dataset Class Hierarchy
Sources: ppocr/data/__init__.py ppocr/data/simple_dataset.py ppocr/data/lmdb_dataset.py ppocr/data/pubtab_dataset.py
| Dataset Class | Use Case | Data Format | Key Features |
|---|---|---|---|
SimpleDataSet | General OCR tasks | Text file with image_path\tlabel | Most common, supports image directories |
LMDBDataSet | Large-scale training | LMDB key-value store | Fast random access, memory-efficient |
PubTabDataSet | Table recognition | Custom format with structure | Handles table structure + cell content |
PGDataSet | End-to-end OCR | JSON with polygons | Text detection + recognition |
MultiScaleDataSet | Detection training | Multiple image scales | On-the-fly multi-scale augmentation |
Each dataset class implements:
__init__(): Initialize with config, data paths, and transform operators__getitem__(idx): Load sample, apply transforms, return processed data__len__(): Return dataset sizeSources: ppocr/data/simple_dataset.py ppocr/data/lmdb_dataset.py ppocr/data/pubtab_dataset.py
The pipeline is configured via YAML files where each transform is specified as a list entry. The create_operators() function instantiates operators from config, and transform() applies them sequentially:
Diagram: Transform Pipeline Configuration
Sources: ppocr/data/imaug/__init__.py79-96 tools/train.py54-62
The create_operators() function at ppocr/data/imaug/__init__.py79-96 dynamically instantiates operators using eval():
The transform() function at ppocr/data/imaug/__init__.py68-76 applies operators sequentially, short-circuiting on None:
Sources: ppocr/data/imaug/__init__.py68-96
Before augmentation and resizing, images must be decoded from file format and prepared for processing. PaddleOCR provides foundational operators for these tasks.
The DecodeImage operator loads image data from bytes or file paths and converts to numpy arrays.
Implementation: ppocr/data/imaug/operators.py
Key Parameters:
img_mode: Color space - BGR (default), RGB, or GRAYchannel_first: If True, output shape is (C, H, W); if False, (H, W, C) (default)ignore_orientation: Whether to ignore EXIF orientation tagsProcess:
data['image'] (bytes or file path)cv2.imdecode() or PILcv2.cvtColor())channel_first=Truedata['image'] numpy arrayExample Usage in Config:
Sources: ppocr/data/imaug/operators.py
The NormalizeImage operator normalizes pixel values to standardized ranges for neural network input.
Implementation: ppocr/data/imaug/operators.py
Key Parameters:
scale: Division factor (e.g., 1./255. converts [0,255] to [0,1])mean: Mean values for each channel (e.g., [0.485, 0.456, 0.406] for ImageNet)std: Standard deviation for each channel (e.g., [0.229, 0.224, 0.225])order: Channel order - hwc (height, width, channel) or chwNormalization Formula:
normalized_image = (image * scale - mean) / std
Common Configurations:
| Use Case | scale | mean | std | Result Range |
|---|---|---|---|---|
| Standard ImageNet | 1./255. | [0.485, 0.456, 0.406] | [0.229, 0.224, 0.225] | Approximately [-2, 2] |
| Simple [0,1] | 1./255. | [0, 0, 0] | [1, 1, 1] | [0, 1] |
| Centered [-1,1] | 1./255. | [0.5, 0.5, 0.5] | [0.5, 0.5, 0.5] | [-1, 1] |
Example Usage in Config:
Sources: ppocr/data/imaug/operators.py
The ToCHWImage operator transposes image arrays from (H, W, C) to (C, H, W) format, which is required by PaddlePaddle models.
Implementation: ppocr/data/imaug/operators.py
Process:
This operator is typically placed after NormalizeImage and before batch collation.
Sources: ppocr/data/imaug/operators.py
Label encoding converts raw text annotations into numerical indices that models can process. The system provides specialized encoders for different model architectures and task types.
Diagram: Label Encoder Class Hierarchy
Sources: ppocr/data/imaug/label_ops.py101-167
| Encoder Class | Algorithm | Special Tokens | Max Length Handling | Output Format |
|---|---|---|---|---|
CTCLabelEncode | CTC-based models | blank (index 0) | Pads with 0 | label: indices array, length: text length |
AttnLabelEncode | Attention models | sos, eos | Rejects if ≥ max_length | label: [sos] + text + [eos] + padding |
SRNLabelEncode | SRN | sos, eos | Rejects if > max_length | label: text + [eos] + padding |
SEEDLabelEncode | SEED | eos, padding, unknown | Rejects if ≥ max_length | label: text + [eos] + [padding]* |
RFLLabelEncode | RFL | sos, eos | Rejects if ≥ max_length | label: [sos] + text + [eos], cnt_label: char counts |
ParseQLabelEncode | ParseQ | [B], [E], [P] | Configured via max_text_length | Filters out EOS during decode |
SARLabelEncode | SAR | <BOS/EOS>, <UKN>, <PAD> | Via TableLabelEncode parent | Special token indices |
Sources: ppocr/data/imaug/label_ops.py169-630
Diagram: Label Encoding Pipeline
Sources: ppocr/data/imaug/label_ops.py104-193
Detection Label Encoder (DetLabelEncode): Parses JSON annotations containing polygon coordinates and transcriptions. Marks samples with * or ### as ignore tags.
Diagram: Detection Label Encoding
Sources: ppocr/data/imaug/label_ops.py49-99
Table Label Encoder (TableLabelEncode): Encodes table structure as HTML-like tokens (<td>, <tr>, etc.) with bounding box coordinates for each cell.
Sources: ppocr/data/imaug/label_ops.py633-743
KIE Label Encoder (KieLabelEncode): For Key Information Extraction, computes spatial relations between text boxes and encodes both text content and geometric relationships.
Sources: ppocr/data/imaug/label_ops.py269-444
Table recognition requires specialized operators to handle structure annotations and cell-level processing.
The ResizeTableImage operator resizes table images while maintaining aspect ratio and updating bounding box coordinates.
Implementation: ppocr/data/imaug/table_ops.py
Key Parameters:
max_len: Maximum dimension (width or height)resize_bboxes: Whether to rescale bounding box coordinates (default True)infer_mode: If True, only resizes image without bbox updatesProcess:
scale = max_len / max(height, width)cv2.resize(img, None, fx=scale, fy=scale)Example Usage:
Sources: ppocr/data/imaug/table_ops.py
The GenTableMask operator generates mask tensors for table structure learning, used by table recognition models.
Implementation: ppocr/data/imaug/table_ops.py
Process:
bbox_masks tensor indicating valid/invalid bounding boxesstructure_mask tensor for structure token validitySources: ppocr/data/imaug/table_ops.py
Image augmentation improves model generalization by applying realistic transformations to training images. PaddleOCR provides algorithm-specific augmentation pipelines optimized for different model architectures.
Diagram: Augmentation Operator Classes
Sources: ppocr/data/imaug/rec_img_aug.py35-272
RecAug combines Text Image Augmentation (TIA) techniques with basic data augmentation. It is the default augmentation for most recognition models.
TIA Transformations (applied with tia_prob=0.4):
tia_distort(): Applies elastic distortion (3-6 control points)tia_stretch(): Non-uniform stretching (3-6 control points)tia_perspective(): Perspective transformationBase Augmentations (each applied with probability 0.4):
get_crop() removes random edges (requires image ≥20×20)hsv_aug() adjusts hue/saturation/valuejitter() adjusts brightnessadd_gaussian_noise()Sources: ppocr/data/imaug/rec_img_aug.py35-114
ABINetRecAug uses CVGeometry, CVDeterioration, and CVColorJitter from abinet_aug.py:
Sources: ppocr/data/imaug/rec_img_aug.py116-146
SVTRRecAug uses SVTRGeometry and SVTRDeterioration:
aug_type parameter for geometry variationsSources: ppocr/data/imaug/rec_img_aug.py196-232
ParseQRecAug uses ParseQDeterioration:
lam=20 and radius=2.0 parametersSources: ppocr/data/imaug/rec_img_aug.py234-272
RecConAug concatenates multiple text samples horizontally to simulate continuous text scenarios. This is useful for training models on scene text where multiple words appear consecutively.
Configuration:
prob: Probability of applying concatenation (default 0.5)max_text_length: Maximum total text length after concatenation (default 25)max_wh_ratio: Maximum width-to-height ratio (default from image_shape)Process:
ext_data (extra samples provided by dataloader)max_text_length and ratio ≤ max_wh_ratiodata['label'] += ext_data['label']Sources: ppocr/data/imaug/rec_img_aug.py148-194
After augmentation, images must be resized to fixed dimensions and normalized for model input. Different algorithms require specialized resize strategies.
Diagram: Resize Operator Variants
Sources: ppocr/data/imaug/rec_img_aug.py285-640
The most common resize operation maintains aspect ratio with optional padding:
Diagram: RecResizeImg Processing Flow
Key Functions:
resize_norm_img(img, image_shape, padding=True): Resizes image while preserving aspect ratio. If width exceeds target, resizes to target width. Otherwise, resizes to calculated width and pads with zeros if padding=True.resize_norm_img_chinese(img, image_shape): Special handling for images with Chinese characters, uses different normalization.Sources: ppocr/data/imaug/rec_img_aug.py285-312
SAR Resize (SARRecResizeImg):
resize_norm_img_sar() which computes resize and padding shapesresized_shape, pad_shape, valid_ratiowidth_downsample_ratio in SAR's architectureSources: ppocr/data/imaug/rec_img_aug.py403-420
SRN Resize (SRNRecResizeImg):
resize_norm_img_srn() for SRN-specific normalizationsrn_other_inputs():
encoder_word_pos: Position encoding for encodergsrm_word_pos: GSRM word positionsgsrm_slf_attn_bias1/2: Self-attention biasesSources: ppocr/data/imaug/rec_img_aug.py379-401
SPIN Resize (SPINRecResizeImg):
(img - mean) / stdSources: ppocr/data/imaug/rec_img_aug.py442-487
RobustScanner Resize (RobustScannerRecResizeImg):
word_positons array: np.array(range(0, max_text_length))Sources: ppocr/data/imaug/rec_img_aug.py572-640
| Resize Class | Grayscale | Padding | Aspect Ratio | Special Outputs | Interpolation |
|---|---|---|---|---|---|
RecResizeImg | No | Optional | Preserved | valid_ratio | LINEAR |
SARRecResizeImg | No | Yes | Preserved | resized_shape, pad_shape, valid_ratio | LINEAR |
SRNRecResizeImg | No | No | SRN-specific | encoder_word_pos, gsrm_word_pos, attention biases | LINEAR |
ABINetRecResizeImg | No | Yes | Preserved | valid_ratio | CUBIC |
SVTRRecResizeImg | No | Optional | Preserved | valid_ratio | LINEAR |
RFLRecResizeImg | Yes | Optional | Preserved | valid_ratio | Configurable |
RobustScannerRecResizeImg | No | Yes | Preserved | resized_shape, pad_shape, valid_ratio, word_positons | LINEAR |
GrayRecResizeImg | Yes | Optional | Varies | None | PIL/OpenCV |
PRENResizeImg | No | No | Not preserved | None | LINEAR |
SPINRecResizeImg | Yes | No | Not preserved | None | Configurable |
Sources: ppocr/data/imaug/rec_img_aug.py285-640
The transform pipeline is implemented through two key functions in ppocr/data/imaug/__init__.py:
Each operator implements the __call__(self, data) method which:
{"image": np.ndarray, "label": str})None to discard sampleDiagram: Operator Call Chain
Sources: ppocr/data/imaug/__init__.py68-96 ppocr/data/imaug/operators.py ppocr/data/imaug/label_ops.py153-193 ppocr/data/imaug/rec_img_aug.py52-66
A typical YAML configuration for recognition training:
The execution flow during training:
Sources: ppocr/data/imaug/__init__.py68-96 tools/train.py54-62 tools/program.py319-383
The data processing pipeline integrates with the training system through build_dataloader():
Diagram: Data Pipeline Integration
Sources: tools/train.py54-62 tools/program.py191-649 ppocr/data/__init__.py ppocr/data/simple_dataset.py ppocr/data/imaug/__init__.py68-96
The build_dataloader() function creates the data pipeline:
Key Steps:
config['Train']['dataset']name field:
SimpleDataSet: For standard image-label pairsLMDBDataSet: For LMDB database formatPubTabDataSet: For table recognitionPGDataSet: For end-to-end OCRcreate_operators(transforms, global_config)paddle.io.DataLoader with:
num_workers)Implementation: ppocr/data/__init__.py
Usage:
Sources: ppocr/data/__init__.py tools/train.py54-62
The training loop in tools/program.py::train() receives batches where structure varies by algorithm:
Recognition Models:
Algorithm-Specific Batch Structures:
| Algorithm | batch[0] | batch[1] | batch[2+] |
|---|---|---|---|
| CTC/CRNN | images | label indices | length |
| Attention | images | label sequences (with sos/eos) | length |
| SRN | images | labels | encoder_word_pos, gsrm_word_pos, gsrm_slf_attn_bias1, gsrm_slf_attn_bias2 |
| SAR | images | labels | valid_ratio |
| RobustScanner | images | labels | valid_ratio, word_positons |
| Table | images | structure, bboxes, bbox_masks | (varies) |
| KIE | Entire data dict | (no separation) | N/A |
The training loop handles these variations at tools/program.py326-383:
Sources: tools/program.py326-383
The pipeline includes multiple validation points:
Length Validation: Label encoders reject samples exceeding max_text_length:
Sources: ppocr/data/imaug/label_ops.py153-154
Character Validation: Unknown characters are silently skipped during encoding:
Sources: ppocr/data/imaug/label_ops.py158-163
Image Size Validation: Some augmentations require minimum dimensions:
Sources: ppocr/data/imaug/rec_img_aug.py58-59
Empty Result Handling: If label encoding produces empty result, sample is discarded:
Sources: ppocr/data/imaug/label_ops.py164-165
The PaddleOCR data processing and augmentation pipeline provides a flexible, extensible system for preparing OCR training data:
CTCLabelEncode, AttnLabelEncode, etc.) convert text to indices with appropriate special tokensRecAug, ABINetRecAug, SVTRRecAug) improve model robustnesscreate_operators() and transform() enable easy experimentationThe modular design allows adding new operators by implementing the __call__(self, data) interface and registering them in ppocr/data/imaug/__init__.py.
Sources: ppocr/data/imaug/__init__.py1-96 ppocr/data/imaug/label_ops.py1-1327 ppocr/data/imaug/rec_img_aug.py1-900 tools/program.py191-649
Refresh this wiki
This wiki was recently refreshed. Please wait 2 days to refresh again.