Data Processing and Augmentation

Relevant source files

Purpose and Scope

This page documents the data processing and augmentation pipeline used during model training in PaddleOCR. The system transforms raw image-label pairs into training-ready tensor formats through a configurable sequence of operations including label encoding, image augmentation, resizing, and normalization. This pipeline is essential for preparing data for training detection, recognition, classification, and other OCR tasks.

For information about the training orchestration that uses these data pipelines, see Training Infrastructure and Orchestration. For details on model architectures that consume this processed data, see Model Architecture Components. For post-training evaluation and export, see Model Evaluation and Export.

Data Processing Architecture

The data processing system follows a pipeline architecture where raw data passes through a sequence of transform operators. Each operator modifies the data dictionary and passes it to the next stage. If any operator returns None, the entire sample is discarded.

Diagram: High-Level Data Processing Flow

Sources: ppocr/data/imaug/__init__.py68-96 ppocr/data/imaug/label_ops.py1-1327 ppocr/data/imaug/rec_img_aug.py1-900

Data Loader Classes

PaddleOCR provides multiple dataset classes for different data formats and task types. All datasets inherit from paddle.io.Dataset and integrate with the transform pipeline.

Diagram: Dataset Class Hierarchy

Sources: ppocr/data/__init__.py ppocr/data/simple_dataset.py ppocr/data/lmdb_dataset.py ppocr/data/pubtab_dataset.py

Dataset Class Details

Dataset Class	Use Case	Data Format	Key Features
`SimpleDataSet`	General OCR tasks	Text file with `image_path\tlabel`	Most common, supports image directories
`LMDBDataSet`	Large-scale training	LMDB key-value store	Fast random access, memory-efficient
`PubTabDataSet`	Table recognition	Custom format with structure	Handles table structure + cell content
`PGDataSet`	End-to-end OCR	JSON with polygons	Text detection + recognition
`MultiScaleDataSet`	Detection training	Multiple image scales	On-the-fly multi-scale augmentation

Each dataset class implements:

__init__(): Initialize with config, data paths, and transform operators
__getitem__(idx): Load sample, apply transforms, return processed data
__len__(): Return dataset size

Sources: ppocr/data/simple_dataset.py ppocr/data/lmdb_dataset.py ppocr/data/pubtab_dataset.py

Pipeline Configuration and Instantiation

The pipeline is configured via YAML files where each transform is specified as a list entry. The create_operators() function instantiates operators from config, and transform() applies them sequentially:

Diagram: Transform Pipeline Configuration

Sources: ppocr/data/imaug/__init__.py79-96 tools/train.py54-62

The create_operators() function at ppocr/data/imaug/__init__.py79-96 dynamically instantiates operators using eval():

The transform() function at ppocr/data/imaug/__init__.py68-76 applies operators sequentially, short-circuiting on None:

Sources: ppocr/data/imaug/__init__.py68-96

Basic Image Operators

Before augmentation and resizing, images must be decoded from file format and prepared for processing. PaddleOCR provides foundational operators for these tasks.

DecodeImage

The DecodeImage operator loads image data from bytes or file paths and converts to numpy arrays.

Implementation: ppocr/data/imaug/operators.py

Key Parameters:

img_mode: Color space - BGR (default), RGB, or GRAY
channel_first: If True, output shape is (C, H, W); if False, (H, W, C) (default)
ignore_orientation: Whether to ignore EXIF orientation tags

Process:

Read image bytes from data['image'] (bytes or file path)
Decode using cv2.imdecode() or PIL
Convert color space if needed (cv2.cvtColor())
Reorder channels if channel_first=True
Store as data['image'] numpy array

Example Usage in Config:

Sources: ppocr/data/imaug/operators.py

NormalizeImage

The NormalizeImage operator normalizes pixel values to standardized ranges for neural network input.

Implementation: ppocr/data/imaug/operators.py

Key Parameters:

scale: Division factor (e.g., 1./255. converts [0,255] to [0,1])
mean: Mean values for each channel (e.g., [0.485, 0.456, 0.406] for ImageNet)
std: Standard deviation for each channel (e.g., [0.229, 0.224, 0.225])
order: Channel order - hwc (height, width, channel) or chw

Normalization Formula:

normalized_image = (image * scale - mean) / std

Common Configurations:

Use Case	scale	mean	std	Result Range
Standard ImageNet	1./255.	[0.485, 0.456, 0.406]	[0.229, 0.224, 0.225]	Approximately [-2, 2]
Simple [0,1]	1./255.	[0, 0, 0]	[1, 1, 1]	[0, 1]
Centered [-1,1]	1./255.	[0.5, 0.5, 0.5]	[0.5, 0.5, 0.5]	[-1, 1]

Example Usage in Config:

Sources: ppocr/data/imaug/operators.py

ToCHWImage

The ToCHWImage operator transposes image arrays from (H, W, C) to (C, H, W) format, which is required by PaddlePaddle models.

Implementation: ppocr/data/imaug/operators.py

Process:

This operator is typically placed after NormalizeImage and before batch collation.

Sources: ppocr/data/imaug/operators.py

Label Encoding System

Label encoding converts raw text annotations into numerical indices that models can process. The system provides specialized encoders for different model architectures and task types.

Base Label Encoder Architecture

Diagram: Label Encoder Class Hierarchy

Sources: ppocr/data/imaug/label_ops.py101-167

Label Encoder Classes

Encoder Class	Algorithm	Special Tokens	Max Length Handling	Output Format
`CTCLabelEncode`	CTC-based models	`blank` (index 0)	Pads with 0	`label`: indices array, `length`: text length
`AttnLabelEncode`	Attention models	`sos`, `eos`	Rejects if ≥ max_length	`label`: [sos] + text + [eos] + padding
`SRNLabelEncode`	SRN	`sos`, `eos`	Rejects if > max_length	`label`: text + [eos] + padding
`SEEDLabelEncode`	SEED	`eos`, `padding`, `unknown`	Rejects if ≥ max_length	`label`: text + [eos] + [padding]*
`RFLLabelEncode`	RFL	`sos`, `eos`	Rejects if ≥ max_length	`label`: [sos] + text + [eos], `cnt_label`: char counts
`ParseQLabelEncode`	ParseQ	`[B]`, `[E]`, `[P]`	Configured via max_text_length	Filters out EOS during decode
`SARLabelEncode`	SAR	`<BOS/EOS>`, `<UKN>`, `<PAD>`	Via TableLabelEncode parent	Special token indices

Sources: ppocr/data/imaug/label_ops.py169-630

Label Encoding Process Flow

Diagram: Label Encoding Pipeline

Sources: ppocr/data/imaug/label_ops.py104-193

Special Encoder Types

Detection Label Encoder (DetLabelEncode): Parses JSON annotations containing polygon coordinates and transcriptions. Marks samples with * or ### as ignore tags.

Diagram: Detection Label Encoding

Sources: ppocr/data/imaug/label_ops.py49-99

Table Label Encoder (TableLabelEncode): Encodes table structure as HTML-like tokens (<td>, <tr>, etc.) with bounding box coordinates for each cell.

Sources: ppocr/data/imaug/label_ops.py633-743

KIE Label Encoder (KieLabelEncode): For Key Information Extraction, computes spatial relations between text boxes and encodes both text content and geometric relationships.

Sources: ppocr/data/imaug/label_ops.py269-444

Table-Specific Operations

Table recognition requires specialized operators to handle structure annotations and cell-level processing.

ResizeTableImage

The ResizeTableImage operator resizes table images while maintaining aspect ratio and updating bounding box coordinates.

Implementation: ppocr/data/imaug/table_ops.py

Key Parameters:

max_len: Maximum dimension (width or height)
resize_bboxes: Whether to rescale bounding box coordinates (default True)
infer_mode: If True, only resizes image without bbox updates

Process:

Calculate scale factor: scale = max_len / max(height, width)
Resize image: cv2.resize(img, None, fx=scale, fy=scale)
Update bounding boxes: multiply all coordinates by scale factor
Store resized image and updated bboxes in data dict

Example Usage:

Sources: ppocr/data/imaug/table_ops.py

GenTableMask

The GenTableMask operator generates mask tensors for table structure learning, used by table recognition models.

Implementation: ppocr/data/imaug/table_ops.py

Process:

Create bbox_masks tensor indicating valid/invalid bounding boxes
Create structure_mask tensor for structure token validity
These masks are used during training to ignore padded regions

Sources: ppocr/data/imaug/table_ops.py

Image Augmentation Strategies

Image augmentation improves model generalization by applying realistic transformations to training images. PaddleOCR provides algorithm-specific augmentation pipelines optimized for different model architectures.

Augmentation Class Hierarchy

Diagram: Augmentation Operator Classes

Sources: ppocr/data/imaug/rec_img_aug.py35-272

RecAug: General Recognition Augmentation

RecAug combines Text Image Augmentation (TIA) techniques with basic data augmentation. It is the default augmentation for most recognition models.

TIA Transformations (applied with tia_prob=0.4):

tia_distort(): Applies elastic distortion (3-6 control points)
tia_stretch(): Non-uniform stretching (3-6 control points)
tia_perspective(): Perspective transformation

Base Augmentations (each applied with probability 0.4):

Random Crop: get_crop() removes random edges (requires image ≥20×20)
Gaussian Blur: 5×5 kernel with σ=1
HSV Augmentation: hsv_aug() adjusts hue/saturation/value
Color Jitter: jitter() adjusts brightness
Gaussian Noise: add_gaussian_noise()
Color Inversion: Inverts image colors (255 - pixel)

Sources: ppocr/data/imaug/rec_img_aug.py35-114

Algorithm-Specific Augmentation

ABINetRecAug uses CVGeometry, CVDeterioration, and CVColorJitter from abinet_aug.py:

Geometry: Rotation (±45°), scaling (0.5-2.0×), shearing (±45°, ±15°), distortion (strength 0.5)
Deterioration: Blur variations, noise artifacts, factor=4
Color Jitter: Brightness/contrast/saturation (±50%), hue (±10%)

Sources: ppocr/data/imaug/rec_img_aug.py116-146

SVTRRecAug uses SVTRGeometry and SVTRDeterioration:

Configurable aug_type parameter for geometry variations
SVTR-tuned deterioration parameters
Same color jitter as ABINet

Sources: ppocr/data/imaug/rec_img_aug.py196-232

ParseQRecAug uses ParseQDeterioration:

Enhanced deterioration with additional lam=20 and radius=2.0 parameters
Optimized for ParseQ architecture robustness

Sources: ppocr/data/imaug/rec_img_aug.py234-272

RecConAug: Concatenation Augmentation

RecConAug concatenates multiple text samples horizontally to simulate continuous text scenarios. This is useful for training models on scene text where multiple words appear consecutively.

Configuration:

prob: Probability of applying concatenation (default 0.5)
max_text_length: Maximum total text length after concatenation (default 25)
max_wh_ratio: Maximum width-to-height ratio (default from image_shape)

Process:

Iterate through ext_data (extra samples provided by dataloader)
For each sample, check if concatenation keeps length ≤ max_text_length and ratio ≤ max_wh_ratio
If constraints met, resize both images to same height and concatenate horizontally
Merge labels: data['label'] += ext_data['label']

Sources: ppocr/data/imaug/rec_img_aug.py148-194

Resize and Normalization Operations

After augmentation, images must be resized to fixed dimensions and normalized for model input. Different algorithms require specialized resize strategies.

Resize Operation Classes

Diagram: Resize Operator Variants

Sources: ppocr/data/imaug/rec_img_aug.py285-640

Standard RecResizeImg Process

The most common resize operation maintains aspect ratio with optional padding:

Diagram: RecResizeImg Processing Flow

Key Functions:

resize_norm_img(img, image_shape, padding=True): Resizes image while preserving aspect ratio. If width exceeds target, resizes to target width. Otherwise, resizes to calculated width and pads with zeros if padding=True.
resize_norm_img_chinese(img, image_shape): Special handling for images with Chinese characters, uses different normalization.

Sources: ppocr/data/imaug/rec_img_aug.py285-312

Algorithm-Specific Resize Operations

SAR Resize (SARRecResizeImg):

Uses resize_norm_img_sar() which computes resize and padding shapes
Outputs: resized_shape, pad_shape, valid_ratio
Accounts for width_downsample_ratio in SAR's architecture

Sources: ppocr/data/imaug/rec_img_aug.py403-420

SRN Resize (SRNRecResizeImg):

Uses resize_norm_img_srn() for SRN-specific normalization
Generates positional encodings via srn_other_inputs():
- encoder_word_pos: Position encoding for encoder
- gsrm_word_pos: GSRM word positions
- gsrm_slf_attn_bias1/2: Self-attention biases

Sources: ppocr/data/imaug/rec_img_aug.py379-401

SPIN Resize (SPINRecResizeImg):

Converts to grayscale
Applies mean-std normalization: (img - mean) / std
Supports multiple interpolation types (NEAREST, LINEAR, CUBIC, AREA)

Sources: ppocr/data/imaug/rec_img_aug.py442-487

RobustScanner Resize (RobustScannerRecResizeImg):

Similar to SAR but adds word_positons array: np.array(range(0, max_text_length))
Required for RobustScanner's character-level attention mechanism

Sources: ppocr/data/imaug/rec_img_aug.py572-640

Resize Configuration Table

Resize Class	Grayscale	Padding	Aspect Ratio	Special Outputs	Interpolation
`RecResizeImg`	No	Optional	Preserved	`valid_ratio`	LINEAR
`SARRecResizeImg`	No	Yes	Preserved	`resized_shape`, `pad_shape`, `valid_ratio`	LINEAR
`SRNRecResizeImg`	No	No	SRN-specific	`encoder_word_pos`, `gsrm_word_pos`, attention biases	LINEAR
`ABINetRecResizeImg`	No	Yes	Preserved	`valid_ratio`	CUBIC
`SVTRRecResizeImg`	No	Optional	Preserved	`valid_ratio`	LINEAR
`RFLRecResizeImg`	Yes	Optional	Preserved	`valid_ratio`	Configurable
`RobustScannerRecResizeImg`	No	Yes	Preserved	`resized_shape`, `pad_shape`, `valid_ratio`, `word_positons`	LINEAR
`GrayRecResizeImg`	Yes	Optional	Varies	None	PIL/OpenCV
`PRENResizeImg`	No	No	Not preserved	None	LINEAR
`SPINRecResizeImg`	Yes	No	Not preserved	None	Configurable

Sources: ppocr/data/imaug/rec_img_aug.py285-640

Transform Pipeline Implementation

The transform pipeline is implemented through two key functions in ppocr/data/imaug/__init__.py:

Operator Execution Flow

Each operator implements the __call__(self, data) method which:

Receives data dictionary (e.g., {"image": np.ndarray, "label": str})
Performs transformation
Returns modified data dict or None to discard sample

Diagram: Operator Call Chain

Sources: ppocr/data/imaug/__init__.py68-96 ppocr/data/imaug/operators.py ppocr/data/imaug/label_ops.py153-193 ppocr/data/imaug/rec_img_aug.py52-66

Example Configuration and Execution

A typical YAML configuration for recognition training:

The execution flow during training:

Sources: ppocr/data/imaug/__init__.py68-96 tools/train.py54-62 tools/program.py319-383

Integration with Training Pipeline

The data processing pipeline integrates with the training system through build_dataloader():

Diagram: Data Pipeline Integration

Sources: tools/train.py54-62 tools/program.py191-649 ppocr/data/__init__.py ppocr/data/simple_dataset.py ppocr/data/imaug/__init__.py68-96

Dataloader Construction

The build_dataloader() function creates the data pipeline:

Key Steps:

Extracts dataset configuration from config['Train']['dataset']
Instantiates dataset class based on name field:
- SimpleDataSet: For standard image-label pairs
- LMDBDataSet: For LMDB database format
- PubTabDataSet: For table recognition
- PGDataSet: For end-to-end OCR
Creates transform operators via create_operators(transforms, global_config)
Wraps dataset in paddle.io.DataLoader with:
- Batch size from config
- Multi-process data loading (num_workers)
- Optional distributed sampler for multi-GPU training
- Custom collate function for batching

Implementation: ppocr/data/__init__.py

Usage:

Sources: ppocr/data/__init__.py tools/train.py54-62

Batch Structure in Training Loop

The training loop in tools/program.py::train() receives batches where structure varies by algorithm:

Recognition Models:

Algorithm-Specific Batch Structures:

Algorithm	batch[0]	batch[1]	batch[2+]
CTC/CRNN	images	label indices	length
Attention	images	label sequences (with sos/eos)	length
SRN	images	labels	encoder_word_pos, gsrm_word_pos, gsrm_slf_attn_bias1, gsrm_slf_attn_bias2
SAR	images	labels	valid_ratio
RobustScanner	images	labels	valid_ratio, word_positons
Table	images	structure, bboxes, bbox_masks	(varies)
KIE	Entire data dict	(no separation)	N/A

The training loop handles these variations at tools/program.py326-383:

Sources: tools/program.py326-383

Data Validation and Error Handling

The pipeline includes multiple validation points:

Length Validation: Label encoders reject samples exceeding max_text_length:

Sources: ppocr/data/imaug/label_ops.py153-154

Character Validation: Unknown characters are silently skipped during encoding:

Sources: ppocr/data/imaug/label_ops.py158-163

Image Size Validation: Some augmentations require minimum dimensions:

Sources: ppocr/data/imaug/rec_img_aug.py58-59

Empty Result Handling: If label encoding produces empty result, sample is discarded:

Sources: ppocr/data/imaug/label_ops.py164-165

Summary

The PaddleOCR data processing and augmentation pipeline provides a flexible, extensible system for preparing OCR training data:

Label Encoding: Algorithm-specific encoders (CTCLabelEncode, AttnLabelEncode, etc.) convert text to indices with appropriate special tokens
Augmentation: Multiple augmentation strategies (RecAug, ABINetRecAug, SVTRRecAug) improve model robustness
Resize/Normalize: Specialized resize operations for different architectures maintain aspect ratios and generate auxiliary outputs
Pipeline Architecture: Configurable operator sequences via create_operators() and transform() enable easy experimentation
Validation: Multiple validation points ensure data quality and sample consistency

The modular design allows adding new operators by implementing the __call__(self, data) interface and registering them in ppocr/data/imaug/__init__.py.

Sources: ppocr/data/imaug/__init__.py1-96 ppocr/data/imaug/label_ops.py1-1327 ppocr/data/imaug/rec_img_aug.py1-900 tools/program.py191-649

Data Processing and Augmentation

Relevant source files

Purpose and Scope

Data Processing Architecture

Diagram: High-Level Data Processing Flow

Sources: ppocr/data/imaug/__init__.py68-96 ppocr/data/imaug/label_ops.py1-1327 ppocr/data/imaug/rec_img_aug.py1-900

Data Loader Classes

PaddleOCR provides multiple dataset classes for different data formats and task types. All datasets inherit from paddle.io.Dataset and integrate with the transform pipeline.

Diagram: Dataset Class Hierarchy

Sources: ppocr/data/__init__.py ppocr/data/simple_dataset.py ppocr/data/lmdb_dataset.py ppocr/data/pubtab_dataset.py

Dataset Class Details

Dataset Class	Use Case	Data Format	Key Features
`SimpleDataSet`	General OCR tasks	Text file with `image_path\tlabel`	Most common, supports image directories
`LMDBDataSet`	Large-scale training	LMDB key-value store	Fast random access, memory-efficient
`PubTabDataSet`	Table recognition	Custom format with structure	Handles table structure + cell content
`PGDataSet`	End-to-end OCR	JSON with polygons	Text detection + recognition
`MultiScaleDataSet`	Detection training	Multiple image scales	On-the-fly multi-scale augmentation

Each dataset class implements:

__init__(): Initialize with config, data paths, and transform operators
__getitem__(idx): Load sample, apply transforms, return processed data
__len__(): Return dataset size

Sources: ppocr/data/simple_dataset.py ppocr/data/lmdb_dataset.py ppocr/data/pubtab_dataset.py

Pipeline Configuration and Instantiation

Diagram: Transform Pipeline Configuration

Sources: ppocr/data/imaug/__init__.py79-96 tools/train.py54-62

The create_operators() function at ppocr/data/imaug/__init__.py79-96 dynamically instantiates operators using eval():

The transform() function at ppocr/data/imaug/__init__.py68-76 applies operators sequentially, short-circuiting on None:

Sources: ppocr/data/imaug/__init__.py68-96

Basic Image Operators

Before augmentation and resizing, images must be decoded from file format and prepared for processing. PaddleOCR provides foundational operators for these tasks.

DecodeImage

The DecodeImage operator loads image data from bytes or file paths and converts to numpy arrays.

Implementation: ppocr/data/imaug/operators.py

Key Parameters:

img_mode: Color space - BGR (default), RGB, or GRAY
channel_first: If True, output shape is (C, H, W); if False, (H, W, C) (default)
ignore_orientation: Whether to ignore EXIF orientation tags

Process:

Read image bytes from data['image'] (bytes or file path)
Decode using cv2.imdecode() or PIL
Convert color space if needed (cv2.cvtColor())
Reorder channels if channel_first=True
Store as data['image'] numpy array

Example Usage in Config:

Sources: ppocr/data/imaug/operators.py

NormalizeImage

The NormalizeImage operator normalizes pixel values to standardized ranges for neural network input.

Implementation: ppocr/data/imaug/operators.py

Key Parameters:

scale: Division factor (e.g., 1./255. converts [0,255] to [0,1])
mean: Mean values for each channel (e.g., [0.485, 0.456, 0.406] for ImageNet)
std: Standard deviation for each channel (e.g., [0.229, 0.224, 0.225])
order: Channel order - hwc (height, width, channel) or chw

Normalization Formula:

normalized_image = (image * scale - mean) / std

Common Configurations:

Use Case	scale	mean	std	Result Range
Standard ImageNet	1./255.	[0.485, 0.456, 0.406]	[0.229, 0.224, 0.225]	Approximately [-2, 2]
Simple [0,1]	1./255.	[0, 0, 0]	[1, 1, 1]	[0, 1]
Centered [-1,1]	1./255.	[0.5, 0.5, 0.5]	[0.5, 0.5, 0.5]	[-1, 1]

Example Usage in Config:

Sources: ppocr/data/imaug/operators.py

ToCHWImage

The ToCHWImage operator transposes image arrays from (H, W, C) to (C, H, W) format, which is required by PaddlePaddle models.

Implementation: ppocr/data/imaug/operators.py

Process:

This operator is typically placed after NormalizeImage and before batch collation.

Sources: ppocr/data/imaug/operators.py

Label Encoding System

Label encoding converts raw text annotations into numerical indices that models can process. The system provides specialized encoders for different model architectures and task types.

Base Label Encoder Architecture

Diagram: Label Encoder Class Hierarchy

Sources: ppocr/data/imaug/label_ops.py101-167

Label Encoder Classes

Encoder Class	Algorithm	Special Tokens	Max Length Handling	Output Format
`CTCLabelEncode`	CTC-based models	`blank` (index 0)	Pads with 0	`label`: indices array, `length`: text length
`AttnLabelEncode`	Attention models	`sos`, `eos`	Rejects if ≥ max_length	`label`: [sos] + text + [eos] + padding
`SRNLabelEncode`	SRN	`sos`, `eos`	Rejects if > max_length	`label`: text + [eos] + padding
`SEEDLabelEncode`	SEED	`eos`, `padding`, `unknown`	Rejects if ≥ max_length	`label`: text + [eos] + [padding]*
`RFLLabelEncode`	RFL	`sos`, `eos`	Rejects if ≥ max_length	`label`: [sos] + text + [eos], `cnt_label`: char counts
`ParseQLabelEncode`	ParseQ	`[B]`, `[E]`, `[P]`	Configured via max_text_length	Filters out EOS during decode
`SARLabelEncode`	SAR	`<BOS/EOS>`, `<UKN>`, `<PAD>`	Via TableLabelEncode parent	Special token indices

Sources: ppocr/data/imaug/label_ops.py169-630

Label Encoding Process Flow

Diagram: Label Encoding Pipeline

Sources: ppocr/data/imaug/label_ops.py104-193

Special Encoder Types

Detection Label Encoder (DetLabelEncode): Parses JSON annotations containing polygon coordinates and transcriptions. Marks samples with * or ### as ignore tags.

Diagram: Detection Label Encoding

Sources: ppocr/data/imaug/label_ops.py49-99

Table Label Encoder (TableLabelEncode): Encodes table structure as HTML-like tokens (<td>, <tr>, etc.) with bounding box coordinates for each cell.

Sources: ppocr/data/imaug/label_ops.py633-743

KIE Label Encoder (KieLabelEncode): For Key Information Extraction, computes spatial relations between text boxes and encodes both text content and geometric relationships.

Sources: ppocr/data/imaug/label_ops.py269-444

Table-Specific Operations

Table recognition requires specialized operators to handle structure annotations and cell-level processing.

ResizeTableImage

The ResizeTableImage operator resizes table images while maintaining aspect ratio and updating bounding box coordinates.

Implementation: ppocr/data/imaug/table_ops.py

Key Parameters:

max_len: Maximum dimension (width or height)
resize_bboxes: Whether to rescale bounding box coordinates (default True)
infer_mode: If True, only resizes image without bbox updates

Process:

Calculate scale factor: scale = max_len / max(height, width)
Resize image: cv2.resize(img, None, fx=scale, fy=scale)
Update bounding boxes: multiply all coordinates by scale factor
Store resized image and updated bboxes in data dict

Example Usage:

Sources: ppocr/data/imaug/table_ops.py

GenTableMask

The GenTableMask operator generates mask tensors for table structure learning, used by table recognition models.

Implementation: ppocr/data/imaug/table_ops.py

Process:

Create bbox_masks tensor indicating valid/invalid bounding boxes
Create structure_mask tensor for structure token validity
These masks are used during training to ignore padded regions

Sources: ppocr/data/imaug/table_ops.py

Image Augmentation Strategies

Augmentation Class Hierarchy

Diagram: Augmentation Operator Classes

Sources: ppocr/data/imaug/rec_img_aug.py35-272

RecAug: General Recognition Augmentation

RecAug combines Text Image Augmentation (TIA) techniques with basic data augmentation. It is the default augmentation for most recognition models.

TIA Transformations (applied with tia_prob=0.4):

tia_distort(): Applies elastic distortion (3-6 control points)
tia_stretch(): Non-uniform stretching (3-6 control points)
tia_perspective(): Perspective transformation

Base Augmentations (each applied with probability 0.4):

Random Crop: get_crop() removes random edges (requires image ≥20×20)
Gaussian Blur: 5×5 kernel with σ=1
HSV Augmentation: hsv_aug() adjusts hue/saturation/value
Color Jitter: jitter() adjusts brightness
Gaussian Noise: add_gaussian_noise()
Color Inversion: Inverts image colors (255 - pixel)

Sources: ppocr/data/imaug/rec_img_aug.py35-114

Algorithm-Specific Augmentation

ABINetRecAug uses CVGeometry, CVDeterioration, and CVColorJitter from abinet_aug.py:

Geometry: Rotation (±45°), scaling (0.5-2.0×), shearing (±45°, ±15°), distortion (strength 0.5)
Deterioration: Blur variations, noise artifacts, factor=4
Color Jitter: Brightness/contrast/saturation (±50%), hue (±10%)

Sources: ppocr/data/imaug/rec_img_aug.py116-146

SVTRRecAug uses SVTRGeometry and SVTRDeterioration:

Configurable aug_type parameter for geometry variations
SVTR-tuned deterioration parameters
Same color jitter as ABINet

Sources: ppocr/data/imaug/rec_img_aug.py196-232

ParseQRecAug uses ParseQDeterioration:

Enhanced deterioration with additional lam=20 and radius=2.0 parameters
Optimized for ParseQ architecture robustness

Sources: ppocr/data/imaug/rec_img_aug.py234-272

RecConAug: Concatenation Augmentation

RecConAug concatenates multiple text samples horizontally to simulate continuous text scenarios. This is useful for training models on scene text where multiple words appear consecutively.

Configuration:

prob: Probability of applying concatenation (default 0.5)
max_text_length: Maximum total text length after concatenation (default 25)
max_wh_ratio: Maximum width-to-height ratio (default from image_shape)

Process:

Iterate through ext_data (extra samples provided by dataloader)
For each sample, check if concatenation keeps length ≤ max_text_length and ratio ≤ max_wh_ratio
If constraints met, resize both images to same height and concatenate horizontally
Merge labels: data['label'] += ext_data['label']

Sources: ppocr/data/imaug/rec_img_aug.py148-194

Resize and Normalization Operations

After augmentation, images must be resized to fixed dimensions and normalized for model input. Different algorithms require specialized resize strategies.

Resize Operation Classes

Diagram: Resize Operator Variants

Sources: ppocr/data/imaug/rec_img_aug.py285-640

Standard RecResizeImg Process

The most common resize operation maintains aspect ratio with optional padding:

Diagram: RecResizeImg Processing Flow

Key Functions:

resize_norm_img(img, image_shape, padding=True): Resizes image while preserving aspect ratio. If width exceeds target, resizes to target width. Otherwise, resizes to calculated width and pads with zeros if padding=True.
resize_norm_img_chinese(img, image_shape): Special handling for images with Chinese characters, uses different normalization.

Sources: ppocr/data/imaug/rec_img_aug.py285-312

Algorithm-Specific Resize Operations

SAR Resize (SARRecResizeImg):

Uses resize_norm_img_sar() which computes resize and padding shapes
Outputs: resized_shape, pad_shape, valid_ratio
Accounts for width_downsample_ratio in SAR's architecture

Sources: ppocr/data/imaug/rec_img_aug.py403-420

SRN Resize (SRNRecResizeImg):

Uses resize_norm_img_srn() for SRN-specific normalization
Generates positional encodings via srn_other_inputs():
- encoder_word_pos: Position encoding for encoder
- gsrm_word_pos: GSRM word positions
- gsrm_slf_attn_bias1/2: Self-attention biases

Sources: ppocr/data/imaug/rec_img_aug.py379-401

SPIN Resize (SPINRecResizeImg):

Converts to grayscale
Applies mean-std normalization: (img - mean) / std
Supports multiple interpolation types (NEAREST, LINEAR, CUBIC, AREA)

Sources: ppocr/data/imaug/rec_img_aug.py442-487

RobustScanner Resize (RobustScannerRecResizeImg):

Similar to SAR but adds word_positons array: np.array(range(0, max_text_length))
Required for RobustScanner's character-level attention mechanism

Sources: ppocr/data/imaug/rec_img_aug.py572-640

Resize Configuration Table

Resize Class	Grayscale	Padding	Aspect Ratio	Special Outputs	Interpolation
`RecResizeImg`	No	Optional	Preserved	`valid_ratio`	LINEAR
`SARRecResizeImg`	No	Yes	Preserved	`resized_shape`, `pad_shape`, `valid_ratio`	LINEAR
`SRNRecResizeImg`	No	No	SRN-specific	`encoder_word_pos`, `gsrm_word_pos`, attention biases	LINEAR
`ABINetRecResizeImg`	No	Yes	Preserved	`valid_ratio`	CUBIC
`SVTRRecResizeImg`	No	Optional	Preserved	`valid_ratio`	LINEAR
`RFLRecResizeImg`	Yes	Optional	Preserved	`valid_ratio`	Configurable
`RobustScannerRecResizeImg`	No	Yes	Preserved	`resized_shape`, `pad_shape`, `valid_ratio`, `word_positons`	LINEAR
`GrayRecResizeImg`	Yes	Optional	Varies	None	PIL/OpenCV
`PRENResizeImg`	No	No	Not preserved	None	LINEAR
`SPINRecResizeImg`	Yes	No	Not preserved	None	Configurable

Sources: ppocr/data/imaug/rec_img_aug.py285-640

Transform Pipeline Implementation

The transform pipeline is implemented through two key functions in ppocr/data/imaug/__init__.py:

Operator Execution Flow

Each operator implements the __call__(self, data) method which:

Receives data dictionary (e.g., {"image": np.ndarray, "label": str})
Performs transformation
Returns modified data dict or None to discard sample

Diagram: Operator Call Chain

Sources: ppocr/data/imaug/__init__.py68-96 ppocr/data/imaug/operators.py ppocr/data/imaug/label_ops.py153-193 ppocr/data/imaug/rec_img_aug.py52-66

Example Configuration and Execution

A typical YAML configuration for recognition training:

The execution flow during training:

Sources: ppocr/data/imaug/__init__.py68-96 tools/train.py54-62 tools/program.py319-383

Integration with Training Pipeline

The data processing pipeline integrates with the training system through build_dataloader():

Diagram: Data Pipeline Integration

Sources: tools/train.py54-62 tools/program.py191-649 ppocr/data/__init__.py ppocr/data/simple_dataset.py ppocr/data/imaug/__init__.py68-96

Dataloader Construction

The build_dataloader() function creates the data pipeline:

Key Steps:

Extracts dataset configuration from config['Train']['dataset']
Instantiates dataset class based on name field:
- SimpleDataSet: For standard image-label pairs
- LMDBDataSet: For LMDB database format
- PubTabDataSet: For table recognition
- PGDataSet: For end-to-end OCR
Creates transform operators via create_operators(transforms, global_config)
Wraps dataset in paddle.io.DataLoader with:
- Batch size from config
- Multi-process data loading (num_workers)
- Optional distributed sampler for multi-GPU training
- Custom collate function for batching

Implementation: ppocr/data/__init__.py

Usage:

Sources: ppocr/data/__init__.py tools/train.py54-62

Batch Structure in Training Loop

The training loop in tools/program.py::train() receives batches where structure varies by algorithm:

Recognition Models:

Algorithm-Specific Batch Structures:

Algorithm	batch[0]	batch[1]	batch[2+]
CTC/CRNN	images	label indices	length
Attention	images	label sequences (with sos/eos)	length
SRN	images	labels	encoder_word_pos, gsrm_word_pos, gsrm_slf_attn_bias1, gsrm_slf_attn_bias2
SAR	images	labels	valid_ratio
RobustScanner	images	labels	valid_ratio, word_positons
Table	images	structure, bboxes, bbox_masks	(varies)
KIE	Entire data dict	(no separation)	N/A

The training loop handles these variations at tools/program.py326-383:

Sources: tools/program.py326-383

Data Validation and Error Handling

The pipeline includes multiple validation points:

Length Validation: Label encoders reject samples exceeding max_text_length:

Sources: ppocr/data/imaug/label_ops.py153-154

Character Validation: Unknown characters are silently skipped during encoding:

Sources: ppocr/data/imaug/label_ops.py158-163

Image Size Validation: Some augmentations require minimum dimensions:

Sources: ppocr/data/imaug/rec_img_aug.py58-59

Empty Result Handling: If label encoding produces empty result, sample is discarded:

Sources: ppocr/data/imaug/label_ops.py164-165

Summary

The PaddleOCR data processing and augmentation pipeline provides a flexible, extensible system for preparing OCR training data:

Label Encoding: Algorithm-specific encoders (CTCLabelEncode, AttnLabelEncode, etc.) convert text to indices with appropriate special tokens
Augmentation: Multiple augmentation strategies (RecAug, ABINetRecAug, SVTRRecAug) improve model robustness
Resize/Normalize: Specialized resize operations for different architectures maintain aspect ratios and generate auxiliary outputs
Pipeline Architecture: Configurable operator sequences via create_operators() and transform() enable easy experimentation
Validation: Multiple validation points ensure data quality and sample consistency

The modular design allows adding new operators by implementing the __call__(self, data) interface and registering them in ppocr/data/imaug/__init__.py.

Sources: ppocr/data/imaug/__init__.py1-96 ppocr/data/imaug/label_ops.py1-1327 ppocr/data/imaug/rec_img_aug.py1-900 tools/program.py191-649

Data Processing and Augmentation

Purpose and Scope

Data Processing Architecture

Data Loader Classes

Dataset Class Details

Pipeline Configuration and Instantiation

Basic Image Operators

DecodeImage

NormalizeImage

ToCHWImage

Label Encoding System

Base Label Encoder Architecture

Label Encoder Classes

Label Encoding Process Flow

Special Encoder Types

Table-Specific Operations

ResizeTableImage

GenTableMask

Image Augmentation Strategies

Augmentation Class Hierarchy

RecAug: General Recognition Augmentation

Algorithm-Specific Augmentation

RecConAug: Concatenation Augmentation

Resize and Normalization Operations

Resize Operation Classes

Standard RecResizeImg Process

Algorithm-Specific Resize Operations

Resize Configuration Table

Transform Pipeline Implementation

Operator Execution Flow

Example Configuration and Execution

Integration with Training Pipeline

Dataloader Construction

Batch Structure in Training Loop

Data Validation and Error Handling

Summary

On this page

Data Processing and Augmentation

Purpose and Scope

Data Processing Architecture

Data Loader Classes

Dataset Class Details

Pipeline Configuration and Instantiation

Basic Image Operators

DecodeImage

NormalizeImage

ToCHWImage

Label Encoding System

Base Label Encoder Architecture

Label Encoder Classes

Label Encoding Process Flow

Special Encoder Types

Table-Specific Operations

ResizeTableImage

GenTableMask

Image Augmentation Strategies

Augmentation Class Hierarchy

RecAug: General Recognition Augmentation

Algorithm-Specific Augmentation

RecConAug: Concatenation Augmentation

Resize and Normalization Operations

Resize Operation Classes

Standard RecResizeImg Process

Algorithm-Specific Resize Operations

Resize Configuration Table

Transform Pipeline Implementation

Operator Execution Flow

Example Configuration and Execution

Integration with Training Pipeline

Dataloader Construction

Batch Structure in Training Loop

Data Validation and Error Handling

Summary

On this page