Knowledge Distillation

Relevant source files

Purpose and Scope

Knowledge distillation is a model compression technique used in PaddleOCR to transfer knowledge from a large, accurate "teacher" model to a smaller, more efficient "student" model. This page documents the distillation infrastructure, including the DistillationModel architecture, specialized loss functions, and training strategies for both text detection and recognition tasks.

For information about the base model architecture used within distillation, see Model Architecture Building. For loss functions used in standard (non-distillation) training, see Loss Functions and Metrics.

Architecture Overview

Knowledge distillation in PaddleOCR uses a multi-model wrapper architecture that simultaneously processes inputs through teacher and student models, then computes both task-specific losses and distillation losses to guide student learning.

Distillation Training Flow

Sources: ppocr/modeling/architectures/distillation_model.py1-61 ppocr/losses/combined_loss.py1-85

DistillationModel Class

The DistillationModel class wraps multiple BaseModel instances to enable simultaneous teacher-student training. It manages model initialization, parameter freezing, and forward pass orchestration.

Class Structure

Key Implementation Details:

Model Initialization ppocr/modeling/architectures/distillation_model.py29-54:
- Iterates through config["Models"] dictionary
- Creates a BaseModel for each entry (typically "Teacher" and "Student")
- Loads pretrained weights if pretrained parameter is specified
- Freezes parameters if freeze_params=True (typically for teacher)
- Stores models in model_list with corresponding names in model_name_list
Parameter Freezing ppocr/modeling/architectures/distillation_model.py50-52:
Forward Pass ppocr/modeling/architectures/distillation_model.py56-60:
- Runs input through all models in model_list
- Returns dictionary mapping model names to their outputs
- Output format: {"Teacher": teacher_output, "Student": student_output}

Sources: ppocr/modeling/architectures/distillation_model.py1-61

Loss Functions

The distillation training system uses a combination of task-specific losses (to ensure the student learns the actual task) and distillation losses (to transfer knowledge from teacher to student).

Task Losses

Task losses are standard loss functions wrapped for distillation scenarios. They compute loss only on student predictions against ground truth labels.

Loss Class	Purpose	Key Parameter
`DistillationCTCLoss`	CTC loss for recognition	`model_name_list` - which models to compute loss for
`DistillationDBLoss`	DB loss for detection	`model_name_list`
`DistillationSARLoss`	SAR loss for recognition	`model_name_list`
`DistillationNRTRLoss`	NRTR loss for recognition	`model_name_list`

Example: DistillationCTCLoss ppocr/losses/distillation_loss.py613-637

Sources: ppocr/losses/distillation_loss.py613-637 ppocr/losses/distillation_loss.py640-668

Distillation Losses

Distillation losses transfer knowledge by matching teacher and student outputs. Three main types are supported: DML, KL Divergence, and DKD.

Distillation Loss Comparison

Sources: ppocr/losses/basic_loss.py95-134 ppocr/losses/basic_loss.py173-197 ppocr/losses/basic_loss.py199-248

DML (Deep Mutual Learning)

Implementation: DistillationDMLLoss ppocr/losses/distillation_loss.py45-141

Key Features:

Bidirectional learning: Loss computed in both directions
Activation options: Supports softmax or sigmoid activation
Log mode: Optional log computation for feature map distillation (recognition)
Multi-pair support: Can distill between multiple model pairs simultaneously

Base DML Loss ppocr/losses/basic_loss.py95-134:

Uses KL-JS divergence to measure distribution difference
Symmetric: 0.5 * [KL(out1||out2) + KL(out2||out1)]
For recognition: applies to CTC logits with log transformation
For detection: applies to probability maps directly

KL Divergence Loss

Implementation: DistillationKLDivLoss ppocr/losses/distillation_loss.py143-246

Key Features:

Asymmetric: Student learns from teacher only
Soft targets: Uses teacher's softened probability distribution
Mask support: Can apply loss only to valid positions (for sequence tasks)

Base KLDiv Loss ppocr/losses/basic_loss.py173-197:

DKD (Decoupled Knowledge Distillation)

Implementation: DistillationDKDLoss ppocr/losses/distillation_loss.py248-357

Key Features:

Decoupled learning: Separates target class knowledge from non-target class knowledge
Temperature scaling: Softens distributions with temperature parameter
Weighted combination: α * TCKD + β * NCKD

Base DKD Loss ppocr/losses/basic_loss.py199-248:

TCKD (Target Class Knowledge Distillation): Matches predicted probabilities on correct class vs all other classes
NCKD (Non-target Class Knowledge Distillation): Matches relative probabilities among incorrect classes
Parameters:
- temperature: Softening factor (default 1.0)
- alpha: Weight for TCKD (default 1.0)
- beta: Weight for NCKD (default 1.0)

Sources: ppocr/losses/distillation_loss.py45-141 ppocr/losses/distillation_loss.py143-246 ppocr/losses/distillation_loss.py248-357 ppocr/losses/basic_loss.py95-248

CombinedLoss

The CombinedLoss class aggregates multiple loss functions with individual weights, enabling flexible multi-objective optimization during distillation training.

Architecture:

Implementation ppocr/losses/combined_loss.py43-84:

Sources: ppocr/losses/combined_loss.py1-85

Configuration Structure

Distillation training is configured through YAML files that specify the model architecture, loss composition, and training parameters.

Example Configuration Structure:

Key Configuration Parameters:

Parameter	Location	Purpose
`freeze_params`	`Architecture.Models.<name>`	Whether to freeze model parameters (True for teacher)
`pretrained`	`Architecture.Models.<name>`	Path to pretrained weights to load
`model_name_list`	`Loss.*.model_name_list`	Which models to apply task loss to (typically ["Student"])
`model_name_pairs`	`Loss.*.model_name_pairs`	Model pairs for distillation loss (e.g., [["Student", "Teacher"]])
`key`	`Loss.*.key`	Output key to extract from model predictions
`weight`	`Loss.*.weight`	Weight for this loss in CombinedLoss aggregation

Sources: ppocr/modeling/architectures/distillation_model.py29-54 ppocr/losses/combined_loss.py43-84

Model Loading and Parameter Freezing

The distillation system uses specialized model loading logic to support pretrained teacher models and selective parameter freezing.

Model Loading Flow:

Implementation Details:

Pretrained Model Loading ppocr/modeling/architectures/distillation_model.py47-49:

The load_pretrained_params function ppocr/utils/save_load.py172-212:

Loads .pdparams file from specified path
Performs shape and dtype validation
Handles float16 to float32 conversion if needed
Logs warnings for missing or mismatched parameters

Parameter Freezing ppocr/modeling/architectures/distillation_model.py50-52:

Checkpoint Saving ppocr/utils/save_load.py214-281:

For distillation models with NLP backbones: saves student model separately ppocr/utils/save_load.py257-263
Standard models: saves complete state dict ppocr/utils/save_load.py244-250
Saves optimizer state (.pdopt), model parameters (.pdparams), and training states (.states)

Sources: ppocr/modeling/architectures/distillation_model.py29-54 ppocr/utils/save_load.py172-212 ppocr/utils/save_load.py214-281

Post-processing for Distillation

Detection models using distillation require specialized post-processing to handle outputs from multiple models.

DistillationDBPostProcess ppocr/postprocess/db_postprocess.py259-290:

Key Features:

Model selection: Specifies which model's outputs to use via model_name parameter (typically ["student"])
Reuses DBPostProcess: Delegates to standard DBPostProcess for actual box extraction ppocr/postprocess/db_postprocess.py275-283
Multiple models: Can process outputs from multiple models if needed

Implementation ppocr/postprocess/db_postprocess.py259-290:

Sources: ppocr/postprocess/db_postprocess.py259-290

Metrics for Distillation

The DistillationMetric class wraps standard metrics to evaluate multiple models during distillation training.

Architecture:

Key Methods ppocr/metrics/distillation_metric.py1-73:

Initialization ppocr/metrics/distillation_metric.py27-33:
- key: Model name to use as primary (typically "Student")
- base_metric_name: Base metric class (e.g., "RecMetric", "DetMetric")
- main_indicator: Primary metric to track (e.g., "acc")
Lazy Initialization ppocr/metrics/distillation_metric.py35-42:
- Metrics are created on first call to match prediction structure
- Creates one metric instance per model in predictions
Metric Aggregation ppocr/metrics/distillation_metric.py52-68:
- Main model metrics returned with original names
- Other models' metrics prefixed with model name (e.g., "Teacher_acc")

Example Output:

Sources: ppocr/metrics/distillation_metric.py1-73

Training Strategies

Common distillation training patterns in PaddleOCR:

Detection Distillation (DB Model):

Teacher: ResNet50-based DBNet
Student: MobileNetV3-based DBNet
Losses:
- DistillationDBLoss (weight: 1.0) - Task loss on student
- DistillationDMLLoss (weight: 1.0) - DML on shrink_maps and threshold_maps

Recognition Distillation (CTC Model):

Teacher: Large backbone (ResNet, SVTRNet)
Student: MobileNetV3 or smaller SVTR
Losses:
- DistillationCTCLoss (weight: 1.0) - CTC loss on student
- DistillationDMLLoss (weight: 1.0, use_log: true) - DML on CTC logits
- Optional DistillationKLDivLoss - Additional knowledge transfer

Multi-head Recognition:

Uses multi_head=True in loss configuration
Applies distillation to specific heads (e.g., "ctc" head)
Handles sequence masking for valid positions

Sources: ppocr/losses/distillation_loss.py45-141 ppocr/losses/distillation_loss.py613-637

Knowledge Distillation

Relevant source files

Purpose and Scope

Architecture Overview

Distillation Training Flow

Sources: ppocr/modeling/architectures/distillation_model.py1-61 ppocr/losses/combined_loss.py1-85

DistillationModel Class

Class Structure

Key Implementation Details:

Model Initialization ppocr/modeling/architectures/distillation_model.py29-54:
- Iterates through config["Models"] dictionary
- Creates a BaseModel for each entry (typically "Teacher" and "Student")
- Loads pretrained weights if pretrained parameter is specified
- Freezes parameters if freeze_params=True (typically for teacher)
- Stores models in model_list with corresponding names in model_name_list
Parameter Freezing ppocr/modeling/architectures/distillation_model.py50-52:
Forward Pass ppocr/modeling/architectures/distillation_model.py56-60:
- Runs input through all models in model_list
- Returns dictionary mapping model names to their outputs
- Output format: {"Teacher": teacher_output, "Student": student_output}

Sources: ppocr/modeling/architectures/distillation_model.py1-61

Loss Functions

The distillation training system uses a combination of task-specific losses (to ensure the student learns the actual task) and distillation losses (to transfer knowledge from teacher to student).

Task Losses

Task losses are standard loss functions wrapped for distillation scenarios. They compute loss only on student predictions against ground truth labels.

Loss Class	Purpose	Key Parameter
`DistillationCTCLoss`	CTC loss for recognition	`model_name_list` - which models to compute loss for
`DistillationDBLoss`	DB loss for detection	`model_name_list`
`DistillationSARLoss`	SAR loss for recognition	`model_name_list`
`DistillationNRTRLoss`	NRTR loss for recognition	`model_name_list`

Example: DistillationCTCLoss ppocr/losses/distillation_loss.py613-637

Sources: ppocr/losses/distillation_loss.py613-637 ppocr/losses/distillation_loss.py640-668

Distillation Losses

Distillation losses transfer knowledge by matching teacher and student outputs. Three main types are supported: DML, KL Divergence, and DKD.

Distillation Loss Comparison

Sources: ppocr/losses/basic_loss.py95-134 ppocr/losses/basic_loss.py173-197 ppocr/losses/basic_loss.py199-248

DML (Deep Mutual Learning)

Implementation: DistillationDMLLoss ppocr/losses/distillation_loss.py45-141

Key Features:

Bidirectional learning: Loss computed in both directions
Activation options: Supports softmax or sigmoid activation
Log mode: Optional log computation for feature map distillation (recognition)
Multi-pair support: Can distill between multiple model pairs simultaneously

Base DML Loss ppocr/losses/basic_loss.py95-134:

Uses KL-JS divergence to measure distribution difference
Symmetric: 0.5 * [KL(out1||out2) + KL(out2||out1)]
For recognition: applies to CTC logits with log transformation
For detection: applies to probability maps directly

KL Divergence Loss

Implementation: DistillationKLDivLoss ppocr/losses/distillation_loss.py143-246

Key Features:

Asymmetric: Student learns from teacher only
Soft targets: Uses teacher's softened probability distribution
Mask support: Can apply loss only to valid positions (for sequence tasks)

Base KLDiv Loss ppocr/losses/basic_loss.py173-197:

DKD (Decoupled Knowledge Distillation)

Implementation: DistillationDKDLoss ppocr/losses/distillation_loss.py248-357

Key Features:

Decoupled learning: Separates target class knowledge from non-target class knowledge
Temperature scaling: Softens distributions with temperature parameter
Weighted combination: α * TCKD + β * NCKD

Base DKD Loss ppocr/losses/basic_loss.py199-248:

TCKD (Target Class Knowledge Distillation): Matches predicted probabilities on correct class vs all other classes
NCKD (Non-target Class Knowledge Distillation): Matches relative probabilities among incorrect classes
Parameters:
- temperature: Softening factor (default 1.0)
- alpha: Weight for TCKD (default 1.0)
- beta: Weight for NCKD (default 1.0)

Sources: ppocr/losses/distillation_loss.py45-141 ppocr/losses/distillation_loss.py143-246 ppocr/losses/distillation_loss.py248-357 ppocr/losses/basic_loss.py95-248

CombinedLoss

The CombinedLoss class aggregates multiple loss functions with individual weights, enabling flexible multi-objective optimization during distillation training.

Architecture:

Implementation ppocr/losses/combined_loss.py43-84:

Sources: ppocr/losses/combined_loss.py1-85

Configuration Structure

Distillation training is configured through YAML files that specify the model architecture, loss composition, and training parameters.

Example Configuration Structure:

Key Configuration Parameters:

Parameter	Location	Purpose
`freeze_params`	`Architecture.Models.<name>`	Whether to freeze model parameters (True for teacher)
`pretrained`	`Architecture.Models.<name>`	Path to pretrained weights to load
`model_name_list`	`Loss.*.model_name_list`	Which models to apply task loss to (typically ["Student"])
`model_name_pairs`	`Loss.*.model_name_pairs`	Model pairs for distillation loss (e.g., [["Student", "Teacher"]])
`key`	`Loss.*.key`	Output key to extract from model predictions
`weight`	`Loss.*.weight`	Weight for this loss in CombinedLoss aggregation

Sources: ppocr/modeling/architectures/distillation_model.py29-54 ppocr/losses/combined_loss.py43-84

Model Loading and Parameter Freezing

The distillation system uses specialized model loading logic to support pretrained teacher models and selective parameter freezing.

Model Loading Flow:

Implementation Details:

Pretrained Model Loading ppocr/modeling/architectures/distillation_model.py47-49:

The load_pretrained_params function ppocr/utils/save_load.py172-212:

Loads .pdparams file from specified path
Performs shape and dtype validation
Handles float16 to float32 conversion if needed
Logs warnings for missing or mismatched parameters

Parameter Freezing ppocr/modeling/architectures/distillation_model.py50-52:

Checkpoint Saving ppocr/utils/save_load.py214-281:

For distillation models with NLP backbones: saves student model separately ppocr/utils/save_load.py257-263
Standard models: saves complete state dict ppocr/utils/save_load.py244-250
Saves optimizer state (.pdopt), model parameters (.pdparams), and training states (.states)

Sources: ppocr/modeling/architectures/distillation_model.py29-54 ppocr/utils/save_load.py172-212 ppocr/utils/save_load.py214-281

Post-processing for Distillation

Detection models using distillation require specialized post-processing to handle outputs from multiple models.

DistillationDBPostProcess ppocr/postprocess/db_postprocess.py259-290:

Key Features:

Model selection: Specifies which model's outputs to use via model_name parameter (typically ["student"])
Reuses DBPostProcess: Delegates to standard DBPostProcess for actual box extraction ppocr/postprocess/db_postprocess.py275-283
Multiple models: Can process outputs from multiple models if needed

Implementation ppocr/postprocess/db_postprocess.py259-290:

Sources: ppocr/postprocess/db_postprocess.py259-290

Metrics for Distillation

The DistillationMetric class wraps standard metrics to evaluate multiple models during distillation training.

Architecture:

Key Methods ppocr/metrics/distillation_metric.py1-73:

Initialization ppocr/metrics/distillation_metric.py27-33:
- key: Model name to use as primary (typically "Student")
- base_metric_name: Base metric class (e.g., "RecMetric", "DetMetric")
- main_indicator: Primary metric to track (e.g., "acc")
Lazy Initialization ppocr/metrics/distillation_metric.py35-42:
- Metrics are created on first call to match prediction structure
- Creates one metric instance per model in predictions
Metric Aggregation ppocr/metrics/distillation_metric.py52-68:
- Main model metrics returned with original names
- Other models' metrics prefixed with model name (e.g., "Teacher_acc")

Example Output:

Sources: ppocr/metrics/distillation_metric.py1-73

Training Strategies

Common distillation training patterns in PaddleOCR:

Detection Distillation (DB Model):

Teacher: ResNet50-based DBNet
Student: MobileNetV3-based DBNet
Losses:
- DistillationDBLoss (weight: 1.0) - Task loss on student
- DistillationDMLLoss (weight: 1.0) - DML on shrink_maps and threshold_maps

Recognition Distillation (CTC Model):

Teacher: Large backbone (ResNet, SVTRNet)
Student: MobileNetV3 or smaller SVTR
Losses:
- DistillationCTCLoss (weight: 1.0) - CTC loss on student
- DistillationDMLLoss (weight: 1.0, use_log: true) - DML on CTC logits
- Optional DistillationKLDivLoss - Additional knowledge transfer

Multi-head Recognition:

Uses multi_head=True in loss configuration
Applies distillation to specific heads (e.g., "ctc" head)
Handles sequence masking for valid positions

Sources: ppocr/losses/distillation_loss.py45-141 ppocr/losses/distillation_loss.py613-637

Knowledge Distillation

Purpose and Scope

Architecture Overview

DistillationModel Class

Loss Functions

Task Losses

Distillation Losses

DML (Deep Mutual Learning)

KL Divergence Loss

DKD (Decoupled Knowledge Distillation)

CombinedLoss

Configuration Structure

Model Loading and Parameter Freezing

Post-processing for Distillation

Metrics for Distillation

Training Strategies

On this page

Knowledge Distillation

Purpose and Scope

Architecture Overview

DistillationModel Class

Loss Functions

Task Losses

Distillation Losses

DML (Deep Mutual Learning)

KL Divergence Loss

DKD (Decoupled Knowledge Distillation)

CombinedLoss

Configuration Structure

Model Loading and Parameter Freezing

Post-processing for Distillation

Metrics for Distillation

Training Strategies

On this page