Knowledge distillation is a model compression technique used in PaddleOCR to transfer knowledge from a large, accurate "teacher" model to a smaller, more efficient "student" model. This page documents the distillation infrastructure, including the DistillationModel architecture, specialized loss functions, and training strategies for both text detection and recognition tasks.
For information about the base model architecture used within distillation, see Model Architecture Building. For loss functions used in standard (non-distillation) training, see Loss Functions and Metrics.
Knowledge distillation in PaddleOCR uses a multi-model wrapper architecture that simultaneously processes inputs through teacher and student models, then computes both task-specific losses and distillation losses to guide student learning.
Distillation Training Flow
Sources: ppocr/modeling/architectures/distillation_model.py1-61 ppocr/losses/combined_loss.py1-85
The DistillationModel class wraps multiple BaseModel instances to enable simultaneous teacher-student training. It manages model initialization, parameter freezing, and forward pass orchestration.
Class Structure
Key Implementation Details:
Model Initialization ppocr/modeling/architectures/distillation_model.py29-54:
config["Models"] dictionaryBaseModel for each entry (typically "Teacher" and "Student")pretrained parameter is specifiedfreeze_params=True (typically for teacher)model_list with corresponding names in model_name_listParameter Freezing ppocr/modeling/architectures/distillation_model.py50-52:
Forward Pass ppocr/modeling/architectures/distillation_model.py56-60:
model_list{"Teacher": teacher_output, "Student": student_output}Sources: ppocr/modeling/architectures/distillation_model.py1-61
The distillation training system uses a combination of task-specific losses (to ensure the student learns the actual task) and distillation losses (to transfer knowledge from teacher to student).
Task losses are standard loss functions wrapped for distillation scenarios. They compute loss only on student predictions against ground truth labels.
| Loss Class | Purpose | Key Parameter |
|---|---|---|
DistillationCTCLoss | CTC loss for recognition | model_name_list - which models to compute loss for |
DistillationDBLoss | DB loss for detection | model_name_list |
DistillationSARLoss | SAR loss for recognition | model_name_list |
DistillationNRTRLoss | NRTR loss for recognition | model_name_list |
Example: DistillationCTCLoss ppocr/losses/distillation_loss.py613-637
Sources: ppocr/losses/distillation_loss.py613-637 ppocr/losses/distillation_loss.py640-668
Distillation losses transfer knowledge by matching teacher and student outputs. Three main types are supported: DML, KL Divergence, and DKD.
Distillation Loss Comparison
Sources: ppocr/losses/basic_loss.py95-134 ppocr/losses/basic_loss.py173-197 ppocr/losses/basic_loss.py199-248
Implementation: DistillationDMLLoss ppocr/losses/distillation_loss.py45-141
Key Features:
softmax or sigmoid activationBase DML Loss ppocr/losses/basic_loss.py95-134:
0.5 * [KL(out1||out2) + KL(out2||out1)]Implementation: DistillationKLDivLoss ppocr/losses/distillation_loss.py143-246
Key Features:
Base KLDiv Loss ppocr/losses/basic_loss.py173-197:
Implementation: DistillationDKDLoss ppocr/losses/distillation_loss.py248-357
Key Features:
α * TCKD + β * NCKDBase DKD Loss ppocr/losses/basic_loss.py199-248:
temperature: Softening factor (default 1.0)alpha: Weight for TCKD (default 1.0)beta: Weight for NCKD (default 1.0)Sources: ppocr/losses/distillation_loss.py45-141 ppocr/losses/distillation_loss.py143-246 ppocr/losses/distillation_loss.py248-357 ppocr/losses/basic_loss.py95-248
The CombinedLoss class aggregates multiple loss functions with individual weights, enabling flexible multi-objective optimization during distillation training.
Architecture:
Implementation ppocr/losses/combined_loss.py43-84:
Sources: ppocr/losses/combined_loss.py1-85
Distillation training is configured through YAML files that specify the model architecture, loss composition, and training parameters.
Example Configuration Structure:
Key Configuration Parameters:
| Parameter | Location | Purpose |
|---|---|---|
freeze_params | Architecture.Models.<name> | Whether to freeze model parameters (True for teacher) |
pretrained | Architecture.Models.<name> | Path to pretrained weights to load |
model_name_list | Loss.*.model_name_list | Which models to apply task loss to (typically ["Student"]) |
model_name_pairs | Loss.*.model_name_pairs | Model pairs for distillation loss (e.g., [["Student", "Teacher"]]) |
key | Loss.*.key | Output key to extract from model predictions |
weight | Loss.*.weight | Weight for this loss in CombinedLoss aggregation |
Sources: ppocr/modeling/architectures/distillation_model.py29-54 ppocr/losses/combined_loss.py43-84
The distillation system uses specialized model loading logic to support pretrained teacher models and selective parameter freezing.
Model Loading Flow:
Implementation Details:
Pretrained Model Loading ppocr/modeling/architectures/distillation_model.py47-49:
The load_pretrained_params function ppocr/utils/save_load.py172-212:
.pdparams file from specified pathParameter Freezing ppocr/modeling/architectures/distillation_model.py50-52:
Checkpoint Saving ppocr/utils/save_load.py214-281:
.pdopt), model parameters (.pdparams), and training states (.states)Sources: ppocr/modeling/architectures/distillation_model.py29-54 ppocr/utils/save_load.py172-212 ppocr/utils/save_load.py214-281
Detection models using distillation require specialized post-processing to handle outputs from multiple models.
DistillationDBPostProcess ppocr/postprocess/db_postprocess.py259-290:
Key Features:
model_name parameter (typically ["student"])DBPostProcess for actual box extraction ppocr/postprocess/db_postprocess.py275-283Implementation ppocr/postprocess/db_postprocess.py259-290:
Sources: ppocr/postprocess/db_postprocess.py259-290
The DistillationMetric class wraps standard metrics to evaluate multiple models during distillation training.
Architecture:
Key Methods ppocr/metrics/distillation_metric.py1-73:
Initialization ppocr/metrics/distillation_metric.py27-33:
key: Model name to use as primary (typically "Student")base_metric_name: Base metric class (e.g., "RecMetric", "DetMetric")main_indicator: Primary metric to track (e.g., "acc")Lazy Initialization ppocr/metrics/distillation_metric.py35-42:
Metric Aggregation ppocr/metrics/distillation_metric.py52-68:
Example Output:
Sources: ppocr/metrics/distillation_metric.py1-73
Common distillation training patterns in PaddleOCR:
Detection Distillation (DB Model):
DistillationDBLoss (weight: 1.0) - Task loss on studentDistillationDMLLoss (weight: 1.0) - DML on shrink_maps and threshold_mapsRecognition Distillation (CTC Model):
DistillationCTCLoss (weight: 1.0) - CTC loss on studentDistillationDMLLoss (weight: 1.0, use_log: true) - DML on CTC logitsDistillationKLDivLoss - Additional knowledge transferMulti-head Recognition:
multi_head=True in loss configurationSources: ppocr/losses/distillation_loss.py45-141 ppocr/losses/distillation_loss.py613-637
Refresh this wiki
This wiki was recently refreshed. Please wait 2 days to refresh again.