Post-Processing and Enrichment

Relevant source files

Purpose and Scope

This page documents the post-processing and enrichment stages that occur after the core pipeline stages (preprocessing, OCR, layout detection, table structure) have completed. These stages refine the extracted structure and add semantic annotations to enhance the output document.

For information about the core pipeline stages that precede these steps, see Standard PDF Pipeline. For details on the threaded execution model, see Threaded Pipeline Architecture.

Post-Processing and Enrichment Flow

The following diagram shows the sequence of post-processing and enrichment operations:

Analysis: After core pipeline stages complete (preprocess, OCR, layout, table structure), the StandardPdfPipeline performs layout postprocessing to clean up overlapping clusters, then applies reading order prediction, and finally constructs the DoclingDocument. The enrichment phase operates on the fully assembled document, sequentially applying picture classification, picture description, chart extraction, and optional code/formula detection models. Each enrichment model processes elements in batches via the BasePipeline._enrich_document() orchestrator.

Sources: docling/pipeline/base_pipeline.py102-124 docling/pipeline/standard_pdf_pipeline.py

Reading Order Determination

The ReadingOrderModel reorders document elements according to natural reading flow, merges related elements, and associates captions and footnotes with their parent elements.

Reading Order Process

Key Operations:

Operation	Description	Code Reference
`predict_reading_order()`	Sorts elements spatially (top-to-bottom, left-to-right within pages)	docling/models/stages/reading_order/readingorder_model.py410-412
`predict_to_captions()`	Associates captions with figures, tables, code blocks	docling/models/stages/reading_order/readingorder_model.py413-415
`predict_to_footnotes()`	Associates footnotes with parent elements	docling/models/stages/reading_order/readingorder_model.py416-418
`predict_merges()`	Identifies text elements that should be merged (e.g., hyphenation)	docling/models/stages/reading_order/readingorder_model.py419-421

The model handles several special cases during document construction:

List items: Consecutive list items are grouped into GroupLabel.LIST containers (docling/models/stages/reading_order/readingorder_model.py351-359)
Headers and footers: Placed in ContentLayer.FURNITURE instead of BODY (docling/models/stages/reading_order/readingorder_model.py374-383)
Tables without structure: Creates 1×1 tables with rich cells containing all child elements (docling/models/stages/reading_order/readingorder_model.py217-282)

Sources: docling/models/stages/reading_order/readingorder_model.py42-431

Layout Postprocessing

Layout postprocessing refines layout predictions by resolving overlapping clusters, mapping text cells to clusters, and merging hierarchical elements. The LayoutPostprocessor uses spatial indexing (R-tree and interval trees) for efficient overlap detection.

Spatial Indexing Architecture

The postprocessor uses three complementary indexes for efficient spatial queries:

Key Classes:

Class	Purpose	Code Reference
`SpatialClusterIndex`	Maintains R-tree + interval trees for clusters	docling/utils/layout_postprocessor.py52-108
`IntervalTree`	1D interval overlap queries using binary search	docling/utils/layout_postprocessor.py124-155
`UnionFind`	Groups overlapping clusters efficiently	docling/utils/layout_postprocessor.py19-49

Sources: docling/utils/layout_postprocessor.py52-155

Overlap Resolution Process

The overlap resolution algorithm uses type-specific thresholds to determine which clusters to merge or remove:

Overlap Detection Logic:

Type-Specific Parameters:

Cluster Type	Area Threshold	Confidence Threshold	Code Reference
Regular (TEXT, SECTION_HEADER, etc.)	1.3	0.05	docling/utils/layout_postprocessor.py162
Picture (PICTURE)	2.0	0.3	docling/utils/layout_postprocessor.py163
Wrapper (FORM, TABLE, KEY_VALUE_REGION)	2.0	0.2	docling/utils/layout_postprocessor.py164

Sources: docling/utils/layout_postprocessor.py157-298

Cell-to-Cluster Mapping

After overlap resolution, the postprocessor maps text cells (word/line-level) to their containing clusters:

Mapping Algorithm:

Query spatial index with cell bounding box
For each candidate cluster, check if cluster.bbox.contains(cell.bbox)
Assign cell to the highest-confidence containing cluster
Cells not contained by any cluster → create wrapper TEXT cluster
Wrapper clusters are added to the cluster list

Special Cases:

Unassigned cells: Wrapped in new TEXT clusters to preserve all content
Multi-match cells: Assigned to highest-confidence cluster
Nested clusters: Parent clusters inherit child cells (merged during hierarchy construction)

Sources: docling/utils/layout_postprocessor.py300-400

Hierarchy Merging

The final postprocessing step merges hierarchically related clusters (e.g., FORM contains child clusters):

Wrapper Types:

DocItemLabel.FORM
DocItemLabel.KEY_VALUE_REGION
DocItemLabel.TABLE
DocItemLabel.DOCUMENT_INDEX

These clusters act as containers that group related child clusters. During hierarchy merging, child clusters are moved into the children list of their parent wrapper, and the wrapper's bounding box is expanded to fully contain all children.

Sources: docling/utils/layout_postprocessor.py402-500

Enrichment Pipeline Architecture

The enrichment pipeline processes the assembled DoclingDocument to add semantic annotations. Each enrichment model implements the GenericEnrichmentModel interface and processes elements in batches.

Enrichment Model Interface

Core Methods:

Method	Purpose	Returns
`is_processable()`	Filters elements that this model can process	`bool`
`prepare_element()`	Prepares element for batch processing (may crop images)	`Optional[EnrichElementT]`
`__call__()`	Processes batch of prepared elements	`Iterable[NodeItem]`

Sources: docling/models/base_model.py150-231

Enrichment Execution Flow

The BasePipeline._enrich_document() method orchestrates enrichment execution:

The pipeline processes elements sequentially through each model, allowing later models to benefit from annotations added by earlier ones.

Sources: docling/pipeline/base_pipeline.py100-122

Code and Formula Enrichment

The CodeFormulaModel is an optional enrichment model that identifies and annotates code blocks and mathematical formulas. Unlike other enrichment models, it requires access to the PDF backend for text extraction and is therefore only available in PDF processing pipelines.

Limitations:

Not available in StandardPdfPipeline: The threaded pipeline clears backends after page processing to reduce memory usage
Available in LegacyStandardPdfPipeline: The legacy pipeline retains backends when do_code_enrichment or do_formula_enrichment is enabled
Requires backend: Cannot run on previously converted DoclingDocument objects without the original PDF

Configuration:

Model Parameters:

Parameter	Type	Default	Description
`enabled`	`bool`	-	Computed from `do_code_enrichment` or `do_formula_enrichment`
`do_code_enrichment`	`bool`	`True`	Detect code blocks
`do_formula_enrichment`	`bool`	`True`	Detect mathematical formulas
`accelerator_options`	`AcceleratorOptions`	-	Device and threading config

Integration Note:

When either enrichment flag is enabled, the pipeline sets self.keep_backend = True to preserve PDF backends through the enrichment phase. However, this conflicts with the memory optimization strategy in StandardPdfPipeline, which clears backends immediately after page assembly.

Sources: docling/datamodel/pipeline_options.py635-648 docling/cli/main.py480-487

Picture Classification

The DocumentPictureClassifier categorizes pictures into 26 classes (bar chart, line chart, geographical map, etc.) using an engine-based inference architecture that supports multiple backends (Transformers, ONNX Runtime).

Picture Classification Architecture

Engine Creation:

The classifier uses a factory pattern to create the appropriate inference engine:

Inference Flow:

Batch preparation: Collect images into ImageClassificationEngineInput list
Engine inference: engine.predict_batch(input_batch) → List[ImageClassificationEngineOutput]
Result processing: Extract class probabilities, create predictions sorted by confidence

Configuration:

Option	Default	Description
`repo_id`	`"ds4sd/docling-models"`	HuggingFace model repository
`revision`	(pinned)	Specific model commit
`repo_cache_folder`	Model-specific	Local cache directory
`model_spec`	Preset-based	Model architecture specification
`engine_options`	`BaseImageClassificationEngineOptions`	Backend-specific config

Output Format:

Available Classes:

The model predicts 26 picture types including: abstract_painting, bar_chart, line_chart, pie_chart, scatter_plot, flowchart, diagram, organizational_chart, geographical_map, photograph_nature, photograph_people, etc.

Sources: docling/models/stages/picture_classifier/document_picture_classifier.py64-170 docling/models/inference_engines/image_classification/__init__.py

Chart Extraction

The ChartExtractionModelGraniteVision converts bar charts, pie charts, and line charts into tabular CSV format using the Granite Vision 3.3 2B model. This enrichment runs after picture classification to filter processable charts.

Chart Extraction Flow

Supported Chart Types:

Only pictures with meta.classification.get_main_prediction().class_name in this list are processed.

Model Configuration:

Parameter	Value
Model	`ibm-granite/granite-vision-3.3-2b-chart2csv-preview`
Revision	`6e1fbaae4604ecc85f4f371416d82154ca49ad67` (pinned)
Prompt	`"Convert the information in this chart into a data table in CSV format."`
Device	CPU or CUDA

Processing Details:

Output Structure:

Integration:

Chart extraction is configured via PdfPipelineOptions:

Sources: docling/models/stages/chart_extraction/granite_vision.py36-280 docling/pipeline/base_pipeline.py149-183

Picture Description

Picture description models generate natural language captions for images using Vision-Language Models (VLMs). Docling supports both inline VLMs (run locally) and API-based VLMs (remote services).

Picture Description Pipeline

Picture Description Filtering

Picture description models support sophisticated filtering to control which pictures receive captions:

Filter Option	Type	Default	Description
`picture_area_threshold`	`float`	`0.0`	Minimum page area fraction (0.0 = no threshold)
`classification_allow`	`Optional[List[PictureClassificationLabel]]`	`None`	Whitelist of allowed classes
`classification_deny`	`Optional[List[PictureClassificationLabel]]`	`None`	Blacklist of denied classes
`classification_min_confidence`	`float`	`0.0`	Minimum classification confidence

Filter Evaluation Logic:

Example Configuration:

Sources: docling/models/picture_description_base_model.py49-104

VLM Configuration

Inline VLMs (see Inline VLM Models for details):

API-based VLMs (see API-Based VLM Models for details):

Output Format:

Picture descriptions are stored in two locations:

Deprecated (annotations list): PictureDescriptionData(text=..., provenance=...)
Current (meta.description): DescriptionMetaField(text=..., created_by=...)

Sources: docs/examples/pictures_description_api.py40-183 docs/examples/pictures_description.ipynb

Custom Enrichment Models

Developers can create custom enrichment models by extending BaseEnrichmentModel or BaseItemAndImageEnrichmentModel.

Creating a Custom Enrichment Model

Example: Custom Picture Classifier

This example demonstrates the minimal structure for a custom enrichment model:

Integrate into Pipeline:

Sources: docs/examples/develop_picture_enrichment.py1-128

Enriching Existing Documents

Enrichment models can also process previously converted documents:

Key Points:

Enrichment requires access to the original PDF for image cropping (BaseItemAndImageEnrichmentModel)
The prepare_element() helper crops images with expansion_factor and images_scale
Enrichment updates elements in place within the DoclingDocument
Multiple enrichment models can be chained sequentially

Sources: docs/examples/enrich_doclingdocument.py1-154

Image-Based Enrichment Models

For models that need to process cropped images from the document, extend BaseItemAndImageEnrichmentModel:

The base class automatically handles:

Cropping images from pages based on provenance bounding boxes
Applying expansion_factor to include surrounding context
Scaling images by images_scale for higher resolution
Handling embedded images (e.g., from DOCX files) when page images unavailable

Sources: docling/models/base_model.py179-231

Post-Processing and Enrichment

Relevant source files

Purpose and Scope

For information about the core pipeline stages that precede these steps, see Standard PDF Pipeline. For details on the threaded execution model, see Threaded Pipeline Architecture.

Post-Processing and Enrichment Flow

The following diagram shows the sequence of post-processing and enrichment operations:

Sources: docling/pipeline/base_pipeline.py102-124 docling/pipeline/standard_pdf_pipeline.py

Reading Order Determination

The ReadingOrderModel reorders document elements according to natural reading flow, merges related elements, and associates captions and footnotes with their parent elements.

Reading Order Process

Key Operations:

Operation	Description	Code Reference
`predict_reading_order()`	Sorts elements spatially (top-to-bottom, left-to-right within pages)	docling/models/stages/reading_order/readingorder_model.py410-412
`predict_to_captions()`	Associates captions with figures, tables, code blocks	docling/models/stages/reading_order/readingorder_model.py413-415
`predict_to_footnotes()`	Associates footnotes with parent elements	docling/models/stages/reading_order/readingorder_model.py416-418
`predict_merges()`	Identifies text elements that should be merged (e.g., hyphenation)	docling/models/stages/reading_order/readingorder_model.py419-421

The model handles several special cases during document construction:

List items: Consecutive list items are grouped into GroupLabel.LIST containers (docling/models/stages/reading_order/readingorder_model.py351-359)
Headers and footers: Placed in ContentLayer.FURNITURE instead of BODY (docling/models/stages/reading_order/readingorder_model.py374-383)
Tables without structure: Creates 1×1 tables with rich cells containing all child elements (docling/models/stages/reading_order/readingorder_model.py217-282)

Sources: docling/models/stages/reading_order/readingorder_model.py42-431

Layout Postprocessing

Spatial Indexing Architecture

The postprocessor uses three complementary indexes for efficient spatial queries:

Key Classes:

Class	Purpose	Code Reference
`SpatialClusterIndex`	Maintains R-tree + interval trees for clusters	docling/utils/layout_postprocessor.py52-108
`IntervalTree`	1D interval overlap queries using binary search	docling/utils/layout_postprocessor.py124-155
`UnionFind`	Groups overlapping clusters efficiently	docling/utils/layout_postprocessor.py19-49

Sources: docling/utils/layout_postprocessor.py52-155

Overlap Resolution Process

The overlap resolution algorithm uses type-specific thresholds to determine which clusters to merge or remove:

Overlap Detection Logic:

Type-Specific Parameters:

Cluster Type	Area Threshold	Confidence Threshold	Code Reference
Regular (TEXT, SECTION_HEADER, etc.)	1.3	0.05	docling/utils/layout_postprocessor.py162
Picture (PICTURE)	2.0	0.3	docling/utils/layout_postprocessor.py163
Wrapper (FORM, TABLE, KEY_VALUE_REGION)	2.0	0.2	docling/utils/layout_postprocessor.py164

Sources: docling/utils/layout_postprocessor.py157-298

Cell-to-Cluster Mapping

After overlap resolution, the postprocessor maps text cells (word/line-level) to their containing clusters:

Mapping Algorithm:

Query spatial index with cell bounding box
For each candidate cluster, check if cluster.bbox.contains(cell.bbox)
Assign cell to the highest-confidence containing cluster
Cells not contained by any cluster → create wrapper TEXT cluster
Wrapper clusters are added to the cluster list

Special Cases:

Unassigned cells: Wrapped in new TEXT clusters to preserve all content
Multi-match cells: Assigned to highest-confidence cluster
Nested clusters: Parent clusters inherit child cells (merged during hierarchy construction)

Sources: docling/utils/layout_postprocessor.py300-400

Hierarchy Merging

The final postprocessing step merges hierarchically related clusters (e.g., FORM contains child clusters):

Wrapper Types:

DocItemLabel.FORM
DocItemLabel.KEY_VALUE_REGION
DocItemLabel.TABLE
DocItemLabel.DOCUMENT_INDEX

Sources: docling/utils/layout_postprocessor.py402-500

Enrichment Pipeline Architecture

Enrichment Model Interface

Core Methods:

Method	Purpose	Returns
`is_processable()`	Filters elements that this model can process	`bool`
`prepare_element()`	Prepares element for batch processing (may crop images)	`Optional[EnrichElementT]`
`__call__()`	Processes batch of prepared elements	`Iterable[NodeItem]`

Sources: docling/models/base_model.py150-231

Enrichment Execution Flow

The BasePipeline._enrich_document() method orchestrates enrichment execution:

The pipeline processes elements sequentially through each model, allowing later models to benefit from annotations added by earlier ones.

Sources: docling/pipeline/base_pipeline.py100-122

Code and Formula Enrichment

Limitations:

Not available in StandardPdfPipeline: The threaded pipeline clears backends after page processing to reduce memory usage
Available in LegacyStandardPdfPipeline: The legacy pipeline retains backends when do_code_enrichment or do_formula_enrichment is enabled
Requires backend: Cannot run on previously converted DoclingDocument objects without the original PDF

Configuration:

Model Parameters:

Parameter	Type	Default	Description
`enabled`	`bool`	-	Computed from `do_code_enrichment` or `do_formula_enrichment`
`do_code_enrichment`	`bool`	`True`	Detect code blocks
`do_formula_enrichment`	`bool`	`True`	Detect mathematical formulas
`accelerator_options`	`AcceleratorOptions`	-	Device and threading config

Integration Note:

Sources: docling/datamodel/pipeline_options.py635-648 docling/cli/main.py480-487

Picture Classification

Picture Classification Architecture

Engine Creation:

The classifier uses a factory pattern to create the appropriate inference engine:

Inference Flow:

Batch preparation: Collect images into ImageClassificationEngineInput list
Engine inference: engine.predict_batch(input_batch) → List[ImageClassificationEngineOutput]
Result processing: Extract class probabilities, create predictions sorted by confidence

Configuration:

Option	Default	Description
`repo_id`	`"ds4sd/docling-models"`	HuggingFace model repository
`revision`	(pinned)	Specific model commit
`repo_cache_folder`	Model-specific	Local cache directory
`model_spec`	Preset-based	Model architecture specification
`engine_options`	`BaseImageClassificationEngineOptions`	Backend-specific config

Output Format:

Available Classes:

Sources: docling/models/stages/picture_classifier/document_picture_classifier.py64-170 docling/models/inference_engines/image_classification/__init__.py

Chart Extraction

Chart Extraction Flow

Supported Chart Types:

Only pictures with meta.classification.get_main_prediction().class_name in this list are processed.

Model Configuration:

Parameter	Value
Model	`ibm-granite/granite-vision-3.3-2b-chart2csv-preview`
Revision	`6e1fbaae4604ecc85f4f371416d82154ca49ad67` (pinned)
Prompt	`"Convert the information in this chart into a data table in CSV format."`
Device	CPU or CUDA

Processing Details:

Output Structure:

Integration:

Chart extraction is configured via PdfPipelineOptions:

Sources: docling/models/stages/chart_extraction/granite_vision.py36-280 docling/pipeline/base_pipeline.py149-183

Picture Description

Picture description models generate natural language captions for images using Vision-Language Models (VLMs). Docling supports both inline VLMs (run locally) and API-based VLMs (remote services).

Picture Description Pipeline

Picture Description Filtering

Picture description models support sophisticated filtering to control which pictures receive captions:

Filter Option	Type	Default	Description
`picture_area_threshold`	`float`	`0.0`	Minimum page area fraction (0.0 = no threshold)
`classification_allow`	`Optional[List[PictureClassificationLabel]]`	`None`	Whitelist of allowed classes
`classification_deny`	`Optional[List[PictureClassificationLabel]]`	`None`	Blacklist of denied classes
`classification_min_confidence`	`float`	`0.0`	Minimum classification confidence

Filter Evaluation Logic:

Example Configuration:

Sources: docling/models/picture_description_base_model.py49-104

VLM Configuration

Inline VLMs (see Inline VLM Models for details):

API-based VLMs (see API-Based VLM Models for details):

Output Format:

Picture descriptions are stored in two locations:

Deprecated (annotations list): PictureDescriptionData(text=..., provenance=...)
Current (meta.description): DescriptionMetaField(text=..., created_by=...)

Sources: docs/examples/pictures_description_api.py40-183 docs/examples/pictures_description.ipynb

Custom Enrichment Models

Developers can create custom enrichment models by extending BaseEnrichmentModel or BaseItemAndImageEnrichmentModel.

Creating a Custom Enrichment Model

Example: Custom Picture Classifier

This example demonstrates the minimal structure for a custom enrichment model:

Integrate into Pipeline:

Sources: docs/examples/develop_picture_enrichment.py1-128

Enriching Existing Documents

Enrichment models can also process previously converted documents:

Key Points:

Enrichment requires access to the original PDF for image cropping (BaseItemAndImageEnrichmentModel)
The prepare_element() helper crops images with expansion_factor and images_scale
Enrichment updates elements in place within the DoclingDocument
Multiple enrichment models can be chained sequentially

Sources: docs/examples/enrich_doclingdocument.py1-154

Image-Based Enrichment Models

For models that need to process cropped images from the document, extend BaseItemAndImageEnrichmentModel:

The base class automatically handles:

Cropping images from pages based on provenance bounding boxes
Applying expansion_factor to include surrounding context
Scaling images by images_scale for higher resolution
Handling embedded images (e.g., from DOCX files) when page images unavailable

Sources: docling/models/base_model.py179-231

Post-Processing and Enrichment

Purpose and Scope

Post-Processing and Enrichment Flow

Reading Order Determination

Reading Order Process

Layout Postprocessing

Spatial Indexing Architecture

Overlap Resolution Process

Cell-to-Cluster Mapping

Hierarchy Merging

Enrichment Pipeline Architecture

Enrichment Model Interface

Enrichment Execution Flow

Code and Formula Enrichment

Picture Classification

Picture Classification Architecture

Chart Extraction

Chart Extraction Flow

Picture Description

Picture Description Pipeline

Picture Description Filtering

VLM Configuration

Custom Enrichment Models

Creating a Custom Enrichment Model

Example: Custom Picture Classifier

Enriching Existing Documents

Image-Based Enrichment Models

On this page

Post-Processing and Enrichment

Purpose and Scope

Post-Processing and Enrichment Flow

Reading Order Determination

Reading Order Process

Layout Postprocessing

Spatial Indexing Architecture

Overlap Resolution Process

Cell-to-Cluster Mapping

Hierarchy Merging

Enrichment Pipeline Architecture

Enrichment Model Interface

Enrichment Execution Flow

Code and Formula Enrichment

Picture Classification

Picture Classification Architecture

Chart Extraction

Chart Extraction Flow

Picture Description

Picture Description Pipeline

Picture Description Filtering

VLM Configuration

Custom Enrichment Models

Creating a Custom Enrichment Model

Example: Custom Picture Classifier

Enriching Existing Documents

Image-Based Enrichment Models

On this page