DoclingDocument Data Model

Relevant source files

This page documents the DoclingDocument data model, which serves as Docling's unified representation for structured documents. This is the core output schema produced by all conversion pipelines, regardless of input format.

For information about how documents flow through the conversion process, see Document Conversion Flow. For configuring pipeline behavior, see Configuration and Pipeline Options.

Overview

The DoclingDocument schema represents documents as a hierarchical structure with rich semantic annotations, spatial provenance, and content organization. Every successful document conversion results in a DoclingDocument instance that preserves:

Document structure: Hierarchical organization of content items
Semantic labels: Classification of content (headings, paragraphs, tables, figures, etc.)
Spatial provenance: Page numbers, bounding boxes, and character spans linking back to source
Content separation: Distinction between main body content and document furniture (headers/footers)
Internal references: Cross-references between related content items

The schema is defined in the docling-core package and is version-controlled to ensure compatibility. Docling documents can be serialized to JSON, YAML, Markdown, HTML, or the legacy DocTags format.

Sources: docling/datamodel/document.py28-36 tests/data/groundtruth/docling_v2/2203.01017v2.json1-16

Schema Structure

A DoclingDocument consists of:

Metadata: Schema version, document name, and origin information
Hierarchical containers: body and furniture as root-level GroupItem objects
Flat collections: All content items stored in type-specific arrays (texts, tables, pictures, groups)
Reference system: Hierarchical structure uses JSON references ($ref) to items in the flat collections

This hybrid structure enables both efficient access to items by type and preservation of document hierarchy.

Sources: tests/data/groundtruth/docling_v2/2203.01017v2.json1-51 docling/datamodel/document.py28-36

Document Origin

The origin field captures metadata about the source document:

Field	Type	Description
`filename`	`str`	Original filename of the document
`mimetype`	`str`	MIME type of the source format
`binary_hash`	`int`	Stable hash of the document content

The binary hash enables deduplication and change detection across document versions.

Sources: tests/data/groundtruth/docling_v2/2203.01017v2.json5-9 docling_core.types.doc.document

Content Layers: Body vs Furniture

Documents are organized into two distinct content layers:

Body Layer

The body contains the main semantic content of the document:

Section headers
Paragraphs and text
Tables and figures
Lists and code blocks
Formulas

Furniture Layer

The furniture contains document elements that are structurally separate from the main content:

Page headers
Page footers
Footnotes
Captions (may also appear in body depending on context)

Both body and furniture are GroupItem instances that act as hierarchical containers. They reference child items through the $ref system.

Sources: tests/data/groundtruth/docling_v2/2203.01017v2.json10-51 docling_core.types.doc.document

Content Item Types

All content items inherit from DocItem and share common attributes:

Attribute	Type	Description
`self_ref`	`str`	JSON reference to this item (e.g., `#/texts/0`)
`parent`	`dict`	Reference to parent item (e.g., `{"$ref": "#/body"}`)
`children`	`list`	References to child items
`content_layer`	`str`	Either `"body"` or `"furniture"`
`label`	`DocItemLabel`	Semantic label (e.g., `"text"`, `"section_header"`)
`prov`	`list[ProvenanceItem]`	Provenance information (location in source)

TextItem

Represents textual content with optional formatting:

Common label values for TextItem:

text - Regular paragraph text
section_header - Section/subsection headers
title - Document title
caption - Figure/table captions
footnote - Footnote text
page_header / page_footer - Header/footer text
list_item - List item text

For section headers, an additional level attribute indicates heading depth (1-6).

Sources: tests/data/groundtruth/docling_v2/2305.03393v1-pg9.json109-162 docling_core.types.doc

TableItem

Represents tabular data with structured cell information:

The data field contains:

table_cells: Array of TableCell objects with text, position, spans, and header flags
num_rows / num_cols: Table dimensions
grid: Optional serialized representation (HTML, CSV, etc.)

Cells support row/column spanning and can be marked as header cells via col_header and row_header boolean flags.

Sources: tests/data/groundtruth/docling_v2/2305.03393v1-pg9.json190-428 docling_core.types.doc

PictureItem

Represents figures, diagrams, charts, and other visual content:

Key fields:

image: Embedded or referenced image data
captions: References to associated caption TextItems
meta.classification: Optional picture classification (chart, diagram, photo, etc.)
meta.tabular_chart: For charts, extracted tabular data

Picture classification labels include:

Chart-Plot - Line/scatter plots
Chart-Bar - Bar charts
Chart-Pie - Pie charts
Diagram-Flowchart - Flowcharts
Diagram-Illustration - Illustrations
Photo - Photographs

Sources: tests/data/groundtruth/docling_v2/picture_classification.json27-103 docling_core.types.doc

SectionHeaderItem

A specialized TextItem for document structure:

Section headers can have children representing the content that falls under that section. The level attribute indicates heading hierarchy (1 = top-level, 6 = deepest).

Sources: docling_core.types.doc tests/data/groundtruth/docling_v2/2203.01017v2.json143-162

ListItem

Represents items in ordered or unordered lists:

ListItems are typically children of a ListGroup (see Groups section below).

Sources: docling_core.types.doc.document tests/data/groundtruth/docling_v2/multi_page.json42-57

Groups and Nested Structures

GroupItem

GroupItem is a container for organizing related content items hierarchically. It acts as a parent to other items without being content itself.

Common group types:

ListGroup: Container for list items (ordered or unordered)
KeyValueArea: Region containing key-value pairs (unstructured)
Form: Structured form with labeled fields
Generic GroupItem: Custom hierarchical organization

Example list structure:

Groups enable preservation of document structure beyond simple linear ordering. Nested groups support complex hierarchies like multi-level lists or form sections.

Sources: docling_core.types.doc.document tests/data/groundtruth/docling_v2/multi_page.json42-80

Provenance Tracking

Every content item includes provenance information linking it back to its location in the source document:

ProvenanceItem Structure

Key aspects:

Multiple provenance entries: Items spanning multiple pages or columns have multiple ProvenanceItem entries
Bounding boxes: Coordinates in PDF points (1/72 inch), with configurable origin
Character spans: Zero-indexed ranges into the source text stream
Page numbering: Pages are 1-indexed (first page = 1)

Provenance enables:

Visual overlay of detected items on source document
Navigation from output back to source location
Extraction of source images for specific regions
Verification of conversion accuracy

Sources: tests/data/groundtruth/docling_v2/2305.03393v1-pg9.json62-76 docling_core.types.doc.base

Reference System

DoclingDocument uses a hybrid storage model combining hierarchical structure with flat collections:

JSON References

References use JSON Pointer syntax:

#/texts/0 - First item in texts array
#/tables/5 - Sixth item in tables array
#/pictures/2 - Third item in pictures array
#/groups/0 - First item in groups array
#/body - The body root node
#/furniture - The furniture root node

Benefits of This Design

Efficient access: Direct array indexing by item type
Preserved hierarchy: Parent-child relationships maintained through references
Deduplication: Items referenced multiple times stored once
Serialization: Clean JSON structure without circular references
Traversal: Both depth-first (follow children) and type-specific (iterate arrays) supported

Sources: tests/data/groundtruth/docling_v2/2203.01017v2.json19-51 docling_core.types.doc.document

Internal Representation

The DoclingDocument is implemented across two main classes:

ConversionResult

Wrapper returned by document conversion:

Location: docling/datamodel/document.py464-525

Page

Internal page-level data structure during processing:

The Page class bridges raw PDF data, ML predictions, and final assembled content. It is part of the ConversionResult.pages list but not exposed in the final DoclingDocument JSON export.

Location: docling/datamodel/base_models.py300-325

Key Type Definitions

Location: docling/datamodel/document.py28-36 docling_core.types.doc

Serialization and Export

The DoclingDocument can be exported to multiple formats:

Method	Output Format	Use Case
`save_as_json()`	JSON	Machine-readable, preserves full structure
`save_as_yaml()`	YAML	Human-readable, preserves full structure
`save_as_markdown()`	Markdown	Text-focused, lightweight
`save_as_html()`	HTML	Web display, preserves formatting
`save_as_doctags()`	DocTags	Legacy format for specific tools
`export_to_dict()`	Python dict	Programmatic access

All export methods support image export modes:

EMBEDDED: Base64-encoded images inline
REFERENCED: External image files with references
PLACEHOLDER: Text placeholders for image positions

Sources: docling_core.types.doc.document

Version Compatibility

The schema is versioned independently from Docling:

Schema version follows semantic versioning:

Major: Breaking changes to structure
Minor: Backward-compatible additions
Patch: Bug fixes and clarifications

Docling includes the schema version in exported documents, enabling version-specific parsing and migration utilities.

Sources: tests/data/groundtruth/docling_v2/2203.01017v2.json2-3

DoclingDocument Data Model

Relevant source files

For information about how documents flow through the conversion process, see Document Conversion Flow. For configuring pipeline behavior, see Configuration and Pipeline Options.

Overview

Document structure: Hierarchical organization of content items
Semantic labels: Classification of content (headings, paragraphs, tables, figures, etc.)
Spatial provenance: Page numbers, bounding boxes, and character spans linking back to source
Content separation: Distinction between main body content and document furniture (headers/footers)
Internal references: Cross-references between related content items

The schema is defined in the docling-core package and is version-controlled to ensure compatibility. Docling documents can be serialized to JSON, YAML, Markdown, HTML, or the legacy DocTags format.

Sources: docling/datamodel/document.py28-36 tests/data/groundtruth/docling_v2/2203.01017v2.json1-16

Schema Structure

A DoclingDocument consists of:

Metadata: Schema version, document name, and origin information
Hierarchical containers: body and furniture as root-level GroupItem objects
Flat collections: All content items stored in type-specific arrays (texts, tables, pictures, groups)
Reference system: Hierarchical structure uses JSON references ($ref) to items in the flat collections

This hybrid structure enables both efficient access to items by type and preservation of document hierarchy.

Sources: tests/data/groundtruth/docling_v2/2203.01017v2.json1-51 docling/datamodel/document.py28-36

Document Origin

The origin field captures metadata about the source document:

Field	Type	Description
`filename`	`str`	Original filename of the document
`mimetype`	`str`	MIME type of the source format
`binary_hash`	`int`	Stable hash of the document content

The binary hash enables deduplication and change detection across document versions.

Sources: tests/data/groundtruth/docling_v2/2203.01017v2.json5-9 docling_core.types.doc.document

Content Layers: Body vs Furniture

Documents are organized into two distinct content layers:

Body Layer

The body contains the main semantic content of the document:

Section headers
Paragraphs and text
Tables and figures
Lists and code blocks
Formulas

Furniture Layer

The furniture contains document elements that are structurally separate from the main content:

Page headers
Page footers
Footnotes
Captions (may also appear in body depending on context)

Both body and furniture are GroupItem instances that act as hierarchical containers. They reference child items through the $ref system.

Sources: tests/data/groundtruth/docling_v2/2203.01017v2.json10-51 docling_core.types.doc.document

Content Item Types

All content items inherit from DocItem and share common attributes:

Attribute	Type	Description
`self_ref`	`str`	JSON reference to this item (e.g., `#/texts/0`)
`parent`	`dict`	Reference to parent item (e.g., `{"$ref": "#/body"}`)
`children`	`list`	References to child items
`content_layer`	`str`	Either `"body"` or `"furniture"`
`label`	`DocItemLabel`	Semantic label (e.g., `"text"`, `"section_header"`)
`prov`	`list[ProvenanceItem]`	Provenance information (location in source)

TextItem

Represents textual content with optional formatting:

Common label values for TextItem:

text - Regular paragraph text
section_header - Section/subsection headers
title - Document title
caption - Figure/table captions
footnote - Footnote text
page_header / page_footer - Header/footer text
list_item - List item text

For section headers, an additional level attribute indicates heading depth (1-6).

Sources: tests/data/groundtruth/docling_v2/2305.03393v1-pg9.json109-162 docling_core.types.doc

TableItem

Represents tabular data with structured cell information:

The data field contains:

table_cells: Array of TableCell objects with text, position, spans, and header flags
num_rows / num_cols: Table dimensions
grid: Optional serialized representation (HTML, CSV, etc.)

Cells support row/column spanning and can be marked as header cells via col_header and row_header boolean flags.

Sources: tests/data/groundtruth/docling_v2/2305.03393v1-pg9.json190-428 docling_core.types.doc

PictureItem

Represents figures, diagrams, charts, and other visual content:

Key fields:

image: Embedded or referenced image data
captions: References to associated caption TextItems
meta.classification: Optional picture classification (chart, diagram, photo, etc.)
meta.tabular_chart: For charts, extracted tabular data

Picture classification labels include:

Chart-Plot - Line/scatter plots
Chart-Bar - Bar charts
Chart-Pie - Pie charts
Diagram-Flowchart - Flowcharts
Diagram-Illustration - Illustrations
Photo - Photographs

Sources: tests/data/groundtruth/docling_v2/picture_classification.json27-103 docling_core.types.doc

SectionHeaderItem

A specialized TextItem for document structure:

Section headers can have children representing the content that falls under that section. The level attribute indicates heading hierarchy (1 = top-level, 6 = deepest).

Sources: docling_core.types.doc tests/data/groundtruth/docling_v2/2203.01017v2.json143-162

ListItem

Represents items in ordered or unordered lists:

ListItems are typically children of a ListGroup (see Groups section below).

Sources: docling_core.types.doc.document tests/data/groundtruth/docling_v2/multi_page.json42-57

Groups and Nested Structures

GroupItem

GroupItem is a container for organizing related content items hierarchically. It acts as a parent to other items without being content itself.

Common group types:

ListGroup: Container for list items (ordered or unordered)
KeyValueArea: Region containing key-value pairs (unstructured)
Form: Structured form with labeled fields
Generic GroupItem: Custom hierarchical organization

Example list structure:

Groups enable preservation of document structure beyond simple linear ordering. Nested groups support complex hierarchies like multi-level lists or form sections.

Sources: docling_core.types.doc.document tests/data/groundtruth/docling_v2/multi_page.json42-80

Provenance Tracking

Every content item includes provenance information linking it back to its location in the source document:

ProvenanceItem Structure

Key aspects:

Multiple provenance entries: Items spanning multiple pages or columns have multiple ProvenanceItem entries
Bounding boxes: Coordinates in PDF points (1/72 inch), with configurable origin
Character spans: Zero-indexed ranges into the source text stream
Page numbering: Pages are 1-indexed (first page = 1)

Provenance enables:

Visual overlay of detected items on source document
Navigation from output back to source location
Extraction of source images for specific regions
Verification of conversion accuracy

Sources: tests/data/groundtruth/docling_v2/2305.03393v1-pg9.json62-76 docling_core.types.doc.base

Reference System

DoclingDocument uses a hybrid storage model combining hierarchical structure with flat collections:

JSON References

References use JSON Pointer syntax:

#/texts/0 - First item in texts array
#/tables/5 - Sixth item in tables array
#/pictures/2 - Third item in pictures array
#/groups/0 - First item in groups array
#/body - The body root node
#/furniture - The furniture root node

Benefits of This Design

Efficient access: Direct array indexing by item type
Preserved hierarchy: Parent-child relationships maintained through references
Deduplication: Items referenced multiple times stored once
Serialization: Clean JSON structure without circular references
Traversal: Both depth-first (follow children) and type-specific (iterate arrays) supported

Sources: tests/data/groundtruth/docling_v2/2203.01017v2.json19-51 docling_core.types.doc.document

Internal Representation

The DoclingDocument is implemented across two main classes:

ConversionResult

Wrapper returned by document conversion:

Location: docling/datamodel/document.py464-525

Page

Internal page-level data structure during processing:

The Page class bridges raw PDF data, ML predictions, and final assembled content. It is part of the ConversionResult.pages list but not exposed in the final DoclingDocument JSON export.

Location: docling/datamodel/base_models.py300-325

Key Type Definitions

Location: docling/datamodel/document.py28-36 docling_core.types.doc

Serialization and Export

The DoclingDocument can be exported to multiple formats:

Method	Output Format	Use Case
`save_as_json()`	JSON	Machine-readable, preserves full structure
`save_as_yaml()`	YAML	Human-readable, preserves full structure
`save_as_markdown()`	Markdown	Text-focused, lightweight
`save_as_html()`	HTML	Web display, preserves formatting
`save_as_doctags()`	DocTags	Legacy format for specific tools
`export_to_dict()`	Python dict	Programmatic access

All export methods support image export modes:

EMBEDDED: Base64-encoded images inline
REFERENCED: External image files with references
PLACEHOLDER: Text placeholders for image positions

Sources: docling_core.types.doc.document

Version Compatibility

The schema is versioned independently from Docling:

Schema version follows semantic versioning:

Major: Breaking changes to structure
Minor: Backward-compatible additions
Patch: Bug fixes and clarifications

Docling includes the schema version in exported documents, enabling version-specific parsing and migration utilities.

Sources: tests/data/groundtruth/docling_v2/2203.01017v2.json2-3

DoclingDocument Data Model

Overview

Schema Structure

Document Origin

Content Layers: Body vs Furniture

Body Layer

Furniture Layer

Content Item Types

TextItem

TableItem

PictureItem

SectionHeaderItem

ListItem

Groups and Nested Structures

GroupItem

Provenance Tracking

ProvenanceItem Structure

Reference System

JSON References

Benefits of This Design

Internal Representation

ConversionResult

Page

Key Type Definitions

Serialization and Export

Version Compatibility

On this page

DoclingDocument Data Model

Overview

Schema Structure

Document Origin

Content Layers: Body vs Furniture

Body Layer

Furniture Layer

Content Item Types

TextItem

TableItem

PictureItem

SectionHeaderItem

ListItem

Groups and Nested Structures

GroupItem

Provenance Tracking

ProvenanceItem Structure

Reference System

JSON References

Benefits of This Design

Internal Representation

ConversionResult

Page

Key Type Definitions

Serialization and Export

Version Compatibility

On this page