This page documents the DoclingDocument data model, which serves as Docling's unified representation for structured documents. This is the core output schema produced by all conversion pipelines, regardless of input format.
For information about how documents flow through the conversion process, see Document Conversion Flow. For configuring pipeline behavior, see Configuration and Pipeline Options.
The DoclingDocument schema represents documents as a hierarchical structure with rich semantic annotations, spatial provenance, and content organization. Every successful document conversion results in a DoclingDocument instance that preserves:
The schema is defined in the docling-core package and is version-controlled to ensure compatibility. Docling documents can be serialized to JSON, YAML, Markdown, HTML, or the legacy DocTags format.
Sources: docling/datamodel/document.py28-36 tests/data/groundtruth/docling_v2/2203.01017v2.json1-16
A DoclingDocument consists of:
body and furniture as root-level GroupItem objectstexts, tables, pictures, groups)$ref) to items in the flat collectionsThis hybrid structure enables both efficient access to items by type and preservation of document hierarchy.
Sources: tests/data/groundtruth/docling_v2/2203.01017v2.json1-51 docling/datamodel/document.py28-36
The origin field captures metadata about the source document:
| Field | Type | Description |
|---|---|---|
filename | str | Original filename of the document |
mimetype | str | MIME type of the source format |
binary_hash | int | Stable hash of the document content |
The binary hash enables deduplication and change detection across document versions.
Sources: tests/data/groundtruth/docling_v2/2203.01017v2.json5-9 docling_core.types.doc.document
Documents are organized into two distinct content layers:
The body contains the main semantic content of the document:
The furniture contains document elements that are structurally separate from the main content:
Both body and furniture are GroupItem instances that act as hierarchical containers. They reference child items through the $ref system.
Sources: tests/data/groundtruth/docling_v2/2203.01017v2.json10-51 docling_core.types.doc.document
All content items inherit from DocItem and share common attributes:
| Attribute | Type | Description |
|---|---|---|
self_ref | str | JSON reference to this item (e.g., #/texts/0) |
parent | dict | Reference to parent item (e.g., {"$ref": "#/body"}) |
children | list | References to child items |
content_layer | str | Either "body" or "furniture" |
label | DocItemLabel | Semantic label (e.g., "text", "section_header") |
prov | list[ProvenanceItem] | Provenance information (location in source) |
Represents textual content with optional formatting:
Common label values for TextItem:
text - Regular paragraph textsection_header - Section/subsection headerstitle - Document titlecaption - Figure/table captionsfootnote - Footnote textpage_header / page_footer - Header/footer textlist_item - List item textFor section headers, an additional level attribute indicates heading depth (1-6).
Sources: tests/data/groundtruth/docling_v2/2305.03393v1-pg9.json109-162 docling_core.types.doc
Represents tabular data with structured cell information:
The data field contains:
table_cells: Array of TableCell objects with text, position, spans, and header flagsnum_rows / num_cols: Table dimensionsgrid: Optional serialized representation (HTML, CSV, etc.)Cells support row/column spanning and can be marked as header cells via col_header and row_header boolean flags.
Sources: tests/data/groundtruth/docling_v2/2305.03393v1-pg9.json190-428 docling_core.types.doc
Represents figures, diagrams, charts, and other visual content:
Key fields:
image: Embedded or referenced image datacaptions: References to associated caption TextItemsmeta.classification: Optional picture classification (chart, diagram, photo, etc.)meta.tabular_chart: For charts, extracted tabular dataPicture classification labels include:
Chart-Plot - Line/scatter plotsChart-Bar - Bar chartsChart-Pie - Pie chartsDiagram-Flowchart - FlowchartsDiagram-Illustration - IllustrationsPhoto - PhotographsSources: tests/data/groundtruth/docling_v2/picture_classification.json27-103 docling_core.types.doc
A specialized TextItem for document structure:
Section headers can have children representing the content that falls under that section. The level attribute indicates heading hierarchy (1 = top-level, 6 = deepest).
Sources: docling_core.types.doc tests/data/groundtruth/docling_v2/2203.01017v2.json143-162
Represents items in ordered or unordered lists:
ListItems are typically children of a ListGroup (see Groups section below).
Sources: docling_core.types.doc.document tests/data/groundtruth/docling_v2/multi_page.json42-57
GroupItem is a container for organizing related content items hierarchically. It acts as a parent to other items without being content itself.
Common group types:
Example list structure:
Groups enable preservation of document structure beyond simple linear ordering. Nested groups support complex hierarchies like multi-level lists or form sections.
Sources: docling_core.types.doc.document tests/data/groundtruth/docling_v2/multi_page.json42-80
Every content item includes provenance information linking it back to its location in the source document:
Key aspects:
ProvenanceItem entriesProvenance enables:
Sources: tests/data/groundtruth/docling_v2/2305.03393v1-pg9.json62-76 docling_core.types.doc.base
DoclingDocument uses a hybrid storage model combining hierarchical structure with flat collections:
References use JSON Pointer syntax:
#/texts/0 - First item in texts array#/tables/5 - Sixth item in tables array#/pictures/2 - Third item in pictures array#/groups/0 - First item in groups array#/body - The body root node#/furniture - The furniture root nodeSources: tests/data/groundtruth/docling_v2/2203.01017v2.json19-51 docling_core.types.doc.document
The DoclingDocument is implemented across two main classes:
Wrapper returned by document conversion:
Location: docling/datamodel/document.py464-525
Internal page-level data structure during processing:
The Page class bridges raw PDF data, ML predictions, and final assembled content. It is part of the ConversionResult.pages list but not exposed in the final DoclingDocument JSON export.
Location: docling/datamodel/base_models.py300-325
Location: docling/datamodel/document.py28-36 docling_core.types.doc
The DoclingDocument can be exported to multiple formats:
| Method | Output Format | Use Case |
|---|---|---|
save_as_json() | JSON | Machine-readable, preserves full structure |
save_as_yaml() | YAML | Human-readable, preserves full structure |
save_as_markdown() | Markdown | Text-focused, lightweight |
save_as_html() | HTML | Web display, preserves formatting |
save_as_doctags() | DocTags | Legacy format for specific tools |
export_to_dict() | Python dict | Programmatic access |
All export methods support image export modes:
EMBEDDED: Base64-encoded images inlineREFERENCED: External image files with referencesPLACEHOLDER: Text placeholders for image positionsSources: docling_core.types.doc.document
The schema is versioned independently from Docling:
Schema version follows semantic versioning:
Docling includes the schema version in exported documents, enabling version-specific parsing and migration utilities.
Sources: tests/data/groundtruth/docling_v2/2203.01017v2.json2-3
Refresh this wiki