Document Backends

Relevant source files

Purpose and Scope

Document backends are responsible for parsing and extracting raw content from various input document formats. Each backend implements format-specific logic to read files and produce structured data that can be processed by Docling's pipelines. Backends operate at the lowest level of the conversion stack, providing a uniform interface for different document formats.

This page covers the backend architecture, backend types, and their implementations. For information about how backends are used within processing pipelines, see Processing Pipelines. For configuration options passed to backends, see Configuration and Pipeline Options.

Sources: docling/backend/abstract_backend.py1-85 docling/document_converter.py18-43

Backend Architecture

Docling backends follow a three-tier class hierarchy:

AbstractDocumentBackend

The base class for all document backends. It defines the minimal interface that every backend must implement:

is_valid(): Validates whether the document was successfully loaded
supports_pagination(): Indicates if the format has page-level structure
supported_formats(): Returns the set of InputFormat values this backend handles
unload(): Releases resources after processing

All backends receive three constructor arguments:

in_doc: The InputDocument containing metadata
path_or_stream: The actual file path or byte stream
options: Format-specific configuration (subclass of BaseBackendOptions)

Sources: docling/backend/abstract_backend.py19-51

PaginatedDocumentBackend

Extends AbstractDocumentBackend for formats with page-based structure (primarily PDFs and images). These backends must implement page_count() and work with pipelines that process pages individually. Paginated backends do not directly produce a DoclingDocument—instead, they provide page-level data that pipelines transform incrementally.

Sources: docling/backend/abstract_backend.py54-63

DeclarativeDocumentBackend

Extends AbstractDocumentBackend for formats that can be directly converted to DoclingDocument without multi-stage processing. These backends implement a convert() method that returns a fully-formed DoclingDocument. Office documents (DOCX, XLSX, PPTX), web formats (HTML, Markdown), and structured formats (CSV, LaTeX) use this approach.

Sources: docling/backend/abstract_backend.py66-84

PDF Backend Architecture

PDF backends use a two-tier architecture: a document-level backend and a page-level backend. This design enables efficient page-by-page processing in threaded pipelines.

PdfPageBackend Interface

The PdfPageBackend abstract class defines the page-level operations needed by PDF processing pipelines:

Method	Purpose
`is_valid()`	Check if page was successfully loaded
`get_text_in_rect(bbox)`	Extract text within a bounding box
`get_text_cells()`	Return all text cells on the page
`get_segmented_page()`	Return structured page data (if available)
`get_bitmap_rects(scale)`	Get bounding boxes of embedded bitmaps
`get_page_image(scale, cropbox)`	Render page as PIL Image
`get_size()`	Return page dimensions
`unload()`	Release page resources

Sources: docling/backend/pdf_backend.py24-81

DoclingParseDocumentBackend

Uses the docling-parse library (which wraps QPDF) to extract detailed text structure including words, lines, and character-level information. This backend provides the most comprehensive text extraction.

Key features:

Lazy page parsing: Pages are parsed on-demand when accessed
Configurable text granularity via DecodePageConfig (characters, words, lines)
Converts coordinate systems to top-left origin for consistency
Thread-safe through page unloading mechanism
Supports password-protected PDFs

Sources: docling/backend/docling_parse_backend.py26-251 docling/backend/docling_parse_backend.py202-251

Implementation Details:

The document backend initializes both pypdfium2 (for rendering) and docling-parse (for text extraction):

DoclingParseDocumentBackend.__init__:
  - Creates pypdfium2.PdfDocument for page rendering
  - Creates DoclingPdfParser and loads PdfDocument
  - Stores both for page backend creation

The page backend defers parsing until needed:

DoclingParsePageBackend._ensure_parsed():
  - Creates DecodePageConfig with desired granularity
  - Calls dp_doc.get_page() to parse page structure
  - Converts all TextCell coordinates to top-left origin
  - Caches result in self._dpage

Sources: docling/backend/docling_parse_backend.py54-86

PyPdfiumDocumentBackend

Uses the pypdfium2 library (which wraps PDFium) for text extraction. This backend is lighter weight but provides less detailed text structure.

Key features:

Uses PDFium's text page API for extraction
Implements cell merging heuristics to combine fragmented text
Thread-safe through pypdfium2_lock for library calls
Supports password-protected PDFs
Compatible with pypdfium2 versions 4.x and 5.x

Implementation Details:

Text extraction occurs in _compute_text_cells():

PyPdfiumPageBackend._compute_text_cells():
  1. Get PdfTextPage from pypdfium2
  2. Iterate text rectangles via count_rects()/get_rect()
  3. Extract text for each rectangle
  4. Merge horizontally adjacent cells using heuristics
  5. Convert coordinates to top-left origin

The merging heuristics group cells in rows, then merge cells with small horizontal gaps and aligned baselines to reconstruct words and lines fragmented by PDFium.

Sources: docling/backend/pypdfium2_backend.py108-272

Thread Safety Considerations

Both PDF backends implement thread safety for concurrent page processing:

pypdfium2_lock: A global lock protects all pypdfium2 library calls since PDFium is not thread-safe
Page unloading: Pages can be unloaded after processing to free memory in threaded pipelines
Separate page instances: Each page gets its own PdfPageBackend instance to avoid shared state

The StandardPdfPipeline uses these features to process multiple pages concurrently across multiple documents.

Sources: docling/backend/docling_parse_backend.py193-199 docling/backend/pypdfium2_backend.py273-278 docling/utils/locks.py

Declarative Backends

Declarative backends implement the convert() method to produce a complete DoclingDocument without pipeline processing. These backends parse the entire document structure at once.

Office Document Backends

Office backends parse Microsoft Office Open XML formats using dedicated Python libraries.

MsWordDocumentBackend

Parses DOCX files using python-docx. Key features:

Paragraph processing: Extracts text, styles, and formatting runs
List detection: Identifies numbered and bulleted lists with proper nesting
Table extraction: Converts Word tables to TableItem with TableCell structure supporting merged cells
Image extraction: Extracts embedded images and grouped pictures
Header/footer handling: Separates page headers and footers into document furniture
Comment extraction: Optionally includes document comments as annotations

Table Structure: Uses _parse_table() to iterate Word table cells and build TableData with proper row/column spans.

Sources: docling/backend/msword_backend.py docling/document_converter.py97-100

MsExcelDocumentBackend

Parses XLSX files using openpyxl. Key features:

Sheet iteration: Processes each worksheet as a separate document section
Efficient bounds detection: Determines actual data range to avoid processing empty cells
Table detection: Uses flood-fill algorithm to identify contiguous data regions as tables
Cell type handling: Resolves formulas to values, handles merged cells
Singleton cell filtering: Treats isolated single cells as TextItem rather than tables

Table Detection Algorithm:

_detect_tables_with_flood_fill():
  1. Create boolean matrix of non-empty cells
  2. Flood-fill from each unvisited cell to find contiguous regions
  3. Convert regions to bounding boxes
  4. Create TableItem for each region with >= 2 cells

Sources: docling/backend/msexcel_backend.py docling/document_converter.py92-94

MsPowerpointDocumentBackend

Parses PPTX files using python-pptx. Key features:

Slide processing: Each slide becomes a page in the document
Shape hierarchy: Processes text boxes, tables, pictures, and grouped shapes
List handling: Detects bullet points and numbered lists with proper nesting under headers
Table extraction: Converts PowerPoint tables to TableItem structures
Notes extraction: Includes speaker notes as additional text items

List Processing: Groups list items under section headers based on indentation and ordering, ensuring proper hierarchical structure.

Sources: docling/backend/mspowerpoint_backend.py docling/document_converter.py102-104

Web and Markup Backends

HTMLDocumentBackend

Parses HTML files using BeautifulSoup4. Key features:

Tag-to-item mapping: Maps HTML elements to DocItem types (h1-h6 → SectionHeaderItem, p → TextItem, etc.)
Rich table support: Preserves table structure with proper handling of thead/tbody/tfoot, th elements as headers, and rowspan/colspan
Hierarchical parsing: Maintains document structure with proper nesting of sections and lists
Image handling: Configurable modes for embedded images (embedded base64, referenced files, or external URLs)
Block element handling: Special logic for handling block-level elements within paragraphs

Table Parsing: The _parse_table() method implements full HTML table semantics including:

Header detection (th elements, thead section)
Cell spanning (rowspan, colspan attributes)
Rich content in cells (nested paragraphs, lists, images)

Sources: docling/backend/html_backend.py docling/document_converter.py118-121

MarkdownDocumentBackend

Parses Markdown files using marko. Key features:

AST traversal: Walks the marko abstract syntax tree to build document structure
HTML delegation: Detects embedded HTML blocks and delegates to HTMLDocumentBackend
Setext heading support: Handles both ATX (#) and Setext (underline) style headers
List processing: Converts Markdown lists (ordered and unordered) to proper list groups
Table support: Parses GFM-style pipe tables
Escaped character handling: Properly processes Markdown escape sequences

HTML Fallback: When Markdown contains HTML blocks, the backend creates a temporary HTMLDocumentBackend instance to parse those sections, then merges the results.

Sources: docling/backend/md_backend.py docling/document_converter.py107-110

Specialized Format Backends

ImageDocumentBackend

Wraps single images as documents for processing with the standard PDF pipeline. The image is treated as a single-page document where OCR and layout detection can be applied.

Usage: Configured via ImageFormatOption in DocumentConverter to use StandardPdfPipeline.

Sources: docling/backend/image_backend.py docling/document_converter.py134-136

LatexDocumentBackend

Parses LaTeX source files using pylatexenc for mathematical content and custom logic for document structure. Key features:

Environment detection: Identifies LaTeX environments (sections, lists, equations, figures, tables)
Math rendering: Preserves mathematical formulas as FormulaItem
Citation tracking: Extracts bibliography references
Preamble parsing: Processes document metadata and packages

Sources: docling/backend/latex_backend.py docling/document_converter.py150-155

CsvDocumentBackend

Parses CSV files using pandas. The entire CSV is treated as a single table in the document.

Configuration: Supports custom delimiter via CsvBackendOptions.

Sources: docling/backend/csv_backend.py docling/document_converter.py87-90

WebVTTDocumentBackend

Parses WebVTT (Web Video Text Tracks) subtitle files. Each cue is extracted as a TextItem with timestamp information preserved in provenance.

Sources: docling/backend/webvtt_backend.py docling/document_converter.py178-180

NoOpBackend

A placeholder backend used for formats where the pipeline handles all processing (e.g., audio files processed by AsrPipeline). The backend does no parsing itself.

Sources: docling/backend/noop_backend.py docling/document_converter.py145-147

Backend Selection and Configuration

The DocumentConverter class manages backend selection and initialization through the format options system.

Format-to-Backend Mapping

Each InputFormat maps to a default FormatOption that specifies the backend class:

InputFormat	Default Backend	Pipeline
`PDF`	`DoclingParseDocumentBackend`	`StandardPdfPipeline`
`DOCX`	`MsWordDocumentBackend`	`SimplePipeline`
`XLSX`	`MsExcelDocumentBackend`	`SimplePipeline`
`PPTX`	`MsPowerpointDocumentBackend`	`SimplePipeline`
`HTML`	`HTMLDocumentBackend`	`SimplePipeline`
`MD`	`MarkdownDocumentBackend`	`SimplePipeline`
`IMAGE`	`ImageDocumentBackend`	`StandardPdfPipeline`
`CSV`	`CsvDocumentBackend`	`SimplePipeline`
`LATEX`	`LatexDocumentBackend`	`SimplePipeline`
`AUDIO`	`NoOpBackend`	`AsrPipeline`

Sources: docling/document_converter.py158-186

Backend Options

Backend-specific configuration is passed through BackendOptions subclasses:

PdfBackendOptions (PdfBackendOptions):

password: SecretStr for encrypted PDFs
Configures both DoclingParseDocumentBackend and PyPdfiumDocumentBackend

HTMLBackendOptions (HTMLBackendOptions):

image_handling_mode: Controls how images are processed (embedded, referenced, or external)

MarkdownBackendOptions (MarkdownBackendOptions):

Inherits HTMLBackendOptions for embedded HTML handling

LatexBackendOptions (LatexBackendOptions):

Configuration for LaTeX parsing behavior

Sources: docling/datamodel/backend_options.py docling/backend/abstract_backend.py8-12

Custom Backend Configuration

Users can override default backends and options:

Sources: docling/document_converter.py209-257

Backend Lifecycle

The lifecycle of a backend instance follows this pattern:

Initialization

InputDocument creates backend instance with document metadata
Backend validates the document (checks format, encryption, corruption)
InputDocument stores backend in _backend attribute
For paginated backends, page_count() is called to determine document length

Sources: docling/datamodel/document.py137-225

Processing

Paginated backends: Pipeline calls load_page() for each page, processes it, then calls backend's page unload() method
Declarative backends: Pipeline calls convert() once to get full DoclingDocument

Sources: docling/pipeline/standard_pdf_pipeline.py docling/pipeline/simple_pipeline.py

Cleanup

The unload() method releases resources:

Closes file handles or BytesIO streams
Frees page caches
Releases third-party library resources (e.g., pypdfium2 PdfDocument)

This is called automatically after pipeline processing completes or on error.

Sources: docling/backend/abstract_backend.py42-46 docling/backend/docling_parse_backend.py193-199

Deprecated Backends

The following backend classes are deprecated and will be removed:

DoclingParseV2DocumentBackend: Removed in v2.74.0, use DoclingParseDocumentBackend
DoclingParseV4DocumentBackend: Removed in v2.74.0, use DoclingParseDocumentBackend

These were intermediate versions during docling-parse evolution. The current DoclingParseDocumentBackend uses docling-parse v5+ which consolidates all previous versions.

Sources: docling/backend/docling_parse_v2_backend.py docling/backend/docling_parse_v4_backend.py CHANGELOG.md5

Document Backends

Relevant source files

Purpose and Scope

Sources: docling/backend/abstract_backend.py1-85 docling/document_converter.py18-43

Backend Architecture

Docling backends follow a three-tier class hierarchy:

AbstractDocumentBackend

The base class for all document backends. It defines the minimal interface that every backend must implement:

is_valid(): Validates whether the document was successfully loaded
supports_pagination(): Indicates if the format has page-level structure
supported_formats(): Returns the set of InputFormat values this backend handles
unload(): Releases resources after processing

All backends receive three constructor arguments:

in_doc: The InputDocument containing metadata
path_or_stream: The actual file path or byte stream
options: Format-specific configuration (subclass of BaseBackendOptions)

Sources: docling/backend/abstract_backend.py19-51

PaginatedDocumentBackend

Sources: docling/backend/abstract_backend.py54-63

DeclarativeDocumentBackend

Sources: docling/backend/abstract_backend.py66-84

PDF Backend Architecture

PDF backends use a two-tier architecture: a document-level backend and a page-level backend. This design enables efficient page-by-page processing in threaded pipelines.

PdfPageBackend Interface

The PdfPageBackend abstract class defines the page-level operations needed by PDF processing pipelines:

Method	Purpose
`is_valid()`	Check if page was successfully loaded
`get_text_in_rect(bbox)`	Extract text within a bounding box
`get_text_cells()`	Return all text cells on the page
`get_segmented_page()`	Return structured page data (if available)
`get_bitmap_rects(scale)`	Get bounding boxes of embedded bitmaps
`get_page_image(scale, cropbox)`	Render page as PIL Image
`get_size()`	Return page dimensions
`unload()`	Release page resources

Sources: docling/backend/pdf_backend.py24-81

DoclingParseDocumentBackend

Key features:

Lazy page parsing: Pages are parsed on-demand when accessed
Configurable text granularity via DecodePageConfig (characters, words, lines)
Converts coordinate systems to top-left origin for consistency
Thread-safe through page unloading mechanism
Supports password-protected PDFs

Sources: docling/backend/docling_parse_backend.py26-251 docling/backend/docling_parse_backend.py202-251

Implementation Details:

The document backend initializes both pypdfium2 (for rendering) and docling-parse (for text extraction):

DoclingParseDocumentBackend.__init__:
  - Creates pypdfium2.PdfDocument for page rendering
  - Creates DoclingPdfParser and loads PdfDocument
  - Stores both for page backend creation

The page backend defers parsing until needed:

DoclingParsePageBackend._ensure_parsed():
  - Creates DecodePageConfig with desired granularity
  - Calls dp_doc.get_page() to parse page structure
  - Converts all TextCell coordinates to top-left origin
  - Caches result in self._dpage

Sources: docling/backend/docling_parse_backend.py54-86

PyPdfiumDocumentBackend

Uses the pypdfium2 library (which wraps PDFium) for text extraction. This backend is lighter weight but provides less detailed text structure.

Key features:

Uses PDFium's text page API for extraction
Implements cell merging heuristics to combine fragmented text
Thread-safe through pypdfium2_lock for library calls
Supports password-protected PDFs
Compatible with pypdfium2 versions 4.x and 5.x

Implementation Details:

Text extraction occurs in _compute_text_cells():

PyPdfiumPageBackend._compute_text_cells():
  1. Get PdfTextPage from pypdfium2
  2. Iterate text rectangles via count_rects()/get_rect()
  3. Extract text for each rectangle
  4. Merge horizontally adjacent cells using heuristics
  5. Convert coordinates to top-left origin

The merging heuristics group cells in rows, then merge cells with small horizontal gaps and aligned baselines to reconstruct words and lines fragmented by PDFium.

Sources: docling/backend/pypdfium2_backend.py108-272

Thread Safety Considerations

Both PDF backends implement thread safety for concurrent page processing:

pypdfium2_lock: A global lock protects all pypdfium2 library calls since PDFium is not thread-safe
Page unloading: Pages can be unloaded after processing to free memory in threaded pipelines
Separate page instances: Each page gets its own PdfPageBackend instance to avoid shared state

The StandardPdfPipeline uses these features to process multiple pages concurrently across multiple documents.

Sources: docling/backend/docling_parse_backend.py193-199 docling/backend/pypdfium2_backend.py273-278 docling/utils/locks.py

Declarative Backends

Declarative backends implement the convert() method to produce a complete DoclingDocument without pipeline processing. These backends parse the entire document structure at once.

Office Document Backends

Office backends parse Microsoft Office Open XML formats using dedicated Python libraries.

MsWordDocumentBackend

Parses DOCX files using python-docx. Key features:

Paragraph processing: Extracts text, styles, and formatting runs
List detection: Identifies numbered and bulleted lists with proper nesting
Table extraction: Converts Word tables to TableItem with TableCell structure supporting merged cells
Image extraction: Extracts embedded images and grouped pictures
Header/footer handling: Separates page headers and footers into document furniture
Comment extraction: Optionally includes document comments as annotations

Table Structure: Uses _parse_table() to iterate Word table cells and build TableData with proper row/column spans.

Sources: docling/backend/msword_backend.py docling/document_converter.py97-100

MsExcelDocumentBackend

Parses XLSX files using openpyxl. Key features:

Sheet iteration: Processes each worksheet as a separate document section
Efficient bounds detection: Determines actual data range to avoid processing empty cells
Table detection: Uses flood-fill algorithm to identify contiguous data regions as tables
Cell type handling: Resolves formulas to values, handles merged cells
Singleton cell filtering: Treats isolated single cells as TextItem rather than tables

Table Detection Algorithm:

_detect_tables_with_flood_fill():
  1. Create boolean matrix of non-empty cells
  2. Flood-fill from each unvisited cell to find contiguous regions
  3. Convert regions to bounding boxes
  4. Create TableItem for each region with >= 2 cells

Sources: docling/backend/msexcel_backend.py docling/document_converter.py92-94

MsPowerpointDocumentBackend

Parses PPTX files using python-pptx. Key features:

Slide processing: Each slide becomes a page in the document
Shape hierarchy: Processes text boxes, tables, pictures, and grouped shapes
List handling: Detects bullet points and numbered lists with proper nesting under headers
Table extraction: Converts PowerPoint tables to TableItem structures
Notes extraction: Includes speaker notes as additional text items

List Processing: Groups list items under section headers based on indentation and ordering, ensuring proper hierarchical structure.

Sources: docling/backend/mspowerpoint_backend.py docling/document_converter.py102-104

Web and Markup Backends

HTMLDocumentBackend

Parses HTML files using BeautifulSoup4. Key features:

Tag-to-item mapping: Maps HTML elements to DocItem types (h1-h6 → SectionHeaderItem, p → TextItem, etc.)
Rich table support: Preserves table structure with proper handling of thead/tbody/tfoot, th elements as headers, and rowspan/colspan
Hierarchical parsing: Maintains document structure with proper nesting of sections and lists
Image handling: Configurable modes for embedded images (embedded base64, referenced files, or external URLs)
Block element handling: Special logic for handling block-level elements within paragraphs

Table Parsing: The _parse_table() method implements full HTML table semantics including:

Header detection (th elements, thead section)
Cell spanning (rowspan, colspan attributes)
Rich content in cells (nested paragraphs, lists, images)

Sources: docling/backend/html_backend.py docling/document_converter.py118-121

MarkdownDocumentBackend

Parses Markdown files using marko. Key features:

AST traversal: Walks the marko abstract syntax tree to build document structure
HTML delegation: Detects embedded HTML blocks and delegates to HTMLDocumentBackend
Setext heading support: Handles both ATX (#) and Setext (underline) style headers
List processing: Converts Markdown lists (ordered and unordered) to proper list groups
Table support: Parses GFM-style pipe tables
Escaped character handling: Properly processes Markdown escape sequences

HTML Fallback: When Markdown contains HTML blocks, the backend creates a temporary HTMLDocumentBackend instance to parse those sections, then merges the results.

Sources: docling/backend/md_backend.py docling/document_converter.py107-110

Specialized Format Backends

ImageDocumentBackend

Wraps single images as documents for processing with the standard PDF pipeline. The image is treated as a single-page document where OCR and layout detection can be applied.

Usage: Configured via ImageFormatOption in DocumentConverter to use StandardPdfPipeline.

Sources: docling/backend/image_backend.py docling/document_converter.py134-136

LatexDocumentBackend

Parses LaTeX source files using pylatexenc for mathematical content and custom logic for document structure. Key features:

Environment detection: Identifies LaTeX environments (sections, lists, equations, figures, tables)
Math rendering: Preserves mathematical formulas as FormulaItem
Citation tracking: Extracts bibliography references
Preamble parsing: Processes document metadata and packages

Sources: docling/backend/latex_backend.py docling/document_converter.py150-155

CsvDocumentBackend

Parses CSV files using pandas. The entire CSV is treated as a single table in the document.

Configuration: Supports custom delimiter via CsvBackendOptions.

Sources: docling/backend/csv_backend.py docling/document_converter.py87-90

WebVTTDocumentBackend

Parses WebVTT (Web Video Text Tracks) subtitle files. Each cue is extracted as a TextItem with timestamp information preserved in provenance.

Sources: docling/backend/webvtt_backend.py docling/document_converter.py178-180

NoOpBackend

A placeholder backend used for formats where the pipeline handles all processing (e.g., audio files processed by AsrPipeline). The backend does no parsing itself.

Sources: docling/backend/noop_backend.py docling/document_converter.py145-147

Backend Selection and Configuration

The DocumentConverter class manages backend selection and initialization through the format options system.

Format-to-Backend Mapping

Each InputFormat maps to a default FormatOption that specifies the backend class:

InputFormat	Default Backend	Pipeline
`PDF`	`DoclingParseDocumentBackend`	`StandardPdfPipeline`
`DOCX`	`MsWordDocumentBackend`	`SimplePipeline`
`XLSX`	`MsExcelDocumentBackend`	`SimplePipeline`
`PPTX`	`MsPowerpointDocumentBackend`	`SimplePipeline`
`HTML`	`HTMLDocumentBackend`	`SimplePipeline`
`MD`	`MarkdownDocumentBackend`	`SimplePipeline`
`IMAGE`	`ImageDocumentBackend`	`StandardPdfPipeline`
`CSV`	`CsvDocumentBackend`	`SimplePipeline`
`LATEX`	`LatexDocumentBackend`	`SimplePipeline`
`AUDIO`	`NoOpBackend`	`AsrPipeline`

Sources: docling/document_converter.py158-186

Backend Options

Backend-specific configuration is passed through BackendOptions subclasses:

PdfBackendOptions (PdfBackendOptions):

password: SecretStr for encrypted PDFs
Configures both DoclingParseDocumentBackend and PyPdfiumDocumentBackend

HTMLBackendOptions (HTMLBackendOptions):

image_handling_mode: Controls how images are processed (embedded, referenced, or external)

MarkdownBackendOptions (MarkdownBackendOptions):

Inherits HTMLBackendOptions for embedded HTML handling

LatexBackendOptions (LatexBackendOptions):

Configuration for LaTeX parsing behavior

Sources: docling/datamodel/backend_options.py docling/backend/abstract_backend.py8-12

Custom Backend Configuration

Users can override default backends and options:

Sources: docling/document_converter.py209-257

Backend Lifecycle

The lifecycle of a backend instance follows this pattern:

Initialization

InputDocument creates backend instance with document metadata
Backend validates the document (checks format, encryption, corruption)
InputDocument stores backend in _backend attribute
For paginated backends, page_count() is called to determine document length

Sources: docling/datamodel/document.py137-225

Processing

Paginated backends: Pipeline calls load_page() for each page, processes it, then calls backend's page unload() method
Declarative backends: Pipeline calls convert() once to get full DoclingDocument

Sources: docling/pipeline/standard_pdf_pipeline.py docling/pipeline/simple_pipeline.py

Cleanup

The unload() method releases resources:

Closes file handles or BytesIO streams
Frees page caches
Releases third-party library resources (e.g., pypdfium2 PdfDocument)

This is called automatically after pipeline processing completes or on error.

Sources: docling/backend/abstract_backend.py42-46 docling/backend/docling_parse_backend.py193-199

Deprecated Backends

The following backend classes are deprecated and will be removed:

DoclingParseV2DocumentBackend: Removed in v2.74.0, use DoclingParseDocumentBackend
DoclingParseV4DocumentBackend: Removed in v2.74.0, use DoclingParseDocumentBackend

These were intermediate versions during docling-parse evolution. The current DoclingParseDocumentBackend uses docling-parse v5+ which consolidates all previous versions.

Sources: docling/backend/docling_parse_v2_backend.py docling/backend/docling_parse_v4_backend.py CHANGELOG.md5

Document Backends

Purpose and Scope

Backend Architecture

AbstractDocumentBackend

PaginatedDocumentBackend

DeclarativeDocumentBackend

PDF Backend Architecture

PdfPageBackend Interface

DoclingParseDocumentBackend

PyPdfiumDocumentBackend

Thread Safety Considerations

Declarative Backends

Office Document Backends

MsWordDocumentBackend

MsExcelDocumentBackend

MsPowerpointDocumentBackend

Web and Markup Backends

HTMLDocumentBackend

MarkdownDocumentBackend

Specialized Format Backends

ImageDocumentBackend

LatexDocumentBackend

CsvDocumentBackend

WebVTTDocumentBackend

NoOpBackend

Backend Selection and Configuration

Format-to-Backend Mapping

Backend Options

Custom Backend Configuration

Backend Lifecycle

Initialization

Processing

Cleanup

Deprecated Backends

On this page

Document Backends

Purpose and Scope

Backend Architecture

AbstractDocumentBackend

PaginatedDocumentBackend

DeclarativeDocumentBackend

PDF Backend Architecture

PdfPageBackend Interface

DoclingParseDocumentBackend

PyPdfiumDocumentBackend

Thread Safety Considerations

Declarative Backends

Office Document Backends

MsWordDocumentBackend

MsExcelDocumentBackend

MsPowerpointDocumentBackend

Web and Markup Backends

HTMLDocumentBackend

MarkdownDocumentBackend

Specialized Format Backends

ImageDocumentBackend

LatexDocumentBackend

CsvDocumentBackend

WebVTTDocumentBackend

NoOpBackend

Backend Selection and Configuration

Format-to-Backend Mapping

Backend Options

Custom Backend Configuration

Backend Lifecycle

Initialization

Processing

Cleanup

Deprecated Backends

On this page