Document backends are responsible for parsing and extracting raw content from various input document formats. Each backend implements format-specific logic to read files and produce structured data that can be processed by Docling's pipelines. Backends operate at the lowest level of the conversion stack, providing a uniform interface for different document formats.
This page covers the backend architecture, backend types, and their implementations. For information about how backends are used within processing pipelines, see Processing Pipelines. For configuration options passed to backends, see Configuration and Pipeline Options.
Sources: docling/backend/abstract_backend.py1-85 docling/document_converter.py18-43
Docling backends follow a three-tier class hierarchy:
The base class for all document backends. It defines the minimal interface that every backend must implement:
is_valid(): Validates whether the document was successfully loadedsupports_pagination(): Indicates if the format has page-level structuresupported_formats(): Returns the set of InputFormat values this backend handlesunload(): Releases resources after processingAll backends receive three constructor arguments:
in_doc: The InputDocument containing metadatapath_or_stream: The actual file path or byte streamoptions: Format-specific configuration (subclass of BaseBackendOptions)Sources: docling/backend/abstract_backend.py19-51
Extends AbstractDocumentBackend for formats with page-based structure (primarily PDFs and images). These backends must implement page_count() and work with pipelines that process pages individually. Paginated backends do not directly produce a DoclingDocument—instead, they provide page-level data that pipelines transform incrementally.
Sources: docling/backend/abstract_backend.py54-63
Extends AbstractDocumentBackend for formats that can be directly converted to DoclingDocument without multi-stage processing. These backends implement a convert() method that returns a fully-formed DoclingDocument. Office documents (DOCX, XLSX, PPTX), web formats (HTML, Markdown), and structured formats (CSV, LaTeX) use this approach.
Sources: docling/backend/abstract_backend.py66-84
PDF backends use a two-tier architecture: a document-level backend and a page-level backend. This design enables efficient page-by-page processing in threaded pipelines.
The PdfPageBackend abstract class defines the page-level operations needed by PDF processing pipelines:
| Method | Purpose |
|---|---|
is_valid() | Check if page was successfully loaded |
get_text_in_rect(bbox) | Extract text within a bounding box |
get_text_cells() | Return all text cells on the page |
get_segmented_page() | Return structured page data (if available) |
get_bitmap_rects(scale) | Get bounding boxes of embedded bitmaps |
get_page_image(scale, cropbox) | Render page as PIL Image |
get_size() | Return page dimensions |
unload() | Release page resources |
Sources: docling/backend/pdf_backend.py24-81
Uses the docling-parse library (which wraps QPDF) to extract detailed text structure including words, lines, and character-level information. This backend provides the most comprehensive text extraction.
Key features:
DecodePageConfig (characters, words, lines)Sources: docling/backend/docling_parse_backend.py26-251 docling/backend/docling_parse_backend.py202-251
Implementation Details:
The document backend initializes both pypdfium2 (for rendering) and docling-parse (for text extraction):
DoclingParseDocumentBackend.__init__:
- Creates pypdfium2.PdfDocument for page rendering
- Creates DoclingPdfParser and loads PdfDocument
- Stores both for page backend creation
The page backend defers parsing until needed:
DoclingParsePageBackend._ensure_parsed():
- Creates DecodePageConfig with desired granularity
- Calls dp_doc.get_page() to parse page structure
- Converts all TextCell coordinates to top-left origin
- Caches result in self._dpage
Sources: docling/backend/docling_parse_backend.py54-86
Uses the pypdfium2 library (which wraps PDFium) for text extraction. This backend is lighter weight but provides less detailed text structure.
Key features:
pypdfium2_lock for library callsImplementation Details:
Text extraction occurs in _compute_text_cells():
PyPdfiumPageBackend._compute_text_cells():
1. Get PdfTextPage from pypdfium2
2. Iterate text rectangles via count_rects()/get_rect()
3. Extract text for each rectangle
4. Merge horizontally adjacent cells using heuristics
5. Convert coordinates to top-left origin
The merging heuristics group cells in rows, then merge cells with small horizontal gaps and aligned baselines to reconstruct words and lines fragmented by PDFium.
Sources: docling/backend/pypdfium2_backend.py108-272
Both PDF backends implement thread safety for concurrent page processing:
PdfPageBackend instance to avoid shared stateThe StandardPdfPipeline uses these features to process multiple pages concurrently across multiple documents.
Sources: docling/backend/docling_parse_backend.py193-199 docling/backend/pypdfium2_backend.py273-278 docling/utils/locks.py
Declarative backends implement the convert() method to produce a complete DoclingDocument without pipeline processing. These backends parse the entire document structure at once.
Office backends parse Microsoft Office Open XML formats using dedicated Python libraries.
Parses DOCX files using python-docx. Key features:
TableItem with TableCell structure supporting merged cellsfurnitureTable Structure: Uses _parse_table() to iterate Word table cells and build TableData with proper row/column spans.
Sources: docling/backend/msword_backend.py docling/document_converter.py97-100
Parses XLSX files using openpyxl. Key features:
TextItem rather than tablesTable Detection Algorithm:
_detect_tables_with_flood_fill():
1. Create boolean matrix of non-empty cells
2. Flood-fill from each unvisited cell to find contiguous regions
3. Convert regions to bounding boxes
4. Create TableItem for each region with >= 2 cells
Sources: docling/backend/msexcel_backend.py docling/document_converter.py92-94
Parses PPTX files using python-pptx. Key features:
TableItem structuresList Processing: Groups list items under section headers based on indentation and ordering, ensuring proper hierarchical structure.
Sources: docling/backend/mspowerpoint_backend.py docling/document_converter.py102-104
Parses HTML files using BeautifulSoup4. Key features:
Table Parsing: The _parse_table() method implements full HTML table semantics including:
Sources: docling/backend/html_backend.py docling/document_converter.py118-121
Parses Markdown files using marko. Key features:
HTMLDocumentBackendHTML Fallback: When Markdown contains HTML blocks, the backend creates a temporary HTMLDocumentBackend instance to parse those sections, then merges the results.
Sources: docling/backend/md_backend.py docling/document_converter.py107-110
Wraps single images as documents for processing with the standard PDF pipeline. The image is treated as a single-page document where OCR and layout detection can be applied.
Usage: Configured via ImageFormatOption in DocumentConverter to use StandardPdfPipeline.
Sources: docling/backend/image_backend.py docling/document_converter.py134-136
Parses LaTeX source files using pylatexenc for mathematical content and custom logic for document structure. Key features:
FormulaItemSources: docling/backend/latex_backend.py docling/document_converter.py150-155
Parses CSV files using pandas. The entire CSV is treated as a single table in the document.
Configuration: Supports custom delimiter via CsvBackendOptions.
Sources: docling/backend/csv_backend.py docling/document_converter.py87-90
Parses WebVTT (Web Video Text Tracks) subtitle files. Each cue is extracted as a TextItem with timestamp information preserved in provenance.
Sources: docling/backend/webvtt_backend.py docling/document_converter.py178-180
A placeholder backend used for formats where the pipeline handles all processing (e.g., audio files processed by AsrPipeline). The backend does no parsing itself.
Sources: docling/backend/noop_backend.py docling/document_converter.py145-147
The DocumentConverter class manages backend selection and initialization through the format options system.
Each InputFormat maps to a default FormatOption that specifies the backend class:
| InputFormat | Default Backend | Pipeline |
|---|---|---|
PDF | DoclingParseDocumentBackend | StandardPdfPipeline |
DOCX | MsWordDocumentBackend | SimplePipeline |
XLSX | MsExcelDocumentBackend | SimplePipeline |
PPTX | MsPowerpointDocumentBackend | SimplePipeline |
HTML | HTMLDocumentBackend | SimplePipeline |
MD | MarkdownDocumentBackend | SimplePipeline |
IMAGE | ImageDocumentBackend | StandardPdfPipeline |
CSV | CsvDocumentBackend | SimplePipeline |
LATEX | LatexDocumentBackend | SimplePipeline |
AUDIO | NoOpBackend | AsrPipeline |
Sources: docling/document_converter.py158-186
Backend-specific configuration is passed through BackendOptions subclasses:
PdfBackendOptions (PdfBackendOptions):
password: SecretStr for encrypted PDFsDoclingParseDocumentBackend and PyPdfiumDocumentBackendHTMLBackendOptions (HTMLBackendOptions):
image_handling_mode: Controls how images are processed (embedded, referenced, or external)MarkdownBackendOptions (MarkdownBackendOptions):
HTMLBackendOptions for embedded HTML handlingLatexBackendOptions (LatexBackendOptions):
Sources: docling/datamodel/backend_options.py docling/backend/abstract_backend.py8-12
Users can override default backends and options:
Sources: docling/document_converter.py209-257
The lifecycle of a backend instance follows this pattern:
InputDocument creates backend instance with document metadataInputDocument stores backend in _backend attributepage_count() is called to determine document lengthSources: docling/datamodel/document.py137-225
load_page() for each page, processes it, then calls backend's page unload() methodconvert() once to get full DoclingDocumentSources: docling/pipeline/standard_pdf_pipeline.py docling/pipeline/simple_pipeline.py
The unload() method releases resources:
BytesIO streamsThis is called automatically after pipeline processing completes or on error.
Sources: docling/backend/abstract_backend.py42-46 docling/backend/docling_parse_backend.py193-199
The following backend classes are deprecated and will be removed:
DoclingParseDocumentBackendDoclingParseDocumentBackendThese were intermediate versions during docling-parse evolution. The current DoclingParseDocumentBackend uses docling-parse v5+ which consolidates all previous versions.
Sources: docling/backend/docling_parse_v2_backend.py docling/backend/docling_parse_v4_backend.py CHANGELOG.md5
Refresh this wiki