DocumentConverter API

Relevant source files

Purpose and Scope

This page documents the DocumentConverter class, which is the primary Python API for converting documents in Docling. It covers initialization, configuration with format options, conversion methods (convert(), convert_all(), convert_string()), and the structure of conversion results.

For information about configuring pipeline options (OCR, table structure, VLM settings), see Pipeline Configuration API. For practical code examples, see Usage Examples.

DocumentConverter Class

The DocumentConverter class docling/document_converter.py189-661 orchestrates document conversion by routing input documents to appropriate processing pipelines based on their format. It manages pipeline initialization, caching, and execution.

Initialization

The constructor accepts two parameters to control which document formats are processed and how they are configured:

Parameters:

allowed_formats: List of InputFormat enum values specifying which document types to accept. If None, all supported formats are allowed docling/document_converter.py221-223
format_options: Dictionary mapping InputFormat to FormatOption instances that specify the pipeline class, backend, and pipeline-specific options for each format docling/document_converter.py250-257

Attributes:

allowed_formats: The list of allowed input formats docling/document_converter.py221-223
format_to_options: Mapping of formats to their FormatOption configurations docling/document_converter.py250-257
initialized_pipelines: Cache of initialized pipelines keyed by (pipeline_class, options_hash) docling/document_converter.py258-260

Sources: docling/document_converter.py189-261

Format Options Hierarchy

Each input format is associated with a FormatOption that specifies the processing pipeline and document backend to use. The format options form a class hierarchy:

Key FormatOption Fields docling/document_converter.py75-84:

pipeline_cls: The BasePipeline subclass to use for processing
backend: The AbstractDocumentBackend subclass to use for parsing
pipeline_options: Optional PipelineOptions instance for pipeline configuration
backend_options: Optional BackendOptions instance for backend configuration

Default Format Options:

The system provides default options for each format via _get_default_option() docling/document_converter.py158-186 Common configurations include:

Input Format	Pipeline	Backend
`InputFormat.PDF`	`StandardPdfPipeline`	`DoclingParseDocumentBackend`
`InputFormat.DOCX`	`SimplePipeline`	`MsWordDocumentBackend`
`InputFormat.XLSX`	`SimplePipeline`	`MsExcelDocumentBackend`
`InputFormat.PPTX`	`SimplePipeline`	`MsPowerpointDocumentBackend`
`InputFormat.HTML`	`SimplePipeline`	`HTMLDocumentBackend`
`InputFormat.MD`	`SimplePipeline`	`MarkdownDocumentBackend`
`InputFormat.IMAGE`	`StandardPdfPipeline`	`ImageDocumentBackend`
`InputFormat.AUDIO`	`AsrPipeline`	`NoOpBackend`

Sources: docling/document_converter.py75-186 docling/datamodel/base_models.py37-44

Pipeline Caching

The DocumentConverter caches initialized pipelines to avoid redundant model loading. Pipelines are keyed by a tuple of (pipeline_class, options_hash) where options_hash is an MD5 hash of the serialized pipeline options docling/document_converter.py267-272

Pipeline Retrieval Flow:

Cache Key Generation docling/document_converter.py267-272:

Serialize pipeline options using pipeline_options.model_dump()
Convert to string representation
Compute MD5 hash (with usedforsecurity=False)
Use (pipeline_class, hash) as cache key

This ensures that two DocumentConverter instances with identical pipeline configurations share the same cached pipeline, avoiding duplicate model loading.

Sources: docling/document_converter.py262-292

Conversion Methods

convert()

Converts a single document from a file path, URL, or DocumentStream:

Parameters:

source: Input document as file path (Path), URL (str), or DocumentStream
headers: HTTP headers for URL sources (dict of string key-value pairs)
raises_on_error: If True, raises ConversionError on failure; if False, errors are captured in ConversionResult.errors
max_num_pages: Maximum pages to accept (documents exceeding this are skipped)
max_file_size: Maximum file size in bytes
page_range: Tuple of (start_page, end_page) for selective conversion (1-indexed)

Returns:

ConversionResult containing the converted DoclingDocument and metadata

Raises:

ConversionError: If conversion fails and raises_on_error=True

Sources: docling/document_converter.py294-336

convert_all()

Converts multiple documents in batch, yielding results as they complete:

Parameters:

source: Iterable of document sources (file paths, URLs, or DocumentStream objects)
headers: Single set of HTTP headers applied to all URL sources
Other parameters same as convert()

Returns:

Iterator yielding ConversionResult instances as documents are processed

Behavior:

Documents are processed in order
Results are yielded incrementally (useful for progress tracking)
If raises_on_error=False, failed documents yield ConversionResult with status=FAILURE
If raises_on_error=True, first failure stops iteration and raises exception

Sources: docling/document_converter.py338-400

convert_string()

Converts document content provided as a string (supports Markdown and HTML):

Parameters:

content: Document content as string
format: Must be InputFormat.MD or InputFormat.HTML
name: Filename for the document (default: timestamp-based name). Appropriate file extension is appended if missing.

Returns:

ConversionResult containing the converted document

Raises:

ValueError: If format is neither InputFormat.MD nor InputFormat.HTML

Implementation Detail: Wraps the string content in a DocumentStream with a BytesIO buffer and passes it to the main conversion pipeline docling/document_converter.py430-458

Sources: docling/document_converter.py402-458

Conversion Flow Architecture

The following diagram shows how documents flow through the converter from input to output:

Sources: docling/document_converter.py461-661 docling/datamodel/document.py441-528

ConversionResult Structure

A ConversionResult contains the converted document and comprehensive metadata about the conversion process.

ConversionResult Class

Key Fields:

Field	Type	Description
`input`	`InputDocument`	Metadata about the input document
`status`	`ConversionStatus`	Overall conversion status
`errors`	`list[ErrorItem]`	Errors encountered during conversion
`document`	`DoclingDocument`	The converted document (empty if failed)
`pages`	`list[Page]`	Page-level data including predictions and images
`timings`	`dict[str, ProfilingItem]`	Performance metrics per pipeline stage
`confidence`	`ConfidenceReport`	Quality scores for OCR, layout, tables
`assembled`	`AssembledUnit`	Assembled page elements

Sources: docling/datamodel/document.py417-420 docling/datamodel/document.py242-414

InputDocument Class

The InputDocument class represents metadata about the source document before conversion:

Key Fields docling/datamodel/document.py111-225:

Field	Type	Description
`file`	`PurePath`	Path representation of the input file
`document_hash`	`str`	Stable hash of file content or stream
`valid`	`bool`	Whether the document passed validation
`format`	`InputFormat`	Detected or specified format
`filesize`	`Optional[int]`	Size in bytes (if known)
`page_count`	`int`	Number of pages (for paginated formats)
`limits`	`DocumentLimits`	Size and page constraints
`backend_options`	`Optional[BackendOptions]`	Backend-specific configuration

Validation Logic:

File size is checked against limits.max_file_size docling/datamodel/document.py159-160
Page count is checked against limits.max_num_pages for paginated backends docling/datamodel/document.py189-192
Invalid documents have valid=False but can still be returned in results

Sources: docling/datamodel/document.py111-225

ConversionStatus Enumeration

The ConversionStatus enum docling/datamodel/base_models.py46-52 indicates the outcome of conversion:

Status	Meaning
`PENDING`	Conversion not yet started
`STARTED`	Conversion in progress
`FAILURE`	Complete failure (no output produced)
`SUCCESS`	Complete success (all pages converted)
`PARTIAL_SUCCESS`	Some pages converted, some failed
`SKIPPED`	Document skipped (e.g., exceeded limits)

Sources: docling/datamodel/base_models.py46-52

ErrorItem Structure

Each error encountered during conversion is recorded as an ErrorItem docling/datamodel/base_models.py182-186:

Field	Type	Description
`component_type`	`DoclingComponentType`	Which component failed (backend, model, pipeline, etc.)
`module_name`	`str`	Python module where error occurred
`error_message`	`str`	Error description

Component Types docling/datamodel/base_models.py167-172:

DOCUMENT_BACKEND: Backend parser errors
MODEL: ML model inference errors
DOC_ASSEMBLER: Document assembly errors
PIPELINE: Pipeline orchestration errors
USER_INPUT: Invalid user input

Sources: docling/datamodel/base_models.py167-186

Persistence and Serialization

Saving ConversionResult

The ConversionResult can be saved to a ZIP archive containing structured JSON:

Archive Structure:

timestamp.json: Save timestamp (ISO format)
version.json: Docling version information
status.json: Conversion status
errors.json: Error items
pages.json: Page metadata
timings.json: Performance metrics
confidence.json: Quality scores
document.json: DoclingDocument (via export_to_dict())

Sources: docling/datamodel/document.py261-331

Loading ConversionResult

Previously saved results can be loaded:

This reconstructs the full ConversionResult object from the ZIP archive docling/datamodel/document.py333-414

Sources: docling/datamodel/document.py333-414

Thread Safety and Concurrency

The DocumentConverter uses a global lock for pipeline cache access:

This ensures thread-safe initialization when multiple threads request the same pipeline simultaneously docling/document_converter.py72 Once a pipeline is cached, multiple threads can safely use it concurrently if the pipeline implementation is thread-safe (most pipelines are).

Sources: docling/document_converter.py72 docling/document_converter.py520-543

DocumentConverter API

Relevant source files

Purpose and Scope

For information about configuring pipeline options (OCR, table structure, VLM settings), see Pipeline Configuration API. For practical code examples, see Usage Examples.

DocumentConverter Class

Initialization

The constructor accepts two parameters to control which document formats are processed and how they are configured:

Parameters:

allowed_formats: List of InputFormat enum values specifying which document types to accept. If None, all supported formats are allowed docling/document_converter.py221-223
format_options: Dictionary mapping InputFormat to FormatOption instances that specify the pipeline class, backend, and pipeline-specific options for each format docling/document_converter.py250-257

Attributes:

allowed_formats: The list of allowed input formats docling/document_converter.py221-223
format_to_options: Mapping of formats to their FormatOption configurations docling/document_converter.py250-257
initialized_pipelines: Cache of initialized pipelines keyed by (pipeline_class, options_hash) docling/document_converter.py258-260

Sources: docling/document_converter.py189-261

Format Options Hierarchy

Each input format is associated with a FormatOption that specifies the processing pipeline and document backend to use. The format options form a class hierarchy:

Key FormatOption Fields docling/document_converter.py75-84:

pipeline_cls: The BasePipeline subclass to use for processing
backend: The AbstractDocumentBackend subclass to use for parsing
pipeline_options: Optional PipelineOptions instance for pipeline configuration
backend_options: Optional BackendOptions instance for backend configuration

Default Format Options:

The system provides default options for each format via _get_default_option() docling/document_converter.py158-186 Common configurations include:

Input Format	Pipeline	Backend
`InputFormat.PDF`	`StandardPdfPipeline`	`DoclingParseDocumentBackend`
`InputFormat.DOCX`	`SimplePipeline`	`MsWordDocumentBackend`
`InputFormat.XLSX`	`SimplePipeline`	`MsExcelDocumentBackend`
`InputFormat.PPTX`	`SimplePipeline`	`MsPowerpointDocumentBackend`
`InputFormat.HTML`	`SimplePipeline`	`HTMLDocumentBackend`
`InputFormat.MD`	`SimplePipeline`	`MarkdownDocumentBackend`
`InputFormat.IMAGE`	`StandardPdfPipeline`	`ImageDocumentBackend`
`InputFormat.AUDIO`	`AsrPipeline`	`NoOpBackend`

Sources: docling/document_converter.py75-186 docling/datamodel/base_models.py37-44

Pipeline Caching

Pipeline Retrieval Flow:

Cache Key Generation docling/document_converter.py267-272:

Serialize pipeline options using pipeline_options.model_dump()
Convert to string representation
Compute MD5 hash (with usedforsecurity=False)
Use (pipeline_class, hash) as cache key

This ensures that two DocumentConverter instances with identical pipeline configurations share the same cached pipeline, avoiding duplicate model loading.

Sources: docling/document_converter.py262-292

Conversion Methods

convert()

Converts a single document from a file path, URL, or DocumentStream:

Parameters:

source: Input document as file path (Path), URL (str), or DocumentStream
headers: HTTP headers for URL sources (dict of string key-value pairs)
raises_on_error: If True, raises ConversionError on failure; if False, errors are captured in ConversionResult.errors
max_num_pages: Maximum pages to accept (documents exceeding this are skipped)
max_file_size: Maximum file size in bytes
page_range: Tuple of (start_page, end_page) for selective conversion (1-indexed)

Returns:

ConversionResult containing the converted DoclingDocument and metadata

Raises:

ConversionError: If conversion fails and raises_on_error=True

Sources: docling/document_converter.py294-336

convert_all()

Converts multiple documents in batch, yielding results as they complete:

Parameters:

source: Iterable of document sources (file paths, URLs, or DocumentStream objects)
headers: Single set of HTTP headers applied to all URL sources
Other parameters same as convert()

Returns:

Iterator yielding ConversionResult instances as documents are processed

Behavior:

Documents are processed in order
Results are yielded incrementally (useful for progress tracking)
If raises_on_error=False, failed documents yield ConversionResult with status=FAILURE
If raises_on_error=True, first failure stops iteration and raises exception

Sources: docling/document_converter.py338-400

convert_string()

Converts document content provided as a string (supports Markdown and HTML):

Parameters:

content: Document content as string
format: Must be InputFormat.MD or InputFormat.HTML
name: Filename for the document (default: timestamp-based name). Appropriate file extension is appended if missing.

Returns:

ConversionResult containing the converted document

Raises:

ValueError: If format is neither InputFormat.MD nor InputFormat.HTML

Implementation Detail: Wraps the string content in a DocumentStream with a BytesIO buffer and passes it to the main conversion pipeline docling/document_converter.py430-458

Sources: docling/document_converter.py402-458

Conversion Flow Architecture

The following diagram shows how documents flow through the converter from input to output:

Sources: docling/document_converter.py461-661 docling/datamodel/document.py441-528

ConversionResult Structure

A ConversionResult contains the converted document and comprehensive metadata about the conversion process.

ConversionResult Class

Key Fields:

Field	Type	Description
`input`	`InputDocument`	Metadata about the input document
`status`	`ConversionStatus`	Overall conversion status
`errors`	`list[ErrorItem]`	Errors encountered during conversion
`document`	`DoclingDocument`	The converted document (empty if failed)
`pages`	`list[Page]`	Page-level data including predictions and images
`timings`	`dict[str, ProfilingItem]`	Performance metrics per pipeline stage
`confidence`	`ConfidenceReport`	Quality scores for OCR, layout, tables
`assembled`	`AssembledUnit`	Assembled page elements

Sources: docling/datamodel/document.py417-420 docling/datamodel/document.py242-414

InputDocument Class

The InputDocument class represents metadata about the source document before conversion:

Key Fields docling/datamodel/document.py111-225:

Field	Type	Description
`file`	`PurePath`	Path representation of the input file
`document_hash`	`str`	Stable hash of file content or stream
`valid`	`bool`	Whether the document passed validation
`format`	`InputFormat`	Detected or specified format
`filesize`	`Optional[int]`	Size in bytes (if known)
`page_count`	`int`	Number of pages (for paginated formats)
`limits`	`DocumentLimits`	Size and page constraints
`backend_options`	`Optional[BackendOptions]`	Backend-specific configuration

Validation Logic:

File size is checked against limits.max_file_size docling/datamodel/document.py159-160
Page count is checked against limits.max_num_pages for paginated backends docling/datamodel/document.py189-192
Invalid documents have valid=False but can still be returned in results

Sources: docling/datamodel/document.py111-225

ConversionStatus Enumeration

The ConversionStatus enum docling/datamodel/base_models.py46-52 indicates the outcome of conversion:

Status	Meaning
`PENDING`	Conversion not yet started
`STARTED`	Conversion in progress
`FAILURE`	Complete failure (no output produced)
`SUCCESS`	Complete success (all pages converted)
`PARTIAL_SUCCESS`	Some pages converted, some failed
`SKIPPED`	Document skipped (e.g., exceeded limits)

Sources: docling/datamodel/base_models.py46-52

ErrorItem Structure

Each error encountered during conversion is recorded as an ErrorItem docling/datamodel/base_models.py182-186:

Field	Type	Description
`component_type`	`DoclingComponentType`	Which component failed (backend, model, pipeline, etc.)
`module_name`	`str`	Python module where error occurred
`error_message`	`str`	Error description

Component Types docling/datamodel/base_models.py167-172:

DOCUMENT_BACKEND: Backend parser errors
MODEL: ML model inference errors
DOC_ASSEMBLER: Document assembly errors
PIPELINE: Pipeline orchestration errors
USER_INPUT: Invalid user input

Sources: docling/datamodel/base_models.py167-186

Persistence and Serialization

Saving ConversionResult

The ConversionResult can be saved to a ZIP archive containing structured JSON:

Archive Structure:

timestamp.json: Save timestamp (ISO format)
version.json: Docling version information
status.json: Conversion status
errors.json: Error items
pages.json: Page metadata
timings.json: Performance metrics
confidence.json: Quality scores
document.json: DoclingDocument (via export_to_dict())

Sources: docling/datamodel/document.py261-331

Loading ConversionResult

Previously saved results can be loaded:

This reconstructs the full ConversionResult object from the ZIP archive docling/datamodel/document.py333-414

Sources: docling/datamodel/document.py333-414

Thread Safety and Concurrency

The DocumentConverter uses a global lock for pipeline cache access:

Sources: docling/document_converter.py72 docling/document_converter.py520-543

DocumentConverter API

Purpose and Scope

DocumentConverter Class

Initialization

Format Options Hierarchy

Pipeline Caching

Conversion Methods

convert()

convert_all()

convert_string()

Conversion Flow Architecture

ConversionResult Structure

ConversionResult Class

InputDocument Class

ConversionStatus Enumeration

ErrorItem Structure

Persistence and Serialization

Saving ConversionResult

Loading ConversionResult

Thread Safety and Concurrency

On this page

DocumentConverter API

Purpose and Scope

DocumentConverter Class

Initialization

Format Options Hierarchy

Pipeline Caching

Conversion Methods

convert()

convert_all()

convert_string()

Conversion Flow Architecture

ConversionResult Structure

ConversionResult Class

InputDocument Class

ConversionStatus Enumeration

ErrorItem Structure

Persistence and Serialization

Saving ConversionResult

Loading ConversionResult

Thread Safety and Concurrency

On this page