This page documents the DocumentConverter class, which is the primary Python API for converting documents in Docling. It covers initialization, configuration with format options, conversion methods (convert(), convert_all(), convert_string()), and the structure of conversion results.
For information about configuring pipeline options (OCR, table structure, VLM settings), see Pipeline Configuration API. For practical code examples, see Usage Examples.
The DocumentConverter class docling/document_converter.py189-661 orchestrates document conversion by routing input documents to appropriate processing pipelines based on their format. It manages pipeline initialization, caching, and execution.
The constructor accepts two parameters to control which document formats are processed and how they are configured:
Parameters:
allowed_formats: List of InputFormat enum values specifying which document types to accept. If None, all supported formats are allowed docling/document_converter.py221-223format_options: Dictionary mapping InputFormat to FormatOption instances that specify the pipeline class, backend, and pipeline-specific options for each format docling/document_converter.py250-257Attributes:
allowed_formats: The list of allowed input formats docling/document_converter.py221-223format_to_options: Mapping of formats to their FormatOption configurations docling/document_converter.py250-257initialized_pipelines: Cache of initialized pipelines keyed by (pipeline_class, options_hash) docling/document_converter.py258-260Sources: docling/document_converter.py189-261
Each input format is associated with a FormatOption that specifies the processing pipeline and document backend to use. The format options form a class hierarchy:
Key FormatOption Fields docling/document_converter.py75-84:
pipeline_cls: The BasePipeline subclass to use for processingbackend: The AbstractDocumentBackend subclass to use for parsingpipeline_options: Optional PipelineOptions instance for pipeline configurationbackend_options: Optional BackendOptions instance for backend configurationDefault Format Options:
The system provides default options for each format via _get_default_option() docling/document_converter.py158-186 Common configurations include:
| Input Format | Pipeline | Backend |
|---|---|---|
InputFormat.PDF | StandardPdfPipeline | DoclingParseDocumentBackend |
InputFormat.DOCX | SimplePipeline | MsWordDocumentBackend |
InputFormat.XLSX | SimplePipeline | MsExcelDocumentBackend |
InputFormat.PPTX | SimplePipeline | MsPowerpointDocumentBackend |
InputFormat.HTML | SimplePipeline | HTMLDocumentBackend |
InputFormat.MD | SimplePipeline | MarkdownDocumentBackend |
InputFormat.IMAGE | StandardPdfPipeline | ImageDocumentBackend |
InputFormat.AUDIO | AsrPipeline | NoOpBackend |
Sources: docling/document_converter.py75-186 docling/datamodel/base_models.py37-44
The DocumentConverter caches initialized pipelines to avoid redundant model loading. Pipelines are keyed by a tuple of (pipeline_class, options_hash) where options_hash is an MD5 hash of the serialized pipeline options docling/document_converter.py267-272
Pipeline Retrieval Flow:
Cache Key Generation docling/document_converter.py267-272:
pipeline_options.model_dump()usedforsecurity=False)(pipeline_class, hash) as cache keyThis ensures that two DocumentConverter instances with identical pipeline configurations share the same cached pipeline, avoiding duplicate model loading.
Sources: docling/document_converter.py262-292
Converts a single document from a file path, URL, or DocumentStream:
Parameters:
source: Input document as file path (Path), URL (str), or DocumentStreamheaders: HTTP headers for URL sources (dict of string key-value pairs)raises_on_error: If True, raises ConversionError on failure; if False, errors are captured in ConversionResult.errorsmax_num_pages: Maximum pages to accept (documents exceeding this are skipped)max_file_size: Maximum file size in bytespage_range: Tuple of (start_page, end_page) for selective conversion (1-indexed)Returns:
ConversionResult containing the converted DoclingDocument and metadataRaises:
ConversionError: If conversion fails and raises_on_error=TrueSources: docling/document_converter.py294-336
Converts multiple documents in batch, yielding results as they complete:
Parameters:
source: Iterable of document sources (file paths, URLs, or DocumentStream objects)headers: Single set of HTTP headers applied to all URL sourcesconvert()Returns:
ConversionResult instances as documents are processedBehavior:
raises_on_error=False, failed documents yield ConversionResult with status=FAILUREraises_on_error=True, first failure stops iteration and raises exceptionSources: docling/document_converter.py338-400
Converts document content provided as a string (supports Markdown and HTML):
Parameters:
content: Document content as stringformat: Must be InputFormat.MD or InputFormat.HTMLname: Filename for the document (default: timestamp-based name). Appropriate file extension is appended if missing.Returns:
ConversionResult containing the converted documentRaises:
ValueError: If format is neither InputFormat.MD nor InputFormat.HTMLImplementation Detail:
Wraps the string content in a DocumentStream with a BytesIO buffer and passes it to the main conversion pipeline docling/document_converter.py430-458
Sources: docling/document_converter.py402-458
The following diagram shows how documents flow through the converter from input to output:
Sources: docling/document_converter.py461-661 docling/datamodel/document.py441-528
A ConversionResult contains the converted document and comprehensive metadata about the conversion process.
Key Fields:
| Field | Type | Description |
|---|---|---|
input | InputDocument | Metadata about the input document |
status | ConversionStatus | Overall conversion status |
errors | list[ErrorItem] | Errors encountered during conversion |
document | DoclingDocument | The converted document (empty if failed) |
pages | list[Page] | Page-level data including predictions and images |
timings | dict[str, ProfilingItem] | Performance metrics per pipeline stage |
confidence | ConfidenceReport | Quality scores for OCR, layout, tables |
assembled | AssembledUnit | Assembled page elements |
Sources: docling/datamodel/document.py417-420 docling/datamodel/document.py242-414
The InputDocument class represents metadata about the source document before conversion:
Key Fields docling/datamodel/document.py111-225:
| Field | Type | Description |
|---|---|---|
file | PurePath | Path representation of the input file |
document_hash | str | Stable hash of file content or stream |
valid | bool | Whether the document passed validation |
format | InputFormat | Detected or specified format |
filesize | Optional[int] | Size in bytes (if known) |
page_count | int | Number of pages (for paginated formats) |
limits | DocumentLimits | Size and page constraints |
backend_options | Optional[BackendOptions] | Backend-specific configuration |
Validation Logic:
limits.max_file_size docling/datamodel/document.py159-160limits.max_num_pages for paginated backends docling/datamodel/document.py189-192valid=False but can still be returned in resultsSources: docling/datamodel/document.py111-225
The ConversionStatus enum docling/datamodel/base_models.py46-52 indicates the outcome of conversion:
| Status | Meaning |
|---|---|
PENDING | Conversion not yet started |
STARTED | Conversion in progress |
FAILURE | Complete failure (no output produced) |
SUCCESS | Complete success (all pages converted) |
PARTIAL_SUCCESS | Some pages converted, some failed |
SKIPPED | Document skipped (e.g., exceeded limits) |
Sources: docling/datamodel/base_models.py46-52
Each error encountered during conversion is recorded as an ErrorItem docling/datamodel/base_models.py182-186:
| Field | Type | Description |
|---|---|---|
component_type | DoclingComponentType | Which component failed (backend, model, pipeline, etc.) |
module_name | str | Python module where error occurred |
error_message | str | Error description |
Component Types docling/datamodel/base_models.py167-172:
DOCUMENT_BACKEND: Backend parser errorsMODEL: ML model inference errorsDOC_ASSEMBLER: Document assembly errorsPIPELINE: Pipeline orchestration errorsUSER_INPUT: Invalid user inputSources: docling/datamodel/base_models.py167-186
The ConversionResult can be saved to a ZIP archive containing structured JSON:
Archive Structure:
timestamp.json: Save timestamp (ISO format)version.json: Docling version informationstatus.json: Conversion statuserrors.json: Error itemspages.json: Page metadatatimings.json: Performance metricsconfidence.json: Quality scoresdocument.json: DoclingDocument (via export_to_dict())Sources: docling/datamodel/document.py261-331
Previously saved results can be loaded:
This reconstructs the full ConversionResult object from the ZIP archive docling/datamodel/document.py333-414
Sources: docling/datamodel/document.py333-414
The DocumentConverter uses a global lock for pipeline cache access:
This ensures thread-safe initialization when multiple threads request the same pipeline simultaneously docling/document_converter.py72 Once a pipeline is cached, multiple threads can safely use it concurrently if the pipeline implementation is thread-safe (most pipelines are).
Sources: docling/document_converter.py72 docling/document_converter.py520-543
Refresh this wiki