Threaded Pipeline Architecture

Relevant source files

docling/pipeline/standard_pdf_pipeline.py

Purpose and Scope

This document describes the StandardPdfPipeline class and its threaded, queue-based architecture for processing PDF documents. The pipeline achieves true parallelism by running multiple stages concurrently with bounded queues connecting them, allowing OCR, layout detection, table structure prediction, and assembly to process different batches simultaneously.

For information about the sequential/legacy PDF pipeline, see Standard PDF Pipeline. For VLM-based processing, see VLM Pipeline. For configuration options, see Pipeline Configuration API.

Sources: docling/pipeline/standard_pdf_pipeline.py1-14

Architecture Overview

The StandardPdfPipeline implements a producer-consumer pattern where pages flow through five sequential stages, each running in its own thread. Pages are wrapped in ThreadedItem envelopes that track run identifiers, error state, and the page payload. Bounded ThreadedQueue instances connect stages, providing natural backpressure when downstream stages cannot keep up.

Key Design Principles

Per-Run Isolation: Every execute() call creates its own RunContext with fresh queues and worker threads. This ensures concurrent invocations never share mutable state, even when the same StandardPdfPipeline instance is used to process multiple documents.

Deterministic Run Tracking: Pages are tracked with monotonic run IDs generated by itertools.count() rather than relying on Python's id() function, which may be reused after garbage collection.

Explicit Shutdown Propagation: Queues support an explicit close() operation that propagates downstream. When a producer closes its output queue, downstream consumers detect the closure and terminate gracefully without requiring sentinel values.

Model Initialization Once: Heavy models (OCR, layout, table structure) are initialized once during __init__() and shared read-only across all worker threads. No model weights are modified during inference.

Sources: docling/pipeline/standard_pdf_pipeline.py1-14 docling/pipeline/standard_pdf_pipeline.py429-438

Core Data Structures

ThreadedItem Envelope

ThreadedItem (@dataclass) wraps each page as it travels between stages. The run_id field distinguishes pages from concurrent document conversions, the payload holds the Page object, and is_failed/error track processing failures. When a stage encounters an error, it marks is_failed=True and attaches the exception, allowing the item to flow through remaining stages without further processing.

Sources: docling/pipeline/standard_pdf_pipeline.py75-85

ThreadedQueue

The ThreadedQueue class implements a bounded, thread-safe queue with blocking operations and explicit closure semantics:

Method	Behavior
`put(item, timeout)`	Blocks until queue has space or is closed. Returns `False` if closed.
`get_batch(size, timeout)`	Blocks until ≥1 item available or queue closed. Returns up to `size` items.
`close()`	Marks queue closed and wakes all waiting threads.
`closed` (property)	Returns `True` if queue has been closed.

The queue uses deque for efficient O(1) append/popleft operations and threading.Condition variables for blocking coordination. The _not_full condition guards against producer overflow, while _not_empty guards against consumer underflow.

Sources: docling/pipeline/standard_pdf_pipeline.py112-179

ThreadedPipelineStage

Each processing stage is represented by a ThreadedPipelineStage instance that owns:

An input_queue: ThreadedQueue for receiving items
A list of _outputs: list[ThreadedQueue] for forwarding processed items
A model: Any instance (e.g., OcrModel, LayoutModel) shared read-only
A background _thread: threading.Thread that processes batches
An optional _postprocess: Callable for cleanup (e.g., releasing page resources)

The stage groups incoming items by run_id to maximize batch coherence, then invokes model(conv_res, pages) on each group. If any exception occurs, all items in that run are marked failed and passed downstream.

Sources: docling/pipeline/standard_pdf_pipeline.py181-321

Pipeline Stages and Flow

Stage Sequence

Sources: docling/pipeline/standard_pdf_pipeline.py530-587

Stage Implementations

PreprocessThreadedStage (lines 323-411)

Lazily loads page backends just-in-time rather than up-front. For each incoming ThreadedItem:

Checks if page._backend is None
If so, calls conv_res.input._backend.load_page(page.page_no - 1) to load the PDF page backend
Validates the backend and extracts page.size
Runs PagePreprocessingModel to generate scaled images

This approach delays expensive PDF parsing until the page is actually needed, distributing I/O across the pipeline execution.

OCR Stage (batch_size=64)

Processes batches of up to 64 pages through the OCR model (RapidOCR, EasyOCR, Tesseract, or OCRMac). The model adds OCR-generated text cells to pages that lack embedded text.

Layout Stage (batch_size=64)

Runs the layout detection model (Heron by default) to identify document elements: text blocks, section headers, tables, figures, formulas, etc. This is typically the most compute-intensive stage and benefits from large batches on GPU.

Table Stage (batch_size=4)

Applies TableFormer to predict table structure (rows, columns, cell spans) for detected table regions. Uses smaller batches because table structure models are memory-intensive.

Assemble Stage (batch_size=1)

Runs PageAssembleModel to combine clusters and cells into a structured representation. Executes the optional _release_page_resources() postprocessor to clean up images and backends.

Sources: docling/pipeline/standard_pdf_pipeline.py323-411 docling/pipeline/standard_pdf_pipeline.py530-587

Execution Flow

Document Processing Timeline

Sources: docling/pipeline/standard_pdf_pipeline.py590-704

Interleaved Feed and Drain

The main thread in _build_document() alternates between feeding pages into the first stage and draining results from the output queue:

This interleaving prevents deadlocks where the main thread blocks feeding while stages wait to emit results.

Sources: docling/pipeline/standard_pdf_pipeline.py620-686

Thread Safety and Isolation

Run Context Isolation

Each execute() call generates a unique run_id from self._run_seq = itertools.count(1) and creates a fresh RunContext with new queues, stages, and threads. Models initialized during __init__() are shared read-only across all runs. Worker threads only mutate thread-local state and the items flowing through their queues.

Sources: docling/pipeline/standard_pdf_pipeline.py429-438 docling/pipeline/standard_pdf_pipeline.py414-422 docling/pipeline/standard_pdf_pipeline.py530-587

Lock Usage

The pipeline coordinates with external libraries through:

pypdfium2_lock: A global threading.Lock defined in docling/utils/locks.py that serializes access to pypdfium2 APIs. PDF page loading and rendering operations acquire this lock to prevent race conditions in the underlying C library.

Backend Lazy Loading: The PreprocessThreadedStage loads page backends just-in-time within its worker thread, avoiding contention during document initialization.

Sources: docling/utils/locks.py1-3 docling/backend/pypdfium2_backend.py44-47 docling/pipeline/standard_pdf_pipeline.py323-411

Error Handling and Timeout

Per-Stage Error Propagation

When a stage encounters an exception processing a batch, it marks all items in that run_id group as failed:

Failed items continue flowing through downstream stages without further processing, preserving the page's error information. The main thread collects these failures in ProcessingResult.failed_pages.

Sources: docling/pipeline/standard_pdf_pipeline.py275-310

Document Timeout

The ThreadedPdfPipelineOptions.document_timeout (in seconds) limits total processing time:

When timeout is exceeded:

The main thread adds the current run_id to ctx.timed_out_run_ids
The input queue is closed to stop feeding new pages
The main thread breaks its feed-drain loop immediately
Stages detect the timed-out run_id and skip processing but pass items through
Any pages not yet fed or completed are marked as failed with RuntimeError("document timeout exceeded")

This allows in-flight work to complete while preventing new work from starting.

Sources: docling/pipeline/standard_pdf_pipeline.py626-644 docling/pipeline/standard_pdf_pipeline.py261-269 docling/pipeline/standard_pdf_pipeline.py689-697

Error Collection

The _integrate_results() method translates ProcessingResult.failed_pages into ErrorItem instances appended to ConversionResult.errors, each with component_type=DoclingComponentType.PIPELINE and a descriptive message.

Sources: docling/pipeline/standard_pdf_pipeline.py706-746

Configuration Options

ThreadedPdfPipelineOptions

Option	Type	Default	Description
`layout_batch_size`	`int`	64	Pages per layout model batch
`table_batch_size`	`int`	4	Pages per table model batch
`ocr_batch_size`	`int`	64	Pages per OCR batch
`queue_max_size`	`int`	512	Maximum items in each queue
`batch_polling_interval_seconds`	`float`	5.0	Timeout for get_batch() calls
`document_timeout`	`Optional[float]`	`None`	Maximum seconds per document

Inherits from PdfPipelineOptions, which provides:

do_ocr, do_table_structure, do_code_enrichment, etc.
ocr_options, layout_options, table_structure_options
generate_page_images, generate_picture_images
accelerator_options for device selection

Batch sizes should be tuned based on GPU memory and model characteristics. Larger batches improve GPU utilization but increase latency for the first result.

Sources: docling/datamodel/pipeline_options.py (referenced in imports)

Post-Processing and Assembly

After all pages complete the threaded stages, the pipeline runs sequential post-processing in the main thread:

Reading Order: ReadingOrderModel sorts document elements logically across pages
Enrichment: Code/formula detection, picture classification, and description models process elements
Layout Postprocessing: LayoutPostprocessor refines cluster bboxes and removes overlaps
Document Assembly: Combines pages into a unified DoclingDocument

These steps are not threaded because they operate on the complete document and have complex interdependencies.

Sources: docling/pipeline/standard_pdf_pipeline.py747-840

Comparison to Legacy Pipeline

The LegacyStandardPdfPipeline processes pages sequentially in batches through a list of models:

While a batch of pages flows through the entire pipeline together, only one stage is active at a time. The threaded pipeline achieves true parallelism: while the layout model processes batch N, the OCR model can process batch N+1, and the table model can process batch N-1.

The threaded pipeline is now the default StandardPdfPipeline implementation. The legacy version remains available for compatibility and debugging.

Sources: docling/pipeline/legacy_standard_pdf_pipeline.py80-95 docling/pipeline/standard_pdf_pipeline.py429-438

Threaded Pipeline Architecture

Relevant source files

docling/pipeline/standard_pdf_pipeline.py

Purpose and Scope

For information about the sequential/legacy PDF pipeline, see Standard PDF Pipeline. For VLM-based processing, see VLM Pipeline. For configuration options, see Pipeline Configuration API.

Sources: docling/pipeline/standard_pdf_pipeline.py1-14

Architecture Overview

Key Design Principles

Sources: docling/pipeline/standard_pdf_pipeline.py1-14 docling/pipeline/standard_pdf_pipeline.py429-438

Core Data Structures

ThreadedItem Envelope

Sources: docling/pipeline/standard_pdf_pipeline.py75-85

ThreadedQueue

The ThreadedQueue class implements a bounded, thread-safe queue with blocking operations and explicit closure semantics:

Method	Behavior
`put(item, timeout)`	Blocks until queue has space or is closed. Returns `False` if closed.
`get_batch(size, timeout)`	Blocks until ≥1 item available or queue closed. Returns up to `size` items.
`close()`	Marks queue closed and wakes all waiting threads.
`closed` (property)	Returns `True` if queue has been closed.

Sources: docling/pipeline/standard_pdf_pipeline.py112-179

ThreadedPipelineStage

Each processing stage is represented by a ThreadedPipelineStage instance that owns:

An input_queue: ThreadedQueue for receiving items
A list of _outputs: list[ThreadedQueue] for forwarding processed items
A model: Any instance (e.g., OcrModel, LayoutModel) shared read-only
A background _thread: threading.Thread that processes batches
An optional _postprocess: Callable for cleanup (e.g., releasing page resources)

Sources: docling/pipeline/standard_pdf_pipeline.py181-321

Pipeline Stages and Flow

Stage Sequence

Sources: docling/pipeline/standard_pdf_pipeline.py530-587

Stage Implementations

PreprocessThreadedStage (lines 323-411)

Lazily loads page backends just-in-time rather than up-front. For each incoming ThreadedItem:

Checks if page._backend is None
If so, calls conv_res.input._backend.load_page(page.page_no - 1) to load the PDF page backend
Validates the backend and extracts page.size
Runs PagePreprocessingModel to generate scaled images

This approach delays expensive PDF parsing until the page is actually needed, distributing I/O across the pipeline execution.

OCR Stage (batch_size=64)

Processes batches of up to 64 pages through the OCR model (RapidOCR, EasyOCR, Tesseract, or OCRMac). The model adds OCR-generated text cells to pages that lack embedded text.

Layout Stage (batch_size=64)

Table Stage (batch_size=4)

Applies TableFormer to predict table structure (rows, columns, cell spans) for detected table regions. Uses smaller batches because table structure models are memory-intensive.

Assemble Stage (batch_size=1)

Runs PageAssembleModel to combine clusters and cells into a structured representation. Executes the optional _release_page_resources() postprocessor to clean up images and backends.

Sources: docling/pipeline/standard_pdf_pipeline.py323-411 docling/pipeline/standard_pdf_pipeline.py530-587

Execution Flow

Document Processing Timeline

Sources: docling/pipeline/standard_pdf_pipeline.py590-704

Interleaved Feed and Drain

The main thread in _build_document() alternates between feeding pages into the first stage and draining results from the output queue:

This interleaving prevents deadlocks where the main thread blocks feeding while stages wait to emit results.

Sources: docling/pipeline/standard_pdf_pipeline.py620-686

Thread Safety and Isolation

Run Context Isolation

Sources: docling/pipeline/standard_pdf_pipeline.py429-438 docling/pipeline/standard_pdf_pipeline.py414-422 docling/pipeline/standard_pdf_pipeline.py530-587

Lock Usage

The pipeline coordinates with external libraries through:

Backend Lazy Loading: The PreprocessThreadedStage loads page backends just-in-time within its worker thread, avoiding contention during document initialization.

Sources: docling/utils/locks.py1-3 docling/backend/pypdfium2_backend.py44-47 docling/pipeline/standard_pdf_pipeline.py323-411

Error Handling and Timeout

Per-Stage Error Propagation

When a stage encounters an exception processing a batch, it marks all items in that run_id group as failed:

Sources: docling/pipeline/standard_pdf_pipeline.py275-310

Document Timeout

The ThreadedPdfPipelineOptions.document_timeout (in seconds) limits total processing time:

When timeout is exceeded:

The main thread adds the current run_id to ctx.timed_out_run_ids
The input queue is closed to stop feeding new pages
The main thread breaks its feed-drain loop immediately
Stages detect the timed-out run_id and skip processing but pass items through
Any pages not yet fed or completed are marked as failed with RuntimeError("document timeout exceeded")

This allows in-flight work to complete while preventing new work from starting.

Sources: docling/pipeline/standard_pdf_pipeline.py626-644 docling/pipeline/standard_pdf_pipeline.py261-269 docling/pipeline/standard_pdf_pipeline.py689-697

Error Collection

Sources: docling/pipeline/standard_pdf_pipeline.py706-746

Configuration Options

ThreadedPdfPipelineOptions

Option	Type	Default	Description
`layout_batch_size`	`int`	64	Pages per layout model batch
`table_batch_size`	`int`	4	Pages per table model batch
`ocr_batch_size`	`int`	64	Pages per OCR batch
`queue_max_size`	`int`	512	Maximum items in each queue
`batch_polling_interval_seconds`	`float`	5.0	Timeout for get_batch() calls
`document_timeout`	`Optional[float]`	`None`	Maximum seconds per document

Inherits from PdfPipelineOptions, which provides:

do_ocr, do_table_structure, do_code_enrichment, etc.
ocr_options, layout_options, table_structure_options
generate_page_images, generate_picture_images
accelerator_options for device selection

Batch sizes should be tuned based on GPU memory and model characteristics. Larger batches improve GPU utilization but increase latency for the first result.

Sources: docling/datamodel/pipeline_options.py (referenced in imports)

Post-Processing and Assembly

After all pages complete the threaded stages, the pipeline runs sequential post-processing in the main thread:

Reading Order: ReadingOrderModel sorts document elements logically across pages
Enrichment: Code/formula detection, picture classification, and description models process elements
Layout Postprocessing: LayoutPostprocessor refines cluster bboxes and removes overlaps
Document Assembly: Combines pages into a unified DoclingDocument

These steps are not threaded because they operate on the complete document and have complex interdependencies.

Sources: docling/pipeline/standard_pdf_pipeline.py747-840

Comparison to Legacy Pipeline

The LegacyStandardPdfPipeline processes pages sequentially in batches through a list of models:

The threaded pipeline is now the default StandardPdfPipeline implementation. The legacy version remains available for compatibility and debugging.

Sources: docling/pipeline/legacy_standard_pdf_pipeline.py80-95 docling/pipeline/standard_pdf_pipeline.py429-438

Threaded Pipeline Architecture

Purpose and Scope

Architecture Overview

Key Design Principles

Core Data Structures

ThreadedItem Envelope

ThreadedQueue

ThreadedPipelineStage

Pipeline Stages and Flow

Stage Sequence

Stage Implementations

Execution Flow

Document Processing Timeline

Interleaved Feed and Drain

Thread Safety and Isolation

Run Context Isolation

Lock Usage

Error Handling and Timeout

Per-Stage Error Propagation

Document Timeout

Error Collection

Configuration Options

ThreadedPdfPipelineOptions

Post-Processing and Assembly

Comparison to Legacy Pipeline

On this page

Threaded Pipeline Architecture

Purpose and Scope

Architecture Overview

Key Design Principles

Core Data Structures

ThreadedItem Envelope

ThreadedQueue

ThreadedPipelineStage

Pipeline Stages and Flow

Stage Sequence

Stage Implementations

Execution Flow

Document Processing Timeline

Interleaved Feed and Drain

Thread Safety and Isolation

Run Context Isolation

Lock Usage

Error Handling and Timeout

Per-Stage Error Propagation

Document Timeout

Error Collection

Configuration Options

ThreadedPdfPipelineOptions

Post-Processing and Assembly

Comparison to Legacy Pipeline

On this page