This document describes the StandardPdfPipeline class and its threaded, queue-based architecture for processing PDF documents. The pipeline achieves true parallelism by running multiple stages concurrently with bounded queues connecting them, allowing OCR, layout detection, table structure prediction, and assembly to process different batches simultaneously.
For information about the sequential/legacy PDF pipeline, see Standard PDF Pipeline. For VLM-based processing, see VLM Pipeline. For configuration options, see Pipeline Configuration API.
Sources: docling/pipeline/standard_pdf_pipeline.py1-14
The StandardPdfPipeline implements a producer-consumer pattern where pages flow through five sequential stages, each running in its own thread. Pages are wrapped in ThreadedItem envelopes that track run identifiers, error state, and the page payload. Bounded ThreadedQueue instances connect stages, providing natural backpressure when downstream stages cannot keep up.
Per-Run Isolation: Every execute() call creates its own RunContext with fresh queues and worker threads. This ensures concurrent invocations never share mutable state, even when the same StandardPdfPipeline instance is used to process multiple documents.
Deterministic Run Tracking: Pages are tracked with monotonic run IDs generated by itertools.count() rather than relying on Python's id() function, which may be reused after garbage collection.
Explicit Shutdown Propagation: Queues support an explicit close() operation that propagates downstream. When a producer closes its output queue, downstream consumers detect the closure and terminate gracefully without requiring sentinel values.
Model Initialization Once: Heavy models (OCR, layout, table structure) are initialized once during __init__() and shared read-only across all worker threads. No model weights are modified during inference.
Sources: docling/pipeline/standard_pdf_pipeline.py1-14 docling/pipeline/standard_pdf_pipeline.py429-438
ThreadedItem (@dataclass) wraps each page as it travels between stages. The run_id field distinguishes pages from concurrent document conversions, the payload holds the Page object, and is_failed/error track processing failures. When a stage encounters an error, it marks is_failed=True and attaches the exception, allowing the item to flow through remaining stages without further processing.
Sources: docling/pipeline/standard_pdf_pipeline.py75-85
The ThreadedQueue class implements a bounded, thread-safe queue with blocking operations and explicit closure semantics:
| Method | Behavior |
|---|---|
put(item, timeout) | Blocks until queue has space or is closed. Returns False if closed. |
get_batch(size, timeout) | Blocks until ≥1 item available or queue closed. Returns up to size items. |
close() | Marks queue closed and wakes all waiting threads. |
closed (property) | Returns True if queue has been closed. |
The queue uses deque for efficient O(1) append/popleft operations and threading.Condition variables for blocking coordination. The _not_full condition guards against producer overflow, while _not_empty guards against consumer underflow.
Sources: docling/pipeline/standard_pdf_pipeline.py112-179
Each processing stage is represented by a ThreadedPipelineStage instance that owns:
input_queue: ThreadedQueue for receiving items_outputs: list[ThreadedQueue] for forwarding processed itemsmodel: Any instance (e.g., OcrModel, LayoutModel) shared read-only_thread: threading.Thread that processes batches_postprocess: Callable for cleanup (e.g., releasing page resources)The stage groups incoming items by run_id to maximize batch coherence, then invokes model(conv_res, pages) on each group. If any exception occurs, all items in that run are marked failed and passed downstream.
Sources: docling/pipeline/standard_pdf_pipeline.py181-321
Sources: docling/pipeline/standard_pdf_pipeline.py530-587
PreprocessThreadedStage (lines 323-411)
Lazily loads page backends just-in-time rather than up-front. For each incoming ThreadedItem:
page._backend is Noneconv_res.input._backend.load_page(page.page_no - 1) to load the PDF page backendpage.sizePagePreprocessingModel to generate scaled imagesThis approach delays expensive PDF parsing until the page is actually needed, distributing I/O across the pipeline execution.
OCR Stage (batch_size=64)
Processes batches of up to 64 pages through the OCR model (RapidOCR, EasyOCR, Tesseract, or OCRMac). The model adds OCR-generated text cells to pages that lack embedded text.
Layout Stage (batch_size=64)
Runs the layout detection model (Heron by default) to identify document elements: text blocks, section headers, tables, figures, formulas, etc. This is typically the most compute-intensive stage and benefits from large batches on GPU.
Table Stage (batch_size=4)
Applies TableFormer to predict table structure (rows, columns, cell spans) for detected table regions. Uses smaller batches because table structure models are memory-intensive.
Assemble Stage (batch_size=1)
Runs PageAssembleModel to combine clusters and cells into a structured representation. Executes the optional _release_page_resources() postprocessor to clean up images and backends.
Sources: docling/pipeline/standard_pdf_pipeline.py323-411 docling/pipeline/standard_pdf_pipeline.py530-587
Sources: docling/pipeline/standard_pdf_pipeline.py590-704
The main thread in _build_document() alternates between feeding pages into the first stage and draining results from the output queue:
This interleaving prevents deadlocks where the main thread blocks feeding while stages wait to emit results.
Sources: docling/pipeline/standard_pdf_pipeline.py620-686
Each execute() call generates a unique run_id from self._run_seq = itertools.count(1) and creates a fresh RunContext with new queues, stages, and threads. Models initialized during __init__() are shared read-only across all runs. Worker threads only mutate thread-local state and the items flowing through their queues.
Sources: docling/pipeline/standard_pdf_pipeline.py429-438 docling/pipeline/standard_pdf_pipeline.py414-422 docling/pipeline/standard_pdf_pipeline.py530-587
The pipeline coordinates with external libraries through:
pypdfium2_lock: A global threading.Lock defined in docling/utils/locks.py that serializes access to pypdfium2 APIs. PDF page loading and rendering operations acquire this lock to prevent race conditions in the underlying C library.
Backend Lazy Loading: The PreprocessThreadedStage loads page backends just-in-time within its worker thread, avoiding contention during document initialization.
Sources: docling/utils/locks.py1-3 docling/backend/pypdfium2_backend.py44-47 docling/pipeline/standard_pdf_pipeline.py323-411
When a stage encounters an exception processing a batch, it marks all items in that run_id group as failed:
Failed items continue flowing through downstream stages without further processing, preserving the page's error information. The main thread collects these failures in ProcessingResult.failed_pages.
Sources: docling/pipeline/standard_pdf_pipeline.py275-310
The ThreadedPdfPipelineOptions.document_timeout (in seconds) limits total processing time:
When timeout is exceeded:
run_id to ctx.timed_out_run_idsrun_id and skip processing but pass items throughRuntimeError("document timeout exceeded")This allows in-flight work to complete while preventing new work from starting.
Sources: docling/pipeline/standard_pdf_pipeline.py626-644 docling/pipeline/standard_pdf_pipeline.py261-269 docling/pipeline/standard_pdf_pipeline.py689-697
The _integrate_results() method translates ProcessingResult.failed_pages into ErrorItem instances appended to ConversionResult.errors, each with component_type=DoclingComponentType.PIPELINE and a descriptive message.
Sources: docling/pipeline/standard_pdf_pipeline.py706-746
| Option | Type | Default | Description |
|---|---|---|---|
layout_batch_size | int | 64 | Pages per layout model batch |
table_batch_size | int | 4 | Pages per table model batch |
ocr_batch_size | int | 64 | Pages per OCR batch |
queue_max_size | int | 512 | Maximum items in each queue |
batch_polling_interval_seconds | float | 5.0 | Timeout for get_batch() calls |
document_timeout | Optional[float] | None | Maximum seconds per document |
Inherits from PdfPipelineOptions, which provides:
do_ocr, do_table_structure, do_code_enrichment, etc.ocr_options, layout_options, table_structure_optionsgenerate_page_images, generate_picture_imagesaccelerator_options for device selectionBatch sizes should be tuned based on GPU memory and model characteristics. Larger batches improve GPU utilization but increase latency for the first result.
Sources: docling/datamodel/pipeline_options.py (referenced in imports)
After all pages complete the threaded stages, the pipeline runs sequential post-processing in the main thread:
ReadingOrderModel sorts document elements logically across pagesLayoutPostprocessor refines cluster bboxes and removes overlapsDoclingDocumentThese steps are not threaded because they operate on the complete document and have complex interdependencies.
Sources: docling/pipeline/standard_pdf_pipeline.py747-840
The LegacyStandardPdfPipeline processes pages sequentially in batches through a list of models:
While a batch of pages flows through the entire pipeline together, only one stage is active at a time. The threaded pipeline achieves true parallelism: while the layout model processes batch N, the OCR model can process batch N+1, and the table model can process batch N-1.
The threaded pipeline is now the default StandardPdfPipeline implementation. The legacy version remains available for compatibility and debugging.
Sources: docling/pipeline/legacy_standard_pdf_pipeline.py80-95 docling/pipeline/standard_pdf_pipeline.py429-438
Refresh this wiki