The Standard PDF Pipeline (StandardPdfPipeline) is a production-ready, multi-threaded PDF processing pipeline that converts PDF documents into structured DoclingDocument representations. It implements a five-stage architecture (preprocess, OCR, layout detection, table structure, assembly) with each stage running in its own thread, connected by bounded queues that provide backpressure and enable concurrent processing of multiple pages.
This document covers the threading architecture, stage design, feed-drain execution model, timeout handling, and document assembly. For VLM-based PDF processing, see VLM Pipeline. For the simpler single-threaded pipeline used by declarative backends, see Simple Pipeline. For configuration details, see Pipeline Configuration API.
Sources: docling/pipeline/standard_pdf_pipeline.py1-14
StandardPdfPipeline processes PDF documents by breaking them into pages and streaming each page through a sequence of ML models running in separate threads. The pipeline ensures:
execute() call creates fresh queues and threadsThe class extends ConvertPipeline and is instantiated with a ThreadedPdfPipelineOptions object that configures batch sizes, queue sizes, timeouts, and which ML models to enable.
Sources: docling/pipeline/standard_pdf_pipeline.py433-442 docling/pipeline/standard_pdf_pipeline.py1-14
ThreadedItem is the envelope that flows between pipeline stages. It wraps a Page object along with metadata:
| Field | Type | Purpose |
|---|---|---|
payload | Page | None | The page being processed |
run_id | int | Unique identifier for this execution (monotonic across pipeline instance) |
page_no | int | Page number in the document |
conv_res | ConversionResult | Reference to the parent conversion result |
error | Exception | None | Any exception encountered during processing |
is_failed | bool | Whether this item has failed processing |
The run_id field is critical for isolation: it allows the pipeline to process multiple documents concurrently (when called from multiple threads) without mixing their data, and enables timeout logic to mark specific runs as abandoned.
Sources: docling/pipeline/standard_pdf_pipeline.py81-91
ProcessingResult aggregates the outcome of a single pipeline run:
| Field | Type | Purpose |
|---|---|---|
pages | List[Page] | Successfully processed pages |
failed_pages | List[Tuple[int, Exception]] | Failed pages with their errors |
total_expected | int | Total pages expected to be processed |
It provides properties to determine success state: is_partial_success, is_complete_failure, success_count, failure_count.
Sources: docling/pipeline/standard_pdf_pipeline.py93-116
ThreadedQueue is a bounded FIFO queue with explicit close() semantics. Unlike Python's standard queue.Queue, it propagates closure downstream: when a queue is closed, consumers waiting on get_batch() are woken and can detect the closure, allowing stages to terminate deterministically without sentinel values.
Key methods:
put(item, timeout): Block until queue accepts item or is closed (returns False if closed)get_batch(size, timeout): Return up to size items, blocking until ≥1 available or closedclose(): Mark queue as closed and wake all waiting threadsSources: docling/pipeline/standard_pdf_pipeline.py118-183
Per-Run Isolation
Each call to _build_document() (from execute()) creates a fresh RunContext containing new stage instances, queues, and a unique run_id. This ensures that concurrent calls to the pipeline never share mutable state. The run_id is generated from a monotonic counter self._run_seq (initialized with itertools.count(1)).
Heavy ML models (ocr_model, layout_model, table_model, assemble_model, reading_order_model) are initialized once in _init_models() and stored on the pipeline instance. Worker threads access these models read-only.
Sources: docling/pipeline/standard_pdf_pipeline.py418-426 docling/pipeline/standard_pdf_pipeline.py436-442 docling/pipeline/standard_pdf_pipeline.py539-596
Each stage is implemented as a ThreadedPipelineStage that wraps a model, manages an input queue, and forwards results to one or more output queues. The stage runs in a daemon thread executing the _run() method:
input_queue using get_batch(batch_size, batch_timeout)run_id to maximize batching efficiencytimed_out_run_ids: if the run is timed out, skip processing and mark items as failedinput_queue is closedWhen the input queue is closed and empty, the stage closes all output queues and terminates.
Sources: docling/pipeline/standard_pdf_pipeline.py185-325
PreprocessThreadedStage is a specialized stage that lazily loads PDF page backends. When a Page enters this stage with page._backend == None, the stage:
conv_res.input._backendbackend.load_page(page.page_no - 1) to get a page backendpage._backendpage_backend.get_size() to populate page.sizeThis deferred loading reduces memory usage and allows pages to be loaded just-in-time as they flow through the pipeline.
Sources: docling/pipeline/standard_pdf_pipeline.py327-416
Runs the ocr_model (created via get_ocr_factory()) on pages. The model can be Tesseract, EasyOCR, RapidOCR, or OCRMac depending on configuration. Batch size is controlled by pipeline_options.ocr_batch_size (default: 64).
Sources: docling/pipeline/standard_pdf_pipeline.py548-555 docling/pipeline/standard_pdf_pipeline.py460-461 docling/pipeline/standard_pdf_pipeline.py512-521
Runs the layout_model (e.g., EGRET, HERON) for object detection on page images. The model identifies bounding boxes and labels for text blocks, figures, tables, etc. Batch size is controlled by pipeline_options.layout_batch_size (default: 64).
Sources: docling/pipeline/standard_pdf_pipeline.py556-563 docling/pipeline/standard_pdf_pipeline.py461-469
Runs the table_model (typically TableFormer with OTSL) to recognize table structure within table bounding boxes identified by layout detection. Batch size is controlled by pipeline_options.table_batch_size (default: 4, since table models are more computationally expensive).
Sources: docling/pipeline/standard_pdf_pipeline.py564-571 docling/pipeline/standard_pdf_pipeline.py470-479
Runs the assemble_model (PageAssembleModel) which combines the outputs of previous stages into a unified representation. It runs with batch_size=1 and includes a postprocess callback (_release_page_resources()) that:
page._image_cache if images aren't needed downstreampage._backend if not needed for enrichmentpage.parsed_page if generate_parsed_pages is disabledThis cleanup minimizes memory usage before pages are returned.
Sources: docling/pipeline/standard_pdf_pipeline.py572-580 docling/pipeline/standard_pdf_pipeline.py523-534
Sources: docling/pipeline/standard_pdf_pipeline.py539-596
The main thread in _build_document() operates a feed-drain loop that interleaves page submission and result collection:
Feed Phase: Attempt to enqueue pages using non-blocking put(timeout=0). Continue until the queue is full or all pages are submitted. When the last page is queued, close the input queue to signal no more data.
Drain Phase: Pull results using get_batch(batch_size=32, timeout=0.05). Sort items by run_id and update ProcessingResult.
Timeout Check: Before each iteration, check if elapsed_time > document_timeout. If so, add run_id to the shared timed_out_run_ids set, close the input queue, and break immediately (don't wait for in-flight work).
Sources: docling/pipeline/standard_pdf_pipeline.py628-710
Timeout is configured via pipeline_options.document_timeout (in seconds). When a document exceeds this limit:
run_id to RunContext.timed_out_run_ids (a shared set[int])_process_batch() before processing a batchrun_id in timed_out_run_ids, the stage marks all items as failed with error "document timeout exceeded" and passes them through without calling the modelAlready-completed work is preserved (items that reached the output queue before timeout are included in results), but new work is aborted. The status is set to ConversionStatus.PARTIAL_SUCCESS if any pages completed.
Sources: docling/pipeline/standard_pdf_pipeline.py636-653 docling/pipeline/standard_pdf_pipeline.py265-273 docling/pipeline/standard_pdf_pipeline.py354-362 docling/pipeline/standard_pdf_pipeline.py716-755
After the feed-drain loop completes, _integrate_results() updates the ConversionResult:
| Action | Implementation |
|---|---|
| Filter successful pages | Keep only pages in proc.pages by page number |
| Record errors | Add ErrorItem for each failed page to conv_res.errors |
| Set status | FAILURE if all failed, PARTIAL_SUCCESS if some failed or timeout, SUCCESS otherwise |
| Clean up resources | Clear image cache if not needed, unload backends, delete parsed pages |
The _integrate_results() method does not wait for in-flight work after timeout; any pages still in the pipeline are marked as failed during the timeout phase.
Sources: docling/pipeline/standard_pdf_pipeline.py716-755
After page-level processing completes, _assemble_document() runs in the main thread:
elements, headers, and body from each page.assembledreading_order_model to establish document-level reading ordergenerate_page_images, attach full page images to DoclingDocument.pages[page_no].imagegenerate_picture_images or generate_table_images, crop element bounding boxes from page images and attach to PictureItem.image or TableItem.imagenp.nanmean() or np.nanquantile()_add_failed_pages_to_document() to insert empty PageItem entries for failed pages (ensures page break markers are generated correctly during export)After assembly, the enrichment pipeline runs (inherited from ConvertPipeline):
CodeFormulaVlmModel detects code blocks and mathematical formulasSources: docling/pipeline/standard_pdf_pipeline.py756-853 docling/pipeline/standard_pdf_pipeline.py855-916 docling/pipeline/standard_pdf_pipeline.py483-509
Sources: docling/pipeline/standard_pdf_pipeline.py756-853
StandardPdfPipeline is configured via ThreadedPdfPipelineOptions, which includes:
| Parameter | Type | Default | Purpose |
|---|---|---|---|
batch_polling_interval_seconds | float | 0.05 | Timeout for get_batch() calls |
queue_max_size | int | 128 | Maximum items in each queue (controls backpressure) |
ocr_batch_size | int | 64 | Batch size for OCR stage |
layout_batch_size | int | 64 | Batch size for layout stage |
table_batch_size | int | 4 | Batch size for table stage |
document_timeout | float | None | None | Maximum processing time per document (seconds) |
do_ocr | bool | True | Enable OCR |
do_table_structure | bool | True | Enable table structure recognition |
do_code_enrichment | bool | False | Enable code block detection |
do_formula_enrichment | bool | False | Enable formula detection |
generate_page_images | bool | False | Include full page images in output |
generate_picture_images | bool | False | Include cropped picture images in output |
generate_parsed_pages | bool | False | Include parsed page data in output |
images_scale | float | 2.0 | Scaling factor for page images |
The pipeline also uses ocr_options, layout_options, table_structure_options, code_formula_options, and accelerator_options to configure individual models.
Sources: docling/pipeline/standard_pdf_pipeline.py436-438 docling/datamodel/pipeline_options.py
Threads are created as daemon=False to ensure they complete before Python exits. When stopping stages, the pipeline:
get_batch() calls)thread.join(timeout=15.0) to wait for graceful shutdownAbandoned threads may occur if a model inference call or PDF backend operation is blocking (e.g., a long-running Tesseract OCR or a stuck pypdfium2_lock). The pipeline design minimizes this risk by using timeouts where possible, but model libraries don't always support interruption.
Sources: docling/pipeline/standard_pdf_pipeline.py226-240 docling/pipeline/standard_pdf_pipeline.py599-606
Refresh this wiki