Standard PDF Pipeline

Relevant source files

docling/pipeline/standard_pdf_pipeline.py

Purpose and Scope

The Standard PDF Pipeline (StandardPdfPipeline) is a production-ready, multi-threaded PDF processing pipeline that converts PDF documents into structured DoclingDocument representations. It implements a five-stage architecture (preprocess, OCR, layout detection, table structure, assembly) with each stage running in its own thread, connected by bounded queues that provide backpressure and enable concurrent processing of multiple pages.

This document covers the threading architecture, stage design, feed-drain execution model, timeout handling, and document assembly. For VLM-based PDF processing, see VLM Pipeline. For the simpler single-threaded pipeline used by declarative backends, see Simple Pipeline. For configuration details, see Pipeline Configuration API.

Sources: docling/pipeline/standard_pdf_pipeline.py1-14

Overview

StandardPdfPipeline processes PDF documents by breaking them into pages and streaming each page through a sequence of ML models running in separate threads. The pipeline ensures:

Per-run isolation: Each execute() call creates fresh queues and threads
Deterministic run IDs: Pages are tracked with monotonic identifiers to avoid ID collisions
Backpressure: Bounded queues block producers when full
Timeout support: Documents can be abandoned mid-processing if they exceed configured limits
Minimal shared state: Heavy ML models are initialized once and accessed read-only by worker threads

The class extends ConvertPipeline and is instantiated with a ThreadedPdfPipelineOptions object that configures batch sizes, queue sizes, timeouts, and which ML models to enable.

Sources: docling/pipeline/standard_pdf_pipeline.py433-442 docling/pipeline/standard_pdf_pipeline.py1-14

Key Data Structures

ThreadedItem

ThreadedItem is the envelope that flows between pipeline stages. It wraps a Page object along with metadata:

Field	Type	Purpose
`payload`	`Page \| None`	The page being processed
`run_id`	`int`	Unique identifier for this execution (monotonic across pipeline instance)
`page_no`	`int`	Page number in the document
`conv_res`	`ConversionResult`	Reference to the parent conversion result
`error`	`Exception \| None`	Any exception encountered during processing
`is_failed`	`bool`	Whether this item has failed processing

The run_id field is critical for isolation: it allows the pipeline to process multiple documents concurrently (when called from multiple threads) without mixing their data, and enables timeout logic to mark specific runs as abandoned.

Sources: docling/pipeline/standard_pdf_pipeline.py81-91

ProcessingResult

ProcessingResult aggregates the outcome of a single pipeline run:

Field	Type	Purpose
`pages`	`List[Page]`	Successfully processed pages
`failed_pages`	`List[Tuple[int, Exception]]`	Failed pages with their errors
`total_expected`	`int`	Total pages expected to be processed

It provides properties to determine success state: is_partial_success, is_complete_failure, success_count, failure_count.

Sources: docling/pipeline/standard_pdf_pipeline.py93-116

ThreadedQueue

ThreadedQueue is a bounded FIFO queue with explicit close() semantics. Unlike Python's standard queue.Queue, it propagates closure downstream: when a queue is closed, consumers waiting on get_batch() are woken and can detect the closure, allowing stages to terminate deterministically without sentinel values.

Key methods:

put(item, timeout): Block until queue accepts item or is closed (returns False if closed)
get_batch(size, timeout): Return up to size items, blocking until ≥1 available or closed
close(): Mark queue as closed and wake all waiting threads

Sources: docling/pipeline/standard_pdf_pipeline.py118-183

Threading Architecture

Per-Run Isolation

Each call to _build_document() (from execute()) creates a fresh RunContext containing new stage instances, queues, and a unique run_id. This ensures that concurrent calls to the pipeline never share mutable state. The run_id is generated from a monotonic counter self._run_seq (initialized with itertools.count(1)).

Heavy ML models (ocr_model, layout_model, table_model, assemble_model, reading_order_model) are initialized once in _init_models() and stored on the pipeline instance. Worker threads access these models read-only.

Sources: docling/pipeline/standard_pdf_pipeline.py418-426 docling/pipeline/standard_pdf_pipeline.py436-442 docling/pipeline/standard_pdf_pipeline.py539-596

Pipeline Stages

Stage Implementation

Each stage is implemented as a ThreadedPipelineStage that wraps a model, manages an input queue, and forwards results to one or more output queues. The stage runs in a daemon thread executing the _run() method:

Pull a batch from input_queue using get_batch(batch_size, batch_timeout)
Group items by run_id to maximize batching efficiency
Check timed_out_run_ids: if the run is timed out, skip processing and mark items as failed
For non-failed items, call the wrapped model
Post-process results (e.g., release resources)
Emit results to all output queues
Repeat until input_queue is closed

When the input queue is closed and empty, the stage closes all output queues and terminates.

Sources: docling/pipeline/standard_pdf_pipeline.py185-325

Preprocess Stage

PreprocessThreadedStage is a specialized stage that lazily loads PDF page backends. When a Page enters this stage with page._backend == None, the stage:

Retrieves the document backend from conv_res.input._backend
Calls backend.load_page(page.page_no - 1) to get a page backend
Attaches the page backend to page._backend
Calls page_backend.get_size() to populate page.size

This deferred loading reduces memory usage and allows pages to be loaded just-in-time as they flow through the pipeline.

Sources: docling/pipeline/standard_pdf_pipeline.py327-416

OCR Stage

Runs the ocr_model (created via get_ocr_factory()) on pages. The model can be Tesseract, EasyOCR, RapidOCR, or OCRMac depending on configuration. Batch size is controlled by pipeline_options.ocr_batch_size (default: 64).

Sources: docling/pipeline/standard_pdf_pipeline.py548-555 docling/pipeline/standard_pdf_pipeline.py460-461 docling/pipeline/standard_pdf_pipeline.py512-521

Layout Stage

Runs the layout_model (e.g., EGRET, HERON) for object detection on page images. The model identifies bounding boxes and labels for text blocks, figures, tables, etc. Batch size is controlled by pipeline_options.layout_batch_size (default: 64).

Sources: docling/pipeline/standard_pdf_pipeline.py556-563 docling/pipeline/standard_pdf_pipeline.py461-469

Table Stage

Runs the table_model (typically TableFormer with OTSL) to recognize table structure within table bounding boxes identified by layout detection. Batch size is controlled by pipeline_options.table_batch_size (default: 4, since table models are more computationally expensive).

Sources: docling/pipeline/standard_pdf_pipeline.py564-571 docling/pipeline/standard_pdf_pipeline.py470-479

Assemble Stage

Runs the assemble_model (PageAssembleModel) which combines the outputs of previous stages into a unified representation. It runs with batch_size=1 and includes a postprocess callback (_release_page_resources()) that:

Clears page._image_cache if images aren't needed downstream
Unloads page._backend if not needed for enrichment
Clears page.parsed_page if generate_parsed_pages is disabled

This cleanup minimizes memory usage before pages are returned.

Sources: docling/pipeline/standard_pdf_pipeline.py572-580 docling/pipeline/standard_pdf_pipeline.py523-534

Sources: docling/pipeline/standard_pdf_pipeline.py539-596

Feed-Drain Execution Model

The main thread in _build_document() operates a feed-drain loop that interleaves page submission and result collection:

Feed Phase: Attempt to enqueue pages using non-blocking put(timeout=0). Continue until the queue is full or all pages are submitted. When the last page is queued, close the input queue to signal no more data.

Drain Phase: Pull results using get_batch(batch_size=32, timeout=0.05). Sort items by run_id and update ProcessingResult.

Timeout Check: Before each iteration, check if elapsed_time > document_timeout. If so, add run_id to the shared timed_out_run_ids set, close the input queue, and break immediately (don't wait for in-flight work).

Sources: docling/pipeline/standard_pdf_pipeline.py628-710

Timeout Handling

Timeout is configured via pipeline_options.document_timeout (in seconds). When a document exceeds this limit:

The main thread adds the run_id to RunContext.timed_out_run_ids (a shared set[int])
Each stage checks this set in _process_batch() before processing a batch
If run_id in timed_out_run_ids, the stage marks all items as failed with error "document timeout exceeded" and passes them through without calling the model
The main thread breaks its feed-drain loop and proceeds to integrate results

Already-completed work is preserved (items that reached the output queue before timeout are included in results), but new work is aborted. The status is set to ConversionStatus.PARTIAL_SUCCESS if any pages completed.

Sources: docling/pipeline/standard_pdf_pipeline.py636-653 docling/pipeline/standard_pdf_pipeline.py265-273 docling/pipeline/standard_pdf_pipeline.py354-362 docling/pipeline/standard_pdf_pipeline.py716-755

Result Integration

After the feed-drain loop completes, _integrate_results() updates the ConversionResult:

Action	Implementation
Filter successful pages	Keep only pages in `proc.pages` by page number
Record errors	Add `ErrorItem` for each failed page to `conv_res.errors`
Set status	`FAILURE` if all failed, `PARTIAL_SUCCESS` if some failed or timeout, `SUCCESS` otherwise
Clean up resources	Clear image cache if not needed, unload backends, delete parsed pages

The _integrate_results() method does not wait for in-flight work after timeout; any pages still in the pipeline are marked as failed during the timeout phase.

Sources: docling/pipeline/standard_pdf_pipeline.py716-755

Document Assembly and Enrichment

After page-level processing completes, _assemble_document() runs in the main thread:

Aggregate Page Elements: Collect elements, headers, and body from each page.assembled
Reading Order: Run reading_order_model to establish document-level reading order
Generate Images:
- If generate_page_images, attach full page images to DoclingDocument.pages[page_no].image
- If generate_picture_images or generate_table_images, crop element bounding boxes from page images and attach to PictureItem.image or TableItem.image
Compute Confidence Scores: Aggregate per-page confidence scores (layout, parse, table, OCR) into document-level scores using np.nanmean() or np.nanquantile()
Add Failed Pages: Call _add_failed_pages_to_document() to insert empty PageItem entries for failed pages (ensures page break markers are generated correctly during export)

After assembly, the enrichment pipeline runs (inherited from ConvertPipeline):

Code/Formula Enrichment: CodeFormulaVlmModel detects code blocks and mathematical formulas
Picture Classification: Classifies pictures (figure, chart, etc.)
Picture Description: Generates text descriptions of pictures
Chart Extraction: Extracts chart data

Sources: docling/pipeline/standard_pdf_pipeline.py756-853 docling/pipeline/standard_pdf_pipeline.py855-916 docling/pipeline/standard_pdf_pipeline.py483-509

Sources: docling/pipeline/standard_pdf_pipeline.py756-853

Configuration

StandardPdfPipeline is configured via ThreadedPdfPipelineOptions, which includes:

Parameter	Type	Default	Purpose
`batch_polling_interval_seconds`	`float`	0.05	Timeout for `get_batch()` calls
`queue_max_size`	`int`	128	Maximum items in each queue (controls backpressure)
`ocr_batch_size`	`int`	64	Batch size for OCR stage
`layout_batch_size`	`int`	64	Batch size for layout stage
`table_batch_size`	`int`	4	Batch size for table stage
`document_timeout`	`float \| None`	`None`	Maximum processing time per document (seconds)
`do_ocr`	`bool`	`True`	Enable OCR
`do_table_structure`	`bool`	`True`	Enable table structure recognition
`do_code_enrichment`	`bool`	`False`	Enable code block detection
`do_formula_enrichment`	`bool`	`False`	Enable formula detection
`generate_page_images`	`bool`	`False`	Include full page images in output
`generate_picture_images`	`bool`	`False`	Include cropped picture images in output
`generate_parsed_pages`	`bool`	`False`	Include parsed page data in output
`images_scale`	`float`	2.0	Scaling factor for page images

The pipeline also uses ocr_options, layout_options, table_structure_options, code_formula_options, and accelerator_options to configure individual models.

Sources: docling/pipeline/standard_pdf_pipeline.py436-438 docling/datamodel/pipeline_options.py

Thread Safety and Cleanup

Threads are created as daemon=False to ensure they complete before Python exits. When stopping stages, the pipeline:

Closes the input queue (wakes waiting get_batch() calls)
Calls thread.join(timeout=15.0) to wait for graceful shutdown
If the thread doesn't terminate within 15 seconds, logs a warning and abandons it

Abandoned threads may occur if a model inference call or PDF backend operation is blocking (e.g., a long-running Tesseract OCR or a stuck pypdfium2_lock). The pipeline design minimizes this risk by using timeouts where possible, but model libraries don't always support interruption.

Sources: docling/pipeline/standard_pdf_pipeline.py226-240 docling/pipeline/standard_pdf_pipeline.py599-606

Standard PDF Pipeline

Relevant source files

docling/pipeline/standard_pdf_pipeline.py

Purpose and Scope

Sources: docling/pipeline/standard_pdf_pipeline.py1-14

Overview

StandardPdfPipeline processes PDF documents by breaking them into pages and streaming each page through a sequence of ML models running in separate threads. The pipeline ensures:

Per-run isolation: Each execute() call creates fresh queues and threads
Deterministic run IDs: Pages are tracked with monotonic identifiers to avoid ID collisions
Backpressure: Bounded queues block producers when full
Timeout support: Documents can be abandoned mid-processing if they exceed configured limits
Minimal shared state: Heavy ML models are initialized once and accessed read-only by worker threads

The class extends ConvertPipeline and is instantiated with a ThreadedPdfPipelineOptions object that configures batch sizes, queue sizes, timeouts, and which ML models to enable.

Sources: docling/pipeline/standard_pdf_pipeline.py433-442 docling/pipeline/standard_pdf_pipeline.py1-14

Key Data Structures

ThreadedItem

ThreadedItem is the envelope that flows between pipeline stages. It wraps a Page object along with metadata:

Field	Type	Purpose
`payload`	`Page \| None`	The page being processed
`run_id`	`int`	Unique identifier for this execution (monotonic across pipeline instance)
`page_no`	`int`	Page number in the document
`conv_res`	`ConversionResult`	Reference to the parent conversion result
`error`	`Exception \| None`	Any exception encountered during processing
`is_failed`	`bool`	Whether this item has failed processing

Sources: docling/pipeline/standard_pdf_pipeline.py81-91

ProcessingResult

ProcessingResult aggregates the outcome of a single pipeline run:

Field	Type	Purpose
`pages`	`List[Page]`	Successfully processed pages
`failed_pages`	`List[Tuple[int, Exception]]`	Failed pages with their errors
`total_expected`	`int`	Total pages expected to be processed

It provides properties to determine success state: is_partial_success, is_complete_failure, success_count, failure_count.

Sources: docling/pipeline/standard_pdf_pipeline.py93-116

ThreadedQueue

Key methods:

put(item, timeout): Block until queue accepts item or is closed (returns False if closed)
get_batch(size, timeout): Return up to size items, blocking until ≥1 available or closed
close(): Mark queue as closed and wake all waiting threads

Sources: docling/pipeline/standard_pdf_pipeline.py118-183

Threading Architecture

Per-Run Isolation

Sources: docling/pipeline/standard_pdf_pipeline.py418-426 docling/pipeline/standard_pdf_pipeline.py436-442 docling/pipeline/standard_pdf_pipeline.py539-596

Pipeline Stages

Stage Implementation

Pull a batch from input_queue using get_batch(batch_size, batch_timeout)
Group items by run_id to maximize batching efficiency
Check timed_out_run_ids: if the run is timed out, skip processing and mark items as failed
For non-failed items, call the wrapped model
Post-process results (e.g., release resources)
Emit results to all output queues
Repeat until input_queue is closed

When the input queue is closed and empty, the stage closes all output queues and terminates.

Sources: docling/pipeline/standard_pdf_pipeline.py185-325

Preprocess Stage

PreprocessThreadedStage is a specialized stage that lazily loads PDF page backends. When a Page enters this stage with page._backend == None, the stage:

Retrieves the document backend from conv_res.input._backend
Calls backend.load_page(page.page_no - 1) to get a page backend
Attaches the page backend to page._backend
Calls page_backend.get_size() to populate page.size

This deferred loading reduces memory usage and allows pages to be loaded just-in-time as they flow through the pipeline.

Sources: docling/pipeline/standard_pdf_pipeline.py327-416

Assemble Stage

Clears page._image_cache if images aren't needed downstream
Unloads page._backend if not needed for enrichment
Clears page.parsed_page if generate_parsed_pages is disabled

This cleanup minimizes memory usage before pages are returned.

Sources: docling/pipeline/standard_pdf_pipeline.py572-580 docling/pipeline/standard_pdf_pipeline.py523-534

Sources: docling/pipeline/standard_pdf_pipeline.py539-596

Feed-Drain Execution Model

The main thread in _build_document() operates a feed-drain loop that interleaves page submission and result collection:

Drain Phase: Pull results using get_batch(batch_size=32, timeout=0.05). Sort items by run_id and update ProcessingResult.

Sources: docling/pipeline/standard_pdf_pipeline.py628-710

Timeout Handling

Timeout is configured via pipeline_options.document_timeout (in seconds). When a document exceeds this limit:

The main thread adds the run_id to RunContext.timed_out_run_ids (a shared set[int])
Each stage checks this set in _process_batch() before processing a batch
If run_id in timed_out_run_ids, the stage marks all items as failed with error "document timeout exceeded" and passes them through without calling the model
The main thread breaks its feed-drain loop and proceeds to integrate results

Result Integration

After the feed-drain loop completes, _integrate_results() updates the ConversionResult:

Action	Implementation
Filter successful pages	Keep only pages in `proc.pages` by page number
Record errors	Add `ErrorItem` for each failed page to `conv_res.errors`
Set status	`FAILURE` if all failed, `PARTIAL_SUCCESS` if some failed or timeout, `SUCCESS` otherwise
Clean up resources	Clear image cache if not needed, unload backends, delete parsed pages

The _integrate_results() method does not wait for in-flight work after timeout; any pages still in the pipeline are marked as failed during the timeout phase.

Sources: docling/pipeline/standard_pdf_pipeline.py716-755

Document Assembly and Enrichment

After page-level processing completes, _assemble_document() runs in the main thread:

Aggregate Page Elements: Collect elements, headers, and body from each page.assembled
Reading Order: Run reading_order_model to establish document-level reading order
Generate Images:
- If generate_page_images, attach full page images to DoclingDocument.pages[page_no].image
- If generate_picture_images or generate_table_images, crop element bounding boxes from page images and attach to PictureItem.image or TableItem.image
Compute Confidence Scores: Aggregate per-page confidence scores (layout, parse, table, OCR) into document-level scores using np.nanmean() or np.nanquantile()
Add Failed Pages: Call _add_failed_pages_to_document() to insert empty PageItem entries for failed pages (ensures page break markers are generated correctly during export)

After assembly, the enrichment pipeline runs (inherited from ConvertPipeline):

Code/Formula Enrichment: CodeFormulaVlmModel detects code blocks and mathematical formulas
Picture Classification: Classifies pictures (figure, chart, etc.)
Picture Description: Generates text descriptions of pictures
Chart Extraction: Extracts chart data

Sources: docling/pipeline/standard_pdf_pipeline.py756-853 docling/pipeline/standard_pdf_pipeline.py855-916 docling/pipeline/standard_pdf_pipeline.py483-509

Sources: docling/pipeline/standard_pdf_pipeline.py756-853

Configuration

StandardPdfPipeline is configured via ThreadedPdfPipelineOptions, which includes:

Parameter	Type	Default	Purpose
`batch_polling_interval_seconds`	`float`	0.05	Timeout for `get_batch()` calls
`queue_max_size`	`int`	128	Maximum items in each queue (controls backpressure)
`ocr_batch_size`	`int`	64	Batch size for OCR stage
`layout_batch_size`	`int`	64	Batch size for layout stage
`table_batch_size`	`int`	4	Batch size for table stage
`document_timeout`	`float \| None`	`None`	Maximum processing time per document (seconds)
`do_ocr`	`bool`	`True`	Enable OCR
`do_table_structure`	`bool`	`True`	Enable table structure recognition
`do_code_enrichment`	`bool`	`False`	Enable code block detection
`do_formula_enrichment`	`bool`	`False`	Enable formula detection
`generate_page_images`	`bool`	`False`	Include full page images in output
`generate_picture_images`	`bool`	`False`	Include cropped picture images in output
`generate_parsed_pages`	`bool`	`False`	Include parsed page data in output
`images_scale`	`float`	2.0	Scaling factor for page images

The pipeline also uses ocr_options, layout_options, table_structure_options, code_formula_options, and accelerator_options to configure individual models.

Sources: docling/pipeline/standard_pdf_pipeline.py436-438 docling/datamodel/pipeline_options.py

Thread Safety and Cleanup

Threads are created as daemon=False to ensure they complete before Python exits. When stopping stages, the pipeline:

Closes the input queue (wakes waiting get_batch() calls)
Calls thread.join(timeout=15.0) to wait for graceful shutdown
If the thread doesn't terminate within 15 seconds, logs a warning and abandons it

Sources: docling/pipeline/standard_pdf_pipeline.py226-240 docling/pipeline/standard_pdf_pipeline.py599-606

Standard PDF Pipeline

Purpose and Scope

Overview

Key Data Structures

ThreadedItem

ProcessingResult

ThreadedQueue

Threading Architecture

Pipeline Stages

Stage Implementation

Preprocess Stage

OCR Stage

Layout Stage

Table Stage

Assemble Stage

Feed-Drain Execution Model

Timeout Handling

Result Integration

Document Assembly and Enrichment

Configuration

Thread Safety and Cleanup

On this page

Standard PDF Pipeline

Purpose and Scope

Overview

Key Data Structures

ThreadedItem

ProcessingResult

ThreadedQueue

Threading Architecture

Pipeline Stages

Stage Implementation

Preprocess Stage

OCR Stage

Layout Stage

Table Stage

Assemble Stage

Feed-Drain Execution Model

Timeout Handling

Result Integration

Document Assembly and Enrichment

Configuration

Thread Safety and Cleanup

On this page