Framework Integrations

Relevant source files

This page documents how Docling integrates with popular AI frameworks and libraries for downstream tasks such as Retrieval-Augmented Generation (RAG), document processing, and agentic applications. The focus is on the integration patterns, export mechanisms, and framework-specific adaptations that enable Docling to work seamlessly with LangChain, LlamaIndex, Haystack, and other AI development frameworks.

For document chunking strategies specific to RAG applications, see Document Chunking. For the Model Control Protocol (MCP) server integration with AI agents, see MCP Server Integration. For general export format details, see Export Formats.

Integration Architecture

Docling serves as a document parsing and conversion layer that produces structured DoclingDocument objects. These documents can be exported to various formats and consumed by downstream AI frameworks. The integration pattern is designed to be framework-agnostic, with Docling focusing on high-quality document understanding and frameworks focusing on their specific AI/ML workflows.

Sources: mkdocs.yml109-148 README.md29-43

Core Integration Pattern

The standard integration pattern involves three stages:

Document Conversion: Use DocumentConverter to parse input documents into DoclingDocument objects
Export/Transformation: Export the DoclingDocument to a format compatible with the target framework
Framework Consumption: Feed the exported data into framework-specific loaders, parsers, or converters

DoclingDocument as Integration Hub

The DoclingDocument class from the docling_core library serves as the central integration point. It provides a unified representation with:

Structured Content: Body items (text, tables, pictures, sections) organized hierarchically
Provenance Tracking: Bounding boxes and page numbers for spatial grounding
Rich Metadata: Document origin, formatting, and semantic labels
Export Methods: Multiple serialization formats for framework compatibility

Sources: README.md34-36 docs/index.md40

Framework-Specific Integrations

LangChain Integration

LangChain integration typically uses Docling as a document loader to populate Document objects for downstream processing. The integration leverages:

Markdown Export: Convert DoclingDocument to Markdown text for LangChain's Document class
Metadata Preservation: Transfer document metadata to LangChain's metadata fields
Text Splitting: Combine Docling's chunking with LangChain's text splitters
Vector Store Indexing: Feed processed documents into LangChain vector stores

Integration Pattern:

Sources: mkdocs.yml110-111 README.md38

LlamaIndex Integration

LlamaIndex integration uses Docling for document parsing with integration into LlamaIndex's Document and node structure:

Document Reader Pattern: Docling acts as a custom document reader
Node Creation: Convert DoclingDocument content items into LlamaIndex nodes
Hierarchical Relationships: Preserve section hierarchy in node relationships
Multimodal Support: Handle text, tables, and images through different node types

Integration Pattern:

Sources: mkdocs.yml112 README.md38

Haystack Integration

Haystack integration positions Docling as a document converter in Haystack's pipeline architecture:

Converter Component: Docling as a preprocessing component
Document Store Compatibility: Export to formats compatible with Haystack document stores
Pipeline Integration: Embed Docling conversion in Haystack processing pipelines
Metadata Handling: Map Docling metadata to Haystack's document metadata schema

Integration Pattern:

Sources: mkdocs.yml110 README.md38

Crew AI Integration

Crew AI integration uses Docling as a tool within agent workflows:

Tool Integration: Docling converter as a Crew AI tool
Document Processing Tasks: Agents can invoke Docling to parse documents
Result Formatting: Export to agent-friendly formats
Multi-Document Processing: Batch processing for complex agent tasks

Sources: mkdocs.yml141 README.md38

Other Framework Integrations

Additional frameworks supported through the Docling integration ecosystem:

Framework	Integration Type	Primary Use Case
txtai	Document loader	Semantic search and indexing
Semantica	Document processor	Knowledge graph construction
Bee Agent Framework	Agent tool	Document processing in agent workflows
Langflow	Node component	Visual workflow integration
Hector	Document parser	Domain-specific document processing
Apify	Actor integration	Web scraping and document processing
Data Prep Kit	Data pipeline	Large-scale document preparation
InstructLab	Knowledge base	Synthetic data generation
spaCy	NLP pipeline	Entity extraction and NLP tasks
Prodigy	Annotation tool	Document annotation workflows

Sources: mkdocs.yml137-166 README.md116-117

Common Integration Patterns

Document Loading Pipeline

Sources: README.md73-81 docs/usage/index.md5-20

RAG Pipeline Integration

Sources: mkdocs.yml109-113 README.md38

Multi-Framework Processing Workflow

This pattern demonstrates how a single DoclingDocument can feed multiple frameworks simultaneously:

Sources: README.md34-36

Export and Serialization for Framework Integration

Supported Export Formats

The DoclingDocument class provides multiple export methods optimized for different framework requirements:

Export Method	Output Format	Primary Use Case	Framework Compatibility
`export_to_markdown()`	Markdown text	Text-based processing, RAG	LangChain, LlamaIndex, Haystack
`export_to_json()`	JSON string	Structured data exchange	All frameworks
`export_to_dict()`	Python dictionary	Programmatic access	Custom integrations
`export_to_html()`	HTML string	Web rendering	Web applications
`export_to_doctags()`	DocTags format	Semantic structure	Specialized NLP pipelines

Export Options and Customization

Each export method supports options to control output formatting:

Markdown Export: Control heading levels, include/exclude images, table formatting
JSON Export: Include/exclude provenance, metadata, formatting details
HTML Export: Custom CSS classes, image handling, table styling
DocTags Export: Semantic tag preservation, hierarchical structure

Metadata Mapping

Framework integrations typically require mapping Docling metadata to framework-specific schemas:

Sources: README.md36 docs/index.md41

Chunking Integration for RAG

For RAG applications, document chunking is a critical integration point. Docling provides chunking capabilities that work with framework-specific chunkers:

Hybrid Chunking: Combine semantic (section-based) and size-based chunking
Provenance Preservation: Maintain source document references in chunks
Hierarchical Chunks: Preserve document structure in chunk organization
Framework Compatibility: Generate chunks compatible with LangChain, LlamaIndex chunkers

See Document Chunking for detailed information on chunking strategies and implementation.

Sources: mkdocs.yml103-106 docs/index.md78

Integration Examples

The Docling repository includes working examples demonstrating framework integrations:

Example	Framework	File Location	Description
RAG with Haystack	Haystack	`docs/examples/rag_haystack.ipynb`	Complete RAG pipeline with Haystack
RAG with LangChain	LangChain	`docs/examples/rag_langchain.ipynb`	Document loading and RAG with LangChain
RAG with LlamaIndex	LlamaIndex	`docs/examples/rag_llamaindex.ipynb`	Query engine integration with LlamaIndex
RAG with Milvus	Vector Store	`docs/examples/rag_milvus.ipynb`	Vector store integration with Milvus
RAG with Qdrant	Vector Store	`docs/examples/retrieval_qdrant.ipynb`	Retrieval pipeline with Qdrant
RAG with Weaviate	Vector Store	`docs/examples/rag_weaviate.ipynb`	Document indexing with Weaviate
RAG with MongoDB	Vector Store	`docs/examples/rag_mongodb.ipynb`	Atlas Vector Search integration
RAG with Azure Search	Vector Store	`docs/examples/rag_azuresearch.ipynb`	Azure Cognitive Search integration
RAG with OpenSearch	Vector Store	`docs/examples/rag_opensearch.ipynb`	OpenSearch vector store integration
Visual Grounding	Multimodal	`docs/examples/visual_grounding.ipynb`	Spatial provenance for multimodal RAG

Sources: mkdocs.yml109-135

Best Practices for Framework Integration

Document Preparation

Choose Appropriate Pipeline: Use StandardPdfPipeline for structured documents, VlmPipeline for complex layouts
Configure OCR Settings: Enable OCR for scanned documents to ensure text extraction
Enable Enrichments: Use picture classification, table structure, and code detection as needed
Set Timeout Limits: Configure appropriate timeouts for production environments

Export Strategy

Select Appropriate Format: Use Markdown for text-heavy workflows, JSON for structured data needs
Preserve Metadata: Include document origin and provenance for traceability
Handle Large Documents: Consider chunking before export for memory efficiency
Validate Exports: Verify export quality matches framework requirements

Framework Adaptation

Map Metadata Correctly: Ensure Docling metadata aligns with framework schemas
Handle Multimodal Content: Plan for tables, images, and formulas in framework pipelines
Preserve Structure: Maintain document hierarchy when supported by the framework
Test Integration Points: Validate end-to-end workflows with representative documents

Performance Optimization

Batch Processing: Use DocumentConverter for multiple documents efficiently
Model Caching: Leverage model caching for repeated conversions
GPU Acceleration: Enable GPU support for ML models when available
Pipeline Selection: Choose the fastest pipeline that meets quality requirements

Sources: README.md84-85 docs/usage/index.md22-24

Framework Integrations

Relevant source files

Integration Architecture

Sources: mkdocs.yml109-148 README.md29-43

Core Integration Pattern

The standard integration pattern involves three stages:

Document Conversion: Use DocumentConverter to parse input documents into DoclingDocument objects
Export/Transformation: Export the DoclingDocument to a format compatible with the target framework
Framework Consumption: Feed the exported data into framework-specific loaders, parsers, or converters

DoclingDocument as Integration Hub

The DoclingDocument class from the docling_core library serves as the central integration point. It provides a unified representation with:

Structured Content: Body items (text, tables, pictures, sections) organized hierarchically
Provenance Tracking: Bounding boxes and page numbers for spatial grounding
Rich Metadata: Document origin, formatting, and semantic labels
Export Methods: Multiple serialization formats for framework compatibility

Sources: README.md34-36 docs/index.md40

Framework-Specific Integrations

LangChain Integration

LangChain integration typically uses Docling as a document loader to populate Document objects for downstream processing. The integration leverages:

Markdown Export: Convert DoclingDocument to Markdown text for LangChain's Document class
Metadata Preservation: Transfer document metadata to LangChain's metadata fields
Text Splitting: Combine Docling's chunking with LangChain's text splitters
Vector Store Indexing: Feed processed documents into LangChain vector stores

Integration Pattern:

Sources: mkdocs.yml110-111 README.md38

LlamaIndex Integration

LlamaIndex integration uses Docling for document parsing with integration into LlamaIndex's Document and node structure:

Document Reader Pattern: Docling acts as a custom document reader
Node Creation: Convert DoclingDocument content items into LlamaIndex nodes
Hierarchical Relationships: Preserve section hierarchy in node relationships
Multimodal Support: Handle text, tables, and images through different node types

Integration Pattern:

Sources: mkdocs.yml112 README.md38

Haystack Integration

Haystack integration positions Docling as a document converter in Haystack's pipeline architecture:

Converter Component: Docling as a preprocessing component
Document Store Compatibility: Export to formats compatible with Haystack document stores
Pipeline Integration: Embed Docling conversion in Haystack processing pipelines
Metadata Handling: Map Docling metadata to Haystack's document metadata schema

Integration Pattern:

Sources: mkdocs.yml110 README.md38

Crew AI Integration

Crew AI integration uses Docling as a tool within agent workflows:

Tool Integration: Docling converter as a Crew AI tool
Document Processing Tasks: Agents can invoke Docling to parse documents
Result Formatting: Export to agent-friendly formats
Multi-Document Processing: Batch processing for complex agent tasks

Sources: mkdocs.yml141 README.md38

Other Framework Integrations

Additional frameworks supported through the Docling integration ecosystem:

Framework	Integration Type	Primary Use Case
txtai	Document loader	Semantic search and indexing
Semantica	Document processor	Knowledge graph construction
Bee Agent Framework	Agent tool	Document processing in agent workflows
Langflow	Node component	Visual workflow integration
Hector	Document parser	Domain-specific document processing
Apify	Actor integration	Web scraping and document processing
Data Prep Kit	Data pipeline	Large-scale document preparation
InstructLab	Knowledge base	Synthetic data generation
spaCy	NLP pipeline	Entity extraction and NLP tasks
Prodigy	Annotation tool	Document annotation workflows

Sources: mkdocs.yml137-166 README.md116-117

Common Integration Patterns

Document Loading Pipeline

Sources: README.md73-81 docs/usage/index.md5-20

RAG Pipeline Integration

Sources: mkdocs.yml109-113 README.md38

Multi-Framework Processing Workflow

This pattern demonstrates how a single DoclingDocument can feed multiple frameworks simultaneously:

Sources: README.md34-36

Export and Serialization for Framework Integration

Supported Export Formats

The DoclingDocument class provides multiple export methods optimized for different framework requirements:

Export Method	Output Format	Primary Use Case	Framework Compatibility
`export_to_markdown()`	Markdown text	Text-based processing, RAG	LangChain, LlamaIndex, Haystack
`export_to_json()`	JSON string	Structured data exchange	All frameworks
`export_to_dict()`	Python dictionary	Programmatic access	Custom integrations
`export_to_html()`	HTML string	Web rendering	Web applications
`export_to_doctags()`	DocTags format	Semantic structure	Specialized NLP pipelines

Export Options and Customization

Each export method supports options to control output formatting:

Markdown Export: Control heading levels, include/exclude images, table formatting
JSON Export: Include/exclude provenance, metadata, formatting details
HTML Export: Custom CSS classes, image handling, table styling
DocTags Export: Semantic tag preservation, hierarchical structure

Metadata Mapping

Framework integrations typically require mapping Docling metadata to framework-specific schemas:

Sources: README.md36 docs/index.md41

Chunking Integration for RAG

For RAG applications, document chunking is a critical integration point. Docling provides chunking capabilities that work with framework-specific chunkers:

Hybrid Chunking: Combine semantic (section-based) and size-based chunking
Provenance Preservation: Maintain source document references in chunks
Hierarchical Chunks: Preserve document structure in chunk organization
Framework Compatibility: Generate chunks compatible with LangChain, LlamaIndex chunkers

See Document Chunking for detailed information on chunking strategies and implementation.

Sources: mkdocs.yml103-106 docs/index.md78

Integration Examples

The Docling repository includes working examples demonstrating framework integrations:

Example	Framework	File Location	Description
RAG with Haystack	Haystack	`docs/examples/rag_haystack.ipynb`	Complete RAG pipeline with Haystack
RAG with LangChain	LangChain	`docs/examples/rag_langchain.ipynb`	Document loading and RAG with LangChain
RAG with LlamaIndex	LlamaIndex	`docs/examples/rag_llamaindex.ipynb`	Query engine integration with LlamaIndex
RAG with Milvus	Vector Store	`docs/examples/rag_milvus.ipynb`	Vector store integration with Milvus
RAG with Qdrant	Vector Store	`docs/examples/retrieval_qdrant.ipynb`	Retrieval pipeline with Qdrant
RAG with Weaviate	Vector Store	`docs/examples/rag_weaviate.ipynb`	Document indexing with Weaviate
RAG with MongoDB	Vector Store	`docs/examples/rag_mongodb.ipynb`	Atlas Vector Search integration
RAG with Azure Search	Vector Store	`docs/examples/rag_azuresearch.ipynb`	Azure Cognitive Search integration
RAG with OpenSearch	Vector Store	`docs/examples/rag_opensearch.ipynb`	OpenSearch vector store integration
Visual Grounding	Multimodal	`docs/examples/visual_grounding.ipynb`	Spatial provenance for multimodal RAG

Sources: mkdocs.yml109-135

Best Practices for Framework Integration

Document Preparation

Choose Appropriate Pipeline: Use StandardPdfPipeline for structured documents, VlmPipeline for complex layouts
Configure OCR Settings: Enable OCR for scanned documents to ensure text extraction
Enable Enrichments: Use picture classification, table structure, and code detection as needed
Set Timeout Limits: Configure appropriate timeouts for production environments

Export Strategy

Select Appropriate Format: Use Markdown for text-heavy workflows, JSON for structured data needs
Preserve Metadata: Include document origin and provenance for traceability
Handle Large Documents: Consider chunking before export for memory efficiency
Validate Exports: Verify export quality matches framework requirements

Framework Adaptation

Map Metadata Correctly: Ensure Docling metadata aligns with framework schemas
Handle Multimodal Content: Plan for tables, images, and formulas in framework pipelines
Preserve Structure: Maintain document hierarchy when supported by the framework
Test Integration Points: Validate end-to-end workflows with representative documents

Performance Optimization

Batch Processing: Use DocumentConverter for multiple documents efficiently
Model Caching: Leverage model caching for repeated conversions
GPU Acceleration: Enable GPU support for ML models when available
Pipeline Selection: Choose the fastest pipeline that meets quality requirements

Sources: README.md84-85 docs/usage/index.md22-24

Framework Integrations

Integration Architecture

Core Integration Pattern

DoclingDocument as Integration Hub

Framework-Specific Integrations

LangChain Integration

LlamaIndex Integration

Haystack Integration

Crew AI Integration

Other Framework Integrations

Common Integration Patterns

Document Loading Pipeline

RAG Pipeline Integration

Multi-Framework Processing Workflow

Export and Serialization for Framework Integration

Supported Export Formats

Export Options and Customization

Metadata Mapping

Chunking Integration for RAG

Integration Examples

Best Practices for Framework Integration

Document Preparation

Export Strategy

Framework Adaptation

Performance Optimization

On this page

Framework Integrations

Integration Architecture

Core Integration Pattern

DoclingDocument as Integration Hub

Framework-Specific Integrations

LangChain Integration

LlamaIndex Integration

Haystack Integration

Crew AI Integration

Other Framework Integrations

Common Integration Patterns

Document Loading Pipeline

RAG Pipeline Integration

Multi-Framework Processing Workflow

Export and Serialization for Framework Integration

Supported Export Formats

Export Options and Customization

Metadata Mapping

Chunking Integration for RAG

Integration Examples

Best Practices for Framework Integration

Document Preparation

Export Strategy

Framework Adaptation

Performance Optimization

On this page