RAG Pipeline and Knowledge Management

Relevant source files

Purpose and Scope

This document describes the end-to-end RAG (Retrieval-Augmented Generation) pipeline in DB-GPT, covering knowledge ingestion, storage, and retrieval systems. The pipeline transforms raw documents into searchable knowledge through chunking, embedding, and storage in vector databases or knowledge graphs. For information about model integration and inference, see Model Integration and Inference. For retrieval strategies and reranking, see Retrieval Strategies and Reranking.

Architecture Overview

The RAG pipeline consists of five main layers: ingestion, chunking, embedding, storage, and retrieval. The StorageManager acts as a factory for creating storage connectors, while the KnowledgeService orchestrates the entire workflow.

Sources:

Document Ingestion Pipeline

KnowledgeFactory and Document Loading

The KnowledgeFactory creates Knowledge instances from various sources. Documents are loaded and transformed into Chunk objects with metadata.

Chunking Strategies

Strategy	Description	Configuration
`CHUNK_BY_SIZE`	Fixed-size chunks with overlap	`chunk_size`, `chunk_overlap`
`CHUNK_BY_SEPARATOR`	Split by delimiters	Custom separators
`CHUNK_BY_PAGE`	Page-based splitting	Page boundaries

Sources:

packages/dbgpt-ext/src/dbgpt_ext/rag/knowledge/factory.py (referenced in service.py)
packages/dbgpt-ext/src/dbgpt_ext/rag/chunk_manager.py (referenced in service.py)
packages/dbgpt-serve/src/dbgpt_serve/rag/service/service.py231-245

EmbeddingAssembler and Persistence

The EmbeddingAssembler orchestrates embedding generation and storage persistence. It supports both synchronous and asynchronous document loading.

Sources:

Vector Store Architecture

Base Abstractions

The vector store architecture uses a two-level abstraction: IndexStoreConfig for configuration and IndexStoreBase for operations.

Core Classes

Class	Purpose	Key Methods
`IndexStoreConfig`	Configuration container	`create_store()`
`VectorStoreConfig`	Vector-specific config	Extends `IndexStoreConfig`
`IndexStoreBase`	Storage operations interface	`load_document()`, `similar_search_with_scores()`
`VectorStoreBase`	Vector-specific operations	Extends `IndexStoreBase`

Sources:

Vector Store Implementations

Milvus Store

MilvusStore uses the Milvus vector database with HNSW indexing and supports full-text search in versions >= 2.5.0.

Key Configuration:

Primary field: pk_id
Vector field: vector
Text field: content
Metadata field: metadata
Index type: HNSW (default)
Metric type: COSINE

Full-Text Search Support:

Available in Milvus >= 2.5.0
Uses sparse_vector field with BM25 function
Method: is_support_full_text_search() checks version compatibility

Sources:

packages/dbgpt-ext/src/dbgpt_ext/storage/vector_store/milvus_store.py1-822

Chroma Store

ChromaStore provides local persistence with cosine similarity and collection-based organization.

Key Features:

Persistent client with configurable directory
Collection name validation and hashing for invalid names
Metadata transformation for compatible types
Batch operations with upsert support

Collection Name Validation:

Length: 3-63 characters
Pattern: ^[a-zA-Z0-9_][-a-zA-Z0-9_.]*$
No substring ..
No valid IPv4 addresses
Chinese characters converted to hex hash

Sources:

packages/dbgpt-ext/src/dbgpt_ext/storage/vector_store/chroma_store.py1-506

Elasticsearch Store

ElasticStore integrates with Elasticsearch for full-text search and vector similarity.

Configuration:

Index settings: 1 shard, 0 replicas (configurable)
Query field: context
Vector query field: dense_vector
Distance strategy: COSINE

Sources:

packages/dbgpt-ext/src/dbgpt_ext/storage/vector_store/elastic_store.py1-450

Additional Stores

Store	Type	Key Features
`PGVectorStore`	PostgreSQL extension	Langchain integration, SQL-based storage
`WeaviateStore`	Weaviate cloud/local	GraphQL queries, schema-based
`OceanBaseStore`	OceanBase database	HNSW index, JSON metadata, L2/cosine/inner product

Sources:

Knowledge Graph Integration

BuiltinKnowledgeGraph

The BuiltinKnowledgeGraph extracts entities and relationships from documents using LLM-based triplet extraction.

Triplet Extraction: The TripletExtractor uses LLM prompts to identify entities and relationships:

Input: "Alice works at OpenAI and lives in San Francisco."
Output: [(Alice, works_at, OpenAI), (Alice, lives_in, San Francisco)]

Graph Storage Backends:

TuGraph (default)
Neo4j
Extensible via GraphStoreFactory

Sources:

CommunitySummaryKnowledgeGraph

The CommunitySummaryKnowledgeGraph extends BuiltinKnowledgeGraph with hierarchical community detection and summarization.

Configuration Parameters:

Parameter	Default	Description
`kg_extract_top_k`	5	Top K for extraction search
`kg_extract_score_threshold`	0.3	Score threshold for extraction
`kg_community_top_k`	50	Top K communities
`kg_community_score_threshold`	0.3	Community score threshold
`kg_triplet_graph_enabled`	True	Enable triplet graph search
`kg_document_graph_enabled`	True	Enable document graph search
`kg_extraction_batch_size`	20	Batch size for extraction
`kg_community_summary_batch_size`	20	Batch size for summaries

Sources:

Storage Management Layer

StorageManager

The StorageManager provides a unified factory interface for creating storage connectors based on configuration.

Storage Type Resolution:

Configuration Loading: The storage manager reads configuration from app_config.rag.storage:

storage.vector: Vector store config
storage.graph: Knowledge graph config
storage.full_text: Full-text search config

Sources:

packages/dbgpt-serve/src/dbgpt_serve/rag/storage_manager.py1-198

VectorStoreConnector

The VectorStoreConnector wraps storage implementations with connection pooling and batch operations.

Connection Pooling:

Singleton pattern per (vector_store_type, collection_name)
Thread-safe with global pools dictionary
Reuses existing clients for same configuration

Batch Loading:

Sources:

packages/dbgpt-serve/src/dbgpt_serve/rag/connector.py1-300

Knowledge Space Management

Knowledge Space Hierarchy

DB-GPT organizes knowledge using a three-level hierarchy: Space → Document → Chunk.

KnowledgeSpace Context: The context field stores JSON configuration:

Sources:

packages/dbgpt-serve/src/dbgpt_serve/rag/models/models.py (referenced)
packages/dbgpt-serve/src/dbgpt_serve/rag/models/document_db.py (referenced)
packages/dbgpt-serve/src/dbgpt_serve/rag/models/chunk_db.py (referenced)
packages/dbgpt-app/src/dbgpt_app/knowledge/service.py60-196

Document Synchronization

The sync_document() method orchestrates the complete ingestion pipeline.

Sync Status Enum:

Status	Description
`TODO`	Pending synchronization
`RUNNING`	Currently processing
`FINISHED`	Successfully synced
`FAILED`	Sync error occurred

Error Handling:

Sources:

Retrieval and Assembly

EmbeddingRetriever Integration

The EmbeddingAssembler creates retrievers that query the index store.

Retrieval Configuration:

Sources:

Time-Weighted Retrieval

The TimeWeightedEmbeddingRetriever combines semantic similarity with temporal decay.

Time Decay Formula:

combined_score = relevance_score * exp(-decay_rate * hours_passed)

External Storage Protocol:

Sources:

packages/dbgpt-core/src/dbgpt/rag/retriever/time_weighted.py1-250

Knowledge Space Retriever

The KnowledgeSpaceRetriever provides space-level retrieval with reranking support.

Automatic Top-K Adjustment:

Sources:

Full-Text Search Integration

BM25 and Elasticsearch

Full-text search provides keyword-based retrieval alongside semantic search.

Configuration Example (dbgpt-bm25-rag.toml):

ElasticDocumentStore:

Milvus Full-Text Search (v2.5.0+):

Sources:

Configuration and Deployment

Storage Configuration Structure

Storage configuration is defined in .toml files under [rag.storage].

Vector Store Configuration:

Knowledge Graph Configuration:

RAG Pipeline Parameters:

Sources:

configs/dbgpt-bm25-rag.toml1-51
packages/dbgpt-serve/src/dbgpt_serve/rag/config.py (referenced)

Metadata Filters

Metadata filters enable fine-grained document filtering during retrieval.

Filter Operations:

Operator	Symbol	Description
`EQ`	`==`	Equal
`NE`	`!=`	Not equal
`GT`	`>`	Greater than
`LT`	`<`	Less than
`GTE`	`>=`	Greater than or equal
`LTE`	`<=`	Less than or equal
`IN`	`in`	In list
`NIN`	`not in`	Not in list

Example Usage:

Store-Specific Filter Conversion:

Milvus: Converts to expression syntax field == 'value'
Chroma: Converts to $eq, $gt, etc. operators
OceanBase: Converts to SQL WHERE clauses with JSON path

Sources:

RAG Pipeline and Knowledge Management

Relevant source files

Purpose and Scope

Architecture Overview

Sources:

Document Ingestion Pipeline

KnowledgeFactory and Document Loading

The KnowledgeFactory creates Knowledge instances from various sources. Documents are loaded and transformed into Chunk objects with metadata.

Chunking Strategies

Strategy	Description	Configuration
`CHUNK_BY_SIZE`	Fixed-size chunks with overlap	`chunk_size`, `chunk_overlap`
`CHUNK_BY_SEPARATOR`	Split by delimiters	Custom separators
`CHUNK_BY_PAGE`	Page-based splitting	Page boundaries

Sources:

packages/dbgpt-ext/src/dbgpt_ext/rag/knowledge/factory.py (referenced in service.py)
packages/dbgpt-ext/src/dbgpt_ext/rag/chunk_manager.py (referenced in service.py)
packages/dbgpt-serve/src/dbgpt_serve/rag/service/service.py231-245

EmbeddingAssembler and Persistence

The EmbeddingAssembler orchestrates embedding generation and storage persistence. It supports both synchronous and asynchronous document loading.

Sources:

Vector Store Architecture

Base Abstractions

The vector store architecture uses a two-level abstraction: IndexStoreConfig for configuration and IndexStoreBase for operations.

Core Classes

Class	Purpose	Key Methods
`IndexStoreConfig`	Configuration container	`create_store()`
`VectorStoreConfig`	Vector-specific config	Extends `IndexStoreConfig`
`IndexStoreBase`	Storage operations interface	`load_document()`, `similar_search_with_scores()`
`VectorStoreBase`	Vector-specific operations	Extends `IndexStoreBase`

Sources:

Vector Store Implementations

Milvus Store

MilvusStore uses the Milvus vector database with HNSW indexing and supports full-text search in versions >= 2.5.0.

Key Configuration:

Primary field: pk_id
Vector field: vector
Text field: content
Metadata field: metadata
Index type: HNSW (default)
Metric type: COSINE

Full-Text Search Support:

Available in Milvus >= 2.5.0
Uses sparse_vector field with BM25 function
Method: is_support_full_text_search() checks version compatibility

Sources:

packages/dbgpt-ext/src/dbgpt_ext/storage/vector_store/milvus_store.py1-822

Chroma Store

ChromaStore provides local persistence with cosine similarity and collection-based organization.

Key Features:

Persistent client with configurable directory
Collection name validation and hashing for invalid names
Metadata transformation for compatible types
Batch operations with upsert support

Collection Name Validation:

Length: 3-63 characters
Pattern: ^[a-zA-Z0-9_][-a-zA-Z0-9_.]*$
No substring ..
No valid IPv4 addresses
Chinese characters converted to hex hash

Sources:

packages/dbgpt-ext/src/dbgpt_ext/storage/vector_store/chroma_store.py1-506

Elasticsearch Store

ElasticStore integrates with Elasticsearch for full-text search and vector similarity.

Configuration:

Index settings: 1 shard, 0 replicas (configurable)
Query field: context
Vector query field: dense_vector
Distance strategy: COSINE

Sources:

packages/dbgpt-ext/src/dbgpt_ext/storage/vector_store/elastic_store.py1-450

Additional Stores

Store	Type	Key Features
`PGVectorStore`	PostgreSQL extension	Langchain integration, SQL-based storage
`WeaviateStore`	Weaviate cloud/local	GraphQL queries, schema-based
`OceanBaseStore`	OceanBase database	HNSW index, JSON metadata, L2/cosine/inner product

Sources:

Knowledge Graph Integration

BuiltinKnowledgeGraph

The BuiltinKnowledgeGraph extracts entities and relationships from documents using LLM-based triplet extraction.

Triplet Extraction: The TripletExtractor uses LLM prompts to identify entities and relationships:

Input: "Alice works at OpenAI and lives in San Francisco."
Output: [(Alice, works_at, OpenAI), (Alice, lives_in, San Francisco)]

Graph Storage Backends:

TuGraph (default)
Neo4j
Extensible via GraphStoreFactory

Sources:

CommunitySummaryKnowledgeGraph

The CommunitySummaryKnowledgeGraph extends BuiltinKnowledgeGraph with hierarchical community detection and summarization.

Configuration Parameters:

Parameter	Default	Description
`kg_extract_top_k`	5	Top K for extraction search
`kg_extract_score_threshold`	0.3	Score threshold for extraction
`kg_community_top_k`	50	Top K communities
`kg_community_score_threshold`	0.3	Community score threshold
`kg_triplet_graph_enabled`	True	Enable triplet graph search
`kg_document_graph_enabled`	True	Enable document graph search
`kg_extraction_batch_size`	20	Batch size for extraction
`kg_community_summary_batch_size`	20	Batch size for summaries

Sources:

Storage Management Layer

StorageManager

The StorageManager provides a unified factory interface for creating storage connectors based on configuration.

Storage Type Resolution:

Configuration Loading: The storage manager reads configuration from app_config.rag.storage:

storage.vector: Vector store config
storage.graph: Knowledge graph config
storage.full_text: Full-text search config

Sources:

packages/dbgpt-serve/src/dbgpt_serve/rag/storage_manager.py1-198

VectorStoreConnector

The VectorStoreConnector wraps storage implementations with connection pooling and batch operations.

Connection Pooling:

Singleton pattern per (vector_store_type, collection_name)
Thread-safe with global pools dictionary
Reuses existing clients for same configuration

Batch Loading:

Sources:

packages/dbgpt-serve/src/dbgpt_serve/rag/connector.py1-300

Knowledge Space Management

Knowledge Space Hierarchy

DB-GPT organizes knowledge using a three-level hierarchy: Space → Document → Chunk.

KnowledgeSpace Context: The context field stores JSON configuration:

Sources:

packages/dbgpt-serve/src/dbgpt_serve/rag/models/models.py (referenced)
packages/dbgpt-serve/src/dbgpt_serve/rag/models/document_db.py (referenced)
packages/dbgpt-serve/src/dbgpt_serve/rag/models/chunk_db.py (referenced)
packages/dbgpt-app/src/dbgpt_app/knowledge/service.py60-196

Document Synchronization

The sync_document() method orchestrates the complete ingestion pipeline.

Sync Status Enum:

Status	Description
`TODO`	Pending synchronization
`RUNNING`	Currently processing
`FINISHED`	Successfully synced
`FAILED`	Sync error occurred

Error Handling:

Sources:

Retrieval and Assembly

EmbeddingRetriever Integration

The EmbeddingAssembler creates retrievers that query the index store.

Retrieval Configuration:

Sources:

Time-Weighted Retrieval

The TimeWeightedEmbeddingRetriever combines semantic similarity with temporal decay.

Time Decay Formula:

combined_score = relevance_score * exp(-decay_rate * hours_passed)

External Storage Protocol:

Sources:

packages/dbgpt-core/src/dbgpt/rag/retriever/time_weighted.py1-250

Knowledge Space Retriever

The KnowledgeSpaceRetriever provides space-level retrieval with reranking support.

Automatic Top-K Adjustment:

Sources:

Full-Text Search Integration

BM25 and Elasticsearch

Full-text search provides keyword-based retrieval alongside semantic search.

Configuration Example (dbgpt-bm25-rag.toml):

ElasticDocumentStore:

Milvus Full-Text Search (v2.5.0+):

Sources:

Configuration and Deployment

Storage Configuration Structure

Storage configuration is defined in .toml files under [rag.storage].

Vector Store Configuration:

Knowledge Graph Configuration:

RAG Pipeline Parameters:

Sources:

configs/dbgpt-bm25-rag.toml1-51
packages/dbgpt-serve/src/dbgpt_serve/rag/config.py (referenced)

Metadata Filters

Metadata filters enable fine-grained document filtering during retrieval.

Filter Operations:

Operator	Symbol	Description
`EQ`	`==`	Equal
`NE`	`!=`	Not equal
`GT`	`>`	Greater than
`LT`	`<`	Less than
`GTE`	`>=`	Greater than or equal
`LTE`	`<=`	Less than or equal
`IN`	`in`	In list
`NIN`	`not in`	Not in list

Example Usage:

Store-Specific Filter Conversion:

Milvus: Converts to expression syntax field == 'value'
Chroma: Converts to $eq, $gt, etc. operators
OceanBase: Converts to SQL WHERE clauses with JSON path

Sources:

RAG Pipeline and Knowledge Management

Purpose and Scope

Architecture Overview

Document Ingestion Pipeline

KnowledgeFactory and Document Loading

EmbeddingAssembler and Persistence

Vector Store Architecture

Base Abstractions

Vector Store Implementations

Milvus Store

Chroma Store

Elasticsearch Store

Additional Stores

Knowledge Graph Integration

BuiltinKnowledgeGraph

CommunitySummaryKnowledgeGraph

Storage Management Layer

StorageManager

VectorStoreConnector

Knowledge Space Management

Knowledge Space Hierarchy

Document Synchronization

Retrieval and Assembly

EmbeddingRetriever Integration

Time-Weighted Retrieval

Knowledge Space Retriever

Full-Text Search Integration

BM25 and Elasticsearch

Configuration and Deployment

Storage Configuration Structure

Metadata Filters

On this page

RAG Pipeline and Knowledge Management

Purpose and Scope

Architecture Overview

Document Ingestion Pipeline

KnowledgeFactory and Document Loading

EmbeddingAssembler and Persistence

Vector Store Architecture

Base Abstractions

Vector Store Implementations

Milvus Store

Chroma Store

Elasticsearch Store

Additional Stores

Knowledge Graph Integration

BuiltinKnowledgeGraph

CommunitySummaryKnowledgeGraph

Storage Management Layer

StorageManager

VectorStoreConnector

Knowledge Space Management

Knowledge Space Hierarchy

Document Synchronization

Retrieval and Assembly

EmbeddingRetriever Integration

Time-Weighted Retrieval

Knowledge Space Retriever

Full-Text Search Integration

BM25 and Elasticsearch

Configuration and Deployment

Storage Configuration Structure

Metadata Filters

On this page