Data Storage Architecture

Relevant source files

Purpose and Scope

This document describes RAGFlow's multi-tier storage architecture, which separates structured metadata, unstructured document content with vector embeddings, raw file storage, and caching into distinct layers. The architecture is designed for scalability and supports multiple backend implementations for each storage tier.

For information about how documents are processed before storage, see Document Processing Pipeline. For retrieval and search operations, see Retrieval and Search System.

Multi-Tier Storage Overview

RAGFlow implements a four-tier storage architecture where each tier serves a specific purpose and can be independently scaled or swapped with alternative implementations:

Storage Tier Responsibilities:

Tier	Purpose	Data Types	Example Backends
Relational DB	Structured metadata, relationships, configuration	User accounts, dataset configs, task status, conversation history	MySQL, OceanBase
Document Store	Full-text search, vector similarity, chunk storage	Document chunks, embeddings, tokenized text, positions	Elasticsearch, Infinity, OpenSearch
Object Storage	Raw file storage, binary data	Original documents (PDF, DOCX), chunk images, thumbnails	MinIO, S3, OSS, Azure Blob
Cache	Session state, temporary data, task queue	Task queue messages, file cache (12min TTL), OTP tokens	Redis, Valkey

Sources: rag/svr/task_executor.py871-898 api/db/services/dialog_service.py447-451 common/settings.py docker-compose.yml

Relational Database Layer

Database Schema

RAGFlow uses Peewee ORM with support for MySQL and OceanBase (MySQL-compatible). The schema is defined in api/db/db_models.py with automatic connection retry logic and connection pooling.

Primary Tables:

Sources: api/db/db_models.py420-550 api/db/db_models.py135-236

Key Table Descriptions

Document Table - Central table tracking all uploaded files:

Stores metadata: name, size, type, parser configuration
Tracks processing state: run field (UNSTART=0, RUNNING=1, CANCEL=2, DONE=3, FAIL=4)
Maintains counts: chunk_num (indexed chunks), token_num (token consumption)
Links to knowledge base via kb_id foreign key

Task Table - Manages asynchronous document processing:

One task per document or per page range for large PDFs
Tracks progress: progress (0-1 float), progress_msg (status string)
Retry mechanism: retry_count increments on failure (max 3 attempts)
References document via doc_id, used by task executor workers

Knowledgebase Table - Dataset configuration:

Stores parser settings in parser_config JSON field
Maintains aggregate statistics: doc_num, chunk_num, token_num
Specifies embedding model via embd_id
Links to tenant for multi-tenancy

Connection Management:

The api/db/db_models.py243-280 implements RetryingPooledMySQLDatabase which automatically retries on connection loss:

Sources: api/db/db_models.py243-280 api/db/services/document_service.py46-77 api/db/services/knowledgebase_service.py48-130

Document Store Layer

Abstraction Interface

The DocStoreConnection abstract base class in common/doc_store/doc_store_base.py defines a unified interface for all document store implementations. This enables runtime switching between Elasticsearch, Infinity, and OpenSearch.

Core Interface Methods:

Query Expression Types:

Expression	Purpose	Example
`MatchTextExpr`	Full-text search with field weights	`fields=["content_ltks^1.0", "title_tks^0.5"]`
`MatchDenseExpr`	Vector similarity search	`vector_column="q_768_vec", topk=10`
`FusionExpr`	Hybrid search combining text + vector	`method="weighted_sum", weights="0.05,0.95"`
`OrderByExpr`	Result ordering	`asc("page_num_int"), desc("create_timestamp_flt")`

Sources: common/doc_store/doc_store_base.py1-150 rag/nlp/search.py36-172

Elasticsearch Implementation

The ESConnection class rag/utils/es_conn.py34-400 implements the document store interface using Elasticsearch 8.x.

Index Structure:

Each index follows the naming pattern ragflow_{tenant_id} and contains:

Document chunks as individual documents
Vector fields dynamically named (e.g., q_768_vec for 768-dimensional embeddings)
Full-text fields with multiple analyzers: *_ltks (coarse), *_sm_ltks (fine-grained)
Keyword fields for exact matching: *_kwd suffixes
Metadata fields for filtering and ranking

Mapping Configuration (dynamic based on vector_size and parser_id):

Search Implementation rag/utils/es_conn.py39-200:

Bulk Operations: The insert() method uses Elasticsearch bulk API for efficient batch insertion of chunks (typically 64 chunks per batch).

Sources: rag/utils/es_conn.py34-400 common/doc_store/es_conn_base.py1-350

Infinity Implementation

The InfinityConnection class rag/utils/infinity_conn.py30-700 implements the interface using Infinity Database, optimized for SQL-like operations with vector support.

Key Differences from Elasticsearch:

Aspect	Infinity	Elasticsearch
Field naming	No suffix transformations (uses base field names)	Suffix-based (`_ltks`, `_kwd`, `_vec`)
Full-text search	Uses `MATCH` with specific analyzers like `ft_content_rag_coarse`	Uses query_string with Lucene syntax
Vector search	Native `KNN` operator with float arrays	`script_score` or `knn` with dense_vector
Fusion	`FUSION('weighted_sum', ...)` SQL function	Script-based score combination
Result format	Returns `(DataFrame, total_count)` tuple	Returns DataFrame with hits wrapper

Field Mapping Strategy rag/utils/infinity_conn.py36-86:

Index Schema (defined in conf/infinity_mapping.json):

Search Query Example rag/utils/infinity_conn.py92-250:

Connection Pooling: Uses InfinityConnectionPool with Thrift protocol (port 23817) or HTTP (port 23820).

Sources: rag/utils/infinity_conn.py30-700 common/doc_store/infinity_conn_base.py1-800 conf/infinity_mapping.json1-30

OpenSearch Implementation

The OpenSearchConnection class rag/utils/opensearch_conn.py1-800 provides OpenSearch compatibility as an alternative to Elasticsearch.

Implementation Notes:

Nearly identical to Elasticsearch implementation (inherits query structure)
Uses opensearch-py client instead of elasticsearch-py
Supports OpenSearch-specific features like built-in neural search plugin
Compatible with AWS OpenSearch Service

Sources: rag/utils/opensearch_conn.py1-800

Storage Selection

The document store backend is selected at runtime via the DOC_ENGINE environment variable:

Sources: common/settings.py README.md

Object Storage Layer

Abstraction Interface

RAGFlow abstracts object storage through the STORAGE_IMPL singleton, which provides a common interface for MinIO, S3, OSS, and Azure Blob Storage.

Common Operations:

Method	Purpose	Returns
`put(bucket, name, binary)`	Upload file	Success boolean
`get(bucket, name)`	Download file	Binary data
`rm(bucket, name)`	Delete file	Success boolean
`obj_exist(bucket, name)`	Check existence	Boolean
`get_presigned_url(bucket, name, expires)`	Generate temporary URL	URL string

Sources: rag/svr/task_executor.py240-242 api/db/services/file_service.py1-500

Storage Location Pattern

Files are stored with a two-level addressing scheme:

Bucket/Container: Corresponds to kb_id (knowledge base ID)
Object Name: Unique identifier, typically:
- Original document: doc.location field from Document table
- Chunk images: chunk_id (SHA256 hash)
- Thumbnails: thumbnail_id

Address Resolution api/db/services/file2document_service.py60-85:

Sources: api/db/services/file2document_service.py60-85 rag/svr/task_executor.py254-256

MinIO Implementation

The MinIOConnection class rag/utils/minio_conn.py1-300 is the default object storage backend.

Configuration:

Key Features:

Automatic bucket creation: put() creates bucket if not exists rag/utils/minio_conn.py80-120
Retry logic: Exponential backoff on S3Error (5 attempts, 1-second base delay)
Connection pooling: Reuses HTTP connections for performance
Presigned URLs: Generates temporary download links (default 3600s expiration)

Usage Example rag/svr/task_executor.py240-270:

Sources: rag/utils/minio_conn.py1-300 rag/svr/task_executor.py240-270

S3 Implementation

The S3Connection class rag/utils/s3_conn.py1-250 supports AWS S3 and S3-compatible services.

Configuration:

Implementation Details:

Uses boto3 library with connection pooling
Supports both standard AWS S3 and custom S3-compatible endpoints
Implements retry logic via botocore.config.Config(retries={'max_attempts': 5})
Supports multipart uploads for large files (automatic threshold: 100MB)

Sources: rag/utils/s3_conn.py1-250

OSS Implementation

The OSSConnection class rag/utils/oss_conn.py1-200 integrates with Alibaba Cloud Object Storage Service.

Configuration:

Features:

Uses oss2 library (Alibaba Cloud SDK)
Supports OSS-specific features like lifecycle policies
Implements retry on RequestError, ServerError

Sources: rag/utils/oss_conn.py1-200

Azure Blob Storage Implementations

RAGFlow provides two Azure authentication methods:

SAS Token Authentication rag/utils/azure_sas_conn.py1-120:
Service Principal Authentication rag/utils/azure_spn_conn.py1-150:

Both implementations use the azure-storage-blob SDK and provide identical interface methods.

Sources: rag/utils/azure_sas_conn.py1-120 rag/utils/azure_spn_conn.py1-150

Storage Backend Selection

The storage backend is selected at runtime via environment variable:

Sources: common/settings.py

Redis Cache Layer

Purpose and Usage

Redis (or Valkey fork) serves three primary functions:

Task Queue - Redis Streams for distributing document processing tasks
Session Cache - User sessions, OTP tokens, authentication state
File Cache - Temporary file caching with 12-minute TTL

Sources: rag/utils/redis_conn.py1-400 api/db/services/task_service.py1-500

Task Queue Implementation

The task queue uses Redis Streams for reliable, distributed task processing rag/utils/redis_conn.py100-300:

Queue Structure:

Key Operations rag/svr/task_executor.py176-237:

Retry Logic: Unacknowledged messages are automatically redelivered to other consumers after a timeout (configurable, default 120s).

Sources: rag/utils/redis_conn.py100-300 rag/svr/task_executor.py176-237

File Cache

The file cache system rag/svr/cache_file_svr.py1-100 stores frequently accessed document binaries in Redis to reduce object storage latency:

Configuration:

Cache Operations:

Use Case: During document processing, chunks from the same document are processed sequentially. Caching the original document for 12 minutes avoids repeated object storage fetches for multi-page documents.

Sources: rag/svr/cache_file_svr.py1-100

Cancel Flag Storage

Task cancellation uses Redis sets for efficient broadcast api/db/services/task_service.py200-250:

Task executor workers poll this flag periodically during long-running operations rag/svr/task_executor.py350-470

Sources: api/db/services/task_service.py200-250 rag/svr/task_executor.py350-470

Data Flow and Coordination

Document Upload Flow

The complete flow from upload to searchable chunks involves coordination across all storage tiers:

Key Coordination Points:

Atomic Document Creation: Document record is inserted before task creation to ensure referential integrity
Task Status Tracking: Document.run field reflects current processing state (updated by task executor)
Progress Updates: Task.progress (0-1) and progress_msg fields are updated in real-time rag/svr/task_executor.py142-173
Chunk Image Storage: Images extracted from chunks are stored in object storage with chunk_id as key rag/svr/task_executor.py301-325
Metadata Persistence: Document metadata is stored separately in document store via DocMetadataService api/db/services/doc_metadata_service.py1-200

Sources: api/db/services/file_service.py200-400 rag/svr/task_executor.py142-900

Retrieval Flow

Search and retrieval operations coordinate between relational database (metadata) and document store (content):

Hybrid Search Process rag/nlp/search.py74-172:

Text Query Construction: Tokenize question, build MatchTextExpr with field weights
Vector Generation: Encode question to embedding vector, create MatchDenseExpr
Fusion Configuration: Create FusionExpr with weighted sum (default 0.05 text, 0.95 vector)
Execute Search: Document store performs hybrid search with configured match expressions
Rerank (Optional): Apply reranking model to refine results rag/nlp/search.py294-350
Metadata Enrichment: Join with document metadata from MySQL
Context Assembly: Format chunks for LLM prompt

Sources: api/db/services/dialog_service.py411-734 rag/nlp/search.py74-350

Metadata Synchronization

Document metadata is stored in both MySQL and document store, requiring synchronization:

Metadata Table (MySQL):

Metadata in Document Store:

Synchronization Points api/db/services/doc_metadata_service.py50-150:

On Document Upload: Metadata extracted and stored in both locations
On Metadata Update: Updates propagated to document store chunks via batch update
On Document Delete: Metadata removed from both MySQL and document store

Sources: api/db/services/doc_metadata_service.py1-200 rag/svr/task_executor.py407-448

Storage Selection and Configuration

Configuration Priority

Storage backend selection follows this priority:

Environment Variables - Highest priority
service_conf.yaml - Configuration file
Default Values - Built-in defaults

Example Configuration:

Sources: common/settings.py service_conf.yaml.template

Storage Backend Comparison

Aspect	MinIO	S3	OSS	Azure Blob	Elasticsearch	Infinity	OpenSearch
Setup Complexity	Low	Medium	Medium	Medium	Low	Low	Low
Cost	Free (self-hosted)	Pay-per-use	Pay-per-use	Pay-per-use	Free (self-hosted)	Free	Free
Latency	Lowest (local)	Medium	Medium	Medium	N/A	N/A	N/A
Scalability	Manual	Automatic	Automatic	Automatic	Cluster	Cluster	Cluster
Vector Search	N/A	N/A	N/A	N/A	Via script_score	Native KNN	Neural search
Full-Text	N/A	N/A	N/A	N/A	Lucene	Custom analyzers	Lucene

Recommendations:

Development/Testing: MinIO (object storage) + Elasticsearch (document store)
Production (Cloud): S3/OSS/Azure (object storage) + Infinity/Elasticsearch (document store)
Production (On-Premise): MinIO (object storage) + Infinity (document store)
High Performance: S3 + Infinity (optimized for vector operations)

Sources: README.md docker-compose.yml

Performance Tuning

Object Storage:

Enable connection pooling (all implementations support this by default)
Use multipart upload for files >100MB
Configure presigned URL expiry based on usage patterns

Document Store:

Elasticsearch: Adjust number_of_shards based on cluster size (default: 1)
Infinity: Tune connection pool size via INFINITY_CONNECTION_POOL_SIZE (default: 20)
Both: Configure bulk insert batch size via EMBEDDING_BATCH_SIZE (default: 64)

Redis:

Enable persistence (AOF or RDB) for task queue durability
Configure maxmemory-policy to allkeys-lru for cache eviction
Set appropriate TTL for file cache based on document processing patterns

Sources: common/settings.py rag/svr/task_executor.py82

Data Storage Architecture

Relevant source files

Purpose and Scope

For information about how documents are processed before storage, see Document Processing Pipeline. For retrieval and search operations, see Retrieval and Search System.

Multi-Tier Storage Overview

RAGFlow implements a four-tier storage architecture where each tier serves a specific purpose and can be independently scaled or swapped with alternative implementations:

Storage Tier Responsibilities:

Tier	Purpose	Data Types	Example Backends
Relational DB	Structured metadata, relationships, configuration	User accounts, dataset configs, task status, conversation history	MySQL, OceanBase
Document Store	Full-text search, vector similarity, chunk storage	Document chunks, embeddings, tokenized text, positions	Elasticsearch, Infinity, OpenSearch
Object Storage	Raw file storage, binary data	Original documents (PDF, DOCX), chunk images, thumbnails	MinIO, S3, OSS, Azure Blob
Cache	Session state, temporary data, task queue	Task queue messages, file cache (12min TTL), OTP tokens	Redis, Valkey

Sources: rag/svr/task_executor.py871-898 api/db/services/dialog_service.py447-451 common/settings.py docker-compose.yml

Relational Database Layer

Database Schema

RAGFlow uses Peewee ORM with support for MySQL and OceanBase (MySQL-compatible). The schema is defined in api/db/db_models.py with automatic connection retry logic and connection pooling.

Primary Tables:

Sources: api/db/db_models.py420-550 api/db/db_models.py135-236

Key Table Descriptions

Document Table - Central table tracking all uploaded files:

Stores metadata: name, size, type, parser configuration
Tracks processing state: run field (UNSTART=0, RUNNING=1, CANCEL=2, DONE=3, FAIL=4)
Maintains counts: chunk_num (indexed chunks), token_num (token consumption)
Links to knowledge base via kb_id foreign key

Task Table - Manages asynchronous document processing:

One task per document or per page range for large PDFs
Tracks progress: progress (0-1 float), progress_msg (status string)
Retry mechanism: retry_count increments on failure (max 3 attempts)
References document via doc_id, used by task executor workers

Knowledgebase Table - Dataset configuration:

Stores parser settings in parser_config JSON field
Maintains aggregate statistics: doc_num, chunk_num, token_num
Specifies embedding model via embd_id
Links to tenant for multi-tenancy

Connection Management:

The api/db/db_models.py243-280 implements RetryingPooledMySQLDatabase which automatically retries on connection loss:

Sources: api/db/db_models.py243-280 api/db/services/document_service.py46-77 api/db/services/knowledgebase_service.py48-130

Document Store Layer

Abstraction Interface

Core Interface Methods:

Query Expression Types:

Expression	Purpose	Example
`MatchTextExpr`	Full-text search with field weights	`fields=["content_ltks^1.0", "title_tks^0.5"]`
`MatchDenseExpr`	Vector similarity search	`vector_column="q_768_vec", topk=10`
`FusionExpr`	Hybrid search combining text + vector	`method="weighted_sum", weights="0.05,0.95"`
`OrderByExpr`	Result ordering	`asc("page_num_int"), desc("create_timestamp_flt")`

Sources: common/doc_store/doc_store_base.py1-150 rag/nlp/search.py36-172

Elasticsearch Implementation

The ESConnection class rag/utils/es_conn.py34-400 implements the document store interface using Elasticsearch 8.x.

Index Structure:

Each index follows the naming pattern ragflow_{tenant_id} and contains:

Document chunks as individual documents
Vector fields dynamically named (e.g., q_768_vec for 768-dimensional embeddings)
Full-text fields with multiple analyzers: *_ltks (coarse), *_sm_ltks (fine-grained)
Keyword fields for exact matching: *_kwd suffixes
Metadata fields for filtering and ranking

Mapping Configuration (dynamic based on vector_size and parser_id):

Search Implementation rag/utils/es_conn.py39-200:

Bulk Operations: The insert() method uses Elasticsearch bulk API for efficient batch insertion of chunks (typically 64 chunks per batch).

Sources: rag/utils/es_conn.py34-400 common/doc_store/es_conn_base.py1-350

Infinity Implementation

The InfinityConnection class rag/utils/infinity_conn.py30-700 implements the interface using Infinity Database, optimized for SQL-like operations with vector support.

Key Differences from Elasticsearch:

Aspect	Infinity	Elasticsearch
Field naming	No suffix transformations (uses base field names)	Suffix-based (`_ltks`, `_kwd`, `_vec`)
Full-text search	Uses `MATCH` with specific analyzers like `ft_content_rag_coarse`	Uses query_string with Lucene syntax
Vector search	Native `KNN` operator with float arrays	`script_score` or `knn` with dense_vector
Fusion	`FUSION('weighted_sum', ...)` SQL function	Script-based score combination
Result format	Returns `(DataFrame, total_count)` tuple	Returns DataFrame with hits wrapper

Field Mapping Strategy rag/utils/infinity_conn.py36-86:

Index Schema (defined in conf/infinity_mapping.json):

Search Query Example rag/utils/infinity_conn.py92-250:

Connection Pooling: Uses InfinityConnectionPool with Thrift protocol (port 23817) or HTTP (port 23820).

Sources: rag/utils/infinity_conn.py30-700 common/doc_store/infinity_conn_base.py1-800 conf/infinity_mapping.json1-30

OpenSearch Implementation

The OpenSearchConnection class rag/utils/opensearch_conn.py1-800 provides OpenSearch compatibility as an alternative to Elasticsearch.

Implementation Notes:

Nearly identical to Elasticsearch implementation (inherits query structure)
Uses opensearch-py client instead of elasticsearch-py
Supports OpenSearch-specific features like built-in neural search plugin
Compatible with AWS OpenSearch Service

Sources: rag/utils/opensearch_conn.py1-800

Storage Selection

The document store backend is selected at runtime via the DOC_ENGINE environment variable:

Sources: common/settings.py README.md

Object Storage Layer

Abstraction Interface

RAGFlow abstracts object storage through the STORAGE_IMPL singleton, which provides a common interface for MinIO, S3, OSS, and Azure Blob Storage.

Common Operations:

Method	Purpose	Returns
`put(bucket, name, binary)`	Upload file	Success boolean
`get(bucket, name)`	Download file	Binary data
`rm(bucket, name)`	Delete file	Success boolean
`obj_exist(bucket, name)`	Check existence	Boolean
`get_presigned_url(bucket, name, expires)`	Generate temporary URL	URL string

Sources: rag/svr/task_executor.py240-242 api/db/services/file_service.py1-500

Storage Location Pattern

Files are stored with a two-level addressing scheme:

Bucket/Container: Corresponds to kb_id (knowledge base ID)
Object Name: Unique identifier, typically:
- Original document: doc.location field from Document table
- Chunk images: chunk_id (SHA256 hash)
- Thumbnails: thumbnail_id

Address Resolution api/db/services/file2document_service.py60-85:

Sources: api/db/services/file2document_service.py60-85 rag/svr/task_executor.py254-256

MinIO Implementation

The MinIOConnection class rag/utils/minio_conn.py1-300 is the default object storage backend.

Configuration:

Key Features:

Automatic bucket creation: put() creates bucket if not exists rag/utils/minio_conn.py80-120
Retry logic: Exponential backoff on S3Error (5 attempts, 1-second base delay)
Connection pooling: Reuses HTTP connections for performance
Presigned URLs: Generates temporary download links (default 3600s expiration)

Usage Example rag/svr/task_executor.py240-270:

Sources: rag/utils/minio_conn.py1-300 rag/svr/task_executor.py240-270

S3 Implementation

The S3Connection class rag/utils/s3_conn.py1-250 supports AWS S3 and S3-compatible services.

Configuration:

Implementation Details:

Uses boto3 library with connection pooling
Supports both standard AWS S3 and custom S3-compatible endpoints
Implements retry logic via botocore.config.Config(retries={'max_attempts': 5})
Supports multipart uploads for large files (automatic threshold: 100MB)

Sources: rag/utils/s3_conn.py1-250

OSS Implementation

The OSSConnection class rag/utils/oss_conn.py1-200 integrates with Alibaba Cloud Object Storage Service.

Configuration:

Features:

Uses oss2 library (Alibaba Cloud SDK)
Supports OSS-specific features like lifecycle policies
Implements retry on RequestError, ServerError

Sources: rag/utils/oss_conn.py1-200

Azure Blob Storage Implementations

RAGFlow provides two Azure authentication methods:

SAS Token Authentication rag/utils/azure_sas_conn.py1-120:
Service Principal Authentication rag/utils/azure_spn_conn.py1-150:

Both implementations use the azure-storage-blob SDK and provide identical interface methods.

Sources: rag/utils/azure_sas_conn.py1-120 rag/utils/azure_spn_conn.py1-150

Storage Backend Selection

The storage backend is selected at runtime via environment variable:

Sources: common/settings.py

Redis Cache Layer

Purpose and Usage

Redis (or Valkey fork) serves three primary functions:

Task Queue - Redis Streams for distributing document processing tasks
Session Cache - User sessions, OTP tokens, authentication state
File Cache - Temporary file caching with 12-minute TTL

Sources: rag/utils/redis_conn.py1-400 api/db/services/task_service.py1-500

Task Queue Implementation

The task queue uses Redis Streams for reliable, distributed task processing rag/utils/redis_conn.py100-300:

Queue Structure:

Key Operations rag/svr/task_executor.py176-237:

Retry Logic: Unacknowledged messages are automatically redelivered to other consumers after a timeout (configurable, default 120s).

Sources: rag/utils/redis_conn.py100-300 rag/svr/task_executor.py176-237

File Cache

The file cache system rag/svr/cache_file_svr.py1-100 stores frequently accessed document binaries in Redis to reduce object storage latency:

Configuration:

Cache Operations:

Sources: rag/svr/cache_file_svr.py1-100

Cancel Flag Storage

Task cancellation uses Redis sets for efficient broadcast api/db/services/task_service.py200-250:

Task executor workers poll this flag periodically during long-running operations rag/svr/task_executor.py350-470

Sources: api/db/services/task_service.py200-250 rag/svr/task_executor.py350-470

Data Flow and Coordination

Document Upload Flow

The complete flow from upload to searchable chunks involves coordination across all storage tiers:

Key Coordination Points:

Atomic Document Creation: Document record is inserted before task creation to ensure referential integrity
Task Status Tracking: Document.run field reflects current processing state (updated by task executor)
Progress Updates: Task.progress (0-1) and progress_msg fields are updated in real-time rag/svr/task_executor.py142-173
Chunk Image Storage: Images extracted from chunks are stored in object storage with chunk_id as key rag/svr/task_executor.py301-325
Metadata Persistence: Document metadata is stored separately in document store via DocMetadataService api/db/services/doc_metadata_service.py1-200

Sources: api/db/services/file_service.py200-400 rag/svr/task_executor.py142-900

Retrieval Flow

Search and retrieval operations coordinate between relational database (metadata) and document store (content):

Hybrid Search Process rag/nlp/search.py74-172:

Text Query Construction: Tokenize question, build MatchTextExpr with field weights
Vector Generation: Encode question to embedding vector, create MatchDenseExpr
Fusion Configuration: Create FusionExpr with weighted sum (default 0.05 text, 0.95 vector)
Execute Search: Document store performs hybrid search with configured match expressions
Rerank (Optional): Apply reranking model to refine results rag/nlp/search.py294-350
Metadata Enrichment: Join with document metadata from MySQL
Context Assembly: Format chunks for LLM prompt

Sources: api/db/services/dialog_service.py411-734 rag/nlp/search.py74-350

Metadata Synchronization

Document metadata is stored in both MySQL and document store, requiring synchronization:

Metadata Table (MySQL):

Metadata in Document Store:

Synchronization Points api/db/services/doc_metadata_service.py50-150:

On Document Upload: Metadata extracted and stored in both locations
On Metadata Update: Updates propagated to document store chunks via batch update
On Document Delete: Metadata removed from both MySQL and document store

Sources: api/db/services/doc_metadata_service.py1-200 rag/svr/task_executor.py407-448

Storage Selection and Configuration

Configuration Priority

Storage backend selection follows this priority:

Environment Variables - Highest priority
service_conf.yaml - Configuration file
Default Values - Built-in defaults

Example Configuration:

Sources: common/settings.py service_conf.yaml.template

Storage Backend Comparison

Aspect	MinIO	S3	OSS	Azure Blob	Elasticsearch	Infinity	OpenSearch
Setup Complexity	Low	Medium	Medium	Medium	Low	Low	Low
Cost	Free (self-hosted)	Pay-per-use	Pay-per-use	Pay-per-use	Free (self-hosted)	Free	Free
Latency	Lowest (local)	Medium	Medium	Medium	N/A	N/A	N/A
Scalability	Manual	Automatic	Automatic	Automatic	Cluster	Cluster	Cluster
Vector Search	N/A	N/A	N/A	N/A	Via script_score	Native KNN	Neural search
Full-Text	N/A	N/A	N/A	N/A	Lucene	Custom analyzers	Lucene

Recommendations:

Development/Testing: MinIO (object storage) + Elasticsearch (document store)
Production (Cloud): S3/OSS/Azure (object storage) + Infinity/Elasticsearch (document store)
Production (On-Premise): MinIO (object storage) + Infinity (document store)
High Performance: S3 + Infinity (optimized for vector operations)

Sources: README.md docker-compose.yml

Performance Tuning

Object Storage:

Enable connection pooling (all implementations support this by default)
Use multipart upload for files >100MB
Configure presigned URL expiry based on usage patterns

Document Store:

Elasticsearch: Adjust number_of_shards based on cluster size (default: 1)
Infinity: Tune connection pool size via INFINITY_CONNECTION_POOL_SIZE (default: 20)
Both: Configure bulk insert batch size via EMBEDDING_BATCH_SIZE (default: 64)

Redis:

Enable persistence (AOF or RDB) for task queue durability
Configure maxmemory-policy to allkeys-lru for cache eviction
Set appropriate TTL for file cache based on document processing patterns

Sources: common/settings.py rag/svr/task_executor.py82

Data Storage Architecture

Purpose and Scope

Multi-Tier Storage Overview

Relational Database Layer

Database Schema

Key Table Descriptions

Document Store Layer

Abstraction Interface

Elasticsearch Implementation

Infinity Implementation

OpenSearch Implementation

Storage Selection

Object Storage Layer

Abstraction Interface

Storage Location Pattern

MinIO Implementation

S3 Implementation

OSS Implementation

Azure Blob Storage Implementations

Storage Backend Selection

Redis Cache Layer

Purpose and Usage

Task Queue Implementation

File Cache

Cancel Flag Storage

Data Flow and Coordination

Document Upload Flow

Retrieval Flow

Metadata Synchronization

Storage Selection and Configuration

Configuration Priority

Storage Backend Comparison

Performance Tuning

On this page

Data Storage Architecture

Purpose and Scope

Multi-Tier Storage Overview

Relational Database Layer

Database Schema

Key Table Descriptions

Document Store Layer

Abstraction Interface

Elasticsearch Implementation

Infinity Implementation

OpenSearch Implementation

Storage Selection

Object Storage Layer

Abstraction Interface

Storage Location Pattern

MinIO Implementation

S3 Implementation

OSS Implementation

Azure Blob Storage Implementations

Storage Backend Selection

Redis Cache Layer

Purpose and Usage

Task Queue Implementation

File Cache

Cancel Flag Storage

Data Flow and Coordination

Document Upload Flow

Retrieval Flow

Metadata Synchronization

Storage Selection and Configuration

Configuration Priority

Storage Backend Comparison

Performance Tuning

On this page