Vector Database System

Relevant source files

The Vector Database System provides an abstraction layer for storing and retrieving document embeddings (vectors) that enable semantic search in AnythingLLM's RAG (Retrieval-Augmented Generation) implementation. This system supports 10+ vector database providers through a common interface, manages workspace-isolated namespaces, implements vector caching for performance optimization, and handles document-to-vector relationships.

For information about how vectors are used during chat execution, see Chat System Architecture. For details on embedding engines that generate the vectors, see Provider Architecture. For text splitting and chunking strategies, see Text Splitting and Chunking.

Architecture Overview

The vector database system follows a provider abstraction pattern where all database implementations extend the VectorDatabase base class and implement a common interface. Each workspace has an isolated namespace in the vector database, and the system maintains bidirectional mappings between documents and their embedded vectors.

System Component Diagram

Sources:

Provider Abstraction Layer

All vector database providers extend the VectorDatabase base class and implement a standardized interface. This abstraction allows the application to switch between providers without changing business logic.

Common Interface Methods

Method	Purpose	Parameters
`connect()`	Establish connection to vector database	None
`heartbeat()`	Check database availability	None
`addDocumentToNamespace()`	Embed and store document chunks	`namespace`, `documentData`, `fullFilePath`, `skipCache`
`deleteDocumentFromNamespace()`	Remove document vectors	`namespace`, `docId`
`performSimilaritySearch()`	Query vectors and retrieve context	`namespace`, `input`, `LLMConnector`, `similarityThreshold`, `topN`, `filterIdentifiers`
`hasNamespace()`	Check if namespace exists	`namespace`
`namespaceCount()`	Count vectors in namespace	`namespace`
`deleteVectorsInNamespace()`	Delete entire namespace	`client`, `namespace`
`curateSources()`	Format search results	`sources`

Provider Implementation Pattern

Sources:

Document Vectorization Pipeline

The addDocumentToNamespace method implements a multi-stage pipeline that transforms document text into searchable vectors. The pipeline includes caching, text splitting, embedding generation, and batch insertion.

Vectorization Flow Diagram

Key Configuration Parameters

Parameter	Source	Default	Purpose
`chunkSize`	`SystemSettings.text_splitter_chunk_size`	1000	Maximum characters per chunk
`chunkOverlap`	`SystemSettings.text_splitter_chunk_overlap`	20	Character overlap between chunks
`chunkHeaderMeta`	`TextSplitter.buildHeaderMeta()`	null	XML metadata header
`chunkPrefix`	`EmbedderEngine.embeddingPrefix`	""	Model-specific prefix
Batch Size	Provider-specific	100-500	Vectors per batch insert

Sources:

Similarity Search and Retrieval

The performSimilaritySearch method executes semantic search queries against the vector database. It embeds the input query, searches for similar vectors, filters results by similarity threshold, and excludes pinned documents.

Search Execution Flow

Similarity Threshold and Distance Conversion

The system uses cosine similarity with different distance metrics across providers. The distanceToSimilarity method normalizes distances to a 0-1 scale:

LanceDB Distance Conversion (server/utils/vectorDbProviders/lance/index.js39-44):

if distance >= 1.0: return 1
if distance < 0: return 1 - abs(distance)
else: return 1 - distance

Chroma Distance Conversion (server/utils/vectorDbProviders/chroma/index.js112-117):

if distance >= 1.0: return 1
if distance < 0: return 1 - abs(distance)
else: return 1 - distance

Reranking (LanceDB Only)

LanceDB supports reranking to improve result quality. The reranking process:

Initial Search: Retrieves a larger set (10-50 results, 10% of total embeddings capped at 50)
Rerank: Uses NativeEmbeddingReranker to reorder by semantic relevance
Filter: Returns only topN results after reranking

Reranking Configuration (server/utils/vectorDbProviders/lance/index.js119-122):

Sources:

Namespace Management

Namespaces provide workspace-level isolation in the vector database. Each workspace has exactly one namespace, and all documents within that workspace are stored in that namespace. Namespace names are often normalized to meet provider-specific requirements.

Namespace Operations

Operation	Method	Purpose
Check Existence	`hasNamespace(namespace)`	Returns boolean if namespace exists
Get Vector Count	`namespaceCount(namespace)`	Returns number of vectors in namespace
Get Statistics	`namespace(client, namespace)`	Returns namespace metadata and count
Delete Namespace	`deleteVectorsInNamespace(client, namespace)`	Removes entire namespace and all vectors
Total Vectors	`totalVectors()`	Returns sum of vectors across all namespaces

Provider-Specific Namespace Normalization

Different providers have different naming requirements:

Chroma Normalization (server/utils/vectorDbProviders/chroma/index.js31-64):

Must contain 3-63 characters
Must start/end with alphanumeric
Only alphanumeric, underscores, hyphens
Cannot contain consecutive periods
Cannot be valid IPv4 address

Milvus/Zilliz Normalization (server/utils/vectorDbProviders/milvus/index.js28-33):

Only letters, numbers, underscores
Must start with letter or underscore
Prepends anythingllm_ if invalid first character

Weaviate Normalization (server/utils/vectorDbProviders/weaviate/index.js1-511):

Uses camelCase() transformation
Creates classes with custom schema

AstraDB Normalization (server/utils/vectorDbProviders/astra/index.js10-16):

Prepends ns_ prefix
Removes invalid characters
Ensures starts with letter

Namespace Lifecycle

Sources:

Supported Vector Database Providers

AnythingLLM supports 10+ vector database providers through the common interface. Each provider has specific connection requirements and features.

Provider Configuration Matrix

Provider	Class	Connection ENV	Special Features
LanceDB	`LanceDb`	`STORAGE_DIR`	Local file-based, reranking support
Chroma	`Chroma`	`CHROMA_ENDPOINT`, `CHROMA_API_KEY`	Self-hosted, collection normalization
Pinecone	`PineconeDB`	`PINECONE_API_KEY`, `PINECONE_INDEX`	Cloud-hosted, serverless
Qdrant	`QDrant`	`QDRANT_ENDPOINT`, `QDRANT_API_KEY`	Self/cloud hosted, hybrid search
Milvus	`Milvus`	`MILVUS_ADDRESS`, `MILVUS_USERNAME`, `MILVUS_PASSWORD`	Open-source, high-performance
Zilliz	`Zilliz`	`ZILLIZ_ENDPOINT`, `ZILLIZ_API_TOKEN`	Cloud Milvus, extends Milvus class
Weaviate	`Weaviate`	`WEAVIATE_ENDPOINT`, `WEAVIATE_API_KEY`	GraphQL API, schema-based
AstraDB	`AstraDB`	`ASTRA_DB_APPLICATION_TOKEN`, `ASTRA_DB_ENDPOINT`	Cassandra-based, 20 vectors/batch limit

LanceDB (Default Provider)

LanceDB is the default vector database, requiring no external services. It stores vectors as files in the STORAGE_DIR/lancedb directory.

URI Construction (server/utils/vectorDbProviders/lance/index.js22-27):

Unique Features:

File-based storage (no external service)
Native reranking support via NativeEmbeddingReranker
Batch size: 500 vectors per insert

Cloud Provider Examples

Pinecone (server/utils/vectorDbProviders/pinecone/index.js19-32):

Requires pre-created index via Pinecone dashboard
Checks index readiness: status.ready
Namespace isolation within single index

Qdrant (server/utils/vectorDbProviders/qdrant/index.js19-37):

Self-hosted or cloud deployment
Cluster health check: client.api("cluster")?.clusterStatus()
Supports filtering and hybrid search

AstraDB (server/utils/vectorDbProviders/astra/index.js40-49):

Cassandra-based vector database
Maximum batch size: 20 vectors (smallest of all providers)
Automatic collection creation with dimension inference

Dimension Inference

Several providers require vector dimensions during collection creation. The system infers dimensions from the first embedded chunk:

Qdrant (server/utils/vectorDbProviders/qdrant/index.js138-153):

Sources:

Vector Caching System

The vector cache stores computed embeddings to avoid re-embedding documents that haven't changed. This significantly improves performance when documents are re-indexed or moved between workspaces.

Cache Lookup and Storage

Cache File Structure

Cache Location: vector-cache/ directory (or STORAGE_DIR/vector-cache/)

Cache Content Format: Each cached file contains an array of chunks, where each chunk is an array of vector objects:

Cache Key Generation

The cache uses the full file path as a unique identifier. The cachedVectorInformation function reads the cache file and returns the stored chunks.

Cache Lookup (server/utils/vectorDbProviders/lance/index.js314-333):

Cache Benefits

Scenario	Without Cache	With Cache
Re-embedding same document	Generate embeddings (slow)	Load from cache (fast)
Moving document between workspaces	Generate embeddings	Load from cache
API call to embedding provider	Yes (costs money)	No (free)
Computation time	~1-5 seconds per document	~50-200ms per document

Sources:

Text Splitting and Chunking

Text splitting breaks documents into manageable chunks before embedding. For detailed documentation on text splitting strategies, see Text Splitting and Chunking.

TextSplitter Configuration

The TextSplitter class wraps the Langchain RecursiveCharacterTextSplitter and adds metadata injection capabilities.

Constructor Parameters (server/utils/TextSplitter/index.js32-35):

chunkPrefix: Prefix for model-specific requirements (e.g., "passage: ")
chunkSize: Maximum characters per chunk (default 1000)
chunkOverlap: Character overlap between chunks (default 20)
chunkHeaderMeta: Metadata object to inject as XML header

Metadata Header Injection

The system can prepend XML-formatted metadata to each chunk, providing context to the LLM during retrieval:

Header Format (server/utils/TextSplitter/index.js134-147):

Header Building (server/utils/TextSplitter/index.js64-118): The buildHeaderMeta static method extracts relevant metadata fields:

title → sourceDocument
published → published
chunkSource (if link:// or youtube://) → source

Chunk Size Determination

The system respects embedding model limits when determining chunk size:

Max Chunk Size Logic (server/utils/TextSplitter/index.js47-57):

Sources:

DocumentVectors Mapping Table

The DocumentVectors model maintains the relationship between documents and their embedded vectors. This bidirectional mapping enables document deletion, updates, and vector management.

Database Schema

Table: document_vectors

Column	Type	Purpose
`id`	Integer	Primary key (auto-increment)
`docId`	String	Document identifier (foreign key)
`vectorId`	String	UUID of vector in vector database

Mapping Operations

Bulk Insert (used during vectorization):

Query by Document (used during deletion):

Delete Mappings (after vector deletion):

Document Deletion Flow

Sources:

API Endpoints for Vector Management

The system exposes API endpoints for namespace management and statistics. These endpoints are typically accessed by administrators or the frontend UI.

Available Endpoints

Endpoint	Method	Purpose
`namespace-stats`	POST	Get namespace statistics (vector count, metadata)
`delete-namespace`	POST	Delete entire namespace and all vectors
`reset`	POST	Delete all namespaces (full reset)

Endpoint Implementation Pattern

Each provider implements these as async methods with the endpoint name:

namespace-stats Example (server/utils/vectorDbProviders/lance/index.js458-468):

delete-namespace Example (server/utils/vectorDbProviders/lance/index.js470-480):

reset Example (server/utils/vectorDbProviders/lance/index.js482-487):

Sources:

Vector Database System

Relevant source files

Architecture Overview

System Component Diagram

Sources:

Provider Abstraction Layer

Common Interface Methods

Method	Purpose	Parameters
`connect()`	Establish connection to vector database	None
`heartbeat()`	Check database availability	None
`addDocumentToNamespace()`	Embed and store document chunks	`namespace`, `documentData`, `fullFilePath`, `skipCache`
`deleteDocumentFromNamespace()`	Remove document vectors	`namespace`, `docId`
`performSimilaritySearch()`	Query vectors and retrieve context	`namespace`, `input`, `LLMConnector`, `similarityThreshold`, `topN`, `filterIdentifiers`
`hasNamespace()`	Check if namespace exists	`namespace`
`namespaceCount()`	Count vectors in namespace	`namespace`
`deleteVectorsInNamespace()`	Delete entire namespace	`client`, `namespace`
`curateSources()`	Format search results	`sources`

Provider Implementation Pattern

Sources:

Document Vectorization Pipeline

Vectorization Flow Diagram

Key Configuration Parameters

Parameter	Source	Default	Purpose
`chunkSize`	`SystemSettings.text_splitter_chunk_size`	1000	Maximum characters per chunk
`chunkOverlap`	`SystemSettings.text_splitter_chunk_overlap`	20	Character overlap between chunks
`chunkHeaderMeta`	`TextSplitter.buildHeaderMeta()`	null	XML metadata header
`chunkPrefix`	`EmbedderEngine.embeddingPrefix`	""	Model-specific prefix
Batch Size	Provider-specific	100-500	Vectors per batch insert

Sources:

Similarity Search and Retrieval

Search Execution Flow

Similarity Threshold and Distance Conversion

The system uses cosine similarity with different distance metrics across providers. The distanceToSimilarity method normalizes distances to a 0-1 scale:

LanceDB Distance Conversion (server/utils/vectorDbProviders/lance/index.js39-44):

if distance >= 1.0: return 1
if distance < 0: return 1 - abs(distance)
else: return 1 - distance

Chroma Distance Conversion (server/utils/vectorDbProviders/chroma/index.js112-117):

if distance >= 1.0: return 1
if distance < 0: return 1 - abs(distance)
else: return 1 - distance

Reranking (LanceDB Only)

LanceDB supports reranking to improve result quality. The reranking process:

Initial Search: Retrieves a larger set (10-50 results, 10% of total embeddings capped at 50)
Rerank: Uses NativeEmbeddingReranker to reorder by semantic relevance
Filter: Returns only topN results after reranking

Reranking Configuration (server/utils/vectorDbProviders/lance/index.js119-122):

Sources:

Namespace Management

Namespace Operations

Operation	Method	Purpose
Check Existence	`hasNamespace(namespace)`	Returns boolean if namespace exists
Get Vector Count	`namespaceCount(namespace)`	Returns number of vectors in namespace
Get Statistics	`namespace(client, namespace)`	Returns namespace metadata and count
Delete Namespace	`deleteVectorsInNamespace(client, namespace)`	Removes entire namespace and all vectors
Total Vectors	`totalVectors()`	Returns sum of vectors across all namespaces

Provider-Specific Namespace Normalization

Different providers have different naming requirements:

Chroma Normalization (server/utils/vectorDbProviders/chroma/index.js31-64):

Must contain 3-63 characters
Must start/end with alphanumeric
Only alphanumeric, underscores, hyphens
Cannot contain consecutive periods
Cannot be valid IPv4 address

Milvus/Zilliz Normalization (server/utils/vectorDbProviders/milvus/index.js28-33):

Only letters, numbers, underscores
Must start with letter or underscore
Prepends anythingllm_ if invalid first character

Weaviate Normalization (server/utils/vectorDbProviders/weaviate/index.js1-511):

Uses camelCase() transformation
Creates classes with custom schema

AstraDB Normalization (server/utils/vectorDbProviders/astra/index.js10-16):

Prepends ns_ prefix
Removes invalid characters
Ensures starts with letter

Namespace Lifecycle

Sources:

Supported Vector Database Providers

AnythingLLM supports 10+ vector database providers through the common interface. Each provider has specific connection requirements and features.

Provider Configuration Matrix

Provider	Class	Connection ENV	Special Features
LanceDB	`LanceDb`	`STORAGE_DIR`	Local file-based, reranking support
Chroma	`Chroma`	`CHROMA_ENDPOINT`, `CHROMA_API_KEY`	Self-hosted, collection normalization
Pinecone	`PineconeDB`	`PINECONE_API_KEY`, `PINECONE_INDEX`	Cloud-hosted, serverless
Qdrant	`QDrant`	`QDRANT_ENDPOINT`, `QDRANT_API_KEY`	Self/cloud hosted, hybrid search
Milvus	`Milvus`	`MILVUS_ADDRESS`, `MILVUS_USERNAME`, `MILVUS_PASSWORD`	Open-source, high-performance
Zilliz	`Zilliz`	`ZILLIZ_ENDPOINT`, `ZILLIZ_API_TOKEN`	Cloud Milvus, extends Milvus class
Weaviate	`Weaviate`	`WEAVIATE_ENDPOINT`, `WEAVIATE_API_KEY`	GraphQL API, schema-based
AstraDB	`AstraDB`	`ASTRA_DB_APPLICATION_TOKEN`, `ASTRA_DB_ENDPOINT`	Cassandra-based, 20 vectors/batch limit

LanceDB (Default Provider)

LanceDB is the default vector database, requiring no external services. It stores vectors as files in the STORAGE_DIR/lancedb directory.

URI Construction (server/utils/vectorDbProviders/lance/index.js22-27):

Unique Features:

File-based storage (no external service)
Native reranking support via NativeEmbeddingReranker
Batch size: 500 vectors per insert

Cloud Provider Examples

Pinecone (server/utils/vectorDbProviders/pinecone/index.js19-32):

Requires pre-created index via Pinecone dashboard
Checks index readiness: status.ready
Namespace isolation within single index

Qdrant (server/utils/vectorDbProviders/qdrant/index.js19-37):

Self-hosted or cloud deployment
Cluster health check: client.api("cluster")?.clusterStatus()
Supports filtering and hybrid search

AstraDB (server/utils/vectorDbProviders/astra/index.js40-49):

Cassandra-based vector database
Maximum batch size: 20 vectors (smallest of all providers)
Automatic collection creation with dimension inference

Dimension Inference

Several providers require vector dimensions during collection creation. The system infers dimensions from the first embedded chunk:

Qdrant (server/utils/vectorDbProviders/qdrant/index.js138-153):

Sources:

Vector Caching System

The vector cache stores computed embeddings to avoid re-embedding documents that haven't changed. This significantly improves performance when documents are re-indexed or moved between workspaces.

Cache Lookup and Storage

Cache File Structure

Cache Location: vector-cache/ directory (or STORAGE_DIR/vector-cache/)

Cache Content Format: Each cached file contains an array of chunks, where each chunk is an array of vector objects:

Cache Key Generation

The cache uses the full file path as a unique identifier. The cachedVectorInformation function reads the cache file and returns the stored chunks.

Cache Lookup (server/utils/vectorDbProviders/lance/index.js314-333):

Cache Benefits

Scenario	Without Cache	With Cache
Re-embedding same document	Generate embeddings (slow)	Load from cache (fast)
Moving document between workspaces	Generate embeddings	Load from cache
API call to embedding provider	Yes (costs money)	No (free)
Computation time	~1-5 seconds per document	~50-200ms per document

Sources:

Text Splitting and Chunking

Text splitting breaks documents into manageable chunks before embedding. For detailed documentation on text splitting strategies, see Text Splitting and Chunking.

TextSplitter Configuration

The TextSplitter class wraps the Langchain RecursiveCharacterTextSplitter and adds metadata injection capabilities.

Constructor Parameters (server/utils/TextSplitter/index.js32-35):

chunkPrefix: Prefix for model-specific requirements (e.g., "passage: ")
chunkSize: Maximum characters per chunk (default 1000)
chunkOverlap: Character overlap between chunks (default 20)
chunkHeaderMeta: Metadata object to inject as XML header

Metadata Header Injection

The system can prepend XML-formatted metadata to each chunk, providing context to the LLM during retrieval:

Header Format (server/utils/TextSplitter/index.js134-147):

Header Building (server/utils/TextSplitter/index.js64-118): The buildHeaderMeta static method extracts relevant metadata fields:

title → sourceDocument
published → published
chunkSource (if link:// or youtube://) → source

Chunk Size Determination

The system respects embedding model limits when determining chunk size:

Max Chunk Size Logic (server/utils/TextSplitter/index.js47-57):

Sources:

DocumentVectors Mapping Table

The DocumentVectors model maintains the relationship between documents and their embedded vectors. This bidirectional mapping enables document deletion, updates, and vector management.

Database Schema

Table: document_vectors

Column	Type	Purpose
`id`	Integer	Primary key (auto-increment)
`docId`	String	Document identifier (foreign key)
`vectorId`	String	UUID of vector in vector database

Mapping Operations

Bulk Insert (used during vectorization):

Query by Document (used during deletion):

Delete Mappings (after vector deletion):

Document Deletion Flow

Sources:

API Endpoints for Vector Management

The system exposes API endpoints for namespace management and statistics. These endpoints are typically accessed by administrators or the frontend UI.

Available Endpoints

Endpoint	Method	Purpose
`namespace-stats`	POST	Get namespace statistics (vector count, metadata)
`delete-namespace`	POST	Delete entire namespace and all vectors
`reset`	POST	Delete all namespaces (full reset)

Endpoint Implementation Pattern

Each provider implements these as async methods with the endpoint name:

namespace-stats Example (server/utils/vectorDbProviders/lance/index.js458-468):

delete-namespace Example (server/utils/vectorDbProviders/lance/index.js470-480):

reset Example (server/utils/vectorDbProviders/lance/index.js482-487):

Sources:

Vector Database System

Architecture Overview

System Component Diagram

Provider Abstraction Layer

Common Interface Methods

Provider Implementation Pattern

Document Vectorization Pipeline

Vectorization Flow Diagram

Key Configuration Parameters

Similarity Search and Retrieval

Search Execution Flow

Similarity Threshold and Distance Conversion

Reranking (LanceDB Only)

Namespace Management

Namespace Operations

Provider-Specific Namespace Normalization

Namespace Lifecycle

Supported Vector Database Providers

Provider Configuration Matrix

LanceDB (Default Provider)

Cloud Provider Examples

Dimension Inference

Vector Caching System

Cache Lookup and Storage

Cache File Structure

Cache Key Generation

Cache Benefits

Text Splitting and Chunking

TextSplitter Configuration

Metadata Header Injection

Chunk Size Determination

DocumentVectors Mapping Table

Database Schema

Mapping Operations

Document Deletion Flow

API Endpoints for Vector Management

Available Endpoints

Endpoint Implementation Pattern

On this page

Vector Database System

Architecture Overview

System Component Diagram

Provider Abstraction Layer

Common Interface Methods

Provider Implementation Pattern

Document Vectorization Pipeline

Vectorization Flow Diagram

Key Configuration Parameters

Similarity Search and Retrieval

Search Execution Flow

Similarity Threshold and Distance Conversion

Reranking (LanceDB Only)

Namespace Management

Namespace Operations

Provider-Specific Namespace Normalization

Namespace Lifecycle

Supported Vector Database Providers

Provider Configuration Matrix

LanceDB (Default Provider)

Cloud Provider Examples

Dimension Inference

Vector Caching System

Cache Lookup and Storage

Cache File Structure

Cache Key Generation

Cache Benefits

Text Splitting and Chunking

TextSplitter Configuration

Metadata Header Injection

Chunk Size Determination

DocumentVectors Mapping Table

Database Schema

Mapping Operations

Document Deletion Flow

API Endpoints for Vector Management

Available Endpoints

Endpoint Implementation Pattern

On this page