Document Vectorization Pipeline

Relevant source files

This document explains the end-to-end process of converting a document into vector embeddings and storing them in a vector database. The pipeline includes text chunking, embedding generation, vector caching, batch insertion, and database tracking via the DocumentVectors table.

For information about vector database providers and their configuration, see Vector Database Providers. For details on text splitting configuration and chunking strategies, see Text Splitting and Chunking. For the similarity search process that retrieves these vectors, see Similarity Search and Reranking.

Pipeline Overview

The document vectorization pipeline is implemented in the addDocumentToNamespace method, present in every vector database provider class. This method orchestrates the transformation of raw document text into searchable vector embeddings.

Pipeline Stages:

Stage	Purpose	Key Function	Output
Cache Check	Avoid recomputing embeddings for unchanged files	`cachedVectorInformation()`	`{exists: boolean, chunks: array}`
Text Splitting	Break document into processable chunks	`TextSplitter.splitText()`	Array of text strings
Embedding	Convert text to vector representations	`EmbedderEngine.embedChunks()`	Array of float arrays
Batch Preparation	Group vectors for efficient insertion	`toChunks(vectors, batchSize)`	Chunked arrays
Database Insertion	Store vectors in namespace	Provider-specific method	Success/failure status
Cache Storage	Save vectors for future reuse	`storeVectorResult()`	File written to disk
Tracking	Record vector IDs for document management	`DocumentVectors.bulkInsert()`	Database records

Sources: server/utils/vectorDbProviders/lance/index.js301-404 server/utils/vectorDbProviders/chroma/index.js203-342 server/utils/vectorDbProviders/qdrant/index.js155-330

Vector Caching Mechanism

The vector cache is a performance optimization that stores computed embeddings on disk, indexed by file path UUID. When a document is re-ingested without changes, the system loads pre-computed vectors instead of re-embedding the text.

Cache Structure and Workflow

Cache File Format:

Each cache file contains an array of chunk batches. Each batch is an array of vector objects with the following structure:

Cache Behavior:

Cache Hit: Vectors are loaded from disk, assigned new UUIDs, and inserted directly into the vector database. No embedding computation occurs.
Cache Miss: The full pipeline executes (splitting → embedding → insertion), and the resulting vectors are written to cache.
skipCache Parameter: When set to true in addDocumentToNamespace(), the cache check is bypassed and the document is always re-embedded.

Sources: server/utils/vectorDbProviders/lance/index.js313-334 server/utils/vectorDbProviders/chroma/index.js215-253 server/utils/vectorDbProviders/pinecone/index.js127-149

Text Splitting and Embedding Generation

When the cache is not available, the pipeline performs text splitting followed by embedding generation. This is the most computationally expensive part of the pipeline.

Text Splitting Configuration

Configuration Sources:

Parameter	Source	Default	Purpose
`chunkSize`	`SystemSettings.getValueOrFallback({label: "text_splitter_chunk_size"})`	Embedder limit	Maximum characters per chunk
`chunkOverlap`	`SystemSettings.getValueOrFallback({label: "text_splitter_chunk_overlap"})`	20	Overlap characters between chunks
`chunkHeaderMeta`	`TextSplitter.buildHeaderMeta(metadata)`	null	XML metadata prepended to chunks
`chunkPrefix`	`EmbedderEngine.embeddingPrefix`	""	Model-specific prefix (e.g., "passage: ")

Embedding Engine Integration

Metadata Header Format (when present):

Sources: server/utils/vectorDbProviders/lance/index.js340-361 server/utils/vectorDbProviders/chroma/index.js259-279 server/utils/TextSplitter/index.js47-118

Batch Insertion Strategies

After embedding generation, vectors are inserted into the vector database in batches to optimize performance and avoid API rate limits. Each provider uses different batch sizes based on its capabilities.

Batch Size by Provider

Provider-Specific Batch Sizes:

Provider	Batch Size	Rationale	Implementation
LanceDB	500	File-based, fast local writes	server/utils/vectorDbProviders/lance/index.js390
Chroma	500	HTTP API, moderate limit	server/utils/vectorDbProviders/chroma/index.js321
Qdrant	500	Efficient batch upsert	server/utils/vectorDbProviders/qdrant/index.js298
Pinecone	100	Cloud API rate limits	server/utils/vectorDbProviders/pinecone/index.js203
Milvus	100	Database performance	server/utils/vectorDbProviders/milvus/index.js260
Weaviate	500	Batch object API	server/utils/vectorDbProviders/weaviate/index.js343
AstraDB	20	Strict API limit (documented)	server/utils/vectorDbProviders/astra/index.js270

Provider-Specific Insertion Methods

Each vector database provider implements insertion differently based on its API requirements:

LanceDB:

Chroma:

Qdrant:

Pinecone:

Sources: server/utils/vectorDbProviders/lance/index.js388-395 server/utils/vectorDbProviders/chroma/index.js318-334 server/utils/vectorDbProviders/qdrant/index.js294-322 server/utils/vectorDbProviders/pinecone/index.js198-208 server/utils/vectorDbProviders/milvus/index.js254-281

DocumentVectors Tracking Table

The DocumentVectors table maintains a mapping between document IDs (docId) and vector IDs (vectorId). This enables efficient document deletion and workspace management.

Tracking Architecture

DocumentVectors Schema:

Field	Type	Purpose
`id`	Integer (Primary Key)	Unique record identifier
`docId`	String	Foreign key to document
`vectorId`	String (UUID)	ID of vector in vector database

Key Operations:

Bulk Insert (DocumentVectors.bulkInsert(documentVectors)):
- Inserts multiple docId-vectorId mappings atomically
- Called after successful vector insertion to maintain consistency
- Example: One document with 50 chunks creates 50 DocumentVectors records
Query by Document (DocumentVectors.where({docId})):
- Retrieves all vector IDs associated with a document
- Used during document deletion to identify vectors for removal
- Enables granular document management within workspaces
Delete by IDs (DocumentVectors.deleteIds(indexes)):
- Removes tracking records after vectors are deleted from VDB
- Takes array of primary key IDs (not vectorIds)
- Maintains referential integrity

Sources: server/utils/vectorDbProviders/lance/index.js290-398 server/utils/vectorDbProviders/chroma/index.js344-360 server/utils/vectorDbProviders/qdrant/index.js332-348

Namespace Management

Namespaces isolate vectors for different workspaces, providing multi-tenancy. Each workspace has its own namespace in the vector database, ensuring data isolation.

Namespace Operations

Provider-Specific Namespace Naming:

Some providers require specific namespace naming conventions:

Provider	Normalization	Example Transformation
Chroma	Regex validation: 3-63 chars, alphanumeric/underscore/hyphen	`My Workspace!` → `My-Workspace`
Milvus/Zilliz	Letters, numbers, underscores only; must start with letter/underscore	`123-workspace` → `anythingllm_123_workspace`
Weaviate	CamelCase class names	`my-workspace` → `myWorkspace`
AstraDB	Prefix with `ns_`	`workspace` → `ns_workspace`
Others	Use namespace as-is	`my-workspace` → `my-workspace`

Namespace Operations:

Collection Creation (on-demand):
- Most providers: Create collection on first vector insert
- Qdrant/Milvus/AstraDB: Require dimension parameter upfront
- Method: getOrCreateCollection(client, namespace, dimensions)
Namespace Deletion:
- Drops entire collection from vector database
- Workspace-level operation for removing all documents
- API endpoint: POST /v/:provider/delete-namespace
Vector Count:
- namespaceCount(namespace): Returns number of vectors in namespace
- Used for workspace statistics and UI display

Sources: server/utils/vectorDbProviders/chroma/index.js31-65 server/utils/vectorDbProviders/milvus/index.js28-33 server/utils/vectorDbProviders/weaviate/index.js138-196 server/utils/vectorDbProviders/astra/index.js10-155 server/utils/vectorDbProviders/qdrant/index.js112-153

Error Handling and Validation

The vectorization pipeline includes comprehensive error handling to ensure data integrity and provide actionable feedback.

Common Error Scenarios

Error Handling Patterns:

Stage	Error Condition	Response	Example
Input Validation	Empty `pageContent`	Return early with `false`	server/utils/vectorDbProviders/lance/index.js310
Cache Load	Cache file corrupted	Fall through to embedding	Logged, not fatal
Text Splitting	Invalid configuration	Throw error	Chunk size exceeds model limit
Embedding	Empty `vectorValues` array	Throw error: "Could not embed document chunks!"	server/utils/vectorDbProviders/lance/index.js383-386
Database Insert	Connection failure	Throw error	Provider-specific error messages
Tracking Insert	DB constraint violation	Throw error	Duplicate vectorId

Logging Strategy:

All providers inherit from VectorDatabase base class which provides a this.logger() method. Key log points:

Document ingestion start: "Adding new vectorized document into namespace"
Cache hit: Logged implicitly when cache path executes
Chunk creation: "Snippets created from document: X"
Vector insertion: "Inserting vectorized chunks into [Provider] collection"

Sources: server/utils/vectorDbProviders/lance/index.js308-403 server/utils/vectorDbProviders/chroma/index.js338-341 server/utils/vectorDbProviders/qdrant/index.js326-329

Provider-Specific Implementation Details

While the pipeline structure is consistent across providers, implementation details vary based on each database's API and capabilities.

Key Differences by Provider

Provider	Dimension Handling	Vector ID Format	Metadata Storage	Special Features
LanceDB	Inferred from vectors	UUID v4	Flat object with vector	Reranking support
Chroma	Inferred from vectors	UUID v4	Separate metadatas array	Collection normalization
Qdrant	Required at creation	UUID v4	Payloads with nested objects	Wait parameter for sync
Pinecone	Inferred from index	UUID v4	Flat metadata object	Namespace API
Milvus	Required at creation	UUID v4	JSON field	Auto-index with COSINE
Zilliz	Required at creation	UUID v4	JSON field	Cloud version of Milvus
Weaviate	Inferred from vectors	UUID v4	Flattened properties	CamelCase class names
AstraDB	Required at creation	UUID v4 (as `_id`)	Metadata object	20-record batch limit

Dimension Inference Pattern

Providers that require dimensions upfront (Qdrant, Milvus, AstraDB) infer them from the first vector:

Metadata Flattening (Weaviate)

Weaviate requires nested objects to be flattened with underscore-separated keys:

Sources: server/utils/vectorDbProviders/qdrant/index.js163-183 server/utils/vectorDbProviders/milvus/index.js112-177 server/utils/vectorDbProviders/astra/index.js126-176 server/utils/vectorDbProviders/weaviate/index.js471-508

Complete Pipeline Code Flow

The following diagram maps the complete pipeline execution, showing all method calls and decision points:

Critical Execution Paths:

Fast Path (Cache Hit): cachedVectorInformation → Load vectors → Insert to DB → Track IDs (~seconds)
Slow Path (Cache Miss): Split text → Embed chunks → Batch insert → Write cache → Track IDs (~minutes)
Force Re-embed: Skip cache → Always slow path (for testing or updates)

Transaction Boundaries:

Vector insertion and cache writing occur before DocumentVectors tracking
If tracking fails after successful insertion, vectors exist in DB but are orphaned
Recovery requires manual cleanup or re-ingestion with same docId

Sources: server/utils/vectorDbProviders/lance/index.js301-404 server/utils/vectorDbProviders/chroma/index.js203-342 server/utils/vectorDbProviders/pinecone/index.js115-216

Document Vectorization Pipeline

Relevant source files

Pipeline Overview

Pipeline Stages:

Stage	Purpose	Key Function	Output
Cache Check	Avoid recomputing embeddings for unchanged files	`cachedVectorInformation()`	`{exists: boolean, chunks: array}`
Text Splitting	Break document into processable chunks	`TextSplitter.splitText()`	Array of text strings
Embedding	Convert text to vector representations	`EmbedderEngine.embedChunks()`	Array of float arrays
Batch Preparation	Group vectors for efficient insertion	`toChunks(vectors, batchSize)`	Chunked arrays
Database Insertion	Store vectors in namespace	Provider-specific method	Success/failure status
Cache Storage	Save vectors for future reuse	`storeVectorResult()`	File written to disk
Tracking	Record vector IDs for document management	`DocumentVectors.bulkInsert()`	Database records

Sources: server/utils/vectorDbProviders/lance/index.js301-404 server/utils/vectorDbProviders/chroma/index.js203-342 server/utils/vectorDbProviders/qdrant/index.js155-330

Vector Caching Mechanism

Cache Structure and Workflow

Cache File Format:

Each cache file contains an array of chunk batches. Each batch is an array of vector objects with the following structure:

Cache Behavior:

Cache Hit: Vectors are loaded from disk, assigned new UUIDs, and inserted directly into the vector database. No embedding computation occurs.
Cache Miss: The full pipeline executes (splitting → embedding → insertion), and the resulting vectors are written to cache.
skipCache Parameter: When set to true in addDocumentToNamespace(), the cache check is bypassed and the document is always re-embedded.

Sources: server/utils/vectorDbProviders/lance/index.js313-334 server/utils/vectorDbProviders/chroma/index.js215-253 server/utils/vectorDbProviders/pinecone/index.js127-149

Text Splitting and Embedding Generation

When the cache is not available, the pipeline performs text splitting followed by embedding generation. This is the most computationally expensive part of the pipeline.

Text Splitting Configuration

Configuration Sources:

Parameter	Source	Default	Purpose
`chunkSize`	`SystemSettings.getValueOrFallback({label: "text_splitter_chunk_size"})`	Embedder limit	Maximum characters per chunk
`chunkOverlap`	`SystemSettings.getValueOrFallback({label: "text_splitter_chunk_overlap"})`	20	Overlap characters between chunks
`chunkHeaderMeta`	`TextSplitter.buildHeaderMeta(metadata)`	null	XML metadata prepended to chunks
`chunkPrefix`	`EmbedderEngine.embeddingPrefix`	""	Model-specific prefix (e.g., "passage: ")

Embedding Engine Integration

Metadata Header Format (when present):

Sources: server/utils/vectorDbProviders/lance/index.js340-361 server/utils/vectorDbProviders/chroma/index.js259-279 server/utils/TextSplitter/index.js47-118

Batch Insertion Strategies

Batch Size by Provider

Provider-Specific Batch Sizes:

Provider	Batch Size	Rationale	Implementation
LanceDB	500	File-based, fast local writes	server/utils/vectorDbProviders/lance/index.js390
Chroma	500	HTTP API, moderate limit	server/utils/vectorDbProviders/chroma/index.js321
Qdrant	500	Efficient batch upsert	server/utils/vectorDbProviders/qdrant/index.js298
Pinecone	100	Cloud API rate limits	server/utils/vectorDbProviders/pinecone/index.js203
Milvus	100	Database performance	server/utils/vectorDbProviders/milvus/index.js260
Weaviate	500	Batch object API	server/utils/vectorDbProviders/weaviate/index.js343
AstraDB	20	Strict API limit (documented)	server/utils/vectorDbProviders/astra/index.js270

Provider-Specific Insertion Methods

Each vector database provider implements insertion differently based on its API requirements:

LanceDB:

Chroma:

Qdrant:

Pinecone:

DocumentVectors Tracking Table

The DocumentVectors table maintains a mapping between document IDs (docId) and vector IDs (vectorId). This enables efficient document deletion and workspace management.

Tracking Architecture

DocumentVectors Schema:

Field	Type	Purpose
`id`	Integer (Primary Key)	Unique record identifier
`docId`	String	Foreign key to document
`vectorId`	String (UUID)	ID of vector in vector database

Key Operations:

Bulk Insert (DocumentVectors.bulkInsert(documentVectors)):
- Inserts multiple docId-vectorId mappings atomically
- Called after successful vector insertion to maintain consistency
- Example: One document with 50 chunks creates 50 DocumentVectors records
Query by Document (DocumentVectors.where({docId})):
- Retrieves all vector IDs associated with a document
- Used during document deletion to identify vectors for removal
- Enables granular document management within workspaces
Delete by IDs (DocumentVectors.deleteIds(indexes)):
- Removes tracking records after vectors are deleted from VDB
- Takes array of primary key IDs (not vectorIds)
- Maintains referential integrity

Sources: server/utils/vectorDbProviders/lance/index.js290-398 server/utils/vectorDbProviders/chroma/index.js344-360 server/utils/vectorDbProviders/qdrant/index.js332-348

Namespace Management

Namespaces isolate vectors for different workspaces, providing multi-tenancy. Each workspace has its own namespace in the vector database, ensuring data isolation.

Namespace Operations

Provider-Specific Namespace Naming:

Some providers require specific namespace naming conventions:

Provider	Normalization	Example Transformation
Chroma	Regex validation: 3-63 chars, alphanumeric/underscore/hyphen	`My Workspace!` → `My-Workspace`
Milvus/Zilliz	Letters, numbers, underscores only; must start with letter/underscore	`123-workspace` → `anythingllm_123_workspace`
Weaviate	CamelCase class names	`my-workspace` → `myWorkspace`
AstraDB	Prefix with `ns_`	`workspace` → `ns_workspace`
Others	Use namespace as-is	`my-workspace` → `my-workspace`

Namespace Operations:

Collection Creation (on-demand):
- Most providers: Create collection on first vector insert
- Qdrant/Milvus/AstraDB: Require dimension parameter upfront
- Method: getOrCreateCollection(client, namespace, dimensions)
Namespace Deletion:
- Drops entire collection from vector database
- Workspace-level operation for removing all documents
- API endpoint: POST /v/:provider/delete-namespace
Vector Count:
- namespaceCount(namespace): Returns number of vectors in namespace
- Used for workspace statistics and UI display

Error Handling and Validation

The vectorization pipeline includes comprehensive error handling to ensure data integrity and provide actionable feedback.

Common Error Scenarios

Error Handling Patterns:

Stage	Error Condition	Response	Example
Input Validation	Empty `pageContent`	Return early with `false`	server/utils/vectorDbProviders/lance/index.js310
Cache Load	Cache file corrupted	Fall through to embedding	Logged, not fatal
Text Splitting	Invalid configuration	Throw error	Chunk size exceeds model limit
Embedding	Empty `vectorValues` array	Throw error: "Could not embed document chunks!"	server/utils/vectorDbProviders/lance/index.js383-386
Database Insert	Connection failure	Throw error	Provider-specific error messages
Tracking Insert	DB constraint violation	Throw error	Duplicate vectorId

Logging Strategy:

All providers inherit from VectorDatabase base class which provides a this.logger() method. Key log points:

Document ingestion start: "Adding new vectorized document into namespace"
Cache hit: Logged implicitly when cache path executes
Chunk creation: "Snippets created from document: X"
Vector insertion: "Inserting vectorized chunks into [Provider] collection"

Sources: server/utils/vectorDbProviders/lance/index.js308-403 server/utils/vectorDbProviders/chroma/index.js338-341 server/utils/vectorDbProviders/qdrant/index.js326-329

Provider-Specific Implementation Details

While the pipeline structure is consistent across providers, implementation details vary based on each database's API and capabilities.

Key Differences by Provider

Provider	Dimension Handling	Vector ID Format	Metadata Storage	Special Features
LanceDB	Inferred from vectors	UUID v4	Flat object with vector	Reranking support
Chroma	Inferred from vectors	UUID v4	Separate metadatas array	Collection normalization
Qdrant	Required at creation	UUID v4	Payloads with nested objects	Wait parameter for sync
Pinecone	Inferred from index	UUID v4	Flat metadata object	Namespace API
Milvus	Required at creation	UUID v4	JSON field	Auto-index with COSINE
Zilliz	Required at creation	UUID v4	JSON field	Cloud version of Milvus
Weaviate	Inferred from vectors	UUID v4	Flattened properties	CamelCase class names
AstraDB	Required at creation	UUID v4 (as `_id`)	Metadata object	20-record batch limit

Dimension Inference Pattern

Providers that require dimensions upfront (Qdrant, Milvus, AstraDB) infer them from the first vector:

Metadata Flattening (Weaviate)

Weaviate requires nested objects to be flattened with underscore-separated keys:

Complete Pipeline Code Flow

The following diagram maps the complete pipeline execution, showing all method calls and decision points:

Critical Execution Paths:

Fast Path (Cache Hit): cachedVectorInformation → Load vectors → Insert to DB → Track IDs (~seconds)
Slow Path (Cache Miss): Split text → Embed chunks → Batch insert → Write cache → Track IDs (~minutes)
Force Re-embed: Skip cache → Always slow path (for testing or updates)

Transaction Boundaries:

Vector insertion and cache writing occur before DocumentVectors tracking
If tracking fails after successful insertion, vectors exist in DB but are orphaned
Recovery requires manual cleanup or re-ingestion with same docId

Sources: server/utils/vectorDbProviders/lance/index.js301-404 server/utils/vectorDbProviders/chroma/index.js203-342 server/utils/vectorDbProviders/pinecone/index.js115-216

Document Vectorization Pipeline

Pipeline Overview

Vector Caching Mechanism

Cache Structure and Workflow

Text Splitting and Embedding Generation

Text Splitting Configuration

Embedding Engine Integration

Batch Insertion Strategies

Batch Size by Provider

Provider-Specific Insertion Methods

DocumentVectors Tracking Table

Tracking Architecture

Namespace Management

Namespace Operations

Error Handling and Validation

Common Error Scenarios

Provider-Specific Implementation Details

Key Differences by Provider

Dimension Inference Pattern

Metadata Flattening (Weaviate)

Complete Pipeline Code Flow

On this page

Document Vectorization Pipeline

Pipeline Overview

Vector Caching Mechanism

Cache Structure and Workflow

Text Splitting and Embedding Generation

Text Splitting Configuration

Embedding Engine Integration

Batch Insertion Strategies

Batch Size by Provider

Provider-Specific Insertion Methods

DocumentVectors Tracking Table

Tracking Architecture

Namespace Management

Namespace Operations

Error Handling and Validation

Common Error Scenarios

Provider-Specific Implementation Details

Key Differences by Provider

Dimension Inference Pattern

Metadata Flattening (Weaviate)

Complete Pipeline Code Flow

On this page