This document explains the end-to-end process of converting a document into vector embeddings and storing them in a vector database. The pipeline includes text chunking, embedding generation, vector caching, batch insertion, and database tracking via the DocumentVectors table.
For information about vector database providers and their configuration, see Vector Database Providers. For details on text splitting configuration and chunking strategies, see Text Splitting and Chunking. For the similarity search process that retrieves these vectors, see Similarity Search and Reranking.
The document vectorization pipeline is implemented in the addDocumentToNamespace method, present in every vector database provider class. This method orchestrates the transformation of raw document text into searchable vector embeddings.
Pipeline Stages:
| Stage | Purpose | Key Function | Output |
|---|---|---|---|
| Cache Check | Avoid recomputing embeddings for unchanged files | cachedVectorInformation() | {exists: boolean, chunks: array} |
| Text Splitting | Break document into processable chunks | TextSplitter.splitText() | Array of text strings |
| Embedding | Convert text to vector representations | EmbedderEngine.embedChunks() | Array of float arrays |
| Batch Preparation | Group vectors for efficient insertion | toChunks(vectors, batchSize) | Chunked arrays |
| Database Insertion | Store vectors in namespace | Provider-specific method | Success/failure status |
| Cache Storage | Save vectors for future reuse | storeVectorResult() | File written to disk |
| Tracking | Record vector IDs for document management | DocumentVectors.bulkInsert() | Database records |
Sources: server/utils/vectorDbProviders/lance/index.js301-404 server/utils/vectorDbProviders/chroma/index.js203-342 server/utils/vectorDbProviders/qdrant/index.js155-330
The vector cache is a performance optimization that stores computed embeddings on disk, indexed by file path UUID. When a document is re-ingested without changes, the system loads pre-computed vectors instead of re-embedding the text.
Cache File Format:
Each cache file contains an array of chunk batches. Each batch is an array of vector objects with the following structure:
Cache Behavior:
true in addDocumentToNamespace(), the cache check is bypassed and the document is always re-embedded.Sources: server/utils/vectorDbProviders/lance/index.js313-334 server/utils/vectorDbProviders/chroma/index.js215-253 server/utils/vectorDbProviders/pinecone/index.js127-149
When the cache is not available, the pipeline performs text splitting followed by embedding generation. This is the most computationally expensive part of the pipeline.
Configuration Sources:
| Parameter | Source | Default | Purpose |
|---|---|---|---|
chunkSize | SystemSettings.getValueOrFallback({label: "text_splitter_chunk_size"}) | Embedder limit | Maximum characters per chunk |
chunkOverlap | SystemSettings.getValueOrFallback({label: "text_splitter_chunk_overlap"}) | 20 | Overlap characters between chunks |
chunkHeaderMeta | TextSplitter.buildHeaderMeta(metadata) | null | XML metadata prepended to chunks |
chunkPrefix | EmbedderEngine.embeddingPrefix | "" | Model-specific prefix (e.g., "passage: ") |
Metadata Header Format (when present):
Sources: server/utils/vectorDbProviders/lance/index.js340-361 server/utils/vectorDbProviders/chroma/index.js259-279 server/utils/TextSplitter/index.js47-118
After embedding generation, vectors are inserted into the vector database in batches to optimize performance and avoid API rate limits. Each provider uses different batch sizes based on its capabilities.
Provider-Specific Batch Sizes:
| Provider | Batch Size | Rationale | Implementation |
|---|---|---|---|
| LanceDB | 500 | File-based, fast local writes | server/utils/vectorDbProviders/lance/index.js390 |
| Chroma | 500 | HTTP API, moderate limit | server/utils/vectorDbProviders/chroma/index.js321 |
| Qdrant | 500 | Efficient batch upsert | server/utils/vectorDbProviders/qdrant/index.js298 |
| Pinecone | 100 | Cloud API rate limits | server/utils/vectorDbProviders/pinecone/index.js203 |
| Milvus | 100 | Database performance | server/utils/vectorDbProviders/milvus/index.js260 |
| Weaviate | 500 | Batch object API | server/utils/vectorDbProviders/weaviate/index.js343 |
| AstraDB | 20 | Strict API limit (documented) | server/utils/vectorDbProviders/astra/index.js270 |
Each vector database provider implements insertion differently based on its API requirements:
LanceDB:
Chroma:
Qdrant:
Pinecone:
Sources: server/utils/vectorDbProviders/lance/index.js388-395 server/utils/vectorDbProviders/chroma/index.js318-334 server/utils/vectorDbProviders/qdrant/index.js294-322 server/utils/vectorDbProviders/pinecone/index.js198-208 server/utils/vectorDbProviders/milvus/index.js254-281
The DocumentVectors table maintains a mapping between document IDs (docId) and vector IDs (vectorId). This enables efficient document deletion and workspace management.
DocumentVectors Schema:
| Field | Type | Purpose |
|---|---|---|
id | Integer (Primary Key) | Unique record identifier |
docId | String | Foreign key to document |
vectorId | String (UUID) | ID of vector in vector database |
Key Operations:
Bulk Insert (DocumentVectors.bulkInsert(documentVectors)):
Query by Document (DocumentVectors.where({docId})):
Delete by IDs (DocumentVectors.deleteIds(indexes)):
Sources: server/utils/vectorDbProviders/lance/index.js290-398 server/utils/vectorDbProviders/chroma/index.js344-360 server/utils/vectorDbProviders/qdrant/index.js332-348
Namespaces isolate vectors for different workspaces, providing multi-tenancy. Each workspace has its own namespace in the vector database, ensuring data isolation.
Provider-Specific Namespace Naming:
Some providers require specific namespace naming conventions:
| Provider | Normalization | Example Transformation |
|---|---|---|
| Chroma | Regex validation: 3-63 chars, alphanumeric/underscore/hyphen | My Workspace! → My-Workspace |
| Milvus/Zilliz | Letters, numbers, underscores only; must start with letter/underscore | 123-workspace → anythingllm_123_workspace |
| Weaviate | CamelCase class names | my-workspace → myWorkspace |
| AstraDB | Prefix with ns_ | workspace → ns_workspace |
| Others | Use namespace as-is | my-workspace → my-workspace |
Namespace Operations:
Collection Creation (on-demand):
getOrCreateCollection(client, namespace, dimensions)Namespace Deletion:
POST /v/:provider/delete-namespaceVector Count:
namespaceCount(namespace): Returns number of vectors in namespaceSources: server/utils/vectorDbProviders/chroma/index.js31-65 server/utils/vectorDbProviders/milvus/index.js28-33 server/utils/vectorDbProviders/weaviate/index.js138-196 server/utils/vectorDbProviders/astra/index.js10-155 server/utils/vectorDbProviders/qdrant/index.js112-153
The vectorization pipeline includes comprehensive error handling to ensure data integrity and provide actionable feedback.
Error Handling Patterns:
| Stage | Error Condition | Response | Example |
|---|---|---|---|
| Input Validation | Empty pageContent | Return early with false | server/utils/vectorDbProviders/lance/index.js310 |
| Cache Load | Cache file corrupted | Fall through to embedding | Logged, not fatal |
| Text Splitting | Invalid configuration | Throw error | Chunk size exceeds model limit |
| Embedding | Empty vectorValues array | Throw error: "Could not embed document chunks!" | server/utils/vectorDbProviders/lance/index.js383-386 |
| Database Insert | Connection failure | Throw error | Provider-specific error messages |
| Tracking Insert | DB constraint violation | Throw error | Duplicate vectorId |
Logging Strategy:
All providers inherit from VectorDatabase base class which provides a this.logger() method. Key log points:
"Adding new vectorized document into namespace""Snippets created from document: X""Inserting vectorized chunks into [Provider] collection"Sources: server/utils/vectorDbProviders/lance/index.js308-403 server/utils/vectorDbProviders/chroma/index.js338-341 server/utils/vectorDbProviders/qdrant/index.js326-329
While the pipeline structure is consistent across providers, implementation details vary based on each database's API and capabilities.
| Provider | Dimension Handling | Vector ID Format | Metadata Storage | Special Features |
|---|---|---|---|---|
| LanceDB | Inferred from vectors | UUID v4 | Flat object with vector | Reranking support |
| Chroma | Inferred from vectors | UUID v4 | Separate metadatas array | Collection normalization |
| Qdrant | Required at creation | UUID v4 | Payloads with nested objects | Wait parameter for sync |
| Pinecone | Inferred from index | UUID v4 | Flat metadata object | Namespace API |
| Milvus | Required at creation | UUID v4 | JSON field | Auto-index with COSINE |
| Zilliz | Required at creation | UUID v4 | JSON field | Cloud version of Milvus |
| Weaviate | Inferred from vectors | UUID v4 | Flattened properties | CamelCase class names |
| AstraDB | Required at creation | UUID v4 (as _id) | Metadata object | 20-record batch limit |
Providers that require dimensions upfront (Qdrant, Milvus, AstraDB) infer them from the first vector:
Weaviate requires nested objects to be flattened with underscore-separated keys:
Sources: server/utils/vectorDbProviders/qdrant/index.js163-183 server/utils/vectorDbProviders/milvus/index.js112-177 server/utils/vectorDbProviders/astra/index.js126-176 server/utils/vectorDbProviders/weaviate/index.js471-508
The following diagram maps the complete pipeline execution, showing all method calls and decision points:
Critical Execution Paths:
cachedVectorInformation → Load vectors → Insert to DB → Track IDs (~seconds)Transaction Boundaries:
DocumentVectors trackingSources: server/utils/vectorDbProviders/lance/index.js301-404 server/utils/vectorDbProviders/chroma/index.js203-342 server/utils/vectorDbProviders/pinecone/index.js115-216
Refresh this wiki