This page documents the text splitting and chunking system in AnythingLLM, which is responsible for breaking down large documents into smaller, manageable pieces (chunks) before vectorization. The system handles chunk size limits, overlap between chunks, metadata header injection, and embedding model-specific prefixes.
For information about how these chunks are vectorized and stored, see Document Vectorization Pipeline. For details about similarity search across these chunks, see Similarity Search and Reranking.
The TextSplitter class provides a configurable interface for splitting text documents into chunks. It wraps Langchain's RecursiveCharacterTextSplitter and adds AnythingLLM-specific features like metadata headers and model-specific prefixes.
Sources: server/utils/TextSplitter/index.js21-170
The TextSplitter accepts four key configuration parameters that control chunking behavior:
| Parameter | Type | Default | Description |
|---|---|---|---|
chunkSize | number | 1000 | Maximum number of characters per chunk |
chunkOverlap | number | 20 | Number of overlapping characters between consecutive chunks |
chunkPrefix | string | "" | Prefix prepended to each chunk (model-specific requirement) |
chunkHeaderMeta | Object | null | Metadata object to be formatted as XML header |
The system uses a two-tier approach to determine the final chunk size, respecting both user preferences and embedder model limitations:
Sources: server/utils/TextSplitter/index.js47-57
The determineMaxChunkSize static method ensures chunks never exceed the embedding model's maximum input length:
Sources: server/utils/TextSplitter/index.js47-57
The system injects structured metadata at the beginning of each chunk to provide context to the embedding model. This metadata is formatted as XML for clear delineation.
Sources: server/utils/TextSplitter/index.js64-118 server/utils/TextSplitter/index.js135-147
The metadata header follows this structure:
The buildHeaderMeta method selectively plucks relevant fields from the full document metadata:
| Source Field | Output Field | Condition |
|---|---|---|
title | sourceDocument | Always included if present |
published | published | Always included if present |
chunkSource | source | Only if starts with link:// or youtube:// |
Sources: server/utils/TextSplitter/index.js64-118
Some embedding models require specific prefixes prepended to text for optimal performance. The system supports model-specific prefixes that are applied before the metadata header.
Sources: server/utils/TextSplitter/index.js125-147
The prefix is defined by the selected embedding engine and passed to the TextSplitter constructor. For example, some models like Cohere require a passage: prefix for document chunks.
Sources: server/utils/vectorDbProviders/lance/index.js353 server/utils/vectorDbProviders/chroma/index.js272
All vector database providers follow a consistent pattern when using the TextSplitter. The integration occurs within the addDocumentToNamespace method of each provider.
Sources: server/utils/vectorDbProviders/lance/index.js341-355 server/utils/vectorDbProviders/chroma/index.js260-274 server/utils/vectorDbProviders/qdrant/index.js232-246
The following shows the typical integration pattern:
Sources: server/utils/vectorDbProviders/lance/index.js341-355
The complete text splitting pipeline processes documents through multiple stages before vectorization:
Sources: server/utils/TextSplitter/index.js21-204 server/utils/vectorDbProviders/lance/index.js341-357
The underlying splitting engine from Langchain uses a recursive approach with multiple separator tiers:
The RecursiveCharacterTextSplitter attempts to split on separators in this order:
\n\n)\n) )This hierarchy preserves natural document structure when possible, only falling back to character-level splitting when necessary to meet chunk size requirements.
Sources: server/utils/TextSplitter/index.js173-188
The overlap parameter ensures context continuity between chunks. When a document is split, the last chunkOverlap characters of one chunk become the first characters of the next chunk. This helps the embedding model maintain semantic continuity across chunk boundaries.
Default overlap is 20 characters, configurable via the text_splitter_chunk_overlap system setting.
Sources: server/utils/TextSplitter/index.js29 server/utils/vectorDbProviders/lance/index.js348-351
The text splitter retrieves configuration from the SystemSettings model, which provides both database-stored and environment variable settings.
Sources: server/utils/vectorDbProviders/lance/index.js343-351
| Setting Label | Purpose | Default Fallback |
|---|---|---|
text_splitter_chunk_size | Maximum chunk size in characters | Model limit |
text_splitter_chunk_overlap | Character overlap between chunks | 20 |
These settings can be modified through the admin UI or environment variables, and changes are validated by the updateENV pipeline documented in Configuration Management.
Sources: server/utils/vectorDbProviders/lance/index.js343-351 server/utils/vectorDbProviders/chroma/index.js262-270
While all vector database providers follow the same text splitting pattern, some have provider-specific constraints:
AstraDB has an additional chunk size constraint due to platform limitations. The chunk size is capped at 7500 characters regardless of other settings:
Sources: server/utils/vectorDbProviders/astra/index.js210-218
Despite minor variations, all providers use the same core logic:
| Provider | Chunk Size Method | Overlap Setting | Metadata Header | Prefix Support |
|---|---|---|---|---|
| LanceDB | determineMaxChunkSize | ✓ | ✓ | ✓ |
| Chroma | determineMaxChunkSize | ✓ | ✓ | ✓ |
| Qdrant | determineMaxChunkSize | ✓ | ✓ | ✓ |
| Pinecone | determineMaxChunkSize | ✓ | ✓ | ✓ |
| Milvus | determineMaxChunkSize | ✓ | ✓ | ✓ |
| Weaviate | determineMaxChunkSize | ✓ | ✓ | ✓ |
| AstraDB | determineMaxChunkSize + cap | ✓ | ✓ | ✓ |
Sources: server/utils/vectorDbProviders/lance/index.js341-354 server/utils/vectorDbProviders/chroma/index.js260-273 server/utils/vectorDbProviders/qdrant/index.js232-245 server/utils/vectorDbProviders/pinecone/index.js157-170 server/utils/vectorDbProviders/milvus/index.js213-226 server/utils/vectorDbProviders/weaviate/index.js274-287 server/utils/vectorDbProviders/astra/index.js209-225
The text splitter fits into the larger document processing pipeline as follows:
Sources: server/utils/vectorDbProviders/lance/index.js301-399 server/utils/TextSplitter/index.js167-169
| Smaller Chunks (< 500) | Larger Chunks (> 1500) |
|---|---|
| + More precise retrieval | + More context per chunk |
| + Less noise | + Fewer total vectors |
| - More vectors to store | - Less precise matching |
| - May lose context | - Potential information overload |
The default of 1000 characters provides a balance suitable for most use cases.
The system caches split and vectorized chunks to avoid re-processing documents. When a document is re-ingested:
cachedVectorInformation(fullFilePath)This caching occurs at the chunk level, storing the already-split text chunks along with their embeddings.
Sources: server/utils/vectorDbProviders/lance/index.js313-333
The text splitter is designed to be robust, but certain edge cases are handled:
If pageContent is empty or null, the vectorization process returns early without creating chunks:
Sources: server/utils/vectorDbProviders/lance/index.js310
When user-configured chunk size exceeds model limits, the system logs a warning and uses the model limit:
[WARN] Text splitter chunk length of 2000 exceeds embedder model max of 1024. Will use 1024.
Sources: server/utils/TextSplitter/index.js53-55
The buildHeaderMeta method silently skips invalid or missing metadata fields rather than throwing errors, ensuring the splitting process continues even with incomplete metadata.
Refresh this wiki