Text Splitting and Chunking

Relevant source files

Purpose and Scope

This page documents the text splitting and chunking system in AnythingLLM, which is responsible for breaking down large documents into smaller, manageable pieces (chunks) before vectorization. The system handles chunk size limits, overlap between chunks, metadata header injection, and embedding model-specific prefixes.

For information about how these chunks are vectorized and stored, see Document Vectorization Pipeline. For details about similarity search across these chunks, see Similarity Search and Reranking.

TextSplitter Class Architecture

The TextSplitter class provides a configurable interface for splitting text documents into chunks. It wraps Langchain's RecursiveCharacterTextSplitter and adds AnythingLLM-specific features like metadata headers and model-specific prefixes.

Class Structure

Sources: server/utils/TextSplitter/index.js21-170

Configuration Parameters

The TextSplitter accepts four key configuration parameters that control chunking behavior:

Parameter	Type	Default	Description
`chunkSize`	`number`	1000	Maximum number of characters per chunk
`chunkOverlap`	`number`	20	Number of overlapping characters between consecutive chunks
`chunkPrefix`	`string`	`""`	Prefix prepended to each chunk (model-specific requirement)
`chunkHeaderMeta`	`Object`	`null`	Metadata object to be formatted as XML header

Chunk Size Determination

The system uses a two-tier approach to determine the final chunk size, respecting both user preferences and embedder model limitations:

Sources: server/utils/TextSplitter/index.js47-57

The determineMaxChunkSize static method ensures chunks never exceed the embedding model's maximum input length:

Sources: server/utils/TextSplitter/index.js47-57

Metadata Header Injection

The system injects structured metadata at the beginning of each chunk to provide context to the embedding model. This metadata is formatted as XML for clear delineation.

Metadata Plucking Process

Sources: server/utils/TextSplitter/index.js64-118 server/utils/TextSplitter/index.js135-147

Header Format

The metadata header follows this structure:

The buildHeaderMeta method selectively plucks relevant fields from the full document metadata:

Source Field	Output Field	Condition
`title`	`sourceDocument`	Always included if present
`published`	`published`	Always included if present
`chunkSource`	`source`	Only if starts with `link://` or `youtube://`

Sources: server/utils/TextSplitter/index.js64-118

Chunk Prefix System

Some embedding models require specific prefixes prepended to text for optimal performance. The system supports model-specific prefixes that are applied before the metadata header.

Prefix Application Flow

Sources: server/utils/TextSplitter/index.js125-147

The prefix is defined by the selected embedding engine and passed to the TextSplitter constructor. For example, some models like Cohere require a passage: prefix for document chunks.

Sources: server/utils/vectorDbProviders/lance/index.js353 server/utils/vectorDbProviders/chroma/index.js272

Integration with Vector Database Providers

All vector database providers follow a consistent pattern when using the TextSplitter. The integration occurs within the addDocumentToNamespace method of each provider.

Common Integration Pattern

Sources: server/utils/vectorDbProviders/lance/index.js341-355 server/utils/vectorDbProviders/chroma/index.js260-274 server/utils/vectorDbProviders/qdrant/index.js232-246

Code Example from LanceDB Provider

The following shows the typical integration pattern:

Sources: server/utils/vectorDbProviders/lance/index.js341-355

Text Splitting Pipeline

The complete text splitting pipeline processes documents through multiple stages before vectorization:

Sources: server/utils/TextSplitter/index.js21-204 server/utils/vectorDbProviders/lance/index.js341-357

RecursiveCharacterTextSplitter Behavior

The underlying splitting engine from Langchain uses a recursive approach with multiple separator tiers:

Separator Hierarchy

The RecursiveCharacterTextSplitter attempts to split on separators in this order:

Double newline (\n\n)
Single newline (\n)
Space ( )
Individual characters

This hierarchy preserves natural document structure when possible, only falling back to character-level splitting when necessary to meet chunk size requirements.

Sources: server/utils/TextSplitter/index.js173-188

Chunk Overlap

The overlap parameter ensures context continuity between chunks. When a document is split, the last chunkOverlap characters of one chunk become the first characters of the next chunk. This helps the embedding model maintain semantic continuity across chunk boundaries.

Default overlap is 20 characters, configurable via the text_splitter_chunk_overlap system setting.

Sources: server/utils/TextSplitter/index.js29 server/utils/vectorDbProviders/lance/index.js348-351

System Settings Integration

The text splitter retrieves configuration from the SystemSettings model, which provides both database-stored and environment variable settings.

Settings Retrieval Pattern

Sources: server/utils/vectorDbProviders/lance/index.js343-351

Key Settings

Setting Label	Purpose	Default Fallback
`text_splitter_chunk_size`	Maximum chunk size in characters	Model limit
`text_splitter_chunk_overlap`	Character overlap between chunks	20

These settings can be modified through the admin UI or environment variables, and changes are validated by the updateENV pipeline documented in Configuration Management.

Sources: server/utils/vectorDbProviders/lance/index.js343-351 server/utils/vectorDbProviders/chroma/index.js262-270

Provider-Specific Implementations

While all vector database providers follow the same text splitting pattern, some have provider-specific constraints:

AstraDB Special Case

AstraDB has an additional chunk size constraint due to platform limitations. The chunk size is capped at 7500 characters regardless of other settings:

Sources: server/utils/vectorDbProviders/astra/index.js210-218

Consistency Across Providers

Despite minor variations, all providers use the same core logic:

Provider	Chunk Size Method	Overlap Setting	Metadata Header	Prefix Support
LanceDB	`determineMaxChunkSize`	✓	✓	✓
Chroma	`determineMaxChunkSize`	✓	✓	✓
Qdrant	`determineMaxChunkSize`	✓	✓	✓
Pinecone	`determineMaxChunkSize`	✓	✓	✓
Milvus	`determineMaxChunkSize`	✓	✓	✓
Weaviate	`determineMaxChunkSize`	✓	✓	✓
AstraDB	`determineMaxChunkSize` + cap	✓	✓	✓

Sources: server/utils/vectorDbProviders/lance/index.js341-354 server/utils/vectorDbProviders/chroma/index.js260-273 server/utils/vectorDbProviders/qdrant/index.js232-245 server/utils/vectorDbProviders/pinecone/index.js157-170 server/utils/vectorDbProviders/milvus/index.js213-226 server/utils/vectorDbProviders/weaviate/index.js274-287 server/utils/vectorDbProviders/astra/index.js209-225

Document Processing Flow

The text splitter fits into the larger document processing pipeline as follows:

Sources: server/utils/vectorDbProviders/lance/index.js301-399 server/utils/TextSplitter/index.js167-169

Performance Considerations

Chunk Size Trade-offs

Smaller Chunks (< 500)	Larger Chunks (> 1500)
+ More precise retrieval	+ More context per chunk
+ Less noise	+ Fewer total vectors
- More vectors to store	- Less precise matching
- May lose context	- Potential information overload

The default of 1000 characters provides a balance suitable for most use cases.

Caching Strategy

The system caches split and vectorized chunks to avoid re-processing documents. When a document is re-ingested:

Check cachedVectorInformation(fullFilePath)
If cache exists, load pre-split chunks and skip text splitting
If cache missing, perform full splitting and vectorization

This caching occurs at the chunk level, storing the already-split text chunks along with their embeddings.

Sources: server/utils/vectorDbProviders/lance/index.js313-333

Error Handling

The text splitter is designed to be robust, but certain edge cases are handled:

Empty Content

If pageContent is empty or null, the vectorization process returns early without creating chunks:

Sources: server/utils/vectorDbProviders/lance/index.js310

Chunk Size Warnings

When user-configured chunk size exceeds model limits, the system logs a warning and uses the model limit:

[WARN] Text splitter chunk length of 2000 exceeds embedder model max of 1024. Will use 1024.

Sources: server/utils/TextSplitter/index.js53-55

Metadata Validation

The buildHeaderMeta method silently skips invalid or missing metadata fields rather than throwing errors, ensuring the splitting process continues even with incomplete metadata.

Sources: server/utils/TextSplitter/index.js110-115

Text Splitting and Chunking

Relevant source files

Purpose and Scope

For information about how these chunks are vectorized and stored, see Document Vectorization Pipeline. For details about similarity search across these chunks, see Similarity Search and Reranking.

TextSplitter Class Architecture

Class Structure

Sources: server/utils/TextSplitter/index.js21-170

Configuration Parameters

The TextSplitter accepts four key configuration parameters that control chunking behavior:

Parameter	Type	Default	Description
`chunkSize`	`number`	1000	Maximum number of characters per chunk
`chunkOverlap`	`number`	20	Number of overlapping characters between consecutive chunks
`chunkPrefix`	`string`	`""`	Prefix prepended to each chunk (model-specific requirement)
`chunkHeaderMeta`	`Object`	`null`	Metadata object to be formatted as XML header

Chunk Size Determination

The system uses a two-tier approach to determine the final chunk size, respecting both user preferences and embedder model limitations:

Sources: server/utils/TextSplitter/index.js47-57

The determineMaxChunkSize static method ensures chunks never exceed the embedding model's maximum input length:

Sources: server/utils/TextSplitter/index.js47-57

Metadata Header Injection

The system injects structured metadata at the beginning of each chunk to provide context to the embedding model. This metadata is formatted as XML for clear delineation.

Metadata Plucking Process

Sources: server/utils/TextSplitter/index.js64-118 server/utils/TextSplitter/index.js135-147

Header Format

The metadata header follows this structure:

The buildHeaderMeta method selectively plucks relevant fields from the full document metadata:

Source Field	Output Field	Condition
`title`	`sourceDocument`	Always included if present
`published`	`published`	Always included if present
`chunkSource`	`source`	Only if starts with `link://` or `youtube://`

Sources: server/utils/TextSplitter/index.js64-118

Chunk Prefix System

Some embedding models require specific prefixes prepended to text for optimal performance. The system supports model-specific prefixes that are applied before the metadata header.

Prefix Application Flow

Sources: server/utils/TextSplitter/index.js125-147

The prefix is defined by the selected embedding engine and passed to the TextSplitter constructor. For example, some models like Cohere require a passage: prefix for document chunks.

Sources: server/utils/vectorDbProviders/lance/index.js353 server/utils/vectorDbProviders/chroma/index.js272

Integration with Vector Database Providers

All vector database providers follow a consistent pattern when using the TextSplitter. The integration occurs within the addDocumentToNamespace method of each provider.

Common Integration Pattern

Sources: server/utils/vectorDbProviders/lance/index.js341-355 server/utils/vectorDbProviders/chroma/index.js260-274 server/utils/vectorDbProviders/qdrant/index.js232-246

Code Example from LanceDB Provider

The following shows the typical integration pattern:

Sources: server/utils/vectorDbProviders/lance/index.js341-355

Text Splitting Pipeline

The complete text splitting pipeline processes documents through multiple stages before vectorization:

Sources: server/utils/TextSplitter/index.js21-204 server/utils/vectorDbProviders/lance/index.js341-357

RecursiveCharacterTextSplitter Behavior

The underlying splitting engine from Langchain uses a recursive approach with multiple separator tiers:

Separator Hierarchy

The RecursiveCharacterTextSplitter attempts to split on separators in this order:

Double newline (\n\n)
Single newline (\n)
Space ( )
Individual characters

This hierarchy preserves natural document structure when possible, only falling back to character-level splitting when necessary to meet chunk size requirements.

Sources: server/utils/TextSplitter/index.js173-188

Chunk Overlap

Default overlap is 20 characters, configurable via the text_splitter_chunk_overlap system setting.

Sources: server/utils/TextSplitter/index.js29 server/utils/vectorDbProviders/lance/index.js348-351

System Settings Integration

The text splitter retrieves configuration from the SystemSettings model, which provides both database-stored and environment variable settings.

Settings Retrieval Pattern

Sources: server/utils/vectorDbProviders/lance/index.js343-351

Key Settings

Setting Label	Purpose	Default Fallback
`text_splitter_chunk_size`	Maximum chunk size in characters	Model limit
`text_splitter_chunk_overlap`	Character overlap between chunks	20

These settings can be modified through the admin UI or environment variables, and changes are validated by the updateENV pipeline documented in Configuration Management.

Sources: server/utils/vectorDbProviders/lance/index.js343-351 server/utils/vectorDbProviders/chroma/index.js262-270

Provider-Specific Implementations

While all vector database providers follow the same text splitting pattern, some have provider-specific constraints:

AstraDB Special Case

AstraDB has an additional chunk size constraint due to platform limitations. The chunk size is capped at 7500 characters regardless of other settings:

Sources: server/utils/vectorDbProviders/astra/index.js210-218

Consistency Across Providers

Despite minor variations, all providers use the same core logic:

Provider	Chunk Size Method	Overlap Setting	Metadata Header	Prefix Support
LanceDB	`determineMaxChunkSize`	✓	✓	✓
Chroma	`determineMaxChunkSize`	✓	✓	✓
Qdrant	`determineMaxChunkSize`	✓	✓	✓
Pinecone	`determineMaxChunkSize`	✓	✓	✓
Milvus	`determineMaxChunkSize`	✓	✓	✓
Weaviate	`determineMaxChunkSize`	✓	✓	✓
AstraDB	`determineMaxChunkSize` + cap	✓	✓	✓

Document Processing Flow

The text splitter fits into the larger document processing pipeline as follows:

Sources: server/utils/vectorDbProviders/lance/index.js301-399 server/utils/TextSplitter/index.js167-169

Performance Considerations

Chunk Size Trade-offs

Smaller Chunks (< 500)	Larger Chunks (> 1500)
+ More precise retrieval	+ More context per chunk
+ Less noise	+ Fewer total vectors
- More vectors to store	- Less precise matching
- May lose context	- Potential information overload

The default of 1000 characters provides a balance suitable for most use cases.

Caching Strategy

The system caches split and vectorized chunks to avoid re-processing documents. When a document is re-ingested:

Check cachedVectorInformation(fullFilePath)
If cache exists, load pre-split chunks and skip text splitting
If cache missing, perform full splitting and vectorization

This caching occurs at the chunk level, storing the already-split text chunks along with their embeddings.

Sources: server/utils/vectorDbProviders/lance/index.js313-333

Error Handling

The text splitter is designed to be robust, but certain edge cases are handled:

Empty Content

If pageContent is empty or null, the vectorization process returns early without creating chunks:

Sources: server/utils/vectorDbProviders/lance/index.js310

Chunk Size Warnings

When user-configured chunk size exceeds model limits, the system logs a warning and uses the model limit:

[WARN] Text splitter chunk length of 2000 exceeds embedder model max of 1024. Will use 1024.

Sources: server/utils/TextSplitter/index.js53-55

Metadata Validation

The buildHeaderMeta method silently skips invalid or missing metadata fields rather than throwing errors, ensuring the splitting process continues even with incomplete metadata.

Sources: server/utils/TextSplitter/index.js110-115

Text Splitting and Chunking

Purpose and Scope

TextSplitter Class Architecture

Class Structure

Configuration Parameters

Chunk Size Determination

Metadata Header Injection

Metadata Plucking Process

Header Format

Chunk Prefix System

Prefix Application Flow

Integration with Vector Database Providers

Common Integration Pattern

Code Example from LanceDB Provider

Text Splitting Pipeline

RecursiveCharacterTextSplitter Behavior

Separator Hierarchy

Chunk Overlap

System Settings Integration

Settings Retrieval Pattern

Key Settings

Provider-Specific Implementations

AstraDB Special Case

Consistency Across Providers

Document Processing Flow

Performance Considerations

Chunk Size Trade-offs

Caching Strategy

Error Handling

Empty Content

Chunk Size Warnings

Metadata Validation

On this page

Text Splitting and Chunking

Purpose and Scope

TextSplitter Class Architecture

Class Structure

Configuration Parameters

Chunk Size Determination

Metadata Header Injection

Metadata Plucking Process

Header Format

Chunk Prefix System

Prefix Application Flow

Integration with Vector Database Providers

Common Integration Pattern

Code Example from LanceDB Provider

Text Splitting Pipeline

RecursiveCharacterTextSplitter Behavior

Separator Hierarchy

Chunk Overlap

System Settings Integration

Settings Retrieval Pattern

Key Settings

Provider-Specific Implementations

AstraDB Special Case

Consistency Across Providers

Document Processing Flow

Performance Considerations

Chunk Size Trade-offs

Caching Strategy

Error Handling

Empty Content

Chunk Size Warnings

Metadata Validation

On this page