The Chat Streaming Engine is the core component responsible for processing chat messages in AnythingLLM. It orchestrates the Retrieval-Augmented Generation (RAG) pipeline by assembling context from multiple sources, managing chat history, and coordinating with LLM providers to generate responses. This document focuses on the backend streaming logic implemented primarily in server/utils/chats/stream.js
For information about the frontend chat components, see page 7.2. For agent-specific functionality, see page 7.4. For embedded chat widgets, see page 7.5.
Sources: server/utils/chats/stream.js1-316
The chat engine operates through a multi-stage pipeline that processes user messages, gathers relevant context, and streams responses from LLM providers. The main entry point is the streamChatWithWorkspace function, which handles both standard workspace chats and thread-scoped conversations.
Sources: server/utils/chats/stream.js18-311
The streamChatWithWorkspace function is exported from server/utils/chats/stream.js and serves as the primary entry point for all chat processing. This function coordinates the entire RAG pipeline from context assembly to LLM invocation and response streaming.
| Parameter | Type | Source | Description |
|---|---|---|---|
response | Express Response | HTTP endpoint | Used for Server-Sent Events streaming via writeResponseChunk() |
workspace | Object | Workspace.get({slug}) | Contains chatProvider, chatModel, openAiTemp, openAiHistory, similarityThreshold, topN, chatMode, queryRefusalResponse |
message | String | Request body | Raw user input before command/agent preprocessing |
chatMode | String | Request body or workspace default | Validated against VALID_CHAT_MODE = ["chat", "query"] |
user | Object | null | JWT authentication | Contains id field for scoping history and permissions |
thread | Object | null | WorkspaceThread.get() | Contains id field for thread-scoped history |
attachments | Array | Request body | Parsed file attachments from WorkspaceParsedFiles |
Sources: server/utils/chats/stream.js18-26 server/utils/chats/stream.js16
AnythingLLM supports two distinct chat modes that control how the system handles context and responses:
| Aspect | Query Mode | Chat Mode |
|---|---|---|
| Purpose | Strict factual answers from documents | Conversational interaction with optional context |
| Context Requirement | Required - refuses without context | Optional - can use general knowledge |
| Behavior without context | Returns refusal response | Proceeds with LLM's general knowledge |
| Use Case | Precise Q&A from documents | General conversation with RAG support |
Query mode enforces strict context requirements through two validation checkpoints. Both checkpoints persist a refusal chat record with include: false to maintain conversation continuity.
Key Design Detail: Refusal responses use include: false so they appear in chat history but don't consume context window space in future interactions.
Sources: server/utils/chats/stream.js60-92 server/utils/chats/stream.js200-227
The chat engine assembles context from four distinct sources, executed in parallel where possible to optimize performance. The final context is a merged collection from all sources.
Sources: server/utils/chats/stream.js102-196
Pinned documents (workspace_documents.pinned = true) are always included in context regardless of relevance to the query. The DocumentManager class handles token-aware limiting to prevent context overflow.
Implementation Details:
DocumentManager.pinnedDocs() queries workspace_documents where pinned = truemaxTokens caps pinned content at ~80% of LLMConnector.promptWindowLimit() to reserve space for chat history and vector resultssourceIdentifier(doc) generates "title:{title}-timestamp:{published}" for deduplicationpinnedDocIdentifiers[] is passed to performSimilaritySearch() via filterIdentifiers parameterSources: server/utils/chats/stream.js115-132 server/utils/chats/index.js107-110
Parsed files are user-uploaded attachments that are scoped to the workspace, thread, or user session. These are retrieved via WorkspaceParsedFiles.getContextFiles().
Sources: server/utils/chats/stream.js135-148
Vector search is conditionally executed based on embeddingsCount and uses workspace-scoped configuration parameters.
Workspace Configuration Parameters:
| Field | Default | Validation | Vector DB Param |
|---|---|---|---|
workspace.similarityThreshold | 0.25 | Float: 0.0-1.0 | similarityThreshold |
workspace.topN | 4 | Int: ≥ 1 | topN |
workspace.vectorSearchMode | "default" | "default" | "rerank" | rerank boolean |
Vector DB Method Signature:
{contextTexts: string[], sources: object[], message: string|null}message field indicates search failure (connection error, timeout, etc.)filterIdentifiers prevents re-fetching documents already in pinnedDocIdentifiers[]Sources: server/utils/chats/stream.js61 server/utils/chats/stream.js150-178 server/models/workspace.js80-94
Recent chat history is retrieved and converted to prompt format for the LLM:
The messageLimit defaults to the workspace's openAiHistory setting (default: 20 messages).
Sources: server/utils/chats/stream.js102-107 server/utils/chats/index.js61-82
The fillSourceWindow utility (imported from server/utils/helpers/chat/index.js) implements intelligent context augmentation by backfilling from chat history when vector search returns fewer than topN results.
Sources: server/utils/helpers/chat/index.js
The context and sources arrays are merged differently to prevent UX confusion:
| Array | Merge Strategy | Rationale |
|---|---|---|
contextTexts | Includes filledSources.contextTexts | LLM needs full historical context for coherent responses |
sources | Only vectorSearchResults.sources | User sees only current search citations, not backfilled sources |
Comment from source code (lines 188-196):
"Why does contextTexts get all the info, but sources only get current search? This is to give the ability of the LLM to 'comprehend' a contextual response without populating the Citations under a response with documents the user 'thinks' are irrelevant due to how we manage backfilling of the context to keep chats with the LLM more correct in responses."
This prevents GitHub issues like "LLM citing document that has no answer in it" while keeping answers accurate.
Sources: server/utils/chats/stream.js180-196
Before context assembly, messages undergo preprocessing to handle special commands and agent invocations.
Sources: server/utils/chats/stream.js28-51 server/utils/chats/index.js8-37
Slash command processing occurs in two phases: built-in command execution and user preset substitution.
VALID_COMMANDS Registry (server/utils/chats/index.js:8-10):
grepCommand Implementation:
SlashCommandPresets.getUserPresets(user?.id)new RegExp(^(${cmd}), "i")"/reset")updatedMessage.replace(regex, preset.prompt)Database Schema:
slash_command_presets table stores user-defined commandscommand (e.g., "/summarize"), prompt (replacement text), uid (user ID), userId (FK)Sources: server/utils/chats/stream.js28-40 server/utils/chats/index.js8-37 server/prisma/schema.prisma283-295
The grepAgents function detects @agent mentions and transitions to the AIbitat framework execution path:
Agent Detection Flow:
grepAgents() parses updatedMessage for @agent patternWorkspaceAgentInvocations record with uuidagentInitWebsocketConnection()truefalse and continues normal chat flowDatabase Persistence:
workspace_agent_invocations table tracks all agent sessionsuuid, prompt, closed, user_id, thread_id, workspace_idSources: server/utils/chats/stream.js43-51 server/utils/chats/agents.js (imported at line 7), server/prisma/schema.prisma201-215
After context assembly, the engine prepares the final prompt for the LLM by compressing messages to fit within the token limit.
The compressMessages method ensures the final prompt fits within the LLM's context window by intelligently truncating chat history while preserving system prompt, user prompt, and context.
chatPrompt Function (server/utils/chats/index.js:91-100):
Prompt Construction Hierarchy:
workspace.openAiPrompt → SystemSettings.saneDefaultSystemPrompt (fallback)SystemPromptVariables.expandSystemPromptVariables() performs variable substitution (e.g., {{username}}, {{date}}){role: "user"|"assistant", content: string}[] formatupdatedMessageCompression Algorithm:
LLMConnector.compressMessages() calculates token counts for each componentpromptWindowLimit(), truncates chat history from oldest to newestSources: server/utils/chats/stream.js231-240 server/utils/chats/index.js91-100
The engine supports both streaming and non-streaming responses depending on the LLM provider's capabilities.
The engine selects response mode based on LLMConnector.streamingEnabled(), which varies by provider implementation.
Sources: server/utils/chats/stream.js244-275
Sources: server/utils/chats/stream.js265-272
After generating a response, the chat engine persists the conversation to the database via WorkspaceChats.new().
After LLM response completion, the chat is persisted to workspace_chats table via WorkspaceChats.new().
workspace_chats Table Schema (server/prisma/schema.prisma:186-199):
| Column | Type | Relation | Description |
|---|---|---|---|
id | Int @id | Primary key | Auto-increment |
workspaceId | Int | workspaces | Namespace for chat |
prompt | String | - | User input (not preprocessed) |
response | String | - | JSON.stringify({text, sources, type, attachments, metrics}) |
include | Boolean | - | If false, excluded from recentChatHistory() |
user_id | Int? | users | Multi-user mode scoping |
thread_id | Int? | - | Thread scoping (no FK to avoid migration) |
api_session_id | String? | - | API client partition key |
feedbackScore | Boolean? | - | User feedback (true = positive, false = negative, null = none) |
createdAt | DateTime | - | Auto-generated timestamp |
lastUpdatedAt | DateTime | - | Auto-updated timestamp |
Key Implementation Notes:
response field stores JSON-stringified object (not Prisma JSON type to maintain SQLite compatibility)include: false used for refusal responses in query mode to exclude from future contextthread_id has no foreign key to prevent full table migration when workspace_threads was addedapi_session_id enables stateful conversations for API clients without user accountsSources: server/utils/chats/stream.js277-309 server/prisma/schema.prisma186-199 server/models/workspaceChats.js5-31
Retrieves chat history filtered by workspace, user, thread, and include: true:
Returns { rawHistory, chatHistory } where:
rawHistory: Array of raw workspace_chats recordschatHistory: Converted to {role, content}[] format via convertToPromptHistory()Sources: server/utils/chats/index.js61-82
Generates unique identifiers for source documents to prevent duplication:
Sources: server/utils/chats/index.js107-110
Utility function for writing streaming response chunks to the HTTP response. Handles Server-Sent Events (SSE) formatting.
Sources: server/utils/helpers/chat/responses.js (referenced at server/utils/chats/stream.js6)
A parallel implementation for embedded chat widgets exists in server/utils/chats/embed.js The streamChatWithForEmbed function reuses most RAG pipeline logic but differs in authentication, configuration, and persistence.
| Aspect | streamChatWithWorkspace | streamChatWithForEmbed |
|---|---|---|
| Function Signature | (response, workspace, message, chatMode, user, thread, attachments) | (response, embed, message, sessionId, {promptOverride, modelOverride, temperatureOverride, username}) |
| Entry Point | POST /workspace/:slug/stream-chat | POST /embed/:embedId/stream-chat |
| Authentication | JWT or instance password | sessionId UUID + allowlist domain check |
| Configuration Source | workspaces table | embed_configs + workspace relation |
| Persistence Target | workspace_chats table | embed_chats table |
| History Function | recentChatHistory({user, workspace, thread, messageLimit}) | recentEmbedChatHistory(sessionId, embed, messageLimit) |
| Source Handling | All sources included in response | filterSources() removes sources from /history endpoint |
| Rate Limiting | User daily limit or none | max_chats_per_day, max_chats_per_session |
| Middleware Stack | validApiKey or flexUserRoleValid | validEmbedConfig, setConnectionMeta, canRespond |
| Command Support | Slash commands + agent invocations | No commands/agents |
| Override Support | None (uses workspace config) | allow_prompt_override, allow_model_override, allow_temperature_override |
1. Configuration Overrides (Lines 22-28):
2. Rate Limiting (server/utils/middleware/embedMiddleware.js:107-157):
max_chats_per_day: Total chats for embed across all sessions (last 24 hours)max_chats_per_session: Total chats for specific sessionId (last 24 hours)429 status with errorMsg field for user-facing display3. Source Filtering (server/models/embedChats.js:47-53):
4. Connection Metadata (server/utils/middleware/embedMiddleware.js:22-28):
Sources: server/utils/chats/embed.js11-223 server/utils/middleware/embedMiddleware.js9-171 server/models/embedChats.js47-53 server/endpoints/embed/index.js19-67
The chat engine reads configuration from the workspaces table:
| workspace Column | Default | Validation | Description |
|---|---|---|---|
chatProvider | null | String or null | LLM provider key (e.g., "openai") |
chatModel | null | String or null | Model identifier |
chatMode | "chat" | "chat" | "query" | Determines context requirement |
openAiTemp | null | Float or null (≥ 0) | Temperature for LLM |
openAiHistory | 20 | Int (≥ 0) | Message limit for history |
openAiPrompt | null | String or null | System prompt (falls back to saneDefaultSystemPrompt) |
similarityThreshold | 0.25 | Float (0.0-1.0) | Vector search cutoff |
topN | 4 | Int (≥ 1) | Number of vector results |
queryRefusalResponse | null | String or null | Custom refusal message |
vectorSearchMode | "default" | "default" | "rerank" | Enables LanceDB reranking |
These are validated by Workspace.validations object and applied in Workspace.update().
Sources: server/models/workspace.js60-131 server/prisma/schema.prisma126-155
The chat engine handles errors at multiple checkpoints:
1. Vector Search Failures
2. Empty Context in Query Mode
type: "textResponse" with queryRefusalResponseworkspace_chats with include: false3. Command Execution Errors
VALID_COMMANDSresetMemory returns success/error via writeResponseChunk4. LLM Provider Errors
getChatCompletion() or handleStream()Sources: server/utils/chats/stream.js65-92 server/utils/chats/stream.js168-178 server/utils/chats/stream.js200-227
The chat engine optimizes performance through:
LLMConnector.promptWindowLimit()Sources: server/utils/chats/stream.js115-196
Refresh this wiki