This page covers the document management system within workspaces, including how documents are associated with workspaces, the pinned documents feature for persistent context, and the watched documents feature for automatic synchronization. For information about the initial document ingestion and vectorization pipeline, see Document Ingestion. For details about vector similarity search during chat, see Similarity Search and Reranking.
The document management system in workspaces provides three primary capabilities:
documents/ folder to specific workspaces for RAG (Retrieval-Augmented Generation)All document-workspace relationships are tracked in the workspace_documents table, which serves as the central registry for document management.
Sources: server/prisma/schema.prisma26-39 server/models/workspace.js1-10
The workspace_documents table maintains the relationship between workspaces and their embedded documents:
| Field | Type | Description |
|---|---|---|
id | Int | Primary key |
docId | String | Unique identifier for the document |
filename | String | Original filename |
docpath | String | Path relative to documents/ folder (e.g., custom-documents/file.json) |
workspaceId | Int | Foreign key to workspaces table |
metadata | String | JSON string of additional metadata |
pinned | Boolean | Whether document is pinned (always in context) |
watched | Boolean | Whether document is watched for auto-sync |
createdAt | DateTime | Creation timestamp |
lastUpdatedAt | DateTime | Last update timestamp |
Diagram: workspace_documents Table Relationships
The docpath field stores the relative path from the documents/ directory, typically in format folder-name/filename-uuid.json. The docId serves as a unique identifier that can be used to track the document across the system.
Sources: server/prisma/schema.prisma26-39 server/prisma/schema.prisma297-315
Documents are stored in a hierarchical folder structure within the documents/ directory:
Diagram: Document Storage Hierarchy
Each JSON file contains:
pageContent: The full text content of the documenttitle, docAuthor, description, docSource, chunkSource, published, wordCount, token_count_estimateThe documents/ path resolves to:
server/storage/documents$STORAGE_DIR/documentsSources: server/utils/files/index.js6-9 server/utils/files/index.js462-474
The /v1/workspace/:slug/update-embeddings endpoint manages document associations:
Diagram: Document Addition/Removal Flow
Request body format:
The system performs these operations:
workspace_documents records and associated vector embeddingsworkspace_documents records and embeds documents if not already vectorizedSources: server/endpoints/api/workspace/index.js448-523 server/models/documents.js
Documents can be uploaded and automatically added to workspaces using the addToWorkspaces parameter:
This triggers:
Sources: server/endpoints/api/document/index.js96-148 server/endpoints/api/document/index.js138-142
Pinned documents are always included in the chat context window, regardless of vector similarity search results. This feature ensures critical information is available for every query.
Diagram: Pinned Documents in RAG Pipeline
The DocumentManager.pinnedDocs() method retrieves all pinned documents for a workspace:
workspace_documents WHERE workspaceId = X AND pinned = truemaxTokens limit (defaults to 80% of LLM context window)pageContent and metadataDuring chat processing, pinned documents are:
contextTexts arrayfilterIdentifiers to prevent duplicationSources: server/utils/chats/stream.js115-132 server/utils/DocumentManager.js
Use the /v1/workspace/:slug/update-pin endpoint:
The system:
workspace_documents record by workspaceId and docpathpinned field to the specified boolean valueSources: server/endpoints/api/workspace/index.js525-591
The sourceIdentifier() function creates unique identifiers for documents:
This identifier is used to populate filterIdentifiers, which prevents the vector search from returning documents that are already pinned, avoiding duplicate context.
Sources: server/utils/chats/index.js107-110 server/utils/chats/stream.js158
The watched documents feature enables automatic re-synchronization of documents from external sources. When enabled, the system periodically checks watched documents and updates them if they've become stale.
Diagram: Document Watch and Sync System
Sources: server/models/documentSyncQueue.js10-90 server/jobs/sync-watched-documents.js9-150
The DocumentSyncQueue.canWatch() method determines if a document is eligible for watching based on its chunkSource metadata field:
| Prefix | Description |
|---|---|
link:// | Web pages (with .html title extension) |
youtube:// | YouTube videos |
confluence:// | Confluence wiki pages |
github:// | GitHub repository files |
gitlab:// | GitLab repository files |
drupalwiki:// | Drupal wiki documents |
A document can only be watched if its chunkSource starts with one of these prefixes. This information is stored during the initial document ingestion.
Sources: server/models/documentSyncQueue.js55-66
Each watched document has one associated document_sync_queues record:
| Field | Purpose |
|---|---|
workspaceDocId | Foreign key to workspace_documents.id (UNIQUE) |
staleAfterMs | Milliseconds before document is considered stale (default: 604800000 = 7 days) |
nextSyncAt | DateTime when next sync should occur |
lastSyncedAt | DateTime of last successful sync |
The system ensures no duplicate queues for the same document across multiple workspaces by checking if any workspace document with the same filename already has a queue.
Sources: server/prisma/schema.prisma297-315 server/models/documentSyncQueue.js72-89
The background worker (sync-watched-documents.js) runs periodically:
nextSyncAt <= NOW()documents/ foldernextSyncAt = NOW() + staleAfterMsdocument_sync_executions recordIf a document fails sync 5 consecutive times (controlled by maxRepeatFailures), it is automatically unwatched to prevent repeated failures.
Sources: server/jobs/sync-watched-documents.js9-150 server/models/documentSyncQueue.js22
The watch feature must be enabled at the system level:
To watch a document:
This:
document_sync_queues recordworkspace_documents.watched = truestaleAfterMsTo unwatch:
This:
document_sync_queues record (CASCADE delete)workspace_documents.watched = false for all workspaces using that documentSources: server/models/documentSyncQueue.js34-38 server/models/documentSyncQueue.js72-120
The viewLocalFiles() function provides a hierarchical view of all documents available for embedding:
Diagram: Document Listing Flow
The function returns a structure:
Sources: server/utils/files/index.js35-100
Pinned Workspaces Lookup:
This query runs once per folder and returns a map of {filename: [workspaceId1, workspaceId2]}.
Watched Documents Lookup:
Returns a map of {filename: workspaceId}. Note: Only one workspace ID is stored per watched document since the queue is shared.
Sources: server/utils/files/index.js304-353
For large files (>150 MB as defined by FILE_READ_SIZE_THRESHOLD), the system uses streaming to parse JSON without loading the entire file into memory:
The pageContent field is always stripped from metadata when listing documents to reduce response size.
Sources: server/utils/files/index.js374-460
The Document.removeDocuments() method handles document removal from a workspace:
Diagram: Document Removal Process
The removal process:
workspace_documents records matching workspaceId and docpathdocument_vectors tabledocument_vectors recordsworkspace_documents recordImportant: This only removes the document from the specified workspace. The source JSON file in documents/ remains intact for use in other workspaces.
Sources: server/models/documents.js
The purgeDocument() function completely removes a document from the entire system:
vector-cache/{uuid}.json filedocuments/{folder}/{filename}.json fileDocument.removeDocuments() for every workspaceAfter purging, the document no longer exists in the system and must be re-uploaded and re-processed.
Sources: server/utils/files/purgeDocument.js13-23 server/endpoints/api/system/index.js207-272
The purgeFolder() function removes entire folders (except custom-documents):
This is useful for cleaning up connector-generated folders (e.g., GitHub repos, Confluence spaces).
Sources: server/utils/files/purgeDocument.js33-84
During chat execution, documents are assembled into context in this priority order:
DocumentManager.pinnedDocs())fillSourceWindow())VectorDb.performSimilaritySearch())Diagram: Context Assembly in Chat Pipeline
The filterIdentifiers array prevents duplication:
VectorDb.performSimilaritySearch() as exclusion filterSources: server/utils/chats/stream.js94-196
In query mode (chatMode = "query"), the system validates that document context exists:
This prevents the LLM from hallucinating answers when no relevant documents are found. In chat mode, the LLM can use general knowledge even without document context.
Sources: server/utils/chats/stream.js198-227
The DocumentManager enforces a maximum token limit for pinned documents:
By default, pinned documents are limited to 80% of the model's context window to leave room for:
This prevents context overflow while maximizing the use of pinned content.
Sources: server/utils/chats/stream.js115-118 server/utils/DocumentManager.js
| Endpoint | Method | Purpose |
|---|---|---|
/v1/workspace/:slug/update-embeddings | POST | Add/remove documents from workspace |
/v1/workspace/:slug/update-pin | POST | Toggle pin status for a document |
/v1/workspace/:slug | GET | Get workspace with documents included |
/v1/documents | GET | List all documents in system with metadata |
/v1/documents/folder/:folderName | GET | Get documents in specific folder |
/v1/system/remove-documents | DELETE | Purge documents from entire system |
/v1/document/upload | POST | Upload document with optional workspace association |
All endpoints require API key authentication via the validApiKey middleware.
Sources: server/endpoints/api/workspace/index.js448-591 server/endpoints/api/document/index.js600-706 server/endpoints/api/system/index.js207-272
Refresh this wiki