Document Management in Workspaces

Relevant source files

This page covers the document management system within workspaces, including how documents are associated with workspaces, the pinned documents feature for persistent context, and the watched documents feature for automatic synchronization. For information about the initial document ingestion and vectorization pipeline, see Document Ingestion. For details about vector similarity search during chat, see Similarity Search and Reranking.

Overview

The document management system in workspaces provides three primary capabilities:

Document Association: Linking processed documents from the documents/ folder to specific workspaces for RAG (Retrieval-Augmented Generation)
Pinned Documents: Marking specific documents to be always included in the chat context window
Watched Documents: Automatically re-syncing documents from external sources when they become stale

All document-workspace relationships are tracked in the workspace_documents table, which serves as the central registry for document management.

Sources: server/prisma/schema.prisma26-39 server/models/workspace.js1-10

Database Schema

The workspace_documents table maintains the relationship between workspaces and their embedded documents:

Field	Type	Description
`id`	Int	Primary key
`docId`	String	Unique identifier for the document
`filename`	String	Original filename
`docpath`	String	Path relative to `documents/` folder (e.g., `custom-documents/file.json`)
`workspaceId`	Int	Foreign key to `workspaces` table
`metadata`	String	JSON string of additional metadata
`pinned`	Boolean	Whether document is pinned (always in context)
`watched`	Boolean	Whether document is watched for auto-sync
`createdAt`	DateTime	Creation timestamp
`lastUpdatedAt`	DateTime	Last update timestamp

Diagram: workspace_documents Table Relationships

The docpath field stores the relative path from the documents/ directory, typically in format folder-name/filename-uuid.json. The docId serves as a unique identifier that can be used to track the document across the system.

Sources: server/prisma/schema.prisma26-39 server/prisma/schema.prisma297-315

Document Storage Structure

Documents are stored in a hierarchical folder structure within the documents/ directory:

Diagram: Document Storage Hierarchy

Each JSON file contains:

pageContent: The full text content of the document
Metadata fields: title, docAuthor, description, docSource, chunkSource, published, wordCount, token_count_estimate

The documents/ path resolves to:

Development: server/storage/documents
Production: $STORAGE_DIR/documents

Sources: server/utils/files/index.js6-9 server/utils/files/index.js462-474

Adding Documents to Workspaces

Via API Endpoint

The /v1/workspace/:slug/update-embeddings endpoint manages document associations:

Diagram: Document Addition/Removal Flow

Request body format:

The system performs these operations:

Removal: Deletes workspace_documents records and associated vector embeddings
Addition: Creates new workspace_documents records and embeds documents if not already vectorized

Sources: server/endpoints/api/workspace/index.js448-523 server/models/documents.js

Via Upload with Auto-Embedding

Documents can be uploaded and automatically added to workspaces using the addToWorkspaces parameter:

This triggers:

Document processing via Collector API
Automatic vectorization
Addition to specified workspaces

Sources: server/endpoints/api/document/index.js96-148 server/endpoints/api/document/index.js138-142

Pinned Documents

Purpose and Behavior

Pinned documents are always included in the chat context window, regardless of vector similarity search results. This feature ensures critical information is available for every query.

Diagram: Pinned Documents in RAG Pipeline

Implementation Details

The DocumentManager.pinnedDocs() method retrieves all pinned documents for a workspace:

Queries workspace_documents WHERE workspaceId = X AND pinned = true
Loads full document content from JSON files
Respects maxTokens limit (defaults to 80% of LLM context window)
Returns array of document objects with pageContent and metadata

During chat processing, pinned documents are:

Retrieved before vector similarity search
Added to contextTexts array
Included in source citations
Filtered from similarity search results using filterIdentifiers to prevent duplication

Sources: server/utils/chats/stream.js115-132 server/utils/DocumentManager.js

Setting Pin Status

Use the /v1/workspace/:slug/update-pin endpoint:

The system:

Finds the workspace_documents record by workspaceId and docpath
Updates the pinned field to the specified boolean value

Sources: server/endpoints/api/workspace/index.js525-591

Deduplication Mechanism

The sourceIdentifier() function creates unique identifiers for documents:

This identifier is used to populate filterIdentifiers, which prevents the vector search from returning documents that are already pinned, avoiding duplicate context.

Sources: server/utils/chats/index.js107-110 server/utils/chats/stream.js158

Watched Documents (Live Sync)

Overview

The watched documents feature enables automatic re-synchronization of documents from external sources. When enabled, the system periodically checks watched documents and updates them if they've become stale.

Diagram: Document Watch and Sync System

Sources: server/models/documentSyncQueue.js10-90 server/jobs/sync-watched-documents.js9-150

Eligibility and Validation

The DocumentSyncQueue.canWatch() method determines if a document is eligible for watching based on its chunkSource metadata field:

Prefix	Description
`link://`	Web pages (with `.html` title extension)
`youtube://`	YouTube videos
`confluence://`	Confluence wiki pages
`github://`	GitHub repository files
`gitlab://`	GitLab repository files
`drupalwiki://`	Drupal wiki documents

A document can only be watched if its chunkSource starts with one of these prefixes. This information is stored during the initial document ingestion.

Sources: server/models/documentSyncQueue.js55-66

Queue Management

Each watched document has one associated document_sync_queues record:

Field	Purpose
`workspaceDocId`	Foreign key to `workspace_documents.id` (UNIQUE)
`staleAfterMs`	Milliseconds before document is considered stale (default: 604800000 = 7 days)
`nextSyncAt`	DateTime when next sync should occur
`lastSyncedAt`	DateTime of last successful sync

The system ensures no duplicate queues for the same document across multiple workspaces by checking if any workspace document with the same filename already has a queue.

Sources: server/prisma/schema.prisma297-315 server/models/documentSyncQueue.js72-89

Sync Execution

The background worker (sync-watched-documents.js) runs periodically:

Query stale documents: Find all queues where nextSyncAt <= NOW()
Fetch new content: Call appropriate connector based on document type
Compare content: Check if document content has changed
Update if changed:
- Overwrite JSON file in documents/ folder
- Purge vector cache for the document
- Re-embed document to vector database across all workspaces
Schedule next sync: Update nextSyncAt = NOW() + staleAfterMs
Log execution: Create document_sync_executions record

If a document fails sync 5 consecutive times (controlled by maxRepeatFailures), it is automatically unwatched to prevent repeated failures.

Sources: server/jobs/sync-watched-documents.js9-150 server/models/documentSyncQueue.js22

Enabling/Disabling Watch

The watch feature must be enabled at the system level:

To watch a document:

This:

Creates a document_sync_queues record
Sets workspace_documents.watched = true
Schedules the first sync based on staleAfterMs

To unwatch:

This:

Deletes the document_sync_queues record (CASCADE delete)
Sets workspace_documents.watched = false for all workspaces using that document

Sources: server/models/documentSyncQueue.js34-38 server/models/documentSyncQueue.js72-120

Document Retrieval and Display

Listing Documents in Workspaces

The viewLocalFiles() function provides a hierarchical view of all documents available for embedding:

Diagram: Document Listing Flow

The function returns a structure:

Sources: server/utils/files/index.js35-100

Enrichment Queries

Pinned Workspaces Lookup:

This query runs once per folder and returns a map of {filename: [workspaceId1, workspaceId2]}.

Watched Documents Lookup:

Returns a map of {filename: workspaceId}. Note: Only one workspace ID is stored per watched document since the queue is shared.

Sources: server/utils/files/index.js304-353

Performance Optimization

For large files (>150 MB as defined by FILE_READ_SIZE_THRESHOLD), the system uses streaming to parse JSON without loading the entire file into memory:

The pageContent field is always stripped from metadata when listing documents to reduce response size.

Sources: server/utils/files/index.js374-460

Document Removal

Removing from Workspace

The Document.removeDocuments() method handles document removal from a workspace:

Diagram: Document Removal Process

The removal process:

Finds all workspace_documents records matching workspaceId and docpath
For each document:
- Retrieves associated vector IDs from document_vectors table
- Deletes vectors from the vector database namespace
- Deletes document_vectors records
- Deletes workspace_documents record

Important: This only removes the document from the specified workspace. The source JSON file in documents/ remains intact for use in other workspaces.

Sources: server/models/documents.js

Purging from System

The purgeDocument() function completely removes a document from the entire system:

Purge vector cache: Delete vector-cache/{uuid}.json file
Purge source document: Delete documents/{folder}/{filename}.json file
Remove from all workspaces: Call Document.removeDocuments() for every workspace

After purging, the document no longer exists in the system and must be re-uploaded and re-processed.

Sources: server/utils/files/purgeDocument.js13-23 server/endpoints/api/system/index.js207-272

Folder Purging

The purgeFolder() function removes entire folders (except custom-documents):

Lists all files in the folder
For each file:
- Purges vector cache
- Removes from all workspaces
Deletes the folder directory

This is useful for cleaning up connector-generated folders (e.g., GitHub repos, Confluence spaces).

Sources: server/utils/files/purgeDocument.js33-84

Integration with Chat Pipeline

Context Assembly Priority

During chat execution, documents are assembled into context in this priority order:

Pinned documents (via DocumentManager.pinnedDocs())
Parsed files (temporary documents uploaded to thread/workspace)
History backfill (sources from previous messages via fillSourceWindow())
Vector similarity search results (via VectorDb.performSimilaritySearch())

Diagram: Context Assembly in Chat Pipeline

The filterIdentifiers array prevents duplication:

Contains source identifiers from pinned documents
Passed to VectorDb.performSimilaritySearch() as exclusion filter
Ensures vector search doesn't return already-included documents

Sources: server/utils/chats/stream.js94-196

Query Mode Validation

In query mode (chatMode = "query"), the system validates that document context exists:

This prevents the LLM from hallucinating answers when no relevant documents are found. In chat mode, the LLM can use general knowledge even without document context.

Sources: server/utils/chats/stream.js198-227

Token Limit Management

The DocumentManager enforces a maximum token limit for pinned documents:

By default, pinned documents are limited to 80% of the model's context window to leave room for:

System prompt
Chat history
User query
Vector search results
LLM response

This prevents context overflow while maximizing the use of pinned content.

Sources: server/utils/chats/stream.js115-118 server/utils/DocumentManager.js

API Endpoints Summary

Endpoint	Method	Purpose
`/v1/workspace/:slug/update-embeddings`	POST	Add/remove documents from workspace
`/v1/workspace/:slug/update-pin`	POST	Toggle pin status for a document
`/v1/workspace/:slug`	GET	Get workspace with documents included
`/v1/documents`	GET	List all documents in system with metadata
`/v1/documents/folder/:folderName`	GET	Get documents in specific folder
`/v1/system/remove-documents`	DELETE	Purge documents from entire system
`/v1/document/upload`	POST	Upload document with optional workspace association

All endpoints require API key authentication via the validApiKey middleware.

Sources: server/endpoints/api/workspace/index.js448-591 server/endpoints/api/document/index.js600-706 server/endpoints/api/system/index.js207-272

Document Management in Workspaces

Relevant source files

Overview

The document management system in workspaces provides three primary capabilities:

Document Association: Linking processed documents from the documents/ folder to specific workspaces for RAG (Retrieval-Augmented Generation)
Pinned Documents: Marking specific documents to be always included in the chat context window
Watched Documents: Automatically re-syncing documents from external sources when they become stale

All document-workspace relationships are tracked in the workspace_documents table, which serves as the central registry for document management.

Sources: server/prisma/schema.prisma26-39 server/models/workspace.js1-10

Database Schema

The workspace_documents table maintains the relationship between workspaces and their embedded documents:

Field	Type	Description
`id`	Int	Primary key
`docId`	String	Unique identifier for the document
`filename`	String	Original filename
`docpath`	String	Path relative to `documents/` folder (e.g., `custom-documents/file.json`)
`workspaceId`	Int	Foreign key to `workspaces` table
`metadata`	String	JSON string of additional metadata
`pinned`	Boolean	Whether document is pinned (always in context)
`watched`	Boolean	Whether document is watched for auto-sync
`createdAt`	DateTime	Creation timestamp
`lastUpdatedAt`	DateTime	Last update timestamp

Diagram: workspace_documents Table Relationships

Sources: server/prisma/schema.prisma26-39 server/prisma/schema.prisma297-315

Document Storage Structure

Documents are stored in a hierarchical folder structure within the documents/ directory:

Diagram: Document Storage Hierarchy

Each JSON file contains:

pageContent: The full text content of the document
Metadata fields: title, docAuthor, description, docSource, chunkSource, published, wordCount, token_count_estimate

The documents/ path resolves to:

Development: server/storage/documents
Production: $STORAGE_DIR/documents

Sources: server/utils/files/index.js6-9 server/utils/files/index.js462-474

Adding Documents to Workspaces

Via API Endpoint

The /v1/workspace/:slug/update-embeddings endpoint manages document associations:

Diagram: Document Addition/Removal Flow

Request body format:

The system performs these operations:

Removal: Deletes workspace_documents records and associated vector embeddings
Addition: Creates new workspace_documents records and embeds documents if not already vectorized

Sources: server/endpoints/api/workspace/index.js448-523 server/models/documents.js

Via Upload with Auto-Embedding

Documents can be uploaded and automatically added to workspaces using the addToWorkspaces parameter:

This triggers:

Document processing via Collector API
Automatic vectorization
Addition to specified workspaces

Sources: server/endpoints/api/document/index.js96-148 server/endpoints/api/document/index.js138-142

Pinned Documents

Purpose and Behavior

Pinned documents are always included in the chat context window, regardless of vector similarity search results. This feature ensures critical information is available for every query.

Diagram: Pinned Documents in RAG Pipeline

Implementation Details

The DocumentManager.pinnedDocs() method retrieves all pinned documents for a workspace:

Queries workspace_documents WHERE workspaceId = X AND pinned = true
Loads full document content from JSON files
Respects maxTokens limit (defaults to 80% of LLM context window)
Returns array of document objects with pageContent and metadata

During chat processing, pinned documents are:

Retrieved before vector similarity search
Added to contextTexts array
Included in source citations
Filtered from similarity search results using filterIdentifiers to prevent duplication

Sources: server/utils/chats/stream.js115-132 server/utils/DocumentManager.js

Setting Pin Status

Use the /v1/workspace/:slug/update-pin endpoint:

The system:

Finds the workspace_documents record by workspaceId and docpath
Updates the pinned field to the specified boolean value

Sources: server/endpoints/api/workspace/index.js525-591

Deduplication Mechanism

The sourceIdentifier() function creates unique identifiers for documents:

This identifier is used to populate filterIdentifiers, which prevents the vector search from returning documents that are already pinned, avoiding duplicate context.

Sources: server/utils/chats/index.js107-110 server/utils/chats/stream.js158

Watched Documents (Live Sync)

Overview

Diagram: Document Watch and Sync System

Sources: server/models/documentSyncQueue.js10-90 server/jobs/sync-watched-documents.js9-150

Eligibility and Validation

The DocumentSyncQueue.canWatch() method determines if a document is eligible for watching based on its chunkSource metadata field:

Prefix	Description
`link://`	Web pages (with `.html` title extension)
`youtube://`	YouTube videos
`confluence://`	Confluence wiki pages
`github://`	GitHub repository files
`gitlab://`	GitLab repository files
`drupalwiki://`	Drupal wiki documents

A document can only be watched if its chunkSource starts with one of these prefixes. This information is stored during the initial document ingestion.

Sources: server/models/documentSyncQueue.js55-66

Queue Management

Each watched document has one associated document_sync_queues record:

Field	Purpose
`workspaceDocId`	Foreign key to `workspace_documents.id` (UNIQUE)
`staleAfterMs`	Milliseconds before document is considered stale (default: 604800000 = 7 days)
`nextSyncAt`	DateTime when next sync should occur
`lastSyncedAt`	DateTime of last successful sync

The system ensures no duplicate queues for the same document across multiple workspaces by checking if any workspace document with the same filename already has a queue.

Sources: server/prisma/schema.prisma297-315 server/models/documentSyncQueue.js72-89

Sync Execution

The background worker (sync-watched-documents.js) runs periodically:

Query stale documents: Find all queues where nextSyncAt <= NOW()
Fetch new content: Call appropriate connector based on document type
Compare content: Check if document content has changed
Update if changed:
- Overwrite JSON file in documents/ folder
- Purge vector cache for the document
- Re-embed document to vector database across all workspaces
Schedule next sync: Update nextSyncAt = NOW() + staleAfterMs
Log execution: Create document_sync_executions record

If a document fails sync 5 consecutive times (controlled by maxRepeatFailures), it is automatically unwatched to prevent repeated failures.

Sources: server/jobs/sync-watched-documents.js9-150 server/models/documentSyncQueue.js22

Enabling/Disabling Watch

The watch feature must be enabled at the system level:

To watch a document:

This:

Creates a document_sync_queues record
Sets workspace_documents.watched = true
Schedules the first sync based on staleAfterMs

To unwatch:

This:

Deletes the document_sync_queues record (CASCADE delete)
Sets workspace_documents.watched = false for all workspaces using that document

Sources: server/models/documentSyncQueue.js34-38 server/models/documentSyncQueue.js72-120

Document Retrieval and Display

Listing Documents in Workspaces

The viewLocalFiles() function provides a hierarchical view of all documents available for embedding:

Diagram: Document Listing Flow

The function returns a structure:

Sources: server/utils/files/index.js35-100

Enrichment Queries

Pinned Workspaces Lookup:

This query runs once per folder and returns a map of {filename: [workspaceId1, workspaceId2]}.

Watched Documents Lookup:

Returns a map of {filename: workspaceId}. Note: Only one workspace ID is stored per watched document since the queue is shared.

Sources: server/utils/files/index.js304-353

Performance Optimization

For large files (>150 MB as defined by FILE_READ_SIZE_THRESHOLD), the system uses streaming to parse JSON without loading the entire file into memory:

The pageContent field is always stripped from metadata when listing documents to reduce response size.

Sources: server/utils/files/index.js374-460

Document Removal

Removing from Workspace

The Document.removeDocuments() method handles document removal from a workspace:

Diagram: Document Removal Process

The removal process:

Finds all workspace_documents records matching workspaceId and docpath
For each document:
- Retrieves associated vector IDs from document_vectors table
- Deletes vectors from the vector database namespace
- Deletes document_vectors records
- Deletes workspace_documents record

Important: This only removes the document from the specified workspace. The source JSON file in documents/ remains intact for use in other workspaces.

Sources: server/models/documents.js

Purging from System

The purgeDocument() function completely removes a document from the entire system:

Purge vector cache: Delete vector-cache/{uuid}.json file
Purge source document: Delete documents/{folder}/{filename}.json file
Remove from all workspaces: Call Document.removeDocuments() for every workspace

After purging, the document no longer exists in the system and must be re-uploaded and re-processed.

Sources: server/utils/files/purgeDocument.js13-23 server/endpoints/api/system/index.js207-272

Folder Purging

The purgeFolder() function removes entire folders (except custom-documents):

Lists all files in the folder
For each file:
- Purges vector cache
- Removes from all workspaces
Deletes the folder directory

This is useful for cleaning up connector-generated folders (e.g., GitHub repos, Confluence spaces).

Sources: server/utils/files/purgeDocument.js33-84

Integration with Chat Pipeline

Context Assembly Priority

During chat execution, documents are assembled into context in this priority order:

Pinned documents (via DocumentManager.pinnedDocs())
Parsed files (temporary documents uploaded to thread/workspace)
History backfill (sources from previous messages via fillSourceWindow())
Vector similarity search results (via VectorDb.performSimilaritySearch())

Diagram: Context Assembly in Chat Pipeline

The filterIdentifiers array prevents duplication:

Contains source identifiers from pinned documents
Passed to VectorDb.performSimilaritySearch() as exclusion filter
Ensures vector search doesn't return already-included documents

Sources: server/utils/chats/stream.js94-196

Query Mode Validation

In query mode (chatMode = "query"), the system validates that document context exists:

This prevents the LLM from hallucinating answers when no relevant documents are found. In chat mode, the LLM can use general knowledge even without document context.

Sources: server/utils/chats/stream.js198-227

Token Limit Management

The DocumentManager enforces a maximum token limit for pinned documents:

By default, pinned documents are limited to 80% of the model's context window to leave room for:

System prompt
Chat history
User query
Vector search results
LLM response

This prevents context overflow while maximizing the use of pinned content.

Sources: server/utils/chats/stream.js115-118 server/utils/DocumentManager.js

API Endpoints Summary

Endpoint	Method	Purpose
`/v1/workspace/:slug/update-embeddings`	POST	Add/remove documents from workspace
`/v1/workspace/:slug/update-pin`	POST	Toggle pin status for a document
`/v1/workspace/:slug`	GET	Get workspace with documents included
`/v1/documents`	GET	List all documents in system with metadata
`/v1/documents/folder/:folderName`	GET	Get documents in specific folder
`/v1/system/remove-documents`	DELETE	Purge documents from entire system
`/v1/document/upload`	POST	Upload document with optional workspace association

All endpoints require API key authentication via the validApiKey middleware.

Sources: server/endpoints/api/workspace/index.js448-591 server/endpoints/api/document/index.js600-706 server/endpoints/api/system/index.js207-272

Document Management in Workspaces

Overview

Database Schema

Document Storage Structure

Adding Documents to Workspaces

Via API Endpoint

Via Upload with Auto-Embedding

Pinned Documents

Purpose and Behavior

Implementation Details

Setting Pin Status

Deduplication Mechanism

Watched Documents (Live Sync)

Overview

Eligibility and Validation

Queue Management

Sync Execution

Enabling/Disabling Watch

Document Retrieval and Display

Listing Documents in Workspaces

Enrichment Queries

Performance Optimization

Document Removal

Removing from Workspace

Purging from System

Folder Purging

Integration with Chat Pipeline

Context Assembly Priority

Query Mode Validation

Token Limit Management

API Endpoints Summary

On this page

Document Management in Workspaces

Overview

Database Schema

Document Storage Structure

Adding Documents to Workspaces

Via API Endpoint

Via Upload with Auto-Embedding

Pinned Documents

Purpose and Behavior

Implementation Details

Setting Pin Status

Deduplication Mechanism

Watched Documents (Live Sync)

Overview

Eligibility and Validation

Queue Management

Sync Execution

Enabling/Disabling Watch

Document Retrieval and Display

Listing Documents in Workspaces

Enrichment Queries

Performance Optimization

Document Removal

Removing from Workspace

Purging from System

Folder Purging

Integration with Chat Pipeline

Context Assembly Priority

Query Mode Validation

Token Limit Management

API Endpoints Summary

On this page