Document ingestion is the process of importing content from various sources into AnythingLLM, parsing it into structured formats, and preparing it for vectorization and embedding in workspaces. This page covers the complete document pipeline from initial upload or connector import through to storage as processed JSON files.
For information about how documents are embedded into workspaces and vectorized, see Document Vectorization Pipeline. For workspace-level document management, see Document Management in Workspaces. For information about the Collector service internals, see Collector Service Architecture.
The document ingestion system follows a multi-stage architecture where documents flow through distinct processing phases before becoming available for embedding.
Sources: server/endpoints/api/document/index.js26-598 collector/extensions/index.js9-227 server/utils/collectorApi.js server/utils/files/index.js1-503
The system provides multiple API endpoints for document ingestion, each handling different input types and use cases.
| Endpoint | Method | Purpose | Input Format |
|---|---|---|---|
/v1/document/upload | POST | Upload files via multipart form | multipart/form-data with file field |
/v1/document/upload/:folderName | POST | Upload to specific folder | multipart/form-data with target folder path |
/v1/document/upload-link | POST | Scrape and ingest URL | JSON with link, scraperHeaders, metadata |
/v1/document/raw-text | POST | Create document from text | JSON with textContent, metadata (requires title) |
All upload endpoints support optional parameters:
addToWorkspaces: Comma-separated workspace slugs for immediate embeddingmetadata: JSON object with custom metadata fields (title, docAuthor, description, docSource)Sources: server/endpoints/api/document/index.js26-330 server/utils/files/multer.js31-45 server/utils/collectorApi.js
Data connectors enable automated ingestion from external sources. Each connector implements source-specific authentication and data extraction logic.
| Connector | Endpoint | Supported Content | Key Features |
|---|---|---|---|
| GitHub | /ext/github/repo | Repository files | Branch selection, path filtering |
| GitLab | /ext/gitlab/repo | Repository files, issues, wikis | Branch selection, optional issues/wikis |
| YouTube | /ext/youtube/transcript | Video transcripts | Automatic transcript extraction |
| Website Depth | /ext/website-depth | Web pages | Recursive crawling with depth/link limits |
| Confluence | /ext/confluence | Wiki pages | Cloud/server support, space-based filtering |
| DrupalWiki | /ext/drupalwiki | Wiki spaces | Space ID filtering |
| Obsidian | /ext/obsidian/vault | Markdown files | Vault structure preservation |
| Paperless-ngx | /ext/paperless-ngx | Stored documents | API token authentication |
Sources: frontend/src/models/dataConnector.js5-233 server/endpoints/extensions/index.js12-204 collector/extensions/index.js9-227
After a file reaches the hot directory, the Collector service processes it through a type-specific parsing pipeline.
Each processed document is stored as a JSON file with the following standardized structure:
The chunkSource field is particularly important for document watching and resync operations. It contains protocol-prefixed identifiers:
link://https://example.com - Web scraped contentyoutube://https://youtube.com/watch?v=... - YouTube transcriptsconfluence://baseUrl/... - Confluence pagesgithub://repo/path/file.md - GitHub filesgitlab://repo/path/file.md - GitLab filesSources: server/utils/files/index.js462-485 collector/extensions/index.js
Documents are organized in a hierarchical folder structure within the storage directory.
storage/
├── documents/ # Processed document JSON files
│ ├── custom-documents/ # Default folder for manual uploads
│ │ ├── file1.pdf-{uuid}.json
│ │ └── file2.txt-{uuid}.json
│ ├── github-repo/ # GitHub connector documents
│ ├── youtube-videos/ # YouTube connector documents
│ └── confluence-space/ # Confluence connector documents
├── vector-cache/ # Cached embeddings
│ ├── {uuid-v5-hash}.json # Cached vectors by filename hash
│ └── ...
├── direct-uploads/ # Temporary upload staging
└── assets/ # System assets (logos, etc.)
The system uses UUID v5 hashing based on the file path to identify cached vectors, enabling fast lookups without recomputing embeddings for unchanged documents.
Key file system operations are centralized in server/utils/files/index.js:
| Function | Purpose | Path Reference |
|---|---|---|
viewLocalFiles() | List all documents in folder hierarchy | server/utils/files/index.js35-100 |
fileData(filePath) | Load and parse document JSON | server/utils/files/index.js25-33 |
cachedVectorInformation(filename) | Check for cached embeddings | server/utils/files/index.js172-187 |
storeVectorResult(vectorData, filename) | Cache computed embeddings | server/utils/files/index.js191-202 |
purgeSourceDocument(filename) | Delete document file | server/utils/files/index.js205-219 |
purgeVectorCache(filename) | Delete cached vectors | server/utils/files/index.js222-231 |
Sources: server/utils/files/index.js1-503
The vector cache prevents redundant embedding computations by storing previously generated embeddings for each document.
The cache uses UUID v5 hashing with the namespace uuidv5.URL and the document's full path as the name. This produces deterministic cache keys:
Cache files contain arrays of pre-chunked vectors with metadata:
Sources: server/utils/files/index.js172-202 server/models/documents.js
The Collector service runs as a separate process (port 8888 by default) and handles compute-intensive document processing operations. The main server communicates with it via HTTP through the CollectorApi client.
| Method | Purpose | Endpoint Called |
|---|---|---|
online() | Check collector availability | GET / |
processDocument(filename, metadata) | Process uploaded file | POST /process |
processLink(link, scraperHeaders, metadata) | Scrape and process URL | POST /process-link |
processRawText(textContent, metadata) | Create document from text | POST /raw-text |
forwardExtensionRequest({endpoint, method, body}) | Forward connector requests | POST /ext/* |
acceptedFileTypes() | Get supported MIME types | GET /accepts |
The Collector service is essential for document ingestion - if it is not online, document uploads will fail with error "Document processing API is not online.".
Sources: server/utils/collectorApi.js server/endpoints/api/document/index.js97-116 collector/extensions/index.js9-227
Processed documents include rich metadata that enables source attribution in chat responses. The citation system uses the chunkSource field to link response snippets back to their original sources.
The frontend citation system recognizes special protocols in chunkSource and renders appropriate icons and links:
| Protocol | Icon | Linkable | Description |
|---|---|---|---|
link:// | LinkSimple | Yes | Web scraped content |
youtube:// | YoutubeLogo | Yes | YouTube video transcripts |
github:// | GithubLogo | Yes | GitHub repository files |
gitlab:// | GitlabLogo | Yes | GitLab repository files |
confluence:// | Confluence | Yes | Confluence wiki pages |
drupalwiki:// | DrupalWiki | Yes | DrupalWiki pages |
obsidian:// | Obsidian | No | Obsidian vault notes |
paperless-ngx:// | PaperlessNgx | Yes | Paperless-ngx documents |
| (none) | FileText | No | Generic uploaded files |
Sources: frontend/src/components/WorkspaceChat/ChatContainer/ChatHistory/Citation/index.jsx212-317
Documents imported from connectors can be marked as "watched" to enable automatic resynchronization when source content changes. This feature requires the experimental_live_file_sync system setting to be enabled.
The background worker (server/jobs/sync-watched-documents.js) periodically checks watched documents and fetches updated content from their sources. Supported document types for watching:
link://)youtube://)confluence://)github://)gitlab://)drupalwiki://)The resync process:
DocumentSyncQueue.staleDocumentQueues()/ext/resync-source-document with document metadatapageContentnextSyncAt timestamp for next checkFailed syncs are tracked; after 5 consecutive failures (DocumentSyncQueue.maxRepeatFailures), the document is automatically unwatched.
Sources: server/models/documentSyncQueue.js10-244 server/jobs/sync-watched-documents.js9-106 collector/extensions/resync/index.js1-203
The chat API endpoints support document attachments with MIME type application/anythingllm-document. These attachments are base64-encoded documents that are:
sources array for citationThis enables ephemeral document context for single-turn queries without polluting the document library.
Sources: server/utils/chats/apiChatHandler.js39-96 server/endpoints/api/workspace/index.js609-621
Refresh this wiki