Document Ingestion

Relevant source files

Document ingestion is the process of importing content from various sources into AnythingLLM, parsing it into structured formats, and preparing it for vectorization and embedding in workspaces. This page covers the complete document pipeline from initial upload or connector import through to storage as processed JSON files.

For information about how documents are embedded into workspaces and vectorized, see Document Vectorization Pipeline. For workspace-level document management, see Document Management in Workspaces. For information about the Collector service internals, see Collector Service Architecture.

Document Ingestion Architecture

The document ingestion system follows a multi-stage architecture where documents flow through distinct processing phases before becoming available for embedding.

Sources: server/endpoints/api/document/index.js26-598 collector/extensions/index.js9-227 server/utils/collectorApi.js server/utils/files/index.js1-503

Document Upload Endpoints

The system provides multiple API endpoints for document ingestion, each handling different input types and use cases.

Endpoint	Method	Purpose	Input Format
`/v1/document/upload`	POST	Upload files via multipart form	`multipart/form-data` with `file` field
`/v1/document/upload/:folderName`	POST	Upload to specific folder	`multipart/form-data` with target folder path
`/v1/document/upload-link`	POST	Scrape and ingest URL	JSON with `link`, `scraperHeaders`, `metadata`
`/v1/document/raw-text`	POST	Create document from text	JSON with `textContent`, `metadata` (requires `title`)

All upload endpoints support optional parameters:

addToWorkspaces: Comma-separated workspace slugs for immediate embedding
metadata: JSON object with custom metadata fields (title, docAuthor, description, docSource)

Sources: server/endpoints/api/document/index.js26-330 server/utils/files/multer.js31-45 server/utils/collectorApi.js

Data Connectors

Data connectors enable automated ingestion from external sources. Each connector implements source-specific authentication and data extraction logic.

Connector	Endpoint	Supported Content	Key Features
GitHub	`/ext/github/repo`	Repository files	Branch selection, path filtering
GitLab	`/ext/gitlab/repo`	Repository files, issues, wikis	Branch selection, optional issues/wikis
YouTube	`/ext/youtube/transcript`	Video transcripts	Automatic transcript extraction
Website Depth	`/ext/website-depth`	Web pages	Recursive crawling with depth/link limits
Confluence	`/ext/confluence`	Wiki pages	Cloud/server support, space-based filtering
DrupalWiki	`/ext/drupalwiki`	Wiki spaces	Space ID filtering
Obsidian	`/ext/obsidian/vault`	Markdown files	Vault structure preservation
Paperless-ngx	`/ext/paperless-ngx`	Stored documents	API token authentication

Sources: frontend/src/models/dataConnector.js5-233 server/endpoints/extensions/index.js12-204 collector/extensions/index.js9-227

File Processing Pipeline

After a file reaches the hot directory, the Collector service processes it through a type-specific parsing pipeline.

Processing Flow by File Type

Metadata Schema

Each processed document is stored as a JSON file with the following standardized structure:

The chunkSource field is particularly important for document watching and resync operations. It contains protocol-prefixed identifiers:

link://https://example.com - Web scraped content
youtube://https://youtube.com/watch?v=... - YouTube transcripts
confluence://baseUrl/... - Confluence pages
github://repo/path/file.md - GitHub files
gitlab://repo/path/file.md - GitLab files

Sources: server/utils/files/index.js462-485 collector/extensions/index.js

Storage and File System Layout

Documents are organized in a hierarchical folder structure within the storage directory.

storage/
├── documents/                    # Processed document JSON files
│   ├── custom-documents/        # Default folder for manual uploads
│   │   ├── file1.pdf-{uuid}.json
│   │   └── file2.txt-{uuid}.json
│   ├── github-repo/             # GitHub connector documents
│   ├── youtube-videos/          # YouTube connector documents
│   └── confluence-space/        # Confluence connector documents
├── vector-cache/                # Cached embeddings
│   ├── {uuid-v5-hash}.json     # Cached vectors by filename hash
│   └── ...
├── direct-uploads/              # Temporary upload staging
└── assets/                      # System assets (logos, etc.)

The system uses UUID v5 hashing based on the file path to identify cached vectors, enabling fast lookups without recomputing embeddings for unchanged documents.

Document File Operations

Key file system operations are centralized in server/utils/files/index.js:

Function	Purpose	Path Reference
`viewLocalFiles()`	List all documents in folder hierarchy	server/utils/files/index.js35-100
`fileData(filePath)`	Load and parse document JSON	server/utils/files/index.js25-33
`cachedVectorInformation(filename)`	Check for cached embeddings	server/utils/files/index.js172-187
`storeVectorResult(vectorData, filename)`	Cache computed embeddings	server/utils/files/index.js191-202
`purgeSourceDocument(filename)`	Delete document file	server/utils/files/index.js205-219
`purgeVectorCache(filename)`	Delete cached vectors	server/utils/files/index.js222-231

Sources: server/utils/files/index.js1-503

Vector Cache System

The vector cache prevents redundant embedding computations by storing previously generated embeddings for each document.

The cache uses UUID v5 hashing with the namespace uuidv5.URL and the document's full path as the name. This produces deterministic cache keys:

Cache files contain arrays of pre-chunked vectors with metadata:

Sources: server/utils/files/index.js172-202 server/models/documents.js

Collector Service Integration

The Collector service runs as a separate process (port 8888 by default) and handles compute-intensive document processing operations. The main server communicates with it via HTTP through the CollectorApi client.

Key CollectorApi Methods

Method	Purpose	Endpoint Called
`online()`	Check collector availability	`GET /`
`processDocument(filename, metadata)`	Process uploaded file	`POST /process`
`processLink(link, scraperHeaders, metadata)`	Scrape and process URL	`POST /process-link`
`processRawText(textContent, metadata)`	Create document from text	`POST /raw-text`
`forwardExtensionRequest({endpoint, method, body})`	Forward connector requests	`POST /ext/*`
`acceptedFileTypes()`	Get supported MIME types	`GET /accepts`

The Collector service is essential for document ingestion - if it is not online, document uploads will fail with error "Document processing API is not online.".

Sources: server/utils/collectorApi.js server/endpoints/api/document/index.js97-116 collector/extensions/index.js9-227

Document Metadata and Citations

Processed documents include rich metadata that enables source attribution in chat responses. The citation system uses the chunkSource field to link response snippets back to their original sources.

Citation Display Mapping

The frontend citation system recognizes special protocols in chunkSource and renders appropriate icons and links:

Protocol	Icon	Linkable	Description
`link://`	LinkSimple	Yes	Web scraped content
`youtube://`	YoutubeLogo	Yes	YouTube video transcripts
`github://`	GithubLogo	Yes	GitHub repository files
`gitlab://`	GitlabLogo	Yes	GitLab repository files
`confluence://`	Confluence	Yes	Confluence wiki pages
`drupalwiki://`	DrupalWiki	Yes	DrupalWiki pages
`obsidian://`	Obsidian	No	Obsidian vault notes
`paperless-ngx://`	PaperlessNgx	Yes	Paperless-ngx documents
(none)	FileText	No	Generic uploaded files

Sources: frontend/src/components/WorkspaceChat/ChatContainer/ChatHistory/Citation/index.jsx212-317

Document Watching and Auto-Sync

Documents imported from connectors can be marked as "watched" to enable automatic resynchronization when source content changes. This feature requires the experimental_live_file_sync system setting to be enabled.

Resync Operations

The background worker (server/jobs/sync-watched-documents.js) periodically checks watched documents and fetches updated content from their sources. Supported document types for watching:

Web links (link://)
YouTube videos (youtube://)
Confluence pages (confluence://)
GitHub files (github://)
GitLab files (gitlab://)
DrupalWiki pages (drupalwiki://)

The resync process:

Identify stale documents via DocumentSyncQueue.staleDocumentQueues()
Call Collector endpoint /ext/resync-source-document with document metadata
Compare new content with existing pageContent
If changed, update document, purge vectors, and re-embed
Update nextSyncAt timestamp for next check

Failed syncs are tracked; after 5 consecutive failures (DocumentSyncQueue.maxRepeatFailures), the document is automatically unwatched.

Sources: server/models/documentSyncQueue.js10-244 server/jobs/sync-watched-documents.js9-106 collector/extensions/resync/index.js1-203

API Document Attachment Processing

The chat API endpoints support document attachments with MIME type application/anythingllm-document. These attachments are base64-encoded documents that are:

Decoded and written to the hot directory
Processed through the Collector service
Injected directly into the chat context without permanent storage
Included in the sources array for citation

This enables ephemeral document context for single-turn queries without polluting the document library.

Sources: server/utils/chats/apiChatHandler.js39-96 server/endpoints/api/workspace/index.js609-621

Document Ingestion

Relevant source files