This document covers the LLM provider architecture in AnythingLLM, including the common interface that all providers implement, the factory pattern for provider selection, core methods, and provider-specific features. It explains how to work with existing providers and how to add new ones.
For information about using LLM providers in the chat system, see Chat System Architecture. For workspace-level LLM configuration, see Workspace Configuration. For agent-specific provider usage, see Agent System.
AnythingLLM supports 30+ LLM providers through a polymorphic architecture where each provider implements a common interface. This design allows the system to treat all providers uniformly while handling vendor-specific quirks internally.
All LLM provider classes implement the same set of core methods:
| Method | Purpose | Returns |
|---|---|---|
constructPrompt() | Formats messages for provider API | Message array |
getChatCompletion() | Synchronous chat completion | {textResponse, metrics} |
streamGetChatCompletion() | Initiates streaming response | Stream object |
handleStream() | Processes streaming chunks | Promise resolving to full text |
embedTextInput() | Embeds query text | Vector array |
embedChunks() | Embeds document chunks | Vector arrays |
compressMessages() | Fits messages to context window | Compressed message array |
promptWindowLimit() | Returns context window size | Number (tokens) |
isValidChatCompletionModel() | Validates model name | Boolean |
streamingEnabled() | Checks streaming support | Boolean |
Sources: server/utils/AiProviders/openAi/index.js13-301 server/utils/AiProviders/anthropic/index.js13-331 server/utils/AiProviders/ollama/index.js14-489 server/utils/AiProviders/gemini/index.js27-457
All provider constructors follow a consistent pattern:
modelPreference parameter or environment variableNativeEmbedderSources: server/utils/AiProviders/openAi/index.js14-34 server/utils/AiProviders/ollama/index.js18-46 server/utils/AiProviders/gemini/index.js28-62 server/utils/AiProviders/anthropic/index.js14-40
Provider selection uses a factory pattern (mentioned in Diagram 5 of the high-level overview). The getLLMProvider() function:
LLM_PROVIDER environment variable or workspace overrideThis allows runtime provider switching without code changes.
Sources: Diagram 5 from high-level architecture
The constructPrompt() method formats input into provider-specific message arrays. All providers accept the same parameters:
Message Structure:
systemPrompt + formatted contextTextsContext formatting uses #appendContext() helper:
This wraps each context chunk in XML-style tags for clear delineation.
Attachment Handling:
Providers generate content arrays for multimodal messages:
Provider Variations:
| Provider | Message Format | Special Handling |
|---|---|---|
| OpenAI | Standard {role, content} | input_text type for O-models |
| Anthropic | Standard, system separate | System as separate parameter |
| Ollama | Standard | Spreads attachments with formatChatHistory |
| Gemini | Standard | Models without system support use user/assistant emulation |
| Azure | Standard | O-type models use user role for system |
Sources: server/utils/AiProviders/openAi/index.js107-126 server/utils/AiProviders/ollama/index.js246-265 server/utils/AiProviders/gemini/index.js341-378 server/utils/AiProviders/anthropic/index.js128-148 server/utils/AiProviders/azureOpenAi/index.js129-148
Synchronous completion method that returns full response at once. Standard implementation:
isValidChatCompletionModel()LLMPerformanceMonitor.measureAsyncFunction(){textResponse, metrics} objectMetrics Structure:
prompt_tokens - Input tokens consumedcompletion_tokens - Output tokens generatedtotal_tokens - Sum of prompt + completionoutputTps - Tokens per second (completion/duration)duration - API call duration in secondsmodel - Model identifier usedprovider - Provider class nametimestamp - Completion timeSources: server/utils/AiProviders/openAi/index.js146-183 server/utils/AiProviders/gemini/index.js380-413 server/utils/AiProviders/anthropic/index.js150-183
Initiates streaming response. Returns a MonitoredStream object wrapped by LLMPerformanceMonitor:
The measureStream() wrapper:
endMeasurement() for finalizing metricsSources: server/utils/AiProviders/openAi/index.js185-206 server/utils/AiProviders/ollama/index.js321-342 server/utils/AiProviders/gemini/index.js415-433
Processes streaming chunks and writes to HTTP response. Standard pattern:
Chunk Processing Variations:
| Provider | Chunk Format | Notes |
|---|---|---|
| OpenAI | {type, delta} | Uses response.output_text.delta |
| Anthropic | {type, delta} | Custom event types content_block_delta, message_stop |
| Ollama | {message: {content}} | Special handling for thinking property (reasoning tokens) |
| Gemini | Standard OpenAI format | Via OpenAI-compatible endpoint |
Some providers use handleDefaultStreamResponseV2() helper for standard OpenAI-format streams.
Sources: server/utils/AiProviders/openAi/index.js208-282 server/utils/AiProviders/ollama/index.js351-469 server/utils/AiProviders/anthropic/index.js211-296
Ensures messages fit within context window by intelligently truncating. Two strategies exist:
Array Compressor (for OpenAI-format messages):
String Compressor (for providers with different formats):
The compressor:
this.limits (typically 15%/15%/70% split)Sources: server/utils/AiProviders/openAi/index.js292-296 server/utils/AiProviders/anthropic/index.js310-318 server/utils/AiProviders/ollama/index.js479-484
Context window management varies significantly across providers due to different model capabilities and API behaviors.
Static Context Windows:
Some providers have fixed limits defined in MODEL_MAP:
Dynamic Context Windows: Others fetch limits from provider APIs:
Ollama Context Window Caching:
Ollama caches context windows on first initialization:
User Override: User-defined limits take precedence but cannot exceed system limits:
Sources: server/utils/AiProviders/ollama/index.js78-115 server/utils/AiProviders/ollama/index.js170-207 server/utils/AiProviders/lmStudio/index.js73-112
Gemini uses filesystem caching for model metadata:
Cache structure:
Cache expires after 1 day (8.64e7 milliseconds).
Sources: server/utils/AiProviders/gemini/index.js172-302 server/utils/AiProviders/gemini/index.js81-89 server/utils/AiProviders/gemini/index.js107-141
All providers use a similar allocation strategy (15%/15%/70%):
Lazy initialization pattern for performance:
Sources: server/utils/AiProviders/openAi/index.js23-27 server/utils/AiProviders/ollama/index.js56-67
All provider calls are wrapped in LLMPerformanceMonitor for metrics collection.
For synchronous completions:
The monitor:
Sources: server/utils/AiProviders/openAi/index.js152-163 server/utils/AiProviders/gemini/index.js381-392
For streaming completions:
The monitored stream:
runPromptTokenCalculation: trueendMeasurement(usage) method to finalize metricsToken Counting:
When runPromptTokenCalculation: true, the monitor uses tiktoken or similar to count tokens since some providers don't return usage metadata.
Finalizing Metrics:
Stream handlers call endMeasurement() with final usage data:
Sources: server/utils/AiProviders/ollama/index.js322-342 server/utils/AiProviders/openAi/index.js191-206 server/utils/AiProviders/ollama/index.js367-393
Sources: server/utils/AiProviders/openAi/index.js146-183 server/utils/AiProviders/ollama/index.js321-342 server/utils/AiProviders/ollama/index.js351-469
Ollama v0.9.0+ supports reasoning tokens via the thinking property. These are wrapped in `";
writeResponseChunk(response, { /* ... /, textResponse: endTag });
fullText += reasoningText + endTag;
reasoningText = "";
}
fullText += content;
writeResponseChunk(response, { / ... */, textResponse: content });
}
**Non-streaming:**
```javascript
let content = res.message.content;
if (res.message.thinking)
content = `${content}`;
This makes reasoning visible in the chat UI for models that support chain-of-thought.
Custom Timeout Support:
Ollama allows custom timeouts for slow responses via OLLAMA_RESPONSE_TIMEOUT:
Sources: server/utils/AiProviders/ollama/index.js398-451 server/utils/AiProviders/ollama/index.js282-284 server/utils/AiProviders/ollama/index.js135-164
Anthropic supports prompt caching to reduce costs for repeated system prompts:
System Prompt Builder:
Configuration via ANTHROPIC_CACHE_CONTROL:
5m - Cache for 5 minutes1h - Cache for 1 hournone or unset - No cachingBenefits:
Sources: server/utils/AiProviders/anthropic/index.js73-86 server/utils/AiProviders/anthropic/index.js93-102 frontend/src/components/LLMSelection/AnthropicAiOptions/index.jsx64-85
Gemini distinguishes between stable (v1) and experimental (v1beta) models:
All models use the v1beta OpenAI-compatible endpoint:
System Prompt Support:
Some models don't support system prompts (Gemma variants):
Emulation for unsupported models:
Frontend Model Grouping:
Gemini models are grouped by stability in UI:
Sources: server/utils/AiProviders/gemini/index.js149-163 server/utils/AiProviders/gemini/index.js39-44 server/utils/AiProviders/gemini/index.js70-72 server/utils/AiProviders/gemini/index.js349-368 frontend/src/components/LLMSelection/GeminiLLMOptions/index.jsx74-80
Azure OpenAI deployments support reasoning models (o1, o1-mini, o3-mini) which require different handling:
System Message Handling:
O-type models don't accept system role:
Temperature Override:
O-type models don't support temperature:
Note: Azure doesn't expose model metadata, so users must manually configure AZURE_OPENAI_MODEL_TYPE="reasoning" for reasoning models.
Sources: server/utils/AiProviders/azureOpenAi/index.js30-31 server/utils/AiProviders/azureOpenAi/index.js137-138 server/utils/AiProviders/azureOpenAi/index.js160
TogetherAI fetches and caches model list from API:
Context Window Lookup:
Sources: server/utils/AiProviders/togetherAi/index.js18-78 server/utils/AiProviders/togetherAi/index.js148-152
To add a new LLM provider, follow this implementation checklist:
Create server/utils/AiProviders/{provider-name}/index.js:
Context Formatting:
Attachment Handling:
Construct Prompt:
Sync Completion:
Stream Completion:
Handle Stream:
Utility Methods:
Embedding Methods:
Message Compression:
Add to getLLMProvider() factory (location varies, typically in a provider loader):
Create frontend/src/components/LLMSelection/NewProviderOptions/index.jsx:
If provider supports dynamic model listing:
Add to server/.env.example:
constructPrompt() formats messages correctlygetChatCompletion() returns valid responsestreamGetChatCompletion() streams chunks properlyhandleStream() processes all chunk typesSources: server/utils/AiProviders/openAi/index.js13-301 server/utils/AiProviders/mistral/index.js10-191 frontend/src/components/LLMSelection/GeminiLLMOptions/index.jsx4-138 frontend/src/hooks/useGetProvidersModels.js58-86
| Provider | API Type | Streaming | Vision | Context Window Source | Model Discovery | Special Features |
|---|---|---|---|---|---|---|
| OpenAI | OpenAI SDK | ✓ | ✓ | MODEL_MAP + API | API models endpoint | O-model support |
| Anthropic | Anthropic SDK | ✓ | ✓ | MODEL_MAP | Hardcoded list | Prompt caching |
| Ollama | Ollama SDK | ✓ | ✓ | API discovery | /api/tags endpoint | Reasoning tokens, custom timeout |
| Gemini | OpenAI-compatible | ✓ | ✓ | API discovery (cached) | /v1/models + /v1beta/models | Experimental models, system prompt emulation |
| Azure OpenAI | OpenAI SDK | ✓ | ✓ | Environment variable | User-configured | O-type model support |
| LM Studio | OpenAI SDK | ✓ | ✓ | API discovery | /api/v0/models | Multi-model chat |
| TogetherAI | OpenAI SDK | ✓ | ✓ | API discovery (cached) | /v1/models | Large model catalog |
| HuggingFace | OpenAI-compatible | ✓ | ✗ | Environment variable | N/A | Inference endpoints |
| LocalAI | OpenAI SDK | ✓ | ✓ | Environment variable | User-configured | Self-hosted |
| Mistral | OpenAI SDK | ✓ | ✓ | Fixed (32000) | Hardcoded | - |
Sources: All provider implementation files listed above
Refresh this wiki