This document describes the common architecture shared by all LLM providers in AnythingLLM. It covers the interface contract, factory pattern, initialization lifecycle, context window management, and embedding engine pairing. For information about implementing new providers, see Adding New LLM Providers. For model discovery and caching strategies, see Model Management and Discovery.
AnythingLLM supports 30+ LLM providers through a polymorphic architecture where each provider implements a common interface while handling vendor-specific API interactions internally. All providers follow the same initialization pattern, expose the same core methods, and integrate with the same performance monitoring and embedding systems.
The architecture enforces:
getChatCompletion, streamGetChatCompletion, constructPrompt, etc.)NativeEmbedder)LLMPerformanceMonitorExample Providers: OllamaAILLM, OpenAiLLM, GeminiLLM, AnthropicLLM, AzureOpenAiLLM, LMStudioLLM, TogetherAiLLM, MistralLLM, HuggingFaceLLM, LocalAiLLM
Sources: server/utils/AiProviders/ollama/index.js14-46 server/utils/AiProviders/openAi/index.js13-34 server/utils/AiProviders/anthropic/index.js13-40
All provider classes implement a standard set of methods and properties. This interface enables the chat engine to interact with any provider without knowing implementation details.
Sources: server/utils/AiProviders/ollama/index.js14-489 server/utils/AiProviders/openAi/index.js13-301 server/utils/AiProviders/anthropic/index.js13-331 server/utils/AiProviders/gemini/index.js27-457
| Method | Purpose | Return Type | Required |
|---|---|---|---|
constructor(embedder, modelPreference) | Initialize provider with optional custom embedder and model | N/A | Yes |
streamingEnabled() | Check if streaming is supported | boolean | Yes |
promptWindowLimit() | Get maximum context window in tokens | number | Yes |
isValidChatCompletionModel(modelName) | Validate model name | Promise<boolean> | Yes |
constructPrompt(promptArgs) | Build message array from components | Array<Message> | Yes |
getChatCompletion(messages, options) | Synchronous completion request | Promise<{textResponse, metrics}> | Yes |
streamGetChatCompletion(messages, options) | Streaming completion request | Promise<Stream> | Yes |
handleStream(response, stream, responseProps) | Process streaming response chunks | Promise<string> | Yes |
embedTextInput(textInput) | Generate embeddings for text | Promise<Array<number>> | Yes |
embedChunks(textChunks) | Generate embeddings for multiple chunks | Promise<Array<Array<number>>> | Yes |
compressMessages(promptArgs, rawHistory) | Compress prompt to fit context window | Promise<Array<Message>> | Yes |
Sources: server/utils/AiProviders/ollama/index.js166-484 server/utils/AiProviders/openAi/index.js52-296
Every provider instance exposes these properties:
The limits object allocates context window space proportionally: system prompt (15%), history (15%), and user input (70%). These values are recalculated when the model changes.
Sources: server/utils/AiProviders/ollama/index.js18-46 server/utils/AiProviders/openAi/index.js22-34
Provider instantiation uses a factory function (not shown in provided files but referenced) that selects the correct class based on configuration. The factory checks both system-level settings (process.env.LLM_PROVIDER) and workspace-level overrides (workspace.chatProvider).
The factory also handles embedding engine selection. If no custom embedder is provided, it defaults to NativeEmbedder, which uses local Xenova/transformers models for offline operation.
Sources: Reference to factory pattern in server/utils/AiProviders/ollama/index.js18 server/utils/AiProviders/openAi/index.js14
Provider constructors follow a consistent initialization pattern:
Sources: server/utils/AiProviders/ollama/index.js18-46 server/utils/AiProviders/gemini/index.js28-62 server/utils/AiProviders/anthropic/index.js14-40
Each provider validates required configuration in the constructor:
This fail-fast approach prevents runtime errors when providers are selected but not properly configured.
Sources: server/utils/AiProviders/ollama/index.js19-20 server/utils/AiProviders/openAi/index.js15 server/utils/AiProviders/anthropic/index.js15-16
Providers wrap vendor SDKs with consistent interfaces. Most use the OpenAI SDK for compatibility:
| Provider | SDK/Client | Base URL |
|---|---|---|
OllamaAILLM | ollama NPM package | process.env.OLLAMA_BASE_PATH |
OpenAiLLM | openai SDK | OpenAI default endpoint |
GeminiLLM | openai SDK | https://generativelanguage.googleapis.com/v1beta/openai/ |
AnthropicLLM | @anthropic-ai/sdk | Anthropic default endpoint |
AzureOpenAiLLM | openai SDK | Formatted from AZURE_OPENAI_ENDPOINT |
LMStudioLLM | openai SDK | process.env.LMSTUDIO_BASE_PATH/v1 |
TogetherAiLLM | openai SDK | https://api.together.xyz/v1 |
MistralLLM | openai SDK | https://api.mistral.ai/v1 |
The OpenAI SDK is used widely because many providers implement OpenAI-compatible APIs, simplifying integration.
Sources: server/utils/AiProviders/ollama/index.js33-37 server/utils/AiProviders/gemini/index.js40-44 server/utils/AiProviders/anthropic/index.js20-24 server/utils/AiProviders/togetherAi/index.js84-89
Context window management is critical for preventing token limit errors. Each provider reports its maximum context window and dynamically allocates space between system prompt, chat history, and user input.
Sources: server/utils/AiProviders/ollama/index.js170-207 server/utils/AiProviders/gemini/index.js107-141 server/utils/AiProviders/openAi/index.js56-62
Static Providers (OpenAI, Anthropic, Mistral): Context windows are hardcoded in MODEL_MAP:
Dynamic Providers (Ollama, LMStudio, Gemini): Context windows are fetched from provider APIs and cached:
Caching prevents API calls on every chat request. Ollama and LMStudio cache context windows in static properties; Gemini writes to filesystem (storage/models/gemini/models.json).
Sources: server/utils/AiProviders/ollama/index.js78-115 server/utils/AiProviders/lmStudio/index.js73-112 server/utils/AiProviders/gemini/index.js172-302
Providers respect user-defined token limits via environment variables:
OLLAMA_MODEL_TOKEN_LIMITLMSTUDIO_MODEL_TOKEN_LIMITAZURE_OPENAI_TOKEN_LIMITLOCAL_AI_MODEL_TOKEN_LIMITHUGGING_FACE_LLM_TOKEN_LIMITThese limits are enforced as the minimum of user-defined and system-detected values to prevent exceeding actual model capabilities:
Sources: server/utils/AiProviders/ollama/index.js178-191 server/utils/AiProviders/lmStudio/index.js138-153
When prompts exceed context limits, the compressMessages method truncates history intelligently:
The messageArrayCompressor (or messageStringCompressor for Anthropic) uses the provider's limits object to determine how many tokens to allocate for each component (system prompt, history, user input).
Sources: server/utils/AiProviders/ollama/index.js479-484 server/utils/AiProviders/anthropic/index.js310-318
All providers are paired with an embedding engine for generating query embeddings during RAG similarity search. The pairing happens in the constructor:
The provider's embedTextInput and embedChunks methods delegate to the paired embedder:
This delegation pattern allows the chat engine to call provider.embedTextInput() without knowing which embedding engine is active. The system-level embedding provider configuration (process.env.EMBEDDING_ENGINE) determines which embedder is instantiated and passed to LLM providers.
Sources: server/utils/AiProviders/ollama/index.js38 server/utils/AiProviders/ollama/index.js472-477 server/utils/AiProviders/openAi/index.js29
NativeEmbedder is the default for offline operation. It uses Xenova/transformers to run embedding models locally (e.g., Xenova/all-MiniLM-L6-v2). This enables AnythingLLM to function without external API dependencies for embeddings.
Sources: server/utils/EmbeddingEngines/native (referenced but not in provided files)
All provider API calls are wrapped in LLMPerformanceMonitor to track metrics:
Sources: server/utils/AiProviders/ollama/index.js268-319 server/utils/AiProviders/openAi/index.js152-183
Every completion returns a standardized metrics object:
Sources: server/utils/AiProviders/ollama/index.js305-318 server/utils/AiProviders/openAi/index.js169-182
Streaming responses use LLMPerformanceMonitor.measureStream():
The monitor wraps the stream and tracks metrics in real-time. Providers call stream.endMeasurement(usage) when streaming completes to finalize metrics.
Sources: server/utils/AiProviders/ollama/index.js322-342 server/utils/AiProviders/gemini/index.js416-433
All providers implement constructPrompt() to transform RAG components into the message format their API expects:
Sources: server/utils/AiProviders/ollama/index.js246-265 server/utils/AiProviders/openAi/index.js107-126
Context from similarity search is formatted and appended to the system prompt:
This ensures context is clearly delimited for the model to reference during response generation.
Sources: server/utils/AiProviders/ollama/index.js117-127 server/utils/AiProviders/openAi/index.js40-50
For providers supporting vision, attachments are converted to content arrays:
This allows models like GPT-4 Vision, Claude 3, and Gemini to analyze images alongside text prompts.
Sources: server/utils/AiProviders/openAi/index.js87-100 server/utils/AiProviders/gemini/index.js320-334
While all providers implement the common interface, many have unique capabilities exposed through additional methods or properties.
| Provider | Feature | Implementation |
|---|---|---|
OllamaAILLM | Reasoning token support | Wraps message.thinking in <think> tags |
OllamaAILLM | Custom fetch timeout | #applyFetch() uses undici Agent for extended timeouts |
AnthropicLLM | Prompt caching | cacheControl getter with 5m/1h TTL options |
GeminiLLM | Experimental models | isExperimentalModel() checks v1beta API access |
GeminiLLM | System prompt emulation | Models without system support use user/assistant workaround |
AzureOpenAiLLM | O-type model handling | isOTypeModel flag changes message format (system → user) |
LMStudioLLM | Dynamic model discovery | Fetches available models via /api/v0/models |
TogetherAiLLM | Model caching | Caches model list to disk for 1 week |
Sources: server/utils/AiProviders/ollama/index.js398-425 server/utils/AiProviders/anthropic/index.js73-86 server/utils/AiProviders/gemini/index.js149-163 server/utils/AiProviders/azureOpenAi/index.js30-31
Ollama models can return reasoning content separately from main output:
This makes internal reasoning visible to users without polluting the main response.
Sources: server/utils/AiProviders/ollama/index.js398-425
Anthropic supports ephemeral caching of system prompts to reduce costs:
Repeated system prompts are cached by Anthropic for 5 minutes or 1 hour, reducing token costs significantly for multi-turn conversations.
Sources: server/utils/AiProviders/anthropic/index.js73-102
Gemini distinguishes between stable (v1) and experimental (v1beta) models:
All models use the same OpenAI-compatible endpoint (v1beta/openai/), regardless of stability status.
Sources: server/utils/AiProviders/gemini/index.js149-163
Azure deployments using reasoning models (o1, o3-mini) require special handling:
O-type models don't accept system role messages and don't support temperature parameters.
Sources: server/utils/AiProviders/azureOpenAi/index.js24-148
The frontend uses hooks to fetch available models dynamically and display them in settings forms.
Sources: frontend/src/hooks/useGetProvidersModels.js1-86 frontend/src/components/LLMSelection/GeminiLLMOptions/index.jsx64-138
Grouped Providers (TogetherAI, OpenAI, Novita, OpenRouter, etc.): Models are organized by creator organization:
This creates <optgroup> elements in the model selector for better UX when hundreds of models are available.
Ungrouped Providers: Models are displayed as a flat list.
Sources: frontend/src/hooks/useGetProvidersModels.js40-86 frontend/src/components/LLMSelection/GeminiLLMOptions/index.jsx116-135
Each provider has a dedicated settings component that renders API key fields, model selectors, and provider-specific options:
AnthropicAiOptions: API key + model selector + prompt caching toggleGeminiLLMOptions: API key + dynamic model selector (Stable/Experimental grouping)These components use System.customModels() to fetch available models when the API key changes.
Sources: frontend/src/components/LLMSelection/AnthropicAiOptions/index.jsx1-191 frontend/src/components/LLMSelection/GeminiLLMOptions/index.jsx1-139
The provider architecture in AnythingLLM achieves flexibility through:
This architecture allows AnythingLLM to support 30+ providers while maintaining a consistent API surface for the chat engine.
Refresh this wiki