The LLM Integration System provides a unified abstraction layer for interacting with 10+ commercial and open-source LLM providers. It manages tenant-specific configurations, API keys, model selection, and implements cross-cutting concerns including error handling, retry logic, usage tracking, and optional observability through Langfuse. The system supports six model types: chat, embedding, rerank, vision (image-to-text), text-to-speech (TTS), and speech-to-text (ASR).
For information about how LLMs are used within the agent workflow system, see Agent Tools and ReAct Loop. For details on tenant management and multi-tenancy, see User and Tenant Management.
System Architecture: LLM Integration Layers
The LLM integration system follows a layered architecture with clear separation between configuration, abstraction, and provider-specific implementations:
Architecture Description: Configuration flows from conf/llm_factories.json through the service layer to runtime model selection. The API Layer exposes Quart HTTP endpoints in api/apps/llm_app.py for managing LLM configurations. The Service Layer provides database access via LLMService and TenantLLMService, plus the LLMBundle wrapper that adds usage tracking, retry logic, and observability. The Model Type Registries are dictionaries (ChatModel, EmbeddingModel, etc.) populated dynamically at module load time by scanning for classes with _FACTORY_NAME attributes. Provider Implementations inherit from abstract base classes (Base) and implement provider-specific logic for API calls, authentication, and response parsing.
Sources: rag/llm/__init__.py129-178 api/db/services/llm_service.py85-118 api/apps/llm_app.py32-56 rag/llm/chat_model.py63 rag/llm/embedding_model.py37 rag/llm/rerank_model.py28
RAGFlow supports six distinct model types, each with its own registry and base class:
| Model Type | Registry Variable | Base Class Location | Purpose |
|---|---|---|---|
| CHAT | ChatModel | rag/llm/chat_model.py63-486 | Text generation, multi-turn conversations, tool calling |
| EMBEDDING | EmbeddingModel | rag/llm/embedding_model.py37-51 | Text-to-vector conversion, similarity search |
| RERANK | RerankModel | rag/llm/rerank_model.py28-54 | Search result reordering, relevance scoring |
| IMAGE2TEXT | CvModel | rag/llm/cv_model.py42-246 | Image description, vision-language models |
| TTS | TTSModel | rag/llm/tts_model.py67-79 | Text-to-speech synthesis |
| SPEECH2TEXT | Seq2txtModel | rag/llm/sequence2txt_model.py31-42 | Audio transcription |
Dynamic Provider Discovery at Module Load
The registry pattern allows dynamic discovery of provider implementations at module load time. Each provider class defines _FACTORY_NAME as a string (e.g., "OpenAI") or list of strings (e.g., ["VLLM", "OpenAI-API-Compatible"]) to identify which factory names it supports. The scan happens in rag/llm/__init__.py150-178 using inspect.getmembers() to find all classes that subclass Base and have the _FACTORY_NAME attribute.
Sources: rag/llm/__init__.py138-178 rag/llm/__init__.py150-178
Each model type defines an abstract base class that provider implementations must subclass:
Chat Model Class Hierarchy
The Base class in rag/llm/chat_model.py63-486 provides core functionality:
OpenAI and AsyncOpenAI clients with configurable timeout (default 600s from LLM_TIMEOUT_SECONDS)max_retries: Default 5 from LLM_MAX_RETRIES env varbase_delay: Default 2.0s from LLM_BASE_DELAY env varmax_rounds: Default 5 for tool calling loops_classify_error() maps exception messages to LLMErrorCode enum valuesbind_tools() attaches tool session and definitionsasync_chat_with_tools() for non-streaming tool executionasync_chat_streamly_with_tools() for streaming with tool calls_append_history() formats tool call/response pairs for conversation contextSources: rag/llm/chat_model.py63-486 rag/llm/chat_model.py64-67 rag/llm/chat_model.py69-75 rag/llm/chat_model.py80-99 rag/llm/chat_model.py247-330
Embedding Model Interface
The embedding base class defines two core methods in rag/llm/embedding_model.py37-51:
Method Specifications:
| Method | Input | Output | Purpose |
|---|---|---|---|
encode(texts: list) | List of document texts | (embeddings: np.ndarray, token_count: int) | Batch encoding for documents/passages |
encode_queries(text: str) | Single query string | (embedding: np.ndarray, token_count: int) | Query encoding (some providers optimize separately) |
Example Implementation - OpenAIEmbed in rag/llm/embedding_model.py91-123:
Provider implementations typically handle:
Sources: rag/llm/embedding_model.py37-123 rag/llm/embedding_model.py91-123
Rerank models implement similarity(query: str, texts: list) -> (rank: np.ndarray, token_count: int). The base class provides _normalize_rank() to scale scores to 0-1 range to avoid division by zero when all ranks are identical rag/llm/rerank_model.py39-54
Sources: rag/llm/rerank_model.py28-54
Provider classes are automatically discovered and registered at module import time through introspection:
The registration code in rag/llm/__init__.py150-178:
MODULE_MAPPING (chat_model, embedding_model, rerank_model, cv_model, sequence2txt_model, tts_model, ocr_model)importlib.import_module()inspect.getmembers() to find all classesBase abstract class_FACTORY_NAME attributeChatModel, EmbeddingModel, etc.)If _FACTORY_NAME is a list (e.g., ["VLLM", "OpenAI-API-Compatible"]), the class is registered under multiple factory names.
Sources: rag/llm/__init__.py138-178
The llm_factories.json file defines available providers and their models:
Example factory definition from conf/llm_factories.json2-199:
Sources: conf/llm_factories.json1-200
The validation logic in api/apps/llm_app.py59-144 tests API keys by:
EmbeddingModel[factory])LLM_TIMEOUT_SECONDS)Sources: api/apps/llm_app.py59-144 api/apps/llm_app.py146-340
API keys are stored in the TenantLLM table with encrypted values. Special handling for providers with complex authentication:
| Provider | Key Format | Example Fields |
|---|---|---|
| VolcEngine | JSON string | ark_api_key, endpoint_id |
| BaiduYiyan | JSON string | yiyan_ak, yiyan_sk |
| Bedrock | JSON string | auth_mode, bedrock_ak, bedrock_sk, bedrock_region, aws_role_arn |
| Azure-OpenAI | JSON string | api_key, api_version |
| Google Cloud | JSON string | google_project_id, google_region, google_service_account_key (base64) |
The key assembly logic is in api/apps/llm_app.py159-209
Sources: api/apps/llm_app.py159-209
LLMBundle Initialization and Method Wrapping Flow
LLMBundle in api/db/services/llm_service.py85-450 wraps raw model instances and adds cross-cutting concerns:
The LLMBundle class in api/db/services/llm_service.py85-450 provides these wrapped methods:
| Method | Model Type | Implementation | Wrapping Concerns |
|---|---|---|---|
encode(texts) | Embedding | api/db/services/llm_service.py95-118 | Token truncation to max_length, usage tracking, Langfuse tracing |
encode_queries(query) | Embedding | api/db/services/llm_service.py120-133 | Usage tracking, Langfuse tracing |
similarity(query, texts) | Rerank | api/db/services/llm_service.py135-147 | Usage tracking, Langfuse tracing |
describe(image) | Vision | api/db/services/llm_service.py149-161 | Usage tracking, Langfuse tracing |
describe_with_prompt(image, prompt) | Vision | api/db/services/llm_service.py163-175 | Usage tracking, Langfuse tracing |
transcription(audio) | ASR | api/db/services/llm_service.py177-189 | Usage tracking, Langfuse tracing |
stream_transcription(audio) | ASR | api/db/services/llm_service.py191-251 | Streaming support with fallback |
tts(text) | TTS | api/db/services/llm_service.py253-265 | Streaming output with final token count |
chat() | Chat | api/db/services/llm_service.py282-364 | Usage tracking, Langfuse tracing, tool binding |
chat_streamly() | Chat | api/db/services/llm_service.py366-448 | Streaming with usage tracking |
The encode() method in api/db/services/llm_service.py99-107 implements automatic text truncation:
This prevents embedding requests from exceeding model token limits by truncating to 95% of max_length. The max_length value comes from the model's configured max_tokens in the database.
Sources: api/db/services/llm_service.py85-450 api/db/services/llm_service.py95-118 api/db/services/llm_service.py99-107
The Base._classify_error() method in rag/llm/chat_model.py80-99 maps exception messages to error codes:
The error classification enables intelligent retry decisions. Only certain error types trigger retries.
Sources: rag/llm/chat_model.py80-99
Retry Logic Flow in Base.async_chat()
Retry Configuration from rag/llm/chat_model.py69-78:
| Parameter | Default | Source | Purpose |
|---|---|---|---|
max_retries | 5 | LLM_MAX_RETRIES env var | Maximum retry attempts |
base_delay | 2.0s | LLM_BASE_DELAY env var | Base delay for exponential backoff |
| Jitter formula | base_delay * random.uniform(10, 150) | rag/llm/chat_model.py77-78 | 10x-150x multiplier (20s to 300s range) |
Retryable Error Codes defined in rag/llm/chat_model.py202-209:
ERROR_RATE_LIMIT: API rate limit exceeded (HTTP 429, "too many requests")ERROR_SERVER: Server-side errors (HTTP 500, 502, 503, 504)All other error codes (ERROR_AUTHENTICATION, ERROR_INVALID_REQUEST, ERROR_TIMEOUT, ERROR_CONNECTION, ERROR_CONTENT_FILTER, ERROR_MODEL, ERROR_QUOTA) return immediately without retry.
Sources: rag/llm/chat_model.py69-243 rag/llm/chat_model.py77-78 rag/llm/chat_model.py202-209 rag/llm/chat_model.py473-485
The is_tools flag comes from llm_factories.json and indicates whether a model supports function calling conf/llm_factories.json15
Sources: api/db/services/llm_service.py89-93 rag/llm/chat_model.py271-276
Streaming Tool Call Flow in async_chat_streamly_with_tools()
The streaming tool call implementation in rag/llm/chat_model.py332-441:
Key Implementation Details:
Tool Call Accumulation rag/llm/chat_model.py359-368:
Parallel Tool Execution rag/llm/chat_model.py404:
thread_pool_exec() to run self.toolcall_session.tool_call(name, args) in thread poolResult Formatting rag/llm/chat_model.py245:
<tool_call> XML: {"name": ..., "args": ..., "result": ...}History Management rag/llm/chat_model.py247-269:
_append_history() adds assistant message with tool_calls arraytool_call_id and result contentMax Rounds Protection rag/llm/chat_model.py412-415:
max_rounds iterations (default 5), appends "Exceed max rounds" messageSources: rag/llm/chat_model.py332-441 rag/llm/chat_model.py245 rag/llm/chat_model.py247-269 rag/llm/chat_model.py359-368 rag/llm/chat_model.py404 rag/llm/chat_model.py412-415
Token Usage Tracking Flow
Token usage is tracked per tenant and optionally per model:
Implementation in LLMBundle:
The increase_usage() call pattern from api/db/services/llm_service.py111-112:
Database Update Logic:
llm_name (e.g., "gpt-4o")model_type (e.g., "chat")Usage Display: The /my_llms endpoint in api/apps/llm_app.py372-414 returns:
Sources: api/db/services/llm_service.py111-112 api/apps/llm_app.py372-414 common/token_utils.py
Langfuse Tracing in LLMBundle Methods
Optional observability through Langfuse traces each LLM operation:
Example Implementation from api/db/services/llm_service.py96-117:
Configuration:
LLMBundle initializationencode, similarity, describe, chat, etc.) follows the same patternTraced Information:
self.llm_name (e.g., "gpt-4o", "text-embedding-3-small")Sources: api/db/services/llm_service.py96-132 api/db/services/llm_service.py96-117
The LLM management API is defined in api/apps/llm_app.py32-453:
| Endpoint | Method | Purpose | Authentication |
|---|---|---|---|
/factories | GET | List available LLM factories with model types | Required |
/set_api_key | POST | Set API key for a factory (tests all model types) | Required |
/add_llm | POST | Add a specific model with validation | Required |
/delete_llm | POST | Remove a specific model configuration | Required |
/enable_llm | POST | Enable/disable a model | Required |
/delete_factory | POST | Remove all models for a factory | Required |
/my_llms | GET | Get tenant's configured models with usage stats | Required |
/list | GET | List available models by type for tenant | Required |
API Key Validation in set_api_key() Endpoint
Key Implementation Details:
Timeout Protection api/apps/llm_app.py75-100:
asyncio.wait_for() with configurable timeout (default 10 seconds from LLM_TIMEOUT_SECONDS)Test Queries:
mdl.encode(["Test if the api key is available"])mdl.async_chat(None, [{"role": "user", "content": "Hello! How are you doing!"}], {"temperature": 0.9, "max_tokens": 50})mdl.similarity("What's the weather?", ["Is it sunny today?"])Early Success Exit:
Batch Save api/apps/llm_app.py130-141:
TenantLLMService.filter_update() if exists, otherwise TenantLLMService.save()Sources: api/apps/llm_app.py59-143 api/apps/llm_app.py75-100 api/apps/llm_app.py130-141
RAGFlow uses LiteLLM for certain providers to reduce implementation overhead. The mapping is defined in rag/llm/__init__.py25-127:
LiteLLM Provider Examples:
| Provider | Prefix | Base URL |
|---|---|---|
| Anthropic | "" (empty) | https://api.anthropic.com/ |
| Cohere | "" (empty) | N/A |
| DeepSeek | deepseek/ | N/A |
| Gemini | gemini/ | N/A |
| Groq | groq/ | N/A |
| Moonshot | moonshot/ | https://api.moonshot.cn/v1 |
| Ollama | ollama_chat/ | Configurable |
| OpenRouter | openai/ | https://openrouter.ai/api/v1 |
| ZHIPU-AI | openai/ | https://open.bigmodel.cn/api/paas/v4 |
Many OpenAI-API-compatible providers use the openai/ prefix with custom base URLs.
Sources: rag/llm/__init__.py63-127
The llm_factories.json uses tags to indicate model capabilities:
| Tag | Meaning | Example Models |
|---|---|---|
| LLM | Large language model (text generation) | All chat models |
| CHAT | Conversational interface | gpt-4o, claude-3-opus |
| IMAGE2TEXT | Vision language model | gpt-4o, qwen-vl-plus |
| TEXT EMBEDDING | Text-to-vector encoding | text-embedding-3-small, bge-m3 |
| TEXT RE-RANK | Search result reranking | gte-rerank, jina-reranker-v2 |
| TTS | Text-to-speech synthesis | tts-1, sambert-zhiru-v1 |
| SPEECH2TEXT | Audio transcription | whisper-1, qwen3-asr-flash |
| AGENT | Agent-specific optimization | qianwen-deepresearch-30b |
| MODERATION | Content moderation | OpenAI moderation models |
| 32K, 128K, 1M | Context window size | Various (indicates max_tokens) |
Models can have multiple tags. Example from conf/llm_factories.json11-16:
Sources: conf/llm_factories.json1-200
Certain providers require complex authentication beyond a simple API key:
Bedrock supports three authentication modes api/apps/llm_app.py172-175:
bedrock_ak, bedrock_sk, bedrock_regionaws_role_arn, bedrock_region (assumes role via STS)Implementation in rag/llm/embedding_model.py461-503 shows the three authentication paths:
assume_role() to get temporary credentialsVolcEngine assembles ark_api_key and endpoint_id into a JSON string stored in api_key field api/apps/llm_app.py163-166:
The VolcEngineChat class then parses this JSON in its constructor rag/llm/chat_model.py656-665
Sources: api/apps/llm_app.py159-209 rag/llm/embedding_model.py461-503 rag/llm/chat_model.py653-665
The frontend LLM configuration UI is powered by constants and API calls:
The IconMap in web/src/constants/llm.ts69-134 maps factory names to icon file names:
The APIMapUrl dictionary web/src/constants/llm.ts136-186 provides direct links to provider API key pages:
These constants enable the frontend to display appropriate icons and help links when users configure LLM providers.
Sources: web/src/constants/llm.ts1-187
System initialization populates LLM configurations from multiple sources:
The get_init_tenant_llm() function in api/db/services/llm_service.py36-83 reads default LLM configurations from settings.py:
These settings come from environment variables or service_conf.yaml. The initialization ensures the superuser has working LLM configurations for all required model types.
Sources: api/db/init_data.py166-179 api/db/services/llm_service.py36-83
Refresh this wiki