LLM Integration System

Relevant source files

The LLM Integration System provides a unified abstraction layer for interacting with 10+ commercial and open-source LLM providers. It manages tenant-specific configurations, API keys, model selection, and implements cross-cutting concerns including error handling, retry logic, usage tracking, and optional observability through Langfuse. The system supports six model types: chat, embedding, rerank, vision (image-to-text), text-to-speech (TTS), and speech-to-text (ASR).

For information about how LLMs are used within the agent workflow system, see Agent Tools and ReAct Loop. For details on tenant management and multi-tenancy, see User and Tenant Management.

Architecture Overview

System Architecture: LLM Integration Layers

The LLM integration system follows a layered architecture with clear separation between configuration, abstraction, and provider-specific implementations:

Architecture Description: Configuration flows from conf/llm_factories.json through the service layer to runtime model selection. The API Layer exposes Quart HTTP endpoints in api/apps/llm_app.py for managing LLM configurations. The Service Layer provides database access via LLMService and TenantLLMService, plus the LLMBundle wrapper that adds usage tracking, retry logic, and observability. The Model Type Registries are dictionaries (ChatModel, EmbeddingModel, etc.) populated dynamically at module load time by scanning for classes with _FACTORY_NAME attributes. Provider Implementations inherit from abstract base classes (Base) and implement provider-specific logic for API calls, authentication, and response parsing.

Sources: rag/llm/__init__.py129-178 api/db/services/llm_service.py85-118 api/apps/llm_app.py32-56 rag/llm/chat_model.py63 rag/llm/embedding_model.py37 rag/llm/rerank_model.py28

Model Type System

RAGFlow supports six distinct model types, each with its own registry and base class:

Model Type	Registry Variable	Base Class Location	Purpose
CHAT	`ChatModel`	rag/llm/chat_model.py63-486	Text generation, multi-turn conversations, tool calling
EMBEDDING	`EmbeddingModel`	rag/llm/embedding_model.py37-51	Text-to-vector conversion, similarity search
RERANK	`RerankModel`	rag/llm/rerank_model.py28-54	Search result reordering, relevance scoring
IMAGE2TEXT	`CvModel`	rag/llm/cv_model.py42-246	Image description, vision-language models
TTS	`TTSModel`	rag/llm/tts_model.py67-79	Text-to-speech synthesis
SPEECH2TEXT	`Seq2txtModel`	rag/llm/sequence2txt_model.py31-42	Audio transcription

Model Type Registry Population

Dynamic Provider Discovery at Module Load

The registry pattern allows dynamic discovery of provider implementations at module load time. Each provider class defines _FACTORY_NAME as a string (e.g., "OpenAI") or list of strings (e.g., ["VLLM", "OpenAI-API-Compatible"]) to identify which factory names it supports. The scan happens in rag/llm/__init__.py150-178 using inspect.getmembers() to find all classes that subclass Base and have the _FACTORY_NAME attribute.

Sources: rag/llm/__init__.py138-178 rag/llm/__init__.py150-178

Provider Base Classes and Implementation Pattern

Each model type defines an abstract base class that provider implementations must subclass:

Chat Model Base Class

Chat Model Class Hierarchy

The Base class in rag/llm/chat_model.py63-486 provides core functionality:

Client initialization (rag/llm/chat_model.py64-67):
- OpenAI and AsyncOpenAI clients with configurable timeout (default 600s from LLM_TIMEOUT_SECONDS)
- Base URL and API key configuration
Retry configuration (rag/llm/chat_model.py69-75):
- max_retries: Default 5 from LLM_MAX_RETRIES env var
- base_delay: Default 2.0s from LLM_BASE_DELAY env var
- max_rounds: Default 5 for tool calling loops
Error classification (rag/llm/chat_model.py80-99):
- _classify_error() maps exception messages to LLMErrorCode enum values
- Distinguishes rate limits, auth errors, server errors, timeouts, etc.
Tool calling support (rag/llm/chat_model.py271-330):
- bind_tools() attaches tool session and definitions
- async_chat_with_tools() for non-streaming tool execution
- async_chat_streamly_with_tools() for streaming with tool calls
History management (rag/llm/chat_model.py247-269):
- _append_history() formats tool call/response pairs for conversation context

Sources: rag/llm/chat_model.py63-486 rag/llm/chat_model.py64-67 rag/llm/chat_model.py69-75 rag/llm/chat_model.py80-99 rag/llm/chat_model.py247-330

Embedding Model Base Class

Embedding Model Interface

The embedding base class defines two core methods in rag/llm/embedding_model.py37-51:

Method Specifications:

Method	Input	Output	Purpose
`encode(texts: list)`	List of document texts	`(embeddings: np.ndarray, token_count: int)`	Batch encoding for documents/passages
`encode_queries(text: str)`	Single query string	`(embedding: np.ndarray, token_count: int)`	Query encoding (some providers optimize separately)

Example Implementation - OpenAIEmbed in rag/llm/embedding_model.py91-123:

Provider implementations typically handle:

Batching: Process in chunks (commonly 16 items per request)
Truncation: Limit text to model's max tokens (e.g., 8191 for OpenAI)
Token counting: Track usage via response fields or estimation

Sources: rag/llm/embedding_model.py37-123 rag/llm/embedding_model.py91-123

Rerank Model Base Class

Rerank models implement similarity(query: str, texts: list) -> (rank: np.ndarray, token_count: int). The base class provides _normalize_rank() to scale scores to 0-1 range to avoid division by zero when all ranks are identical rag/llm/rerank_model.py39-54

Sources: rag/llm/rerank_model.py28-54

Provider Discovery and Registration

Provider classes are automatically discovered and registered at module import time through introspection:

The registration code in rag/llm/__init__.py150-178:

Iterates through MODULE_MAPPING (chat_model, embedding_model, rerank_model, cv_model, sequence2txt_model, tts_model, ocr_model)
Imports each module using importlib.import_module()
Uses inspect.getmembers() to find all classes
Identifies the Base abstract class
Finds subclasses with _FACTORY_NAME attribute
Registers each class in the appropriate dictionary (ChatModel, EmbeddingModel, etc.)

If _FACTORY_NAME is a list (e.g., ["VLLM", "OpenAI-API-Compatible"]), the class is registered under multiple factory names.

Sources: rag/llm/__init__.py138-178

Configuration System

Configuration File Structure

The llm_factories.json file defines available providers and their models:

Example factory definition from conf/llm_factories.json2-199:

Sources: conf/llm_factories.json1-200

Tenant-Specific Configuration Flow

The validation logic in api/apps/llm_app.py59-144 tests API keys by:

Instantiating the appropriate model class (e.g., EmbeddingModel[factory])
Running a test query with timeout (default 10 seconds from LLM_TIMEOUT_SECONDS)
Verifying the response is valid
Only saving configuration if at least one model type passes validation

Sources: api/apps/llm_app.py59-144 api/apps/llm_app.py146-340

API Key Storage and Retrieval

API keys are stored in the TenantLLM table with encrypted values. Special handling for providers with complex authentication:

Provider	Key Format	Example Fields
VolcEngine	JSON string	`ark_api_key`, `endpoint_id`
BaiduYiyan	JSON string	`yiyan_ak`, `yiyan_sk`
Bedrock	JSON string	`auth_mode`, `bedrock_ak`, `bedrock_sk`, `bedrock_region`, `aws_role_arn`
Azure-OpenAI	JSON string	`api_key`, `api_version`
Google Cloud	JSON string	`google_project_id`, `google_region`, `google_service_account_key` (base64)

The key assembly logic is in api/apps/llm_app.py159-209

Sources: api/apps/llm_app.py159-209

LLMBundle Wrapper Class

LLMBundle Initialization and Method Wrapping Flow

LLMBundle in api/db/services/llm_service.py85-450 wraps raw model instances and adds cross-cutting concerns:

Core Wrapping Methods

The LLMBundle class in api/db/services/llm_service.py85-450 provides these wrapped methods:

Method	Model Type	Implementation	Wrapping Concerns
`encode(texts)`	Embedding	api/db/services/llm_service.py95-118	Token truncation to `max_length`, usage tracking, Langfuse tracing
`encode_queries(query)`	Embedding	api/db/services/llm_service.py120-133	Usage tracking, Langfuse tracing
`similarity(query, texts)`	Rerank	api/db/services/llm_service.py135-147	Usage tracking, Langfuse tracing
`describe(image)`	Vision	api/db/services/llm_service.py149-161	Usage tracking, Langfuse tracing
`describe_with_prompt(image, prompt)`	Vision	api/db/services/llm_service.py163-175	Usage tracking, Langfuse tracing
`transcription(audio)`	ASR	api/db/services/llm_service.py177-189	Usage tracking, Langfuse tracing
`stream_transcription(audio)`	ASR	api/db/services/llm_service.py191-251	Streaming support with fallback
`tts(text)`	TTS	api/db/services/llm_service.py253-265	Streaming output with final token count
`chat()`	Chat	api/db/services/llm_service.py282-364	Usage tracking, Langfuse tracing, tool binding
`chat_streamly()`	Chat	api/db/services/llm_service.py366-448	Streaming with usage tracking

Token Truncation for Embeddings

The encode() method in api/db/services/llm_service.py99-107 implements automatic text truncation:

This prevents embedding requests from exceeding model token limits by truncating to 95% of max_length. The max_length value comes from the model's configured max_tokens in the database.

Sources: api/db/services/llm_service.py85-450 api/db/services/llm_service.py95-118 api/db/services/llm_service.py99-107

Error Handling and Retry Logic

Error Classification System

The Base._classify_error() method in rag/llm/chat_model.py80-99 maps exception messages to error codes:

The error classification enables intelligent retry decisions. Only certain error types trigger retries.

Sources: rag/llm/chat_model.py80-99

Retry Policy

Retry Logic Flow in Base.async_chat()

Retry Configuration from rag/llm/chat_model.py69-78:

Parameter	Default	Source	Purpose
`max_retries`	5	`LLM_MAX_RETRIES` env var	Maximum retry attempts
`base_delay`	2.0s	`LLM_BASE_DELAY` env var	Base delay for exponential backoff
Jitter formula	`base_delay * random.uniform(10, 150)`	rag/llm/chat_model.py77-78	10x-150x multiplier (20s to 300s range)

Retryable Error Codes defined in rag/llm/chat_model.py202-209:

ERROR_RATE_LIMIT: API rate limit exceeded (HTTP 429, "too many requests")
ERROR_SERVER: Server-side errors (HTTP 500, 502, 503, 504)

All other error codes (ERROR_AUTHENTICATION, ERROR_INVALID_REQUEST, ERROR_TIMEOUT, ERROR_CONNECTION, ERROR_CONTENT_FILTER, ERROR_MODEL, ERROR_QUOTA) return immediately without retry.

Sources: rag/llm/chat_model.py69-243 rag/llm/chat_model.py77-78 rag/llm/chat_model.py202-209 rag/llm/chat_model.py473-485

Tool Calling and Function Execution

Tool Binding Flow

The is_tools flag comes from llm_factories.json and indicates whether a model supports function calling conf/llm_factories.json15

Sources: api/db/services/llm_service.py89-93 rag/llm/chat_model.py271-276

Tool Call Execution (Streaming)

Streaming Tool Call Flow in async_chat_streamly_with_tools()

The streaming tool call implementation in rag/llm/chat_model.py332-441:

Key Implementation Details:

Tool Call Accumulation rag/llm/chat_model.py359-368:
Parallel Tool Execution rag/llm/chat_model.py404:
- Uses thread_pool_exec() to run self.toolcall_session.tool_call(name, args) in thread pool
- Avoids blocking the async event loop
Result Formatting rag/llm/chat_model.py245:
- Tool results wrapped in <tool_call> XML: {"name": ..., "args": ..., "result": ...}
- Provides visibility into tool execution in streaming output
History Management rag/llm/chat_model.py247-269:
- _append_history() adds assistant message with tool_calls array
- Followed by tool message with tool_call_id and result content
Max Rounds Protection rag/llm/chat_model.py412-415:
- After max_rounds iterations (default 5), appends "Exceed max rounds" message
- Forces final completion without tools to prevent infinite loops

Sources: rag/llm/chat_model.py332-441 rag/llm/chat_model.py245 rag/llm/chat_model.py247-269 rag/llm/chat_model.py359-368 rag/llm/chat_model.py404 rag/llm/chat_model.py412-415

Usage Tracking

Token Usage Tracking Flow

Token usage is tracked per tenant and optionally per model:

Implementation in LLMBundle:

The increase_usage() call pattern from api/db/services/llm_service.py111-112:

Database Update Logic:

First attempts to update specific model by llm_name (e.g., "gpt-4o")
If that fails (model not in tenant config), falls back to updating by model_type (e.g., "chat")
Ensures usage tracking even when using default/fallback models

Usage Display: The /my_llms endpoint in api/apps/llm_app.py372-414 returns:

Sources: api/db/services/llm_service.py111-112 api/apps/llm_app.py372-414 common/token_utils.py

Langfuse Observability Integration

Langfuse Tracing in LLMBundle Methods

Optional observability through Langfuse traces each LLM operation:

Example Implementation from api/db/services/llm_service.py96-117:

Configuration:

Langfuse client and trace context are passed during LLMBundle initialization
Trace context allows associating multiple LLM calls within a single trace (e.g., all calls in a chat session)
Each wrapped method (encode, similarity, describe, chat, etc.) follows the same pattern

Traced Information:

Operation name: Method name (e.g., "encode", "chat", "similarity")
Model identifier: self.llm_name (e.g., "gpt-4o", "text-embedding-3-small")
Input data: Varies by method (texts for embedding, query+texts for rerank, messages for chat)
Usage details: Token counts from model responses

Sources: api/db/services/llm_service.py96-132 api/db/services/llm_service.py96-117

HTTP API Endpoints

The LLM management API is defined in api/apps/llm_app.py32-453:

API Endpoint Summary

Endpoint	Method	Purpose	Authentication
`/factories`	GET	List available LLM factories with model types	Required
`/set_api_key`	POST	Set API key for a factory (tests all model types)	Required
`/add_llm`	POST	Add a specific model with validation	Required
`/delete_llm`	POST	Remove a specific model configuration	Required
`/enable_llm`	POST	Enable/disable a model	Required
`/delete_factory`	POST	Remove all models for a factory	Required
`/my_llms`	GET	Get tenant's configured models with usage stats	Required
`/list`	GET	List available models by type for tenant	Required

API Key Validation Process

API Key Validation in set_api_key() Endpoint

Key Implementation Details:

Timeout Protection api/apps/llm_app.py75-100:
- Uses asyncio.wait_for() with configurable timeout (default 10 seconds from LLM_TIMEOUT_SECONDS)
- Prevents hanging on unresponsive APIs
Test Queries:
- Embedding: mdl.encode(["Test if the api key is available"])
- Chat: mdl.async_chat(None, [{"role": "user", "content": "Hello! How are you doing!"}], {"temperature": 0.9, "max_tokens": 50})
- Rerank: mdl.similarity("What's the weather?", ["Is it sunny today?"])
Early Success Exit:
- Breaks loop after first successful test (any model type)
- Saves all models for the factory if any one passes
Batch Save api/apps/llm_app.py130-141:
- After validation passes, saves/updates all models for the factory
- Uses TenantLLMService.filter_update() if exists, otherwise TenantLLMService.save()

Sources: api/apps/llm_app.py59-143 api/apps/llm_app.py75-100 api/apps/llm_app.py130-141

LiteLLM Provider Integration

RAGFlow uses LiteLLM for certain providers to reduce implementation overhead. The mapping is defined in rag/llm/__init__.py25-127:

LiteLLM Provider Prefixes

LiteLLM Provider Examples:

Provider	Prefix	Base URL
Anthropic	`""` (empty)	`https://api.anthropic.com/`
Cohere	`""` (empty)	N/A
DeepSeek	`deepseek/`	N/A
Gemini	`gemini/`	N/A
Groq	`groq/`	N/A
Moonshot	`moonshot/`	`https://api.moonshot.cn/v1`
Ollama	`ollama_chat/`	Configurable
OpenRouter	`openai/`	`https://openrouter.ai/api/v1`
ZHIPU-AI	`openai/`	`https://open.bigmodel.cn/api/paas/v4`

Many OpenAI-API-compatible providers use the openai/ prefix with custom base URLs.

Sources: rag/llm/__init__.py63-127

Model Type Tags and Capabilities

The llm_factories.json uses tags to indicate model capabilities:

Tag System

Tag	Meaning	Example Models
LLM	Large language model (text generation)	All chat models
CHAT	Conversational interface	gpt-4o, claude-3-opus
IMAGE2TEXT	Vision language model	gpt-4o, qwen-vl-plus
TEXT EMBEDDING	Text-to-vector encoding	text-embedding-3-small, bge-m3
TEXT RE-RANK	Search result reranking	gte-rerank, jina-reranker-v2
TTS	Text-to-speech synthesis	tts-1, sambert-zhiru-v1
SPEECH2TEXT	Audio transcription	whisper-1, qwen3-asr-flash
AGENT	Agent-specific optimization	qianwen-deepresearch-30b
MODERATION	Content moderation	OpenAI moderation models
32K, 128K, 1M	Context window size	Various (indicates max_tokens)

Models can have multiple tags. Example from conf/llm_factories.json11-16:

Sources: conf/llm_factories.json1-200

Special Provider Authentication

Certain providers require complex authentication beyond a simple API key:

Bedrock Authentication

Bedrock supports three authentication modes api/apps/llm_app.py172-175:

access_key_secret: Requires bedrock_ak, bedrock_sk, bedrock_region
iam_role: Requires aws_role_arn, bedrock_region (assumes role via STS)
assume_role: Uses default AWS credential chain

Implementation in rag/llm/embedding_model.py461-503 shows the three authentication paths:

Mode 1: Creates boto3 client with explicit access key and secret
Mode 2: Uses STS assume_role() to get temporary credentials
Mode 3: Uses default boto3 credential resolution

VolcEngine Special Handling

VolcEngine assembles ark_api_key and endpoint_id into a JSON string stored in api_key field api/apps/llm_app.py163-166:

The VolcEngineChat class then parses this JSON in its constructor rag/llm/chat_model.py656-665

Sources: api/apps/llm_app.py159-209 rag/llm/embedding_model.py461-503 rag/llm/chat_model.py653-665

Frontend Integration

The frontend LLM configuration UI is powered by constants and API calls:

Factory Icon Mapping

The IconMap in web/src/constants/llm.ts69-134 maps factory names to icon file names:

API Documentation Links

The APIMapUrl dictionary web/src/constants/llm.ts136-186 provides direct links to provider API key pages:

These constants enable the frontend to display appropriate icons and help links when users configure LLM providers.

Sources: web/src/constants/llm.ts1-187

Initialization and Default Configuration

System initialization populates LLM configurations from multiple sources:

The get_init_tenant_llm() function in api/db/services/llm_service.py36-83 reads default LLM configurations from settings.py:

These settings come from environment variables or service_conf.yaml. The initialization ensures the superuser has working LLM configurations for all required model types.

Sources: api/db/init_data.py166-179 api/db/services/llm_service.py36-83

LLM Integration System

Relevant source files

For information about how LLMs are used within the agent workflow system, see Agent Tools and ReAct Loop. For details on tenant management and multi-tenancy, see User and Tenant Management.

Architecture Overview

System Architecture: LLM Integration Layers

The LLM integration system follows a layered architecture with clear separation between configuration, abstraction, and provider-specific implementations:

Sources: rag/llm/__init__.py129-178 api/db/services/llm_service.py85-118 api/apps/llm_app.py32-56 rag/llm/chat_model.py63 rag/llm/embedding_model.py37 rag/llm/rerank_model.py28

Model Type System

RAGFlow supports six distinct model types, each with its own registry and base class:

Model Type	Registry Variable	Base Class Location	Purpose
CHAT	`ChatModel`	rag/llm/chat_model.py63-486	Text generation, multi-turn conversations, tool calling
EMBEDDING	`EmbeddingModel`	rag/llm/embedding_model.py37-51	Text-to-vector conversion, similarity search
RERANK	`RerankModel`	rag/llm/rerank_model.py28-54	Search result reordering, relevance scoring
IMAGE2TEXT	`CvModel`	rag/llm/cv_model.py42-246	Image description, vision-language models
TTS	`TTSModel`	rag/llm/tts_model.py67-79	Text-to-speech synthesis
SPEECH2TEXT	`Seq2txtModel`	rag/llm/sequence2txt_model.py31-42	Audio transcription

Model Type Registry Population

Dynamic Provider Discovery at Module Load

Sources: rag/llm/__init__.py138-178 rag/llm/__init__.py150-178

Provider Base Classes and Implementation Pattern

Each model type defines an abstract base class that provider implementations must subclass:

Chat Model Base Class

Chat Model Class Hierarchy

The Base class in rag/llm/chat_model.py63-486 provides core functionality:

Client initialization (rag/llm/chat_model.py64-67):
- OpenAI and AsyncOpenAI clients with configurable timeout (default 600s from LLM_TIMEOUT_SECONDS)
- Base URL and API key configuration
Retry configuration (rag/llm/chat_model.py69-75):
- max_retries: Default 5 from LLM_MAX_RETRIES env var
- base_delay: Default 2.0s from LLM_BASE_DELAY env var
- max_rounds: Default 5 for tool calling loops
Error classification (rag/llm/chat_model.py80-99):
- _classify_error() maps exception messages to LLMErrorCode enum values
- Distinguishes rate limits, auth errors, server errors, timeouts, etc.
Tool calling support (rag/llm/chat_model.py271-330):
- bind_tools() attaches tool session and definitions
- async_chat_with_tools() for non-streaming tool execution
- async_chat_streamly_with_tools() for streaming with tool calls
History management (rag/llm/chat_model.py247-269):
- _append_history() formats tool call/response pairs for conversation context

Sources: rag/llm/chat_model.py63-486 rag/llm/chat_model.py64-67 rag/llm/chat_model.py69-75 rag/llm/chat_model.py80-99 rag/llm/chat_model.py247-330

Embedding Model Base Class

Embedding Model Interface

The embedding base class defines two core methods in rag/llm/embedding_model.py37-51:

Method Specifications:

Method	Input	Output	Purpose
`encode(texts: list)`	List of document texts	`(embeddings: np.ndarray, token_count: int)`	Batch encoding for documents/passages
`encode_queries(text: str)`	Single query string	`(embedding: np.ndarray, token_count: int)`	Query encoding (some providers optimize separately)

Example Implementation - OpenAIEmbed in rag/llm/embedding_model.py91-123:

Provider implementations typically handle:

Batching: Process in chunks (commonly 16 items per request)
Truncation: Limit text to model's max tokens (e.g., 8191 for OpenAI)
Token counting: Track usage via response fields or estimation

Sources: rag/llm/embedding_model.py37-123 rag/llm/embedding_model.py91-123

Rerank Model Base Class

Sources: rag/llm/rerank_model.py28-54

Provider Discovery and Registration

Provider classes are automatically discovered and registered at module import time through introspection:

The registration code in rag/llm/__init__.py150-178:

Iterates through MODULE_MAPPING (chat_model, embedding_model, rerank_model, cv_model, sequence2txt_model, tts_model, ocr_model)
Imports each module using importlib.import_module()
Uses inspect.getmembers() to find all classes
Identifies the Base abstract class
Finds subclasses with _FACTORY_NAME attribute
Registers each class in the appropriate dictionary (ChatModel, EmbeddingModel, etc.)

If _FACTORY_NAME is a list (e.g., ["VLLM", "OpenAI-API-Compatible"]), the class is registered under multiple factory names.

Sources: rag/llm/__init__.py138-178

Configuration System

Configuration File Structure

The llm_factories.json file defines available providers and their models:

Example factory definition from conf/llm_factories.json2-199:

Sources: conf/llm_factories.json1-200

Tenant-Specific Configuration Flow

The validation logic in api/apps/llm_app.py59-144 tests API keys by:

Instantiating the appropriate model class (e.g., EmbeddingModel[factory])
Running a test query with timeout (default 10 seconds from LLM_TIMEOUT_SECONDS)
Verifying the response is valid
Only saving configuration if at least one model type passes validation

Sources: api/apps/llm_app.py59-144 api/apps/llm_app.py146-340

API Key Storage and Retrieval

API keys are stored in the TenantLLM table with encrypted values. Special handling for providers with complex authentication:

Provider	Key Format	Example Fields
VolcEngine	JSON string	`ark_api_key`, `endpoint_id`
BaiduYiyan	JSON string	`yiyan_ak`, `yiyan_sk`
Bedrock	JSON string	`auth_mode`, `bedrock_ak`, `bedrock_sk`, `bedrock_region`, `aws_role_arn`
Azure-OpenAI	JSON string	`api_key`, `api_version`
Google Cloud	JSON string	`google_project_id`, `google_region`, `google_service_account_key` (base64)

The key assembly logic is in api/apps/llm_app.py159-209

Sources: api/apps/llm_app.py159-209

LLMBundle Wrapper Class

LLMBundle Initialization and Method Wrapping Flow

LLMBundle in api/db/services/llm_service.py85-450 wraps raw model instances and adds cross-cutting concerns:

Core Wrapping Methods

The LLMBundle class in api/db/services/llm_service.py85-450 provides these wrapped methods:

Method	Model Type	Implementation	Wrapping Concerns
`encode(texts)`	Embedding	api/db/services/llm_service.py95-118	Token truncation to `max_length`, usage tracking, Langfuse tracing
`encode_queries(query)`	Embedding	api/db/services/llm_service.py120-133	Usage tracking, Langfuse tracing
`similarity(query, texts)`	Rerank	api/db/services/llm_service.py135-147	Usage tracking, Langfuse tracing
`describe(image)`	Vision	api/db/services/llm_service.py149-161	Usage tracking, Langfuse tracing
`describe_with_prompt(image, prompt)`	Vision	api/db/services/llm_service.py163-175	Usage tracking, Langfuse tracing
`transcription(audio)`	ASR	api/db/services/llm_service.py177-189	Usage tracking, Langfuse tracing
`stream_transcription(audio)`	ASR	api/db/services/llm_service.py191-251	Streaming support with fallback
`tts(text)`	TTS	api/db/services/llm_service.py253-265	Streaming output with final token count
`chat()`	Chat	api/db/services/llm_service.py282-364	Usage tracking, Langfuse tracing, tool binding
`chat_streamly()`	Chat	api/db/services/llm_service.py366-448	Streaming with usage tracking

Token Truncation for Embeddings

The encode() method in api/db/services/llm_service.py99-107 implements automatic text truncation:

This prevents embedding requests from exceeding model token limits by truncating to 95% of max_length. The max_length value comes from the model's configured max_tokens in the database.

Sources: api/db/services/llm_service.py85-450 api/db/services/llm_service.py95-118 api/db/services/llm_service.py99-107

Error Handling and Retry Logic

Error Classification System

The Base._classify_error() method in rag/llm/chat_model.py80-99 maps exception messages to error codes:

The error classification enables intelligent retry decisions. Only certain error types trigger retries.

Sources: rag/llm/chat_model.py80-99

Retry Policy

Retry Logic Flow in Base.async_chat()

Retry Configuration from rag/llm/chat_model.py69-78:

Parameter	Default	Source	Purpose
`max_retries`	5	`LLM_MAX_RETRIES` env var	Maximum retry attempts
`base_delay`	2.0s	`LLM_BASE_DELAY` env var	Base delay for exponential backoff
Jitter formula	`base_delay * random.uniform(10, 150)`	rag/llm/chat_model.py77-78	10x-150x multiplier (20s to 300s range)

Retryable Error Codes defined in rag/llm/chat_model.py202-209:

ERROR_RATE_LIMIT: API rate limit exceeded (HTTP 429, "too many requests")
ERROR_SERVER: Server-side errors (HTTP 500, 502, 503, 504)

All other error codes (ERROR_AUTHENTICATION, ERROR_INVALID_REQUEST, ERROR_TIMEOUT, ERROR_CONNECTION, ERROR_CONTENT_FILTER, ERROR_MODEL, ERROR_QUOTA) return immediately without retry.

Sources: rag/llm/chat_model.py69-243 rag/llm/chat_model.py77-78 rag/llm/chat_model.py202-209 rag/llm/chat_model.py473-485

Tool Calling and Function Execution

Tool Binding Flow

The is_tools flag comes from llm_factories.json and indicates whether a model supports function calling conf/llm_factories.json15

Sources: api/db/services/llm_service.py89-93 rag/llm/chat_model.py271-276

Tool Call Execution (Streaming)

Streaming Tool Call Flow in async_chat_streamly_with_tools()

The streaming tool call implementation in rag/llm/chat_model.py332-441:

Key Implementation Details:

Tool Call Accumulation rag/llm/chat_model.py359-368:
Parallel Tool Execution rag/llm/chat_model.py404:
- Uses thread_pool_exec() to run self.toolcall_session.tool_call(name, args) in thread pool
- Avoids blocking the async event loop
Result Formatting rag/llm/chat_model.py245:
- Tool results wrapped in <tool_call> XML: {"name": ..., "args": ..., "result": ...}
- Provides visibility into tool execution in streaming output
History Management rag/llm/chat_model.py247-269:
- _append_history() adds assistant message with tool_calls array
- Followed by tool message with tool_call_id and result content
Max Rounds Protection rag/llm/chat_model.py412-415:
- After max_rounds iterations (default 5), appends "Exceed max rounds" message
- Forces final completion without tools to prevent infinite loops

Sources: rag/llm/chat_model.py332-441 rag/llm/chat_model.py245 rag/llm/chat_model.py247-269 rag/llm/chat_model.py359-368 rag/llm/chat_model.py404 rag/llm/chat_model.py412-415

Usage Tracking

Token Usage Tracking Flow

Token usage is tracked per tenant and optionally per model:

Implementation in LLMBundle:

The increase_usage() call pattern from api/db/services/llm_service.py111-112:

Database Update Logic:

First attempts to update specific model by llm_name (e.g., "gpt-4o")
If that fails (model not in tenant config), falls back to updating by model_type (e.g., "chat")
Ensures usage tracking even when using default/fallback models

Usage Display: The /my_llms endpoint in api/apps/llm_app.py372-414 returns:

Sources: api/db/services/llm_service.py111-112 api/apps/llm_app.py372-414 common/token_utils.py

Langfuse Observability Integration

Langfuse Tracing in LLMBundle Methods

Optional observability through Langfuse traces each LLM operation:

Example Implementation from api/db/services/llm_service.py96-117:

Configuration:

Langfuse client and trace context are passed during LLMBundle initialization
Trace context allows associating multiple LLM calls within a single trace (e.g., all calls in a chat session)
Each wrapped method (encode, similarity, describe, chat, etc.) follows the same pattern

Traced Information:

Operation name: Method name (e.g., "encode", "chat", "similarity")
Model identifier: self.llm_name (e.g., "gpt-4o", "text-embedding-3-small")
Input data: Varies by method (texts for embedding, query+texts for rerank, messages for chat)
Usage details: Token counts from model responses

Sources: api/db/services/llm_service.py96-132 api/db/services/llm_service.py96-117

HTTP API Endpoints

The LLM management API is defined in api/apps/llm_app.py32-453:

API Endpoint Summary

Endpoint	Method	Purpose	Authentication
`/factories`	GET	List available LLM factories with model types	Required
`/set_api_key`	POST	Set API key for a factory (tests all model types)	Required
`/add_llm`	POST	Add a specific model with validation	Required
`/delete_llm`	POST	Remove a specific model configuration	Required
`/enable_llm`	POST	Enable/disable a model	Required
`/delete_factory`	POST	Remove all models for a factory	Required
`/my_llms`	GET	Get tenant's configured models with usage stats	Required
`/list`	GET	List available models by type for tenant	Required

API Key Validation Process

API Key Validation in set_api_key() Endpoint

Key Implementation Details:

Timeout Protection api/apps/llm_app.py75-100:
- Uses asyncio.wait_for() with configurable timeout (default 10 seconds from LLM_TIMEOUT_SECONDS)
- Prevents hanging on unresponsive APIs
Test Queries:
- Embedding: mdl.encode(["Test if the api key is available"])
- Chat: mdl.async_chat(None, [{"role": "user", "content": "Hello! How are you doing!"}], {"temperature": 0.9, "max_tokens": 50})
- Rerank: mdl.similarity("What's the weather?", ["Is it sunny today?"])
Early Success Exit:
- Breaks loop after first successful test (any model type)
- Saves all models for the factory if any one passes
Batch Save api/apps/llm_app.py130-141:
- After validation passes, saves/updates all models for the factory
- Uses TenantLLMService.filter_update() if exists, otherwise TenantLLMService.save()

Sources: api/apps/llm_app.py59-143 api/apps/llm_app.py75-100 api/apps/llm_app.py130-141

LiteLLM Provider Integration

RAGFlow uses LiteLLM for certain providers to reduce implementation overhead. The mapping is defined in rag/llm/__init__.py25-127:

LiteLLM Provider Prefixes

LiteLLM Provider Examples:

Provider	Prefix	Base URL
Anthropic	`""` (empty)	`https://api.anthropic.com/`
Cohere	`""` (empty)	N/A
DeepSeek	`deepseek/`	N/A
Gemini	`gemini/`	N/A
Groq	`groq/`	N/A
Moonshot	`moonshot/`	`https://api.moonshot.cn/v1`
Ollama	`ollama_chat/`	Configurable
OpenRouter	`openai/`	`https://openrouter.ai/api/v1`
ZHIPU-AI	`openai/`	`https://open.bigmodel.cn/api/paas/v4`

Many OpenAI-API-compatible providers use the openai/ prefix with custom base URLs.

Sources: rag/llm/__init__.py63-127

Model Type Tags and Capabilities

The llm_factories.json uses tags to indicate model capabilities:

Tag System

Tag	Meaning	Example Models
LLM	Large language model (text generation)	All chat models
CHAT	Conversational interface	gpt-4o, claude-3-opus
IMAGE2TEXT	Vision language model	gpt-4o, qwen-vl-plus
TEXT EMBEDDING	Text-to-vector encoding	text-embedding-3-small, bge-m3
TEXT RE-RANK	Search result reranking	gte-rerank, jina-reranker-v2
TTS	Text-to-speech synthesis	tts-1, sambert-zhiru-v1
SPEECH2TEXT	Audio transcription	whisper-1, qwen3-asr-flash
AGENT	Agent-specific optimization	qianwen-deepresearch-30b
MODERATION	Content moderation	OpenAI moderation models
32K, 128K, 1M	Context window size	Various (indicates max_tokens)

Models can have multiple tags. Example from conf/llm_factories.json11-16:

Sources: conf/llm_factories.json1-200

Special Provider Authentication

Certain providers require complex authentication beyond a simple API key:

Bedrock Authentication

Bedrock supports three authentication modes api/apps/llm_app.py172-175:

access_key_secret: Requires bedrock_ak, bedrock_sk, bedrock_region
iam_role: Requires aws_role_arn, bedrock_region (assumes role via STS)
assume_role: Uses default AWS credential chain

Implementation in rag/llm/embedding_model.py461-503 shows the three authentication paths:

Mode 1: Creates boto3 client with explicit access key and secret
Mode 2: Uses STS assume_role() to get temporary credentials
Mode 3: Uses default boto3 credential resolution

VolcEngine Special Handling

VolcEngine assembles ark_api_key and endpoint_id into a JSON string stored in api_key field api/apps/llm_app.py163-166:

The VolcEngineChat class then parses this JSON in its constructor rag/llm/chat_model.py656-665

Sources: api/apps/llm_app.py159-209 rag/llm/embedding_model.py461-503 rag/llm/chat_model.py653-665

Frontend Integration

The frontend LLM configuration UI is powered by constants and API calls:

Factory Icon Mapping

The IconMap in web/src/constants/llm.ts69-134 maps factory names to icon file names:

API Documentation Links

The APIMapUrl dictionary web/src/constants/llm.ts136-186 provides direct links to provider API key pages:

These constants enable the frontend to display appropriate icons and help links when users configure LLM providers.

Sources: web/src/constants/llm.ts1-187

Initialization and Default Configuration

System initialization populates LLM configurations from multiple sources:

The get_init_tenant_llm() function in api/db/services/llm_service.py36-83 reads default LLM configurations from settings.py:

These settings come from environment variables or service_conf.yaml. The initialization ensures the superuser has working LLM configurations for all required model types.

Sources: api/db/init_data.py166-179 api/db/services/llm_service.py36-83

LLM Integration System

Architecture Overview

Model Type System

Model Type Registry Population

Provider Base Classes and Implementation Pattern

Chat Model Base Class

Embedding Model Base Class

Rerank Model Base Class

Provider Discovery and Registration

Configuration System

Configuration File Structure

Tenant-Specific Configuration Flow

API Key Storage and Retrieval

LLMBundle Wrapper Class

Core Wrapping Methods

Token Truncation for Embeddings

Error Handling and Retry Logic

Error Classification System

Retry Policy

Tool Calling and Function Execution

Tool Binding Flow

Tool Call Execution (Streaming)

Usage Tracking

Langfuse Observability Integration

HTTP API Endpoints

API Endpoint Summary

API Key Validation Process

LiteLLM Provider Integration

LiteLLM Provider Prefixes

Model Type Tags and Capabilities

Tag System

Special Provider Authentication

Bedrock Authentication

VolcEngine Special Handling

Frontend Integration

Factory Icon Mapping

API Documentation Links

Initialization and Default Configuration

On this page

LLM Integration System

Architecture Overview

Model Type System

Model Type Registry Population

Provider Base Classes and Implementation Pattern

Chat Model Base Class

Embedding Model Base Class

Rerank Model Base Class

Provider Discovery and Registration

Configuration System

Configuration File Structure

Tenant-Specific Configuration Flow

API Key Storage and Retrieval

LLMBundle Wrapper Class

Core Wrapping Methods

Token Truncation for Embeddings

Error Handling and Retry Logic

Error Classification System

Retry Policy

Tool Calling and Function Execution

Tool Binding Flow

Tool Call Execution (Streaming)

Usage Tracking

Langfuse Observability Integration

HTTP API Endpoints

API Endpoint Summary

API Key Validation Process

LiteLLM Provider Integration

LiteLLM Provider Prefixes

Model Type Tags and Capabilities

Tag System

Special Provider Authentication

Bedrock Authentication

VolcEngine Special Handling

Frontend Integration

Factory Icon Mapping

API Documentation Links

Initialization and Default Configuration

On this page