Model Integration and Inference

Relevant source files

Purpose and Scope

This page provides an overview of DB-GPT's model integration and inference capabilities. It covers the Service-oriented Multi-model Management Framework (SMMF), the two primary deployment strategies (local and API proxy), and how models are registered, configured, and used for inference.

For detailed information on specific topics, see:

Model Adapters and Proxy Models - 3.1 for adapter patterns and API integrations
Model Workers and Inference Backends - 3.2 for worker architecture and backend engines
Model Configuration and Deployment - 3.3 for configuration files and deployment options
Hardware Acceleration and Performance - 3.4 for optimization techniques

For information about how models are used within RAG pipelines, see RAG Pipeline and Knowledge Management. For information about multi-agent workflows that use models, see Multi-Agents and AWEL Workflows.

Service-oriented Multi-model Management Framework (SMMF)

The SMMF is DB-GPT's core abstraction for managing and serving large language models. It provides a unified interface for interacting with 50+ different LLMs, regardless of whether they are deployed locally or accessed through API proxies. The framework decouples application logic from model deployment details, enabling developers to switch between models without code changes.

The key design principles of SMMF are:

Service-oriented Architecture: Models are exposed as services with standardized interfaces
Multi-model Support: Simultaneous support for multiple models with different capabilities
Deployment Flexibility: Support for both local deployment and remote API access
Performance Optimization: Integration with high-performance inference engines
Resource Management: Efficient allocation and load balancing across model instances

Sources: README.md1-363 README.zh.md1-409 docs/docs/modules/smmf.md1-12

Model Deployment Strategies

DB-GPT supports two primary deployment strategies for language models, each optimized for different use cases and operational requirements.

Local Deployment Strategy

Local deployment runs models on infrastructure controlled by the user, providing maximum data privacy and customization. This strategy is implemented through multiple inference backends:

Backend	Description	Use Case
HuggingFace Transformers	Direct model loading using the Transformers library	Development, smaller models, CPU inference
vLLM	High-throughput inference engine with PagedAttention	Production deployments, high concurrency
LLAMA.cpp	Optimized C++ implementation	CPU/Metal inference, edge devices
MLX	Apple Silicon optimized framework	MacOS deployment, M-series chips

Local deployment provides:

Complete data privacy (no external API calls)
Customizable model parameters and behavior
Support for model fine-tuning and adaptation
Lower latency for on-premise workloads
Independence from external service availability

API Proxy Deployment Strategy

API proxy deployment accesses models through external API providers, enabling rapid deployment without infrastructure requirements. Supported providers include:

OpenAI: GPT-4, GPT-3.5 and other OpenAI models
DeepSeek: DeepSeek-R1, DeepSeek-V3, DeepSeek-Coder
Qwen: Qwen3, QwQ, Qwen2.5 series
Ollama: Local API server for open-source models
Baidu: Wenxin (ERNIE) models
Alibaba: Tongyi Qwen models
Zhipu: ChatGLM models
Xunfei: Spark models
Google: Gemini models

API proxy deployment provides:

Zero infrastructure requirements
Access to cutting-edge proprietary models
Elastic scaling without hardware constraints
Pay-per-use pricing models
Rapid prototyping and experimentation

Sources: README.md180-293 README.zh.md191-318

SMMF Architecture Overview

Figure 1: SMMF Architecture - Model Integration Flow

This diagram illustrates the complete flow from application code through the SMMF to model inference. The ModelRegistry serves as the central hub for model discovery and routing, while the Worker Manager handles lifecycle management and load balancing across multiple model instances.

Sources: README.md180-293 docs/docs/modules/smmf.md1-12

Model Registry and Configuration

The Model Registry is the central component that maintains information about available models and routes requests to appropriate workers. Models are configured through .toml configuration files that define:

Model identification and metadata
Deployment strategy (local or proxy)
Backend-specific parameters
Resource requirements and limits
API credentials (for proxy deployments)

Configuration file naming conventions:

dbgpt-proxy-*.toml: Configuration for API proxy models
dbgpt-local-*.toml: Configuration for locally deployed models

The registry provides:

Model Discovery: Enumerate available models and their capabilities
Dynamic Registration: Add or remove models at runtime
Request Routing: Direct inference requests to appropriate workers
Capability Matching: Select models based on task requirements
Fallback Handling: Route to alternative models on failure

Sources: README.md180-293

Inference Request Flow

Figure 2: Model Inference Request Sequence

This sequence diagram shows the complete lifecycle of an inference request from application code to model execution. The flow supports both synchronous and streaming responses, with the worker handling protocol translation between the application and backend.

Sources: README.md55-81

Supported Model Families

DB-GPT supports extensive model families across both local and proxy deployments. The following table summarizes major model families and their deployment options:

Model Family	Provider	Local Support	API Proxy	Notable Models
DeepSeek	DeepSeek AI	✅	✅	DeepSeek-R1, DeepSeek-V3, DeepSeek-Coder
Qwen	Alibaba	✅	✅	Qwen3-235B, QwQ-32B, Qwen2.5-Coder
GLM	Tsinghua (Zhipu)	✅	✅	GLM-Z1-32B, GLM-4, ChatGLM
Llama	Meta	✅	❌	Llama-3.1-405B, Llama-3.1-70B
Gemma	Google	✅	❌	Gemma-2-27B, Gemma-7B
Yi	01.AI	✅	✅	Yi-1.5-34B, Yi-34B
GPT	OpenAI	❌	✅	GPT-4, GPT-3.5-Turbo
Baichuan	Baichuan	✅	✅	Baichuan2-13B
InternLM	Shanghai AI Lab	✅	❌	InternLM2
Mixtral	Mistral AI	✅	❌	Mixtral-8x7B
Phi	Microsoft	✅	❌	Phi-3

The complete list of supported models is maintained in the model configuration directory and can be extended through custom adapters.

Sources: README.md184-293 README.zh.md195-318

Worker Architecture and Model Instance Management

Figure 3: Worker Pool Management and Load Balancing

The Worker Manager maintains a pool of model worker instances, each capable of handling inference requests. The Model Controller routes incoming requests to available workers based on:

Model Compatibility: Matching requested model to worker configuration
Worker Availability: Checking worker health and capacity
Load Distribution: Balancing requests across multiple instances
Request Queuing: Handling overflow when all workers are busy

Worker lifecycle management includes:

Initialization: Loading model weights or establishing API connections
Health Monitoring: Periodic checks of worker responsiveness
Auto-scaling: Starting/stopping workers based on demand
Graceful Shutdown: Completing in-flight requests before termination

For detailed information on worker implementation, see Model Workers and Inference Backends.

Sources: README.md55-81

Integration with Application Layer

The SMMF integrates with higher-level DB-GPT components through standardized interfaces:

AWEL Integration

AWEL (Agentic Workflow Expression Language) workflows can reference models by name in their DAG definitions. The workflow engine automatically resolves model names to worker instances through the Model Registry.

Agent Integration

Multi-agent systems use models for reasoning, planning, and tool selection. Agents specify model requirements (e.g., reasoning capability, context length) and the SMMF selects appropriate models.

RAG Integration

RAG pipelines use models for both embedding generation and text generation. The SMMF supports separate model configurations for embedding models and generation models, enabling optimized model selection for each task.

For more details on these integrations, see:

Sources: README.md55-81

Model Configuration Structure

Model configuration files follow a standardized TOML format:

Configuration files are loaded by the Model Registry during startup or when models are dynamically registered.

Sources: README.md180-293

Performance Considerations

The SMMF is designed for high-throughput inference with several optimization techniques:

Batch Processing: Grouping multiple requests for efficient GPU utilization (vLLM)
Connection Pooling: Reusing HTTP connections for API proxy requests
Request Queuing: Managing overflow without request loss
Worker Isolation: Preventing one model's issues from affecting others
Resource Limits: Preventing memory exhaustion through configurable limits

For detailed information on hardware acceleration and performance optimization, see Hardware Acceleration and Performance.

Sources: README.md180-293 docs/docs/modules/smmf.md1-12

Model Switching and Fallback

The SMMF supports dynamic model switching and fallback mechanisms:

Hot Swapping: Switching between models without restarting the application
Version Management: Running multiple versions of the same model
Fallback Chains: Automatically trying alternative models on failure
Cost Optimization: Routing to cheaper models for simpler tasks

These capabilities enable sophisticated deployment strategies such as:

Using fast, smaller models for initial processing
Escalating to larger models for complex tasks
Falling back to proxy models when local capacity is exhausted

Sources: README.md55-81

Security and Privacy

Model integration includes security considerations:

Local Deployment Security

Data Privacy: All data remains on-premise
Model Access Control: File system permissions protect model weights
Network Isolation: No external network access required

API Proxy Security

Credential Management: API keys stored securely (environment variables, secret managers)
TLS Encryption: All API communications encrypted in transit
Request Filtering: Sensitive data can be filtered before API calls
Audit Logging: All API interactions logged for compliance

For privacy-focused deployments, local deployment is recommended. For more information, see Model Configuration and Deployment.

Sources: README.md294-297

Extending Model Support

The SMMF is designed for extensibility. Adding support for new models involves:

Creating a Model Adapter: Implementing the BaseModelAdapter interface
Adding Configuration: Creating a .toml configuration file
Registering the Model: Adding the model to the registry
Testing: Validating inference behavior and performance

Custom adapters can be developed for:

Proprietary internal models
New open-source model architectures
Specialized inference engines
Custom API providers

For detailed implementation guidance, see Model Adapters and Proxy Models.

Sources: README.md180-293

Next Steps: For detailed information on specific aspects of model integration:

Model Adapters and Proxy Models - Adapter pattern implementation
Model Workers and Inference Backends - Worker lifecycle and backends
Model Configuration and Deployment - Configuration file format and deployment
Hardware Acceleration and Performance - Optimization techniques

Model Integration and Inference

Relevant source files

Purpose and Scope

For detailed information on specific topics, see:

Model Adapters and Proxy Models - 3.1 for adapter patterns and API integrations
Model Workers and Inference Backends - 3.2 for worker architecture and backend engines
Model Configuration and Deployment - 3.3 for configuration files and deployment options
Hardware Acceleration and Performance - 3.4 for optimization techniques

Service-oriented Multi-model Management Framework (SMMF)

The key design principles of SMMF are:

Service-oriented Architecture: Models are exposed as services with standardized interfaces
Multi-model Support: Simultaneous support for multiple models with different capabilities
Deployment Flexibility: Support for both local deployment and remote API access
Performance Optimization: Integration with high-performance inference engines
Resource Management: Efficient allocation and load balancing across model instances

Sources: README.md1-363 README.zh.md1-409 docs/docs/modules/smmf.md1-12

Model Deployment Strategies

DB-GPT supports two primary deployment strategies for language models, each optimized for different use cases and operational requirements.

Local Deployment Strategy

Local deployment runs models on infrastructure controlled by the user, providing maximum data privacy and customization. This strategy is implemented through multiple inference backends:

Backend	Description	Use Case
HuggingFace Transformers	Direct model loading using the Transformers library	Development, smaller models, CPU inference
vLLM	High-throughput inference engine with PagedAttention	Production deployments, high concurrency
LLAMA.cpp	Optimized C++ implementation	CPU/Metal inference, edge devices
MLX	Apple Silicon optimized framework	MacOS deployment, M-series chips

Local deployment provides:

Complete data privacy (no external API calls)
Customizable model parameters and behavior
Support for model fine-tuning and adaptation
Lower latency for on-premise workloads
Independence from external service availability

API Proxy Deployment Strategy

API proxy deployment accesses models through external API providers, enabling rapid deployment without infrastructure requirements. Supported providers include:

OpenAI: GPT-4, GPT-3.5 and other OpenAI models
DeepSeek: DeepSeek-R1, DeepSeek-V3, DeepSeek-Coder
Qwen: Qwen3, QwQ, Qwen2.5 series
Ollama: Local API server for open-source models
Baidu: Wenxin (ERNIE) models
Alibaba: Tongyi Qwen models
Zhipu: ChatGLM models
Xunfei: Spark models
Google: Gemini models

API proxy deployment provides:

Zero infrastructure requirements
Access to cutting-edge proprietary models
Elastic scaling without hardware constraints
Pay-per-use pricing models
Rapid prototyping and experimentation

Sources: README.md180-293 README.zh.md191-318

SMMF Architecture Overview

Figure 1: SMMF Architecture - Model Integration Flow

Sources: README.md180-293 docs/docs/modules/smmf.md1-12

Model Registry and Configuration

Model identification and metadata
Deployment strategy (local or proxy)
Backend-specific parameters
Resource requirements and limits
API credentials (for proxy deployments)

Configuration file naming conventions:

dbgpt-proxy-*.toml: Configuration for API proxy models
dbgpt-local-*.toml: Configuration for locally deployed models

The registry provides:

Model Discovery: Enumerate available models and their capabilities
Dynamic Registration: Add or remove models at runtime
Request Routing: Direct inference requests to appropriate workers
Capability Matching: Select models based on task requirements
Fallback Handling: Route to alternative models on failure

Sources: README.md180-293

Inference Request Flow

Figure 2: Model Inference Request Sequence

Sources: README.md55-81

Supported Model Families

DB-GPT supports extensive model families across both local and proxy deployments. The following table summarizes major model families and their deployment options:

Model Family	Provider	Local Support	API Proxy	Notable Models
DeepSeek	DeepSeek AI	✅	✅	DeepSeek-R1, DeepSeek-V3, DeepSeek-Coder
Qwen	Alibaba	✅	✅	Qwen3-235B, QwQ-32B, Qwen2.5-Coder
GLM	Tsinghua (Zhipu)	✅	✅	GLM-Z1-32B, GLM-4, ChatGLM
Llama	Meta	✅	❌	Llama-3.1-405B, Llama-3.1-70B
Gemma	Google	✅	❌	Gemma-2-27B, Gemma-7B
Yi	01.AI	✅	✅	Yi-1.5-34B, Yi-34B
GPT	OpenAI	❌	✅	GPT-4, GPT-3.5-Turbo
Baichuan	Baichuan	✅	✅	Baichuan2-13B
InternLM	Shanghai AI Lab	✅	❌	InternLM2
Mixtral	Mistral AI	✅	❌	Mixtral-8x7B
Phi	Microsoft	✅	❌	Phi-3

The complete list of supported models is maintained in the model configuration directory and can be extended through custom adapters.

Sources: README.md184-293 README.zh.md195-318

Worker Architecture and Model Instance Management

Figure 3: Worker Pool Management and Load Balancing

The Worker Manager maintains a pool of model worker instances, each capable of handling inference requests. The Model Controller routes incoming requests to available workers based on:

Model Compatibility: Matching requested model to worker configuration
Worker Availability: Checking worker health and capacity
Load Distribution: Balancing requests across multiple instances
Request Queuing: Handling overflow when all workers are busy

Worker lifecycle management includes:

Initialization: Loading model weights or establishing API connections
Health Monitoring: Periodic checks of worker responsiveness
Auto-scaling: Starting/stopping workers based on demand
Graceful Shutdown: Completing in-flight requests before termination

For detailed information on worker implementation, see Model Workers and Inference Backends.

Sources: README.md55-81

Integration with Application Layer

The SMMF integrates with higher-level DB-GPT components through standardized interfaces:

AWEL Integration

Agent Integration

Multi-agent systems use models for reasoning, planning, and tool selection. Agents specify model requirements (e.g., reasoning capability, context length) and the SMMF selects appropriate models.

RAG Integration

For more details on these integrations, see:

Sources: README.md55-81

Model Configuration Structure

Model configuration files follow a standardized TOML format:

Configuration files are loaded by the Model Registry during startup or when models are dynamically registered.

Sources: README.md180-293

Performance Considerations

The SMMF is designed for high-throughput inference with several optimization techniques:

Batch Processing: Grouping multiple requests for efficient GPU utilization (vLLM)
Connection Pooling: Reusing HTTP connections for API proxy requests
Request Queuing: Managing overflow without request loss
Worker Isolation: Preventing one model's issues from affecting others
Resource Limits: Preventing memory exhaustion through configurable limits

For detailed information on hardware acceleration and performance optimization, see Hardware Acceleration and Performance.

Sources: README.md180-293 docs/docs/modules/smmf.md1-12

Model Switching and Fallback

The SMMF supports dynamic model switching and fallback mechanisms:

Hot Swapping: Switching between models without restarting the application
Version Management: Running multiple versions of the same model
Fallback Chains: Automatically trying alternative models on failure
Cost Optimization: Routing to cheaper models for simpler tasks

These capabilities enable sophisticated deployment strategies such as:

Using fast, smaller models for initial processing
Escalating to larger models for complex tasks
Falling back to proxy models when local capacity is exhausted

Sources: README.md55-81

Security and Privacy

Model integration includes security considerations:

Local Deployment Security

Data Privacy: All data remains on-premise
Model Access Control: File system permissions protect model weights
Network Isolation: No external network access required

API Proxy Security

Credential Management: API keys stored securely (environment variables, secret managers)
TLS Encryption: All API communications encrypted in transit
Request Filtering: Sensitive data can be filtered before API calls
Audit Logging: All API interactions logged for compliance

For privacy-focused deployments, local deployment is recommended. For more information, see Model Configuration and Deployment.

Sources: README.md294-297

Extending Model Support

The SMMF is designed for extensibility. Adding support for new models involves:

Creating a Model Adapter: Implementing the BaseModelAdapter interface
Adding Configuration: Creating a .toml configuration file
Registering the Model: Adding the model to the registry
Testing: Validating inference behavior and performance

Custom adapters can be developed for:

Proprietary internal models
New open-source model architectures
Specialized inference engines
Custom API providers

For detailed implementation guidance, see Model Adapters and Proxy Models.

Sources: README.md180-293

Next Steps: For detailed information on specific aspects of model integration:

Model Adapters and Proxy Models - Adapter pattern implementation
Model Workers and Inference Backends - Worker lifecycle and backends
Model Configuration and Deployment - Configuration file format and deployment
Hardware Acceleration and Performance - Optimization techniques

Model Integration and Inference

Purpose and Scope

Service-oriented Multi-model Management Framework (SMMF)

Model Deployment Strategies

Local Deployment Strategy

API Proxy Deployment Strategy

SMMF Architecture Overview

Model Registry and Configuration

Inference Request Flow

Supported Model Families

Worker Architecture and Model Instance Management

Integration with Application Layer

AWEL Integration

Agent Integration

RAG Integration

Model Configuration Structure

Performance Considerations

Model Switching and Fallback

Security and Privacy

Local Deployment Security

API Proxy Security

Extending Model Support

On this page

Model Integration and Inference

Purpose and Scope

Service-oriented Multi-model Management Framework (SMMF)

Model Deployment Strategies

Local Deployment Strategy

API Proxy Deployment Strategy

SMMF Architecture Overview

Model Registry and Configuration

Inference Request Flow

Supported Model Families

Worker Architecture and Model Instance Management

Integration with Application Layer

AWEL Integration

Agent Integration

RAG Integration

Model Configuration Structure

Performance Considerations

Model Switching and Fallback

Security and Privacy

Local Deployment Security

API Proxy Security

Extending Model Support

On this page