The Service-oriented Multi-model Management Framework (SMMF) is DB-GPT's unified infrastructure for managing and deploying Large Language Models (LLMs). SMMF provides a consistent interface for interacting with 50+ models through both local deployment and API proxies, abstracting away the complexities of different inference frameworks, hardware requirements, and API protocols.
This document covers the architecture, components, deployment strategies, and configuration of SMMF. For information about how models are used in RAG pipelines, see RAG Pipeline and Knowledge Management. For model configuration in Docker deployments, see Docker Base Image and Build System. For acceleration-specific implementation details, see Hardware Acceleration and Performance.
Sources: README.md1-363 docs/docs/modules/smmf.md1-158
SMMF addresses the challenge that there is no de facto standard for deploying and serving LLMs. New models are constantly released with different requirements, training methods, and inference frameworks. Without SMMF, applications would need to implement custom adapters for each model, significantly increasing development complexity and maintenance burden.
SMMF solves this by providing:
.toml configuration filesSources: README.md55-183 docs/docs/modules/smmf.md1-30
Sources: README.md180-292 Diagram 3 from high-level architecture
The Model Registry maintains metadata about available models, including their type (local/proxy), configuration parameters, and deployment status. It serves as the central source of truth for model discovery and selection.
Key Responsibilities:
.toml configuration filesConfiguration Format: Models are registered via TOML files in two categories:
dbgpt-local-*.toml: Local models deployed using inference frameworksdbgpt-proxy-*.toml: Remote API models accessed through proxy adaptersSources: docs/docs/modules/smmf.md32-80
The Model Controller manages the complete lifecycle of model workers, from initialization to shutdown. It coordinates between the registry, worker manager, and individual workers.
Key Responsibilities:
Sources: docs/docs/modules/smmf.md82-105
The Worker Manager handles the operational aspects of the worker pool, including load distribution and resource allocation.
Key Responsibilities:
Load Balancing: The Worker Manager distributes requests across available workers using strategies such as round-robin, least-connections, or weighted distribution based on worker capacity.
Sources: docs/docs/modules/smmf.md107-130
Model Workers are the execution units that host and serve models. Each worker encapsulates a model instance and its associated inference framework.
Worker Types:
| Worker Type | Inference Framework | Use Case |
|---|---|---|
| HuggingFace Worker | Transformers library | General-purpose local inference |
| vLLM Worker | vLLM engine | High-throughput production serving |
| LLAMA.cpp Worker | LLAMA.cpp | CPU/Metal inference, reduced memory |
| MLX Worker | Apple MLX | Optimized for Apple Silicon |
| Proxy Worker | API client | Remote model access via APIs |
Sources: README.md180-292 Diagram 3 from high-level architecture
Local deployment runs models on the same infrastructure as the DB-GPT application, providing full control over model execution, data privacy, and performance tuning.
The default local deployment option using the HuggingFace Transformers library.
Characteristics:
Configuration Parameters:
model_name: HuggingFace model identifier or local pathmodel_type: Specific model architecture (e.g., llama, qwen, chatglm)device: Target device (cuda, cpu)load_in_8bit, load_in_4bit: Quantization flagsSources: README.md180-183
vLLM provides high-throughput, low-latency serving for production workloads through continuous batching and optimized memory management.
Characteristics:
Use Cases:
Configuration Parameters:
max_model_len: Maximum sequence lengthgpu_memory_utilization: Fraction of GPU memory to usetensor_parallel_size: Number of GPUs for tensor parallelismquantization: Quantization method (awq, gptq)Sources: README.md180-183 Diagram 3 from high-level architecture
CPU-optimized inference engine with optional Metal acceleration for Apple devices.
Characteristics:
Use Cases:
Configuration Parameters:
model_path: Path to GGUF format modeln_ctx: Context window sizen_gpu_layers: Number of layers to offload to GPU (Metal)n_threads: CPU thread countSources: README.md180-183 Diagram 3 from high-level architecture
Apple's machine learning framework optimized for Apple Silicon (M1/M2/M3).
Characteristics:
Use Cases:
Sources: README.md180-183 Diagram 3 from high-level architecture
API proxy deployment connects to remote LLM services through standardized adapters, enabling access to commercial models without local infrastructure.
| Provider | Models Supported | Configuration Key |
|---|---|---|
| OpenAI | GPT-4, GPT-3.5, GPT-4-Turbo | openai_api_key, openai_api_base |
| DeepSeek | DeepSeek-V3, DeepSeek-R1, DeepSeek-Coder | deepseek_api_key |
| Qwen | Qwen-2.5, QwQ-32B | qwen_api_key |
| Ollama | Any Ollama-served model | ollama_api_base |
| Baidu (Wenxin) | ERNIE models | wenxin_api_key, wenxin_secret_key |
| Alibaba (Tongyi) | Qwen series | tongyi_api_key |
| Zhipu (ChatGLM) | GLM-4, GLM-Z1 | zhipu_api_key |
Sources: README.md184-318 README.zh.md195-323
Each provider has a dedicated adapter implementing a common interface:
Adapter Responsibilities:
Configuration Parameters (common):
proxy_server_url: Base URL for APIproxy_api_key: Authentication keyproxy_api_version: API version (if applicable)proxies: HTTP proxy configurationSources: docs/docs/modules/smmf.md32-80
SMMF uses TOML format for model configuration, separated into local and proxy configurations.
Example structure for dbgpt-local-llama.toml:
Key Sections:
[model]: Model identification and location[deployment]: Hardware allocation[optimization]: Quantization and acceleration[inference]: Generation parametersSources: Inferred from docs/docs/modules/smmf.md1-158 and standard TOML configuration patterns
Example structure for dbgpt-proxy-openai.toml:
Key Sections:
[proxy]: Provider identification[authentication]: API credentials[parameters]: Provider-specific parametersSources: Inferred from docs/docs/modules/smmf.md1-158
Sources: docs/docs/modules/smmf.md82-105
The initialization process loads models into memory and prepares workers for serving requests.
Initialization Steps:
Sources: docs/docs/modules/smmf.md107-130
Routing Strategies:
Sources: docs/docs/modules/smmf.md107-130
Workers report health metrics to the Controller:
| Metric | Description | Threshold |
|---|---|---|
| Response Time | Average inference latency | Alert if > 5s |
| Error Rate | Failed requests / total requests | Alert if > 5% |
| Memory Usage | GPU/CPU memory consumption | Alert if > 90% |
| Queue Length | Pending requests | Alert if > 100 |
| Heartbeat | Last successful ping | Mark unhealthy if > 30s |
Unhealthy workers are removed from the pool and optionally restarted.
Sources: Inferred from docs/docs/modules/smmf.md82-130
SMMF integrates multiple acceleration techniques to optimize inference performance. These are implemented in the packages/dbgpt-accelerator package.
Flash Attention is an optimized attention mechanism that reduces memory usage and improves speed through kernel fusion and memory hierarchy optimization.
Benefits:
Supported Models: Most transformer-based architectures (Llama, Qwen, ChatGLM, etc.)
Configuration: Enabled by default in vLLM; requires flash-attn package for HuggingFace
Sources: README.md126-127 Diagram 3 from high-level architecture
Quantization reduces model precision to lower memory usage and increase throughput.
Precision Options:
Configuration:
Use Cases: Memory-constrained environments, consumer GPUs
Sources: README.md126-127 Diagram 3 from high-level architecture
GPTQ (Generative Pre-trained Transformer Quantization) performs post-training quantization optimized for generation tasks.
Characteristics:
Configuration:
Sources: README.md126-127 Diagram 3 from high-level architecture
SMMF supports multiple CUDA versions to accommodate different GPU driver environments:
| CUDA Version | Compatible GPUs | Status |
|---|---|---|
| 11.8 | RTX 30xx, A100, V100 | Supported |
| 12.1 | RTX 40xx, H100 | Supported |
| 12.4 | Latest GPUs | Supported |
Installation: CUDA-specific wheels are available for torch, vllm, and acceleration libraries.
Sources: README.md126-127 Diagram 3 from high-level architecture
vLLM is the recommended backend for production deployments requiring high throughput.
Architecture:
Performance: 2-3x higher throughput than HuggingFace Transformers
Sources: Diagram 3 from high-level architecture
HuggingFace's optimized serving framework.
Characteristics:
Sources: Diagram 3 from high-level architecture
NVIDIA's high-performance inference engine.
Characteristics:
Sources: Diagram 3 from high-level architecture
The Model Registry is the central component that tracks all available models and their metadata.
Key Operations:
register_model(): Add new model from configurationget_model(): Retrieve model metadata by namelist_models(): Query models with filters (type, status, provider)update_status(): Update model availabilitySources: docs/docs/modules/smmf.md32-80
SMMF provides model inference for RAG operations:
Model Types Used:
bge-large-en-v1.5)For details on RAG integration, see RAG Pipeline and Knowledge Management.
Sources: examples/rag/embedding_rag_example.py1-75
Agents use SMMF for LLM inference during planning and execution:
For details on agent integration, see Multi-Agents and AWEL Workflows.
Sources: Diagram 4 from high-level architecture
SMMF provides the LLM for natural language to SQL translation:
For details on GBI, see Generative Business Intelligence (GBI).
Sources: Diagram 6 from high-level architecture
Model metadata and operational state are persisted using SQLAlchemy.
Models are stored in the metadata database with the following structure:
| Column | Type | Description |
|---|---|---|
id | Integer | Primary key |
model_name | String | Unique model identifier |
model_type | String | Model architecture (llama, qwen, etc.) |
deployment_type | Enum | Local or Proxy |
config_json | JSON | Full configuration |
status | Enum | Ready, Starting, Failed, etc. |
created_at | DateTime | Registration timestamp |
updated_at | DateTime | Last status update |
Storage Interface: SQLAlchemyStorage provides CRUD operations
Sources: packages/dbgpt-core/src/dbgpt/storage/metadata/db_storage.py1-137
Failure Scenarios:
Sources: Inferred from docs/docs/modules/smmf.md82-130
SMMF implements timeouts at multiple levels:
| Timeout Type | Default | Configurable |
|---|---|---|
| Connection | 10s | Yes |
| Inference | 60s | Yes |
| Queue Wait | 120s | Yes |
Requests exceeding timeouts return error responses to prevent resource exhaustion.
Sources: Inferred from system design principles
SMMF supports request batching to increase throughput:
Static Batching: Accumulate requests for fixed time window (e.g., 100ms) Dynamic Batching (vLLM): Continuously add/remove requests based on sequence length
Configuration:
Sources: Inferred from vLLM integration
Model Weight Caching: Avoid reloading frequently-used models KV Cache: Store key-value pairs for efficient autoregressive generation Response Caching: Cache responses for duplicate queries (optional)
Sources: Inferred from system design
| Metric | Purpose | Export Method |
|---|---|---|
| Requests/sec | Throughput monitoring | Prometheus |
| P50/P95/P99 Latency | Performance tracking | Prometheus |
| GPU Utilization | Resource optimization | NVIDIA-SMI |
| Memory Usage | Capacity planning | Prometheus |
| Error Rate | Reliability monitoring | Prometheus |
Sources: Inferred from production best practices
SMMF provides a comprehensive framework for managing heterogeneous LLM deployments through:
This architecture enables DB-GPT to support rapid model iteration and diverse deployment scenarios while maintaining consistent application behavior.
Sources: README.md1-363 docs/docs/modules/smmf.md1-158 Diagrams 1, 3 from high-level architecture
Refresh this wiki