This document describes the hardware acceleration capabilities and performance optimization features in DB-GPT. It covers the packages/dbgpt-accelerator package structure, CUDA support across multiple versions, quantization techniques, high-throughput inference engines, and platform-specific optimizations. The system provides flexible installation options through optional dependencies that automatically configure the appropriate acceleration backend based on the target hardware platform.
For model deployment strategies and worker management, see Model Workers and Inference Backends. For model configuration, see Model Configuration and Deployment.
DB-GPT's hardware acceleration is implemented through the packages/dbgpt-accelerator monorepo package, which consists of two sub-packages that handle different aspects of acceleration:
Sources: packages/dbgpt-accelerator/dbgpt-acc-auto/pyproject.toml1-197 packages/dbgpt-accelerator/dbgpt-acc-flash-attn/pyproject.toml1-33
| Component | Package | Purpose |
|---|---|---|
dbgpt-acc-auto | packages/dbgpt-accelerator/dbgpt-acc-auto | Manages PyTorch installation, CUDA versions, and acceleration backends |
dbgpt-acc-flash-attn | packages/dbgpt-accelerator/dbgpt-acc-flash-attn | Provides Flash Attention support with no-build-isolation configuration |
| PyTorch Index Sources | Configuration in pyproject.toml | Custom package indices for CUDA-specific PyTorch builds |
The dbgpt-acc-auto package is referenced as a dependency in packages/dbgpt-app/pyproject.toml13 and provides the foundation for all hardware acceleration features.
Sources: packages/dbgpt-accelerator/dbgpt-acc-auto/pyproject.toml1-19 packages/dbgpt-accelerator/dbgpt-acc-flash-attn/pyproject.toml1-18
DB-GPT supports three CUDA versions with platform-specific PyTorch installations through the uv package manager's custom index feature. The system uses conditional dependencies based on platform markers to install the correct PyTorch build.
Sources: packages/dbgpt-accelerator/dbgpt-acc-auto/pyproject.toml46-61 packages/dbgpt-accelerator/dbgpt-acc-auto/pyproject.toml125-148
The system enforces specific PyTorch version constraints based on the target platform:
| Platform | PyTorch Version | Notes |
|---|---|---|
| macOS x86_64 | >=2.2.1,<2.3 | PyTorch 2.3.0+ requires macOS 11.0+ ARM64 |
| macOS ARM64 | >=2.2.1 | Full support for Apple Silicon |
| Linux | >=2.2.1 | All CUDA versions supported |
| Windows | >=2.2.1 | CUDA 11.8, 12.1, 12.4 supported |
These constraints are defined in packages/dbgpt-accelerator/dbgpt-acc-auto/pyproject.toml28-45 with conditional markers like torch>=2.2.1; sys_platform != 'darwin' or platform_machine != 'x86_64'.
The tool.uv.sources section in the pyproject.toml maps PyTorch packages to their appropriate index sources:
The corresponding index definitions are in packages/dbgpt-accelerator/dbgpt-acc-auto/pyproject.toml125-143:
pytorch-cu118: https://download.pytorch.org/whl/cu118pytorch-cu121: https://download.pytorch.org/whl/cu121pytorch-cu124: https://download.pytorch.org/whl/cu124Sources: packages/dbgpt-accelerator/dbgpt-acc-auto/pyproject.toml150-189
The conflict resolution ensures mutually exclusive CUDA versions through packages/dbgpt-accelerator/dbgpt-acc-auto/pyproject.toml110-118:
Sources: packages/dbgpt-accelerator/dbgpt-acc-auto/pyproject.toml26-61 packages/dbgpt-accelerator/dbgpt-acc-auto/pyproject.toml110-118
DB-GPT supports multiple quantization methods to reduce model memory footprint and improve inference speed. The quantization implementations are provided as optional extras in the dbgpt-acc-auto package.
Sources: packages/dbgpt-accelerator/dbgpt-acc-auto/pyproject.toml78-100
BitsAndBytes provides 4-bit and 8-bit quantization for LLM inference with minimal accuracy loss. It requires CUDA support and is only available on Windows and Linux platforms.
Installation:
Dependencies:
bitsandbytes>=0.39.0 - Core quantization libraryaccelerate - HuggingFace acceleration utilitiesPlatform Constraint: sys_platform == 'win32' or sys_platform == 'linux'
The implementation is defined in packages/dbgpt-accelerator/dbgpt-acc-auto/pyproject.toml78-81
Sources: packages/dbgpt-accelerator/dbgpt-acc-auto/pyproject.toml78-81
GPTQ (Generative Pre-trained Transformer Quantization) provides efficient 4-bit weight quantization with optimized kernels for inference.
Installation:
Dependencies:
optimum - HuggingFace optimization toolkitauto-gptq - GPTQ implementationHardware Requirements:
The configuration is in packages/dbgpt-accelerator/dbgpt-acc-auto/pyproject.toml97-100
Sources: packages/dbgpt-accelerator/dbgpt-acc-auto/pyproject.toml82-100
GPTQModel is an advanced GPTQ implementation with additional features and optimization. It requires transformers version greater than 4.48.3 and uses build isolation disabled for proper installation.
Installation:
Dependencies:
optimum - HuggingFace optimization toolkitdevice-smi>=0.3.3 - Device monitoringtokenicer - Tokenization utilitiesgptqmodel - Advanced GPTQ implementationBuild Configuration: The package requires special build configuration in packages/dbgpt-accelerator/dbgpt-acc-auto/pyproject.toml194-196:
This disables build isolation for gptqmodel to ensure proper dependency resolution during installation.
Sources: packages/dbgpt-accelerator/dbgpt-acc-auto/pyproject.toml90-96 packages/dbgpt-accelerator/dbgpt-acc-auto/pyproject.toml194-196
| Method | Bit-width | Platform Support | Primary Use Case |
|---|---|---|---|
| BitsAndBytes | 4-bit, 8-bit | Windows, Linux | Easy integration, HuggingFace models |
| Auto-GPTQ | 4-bit | All with CUDA/ROCm | Production inference, optimized kernels |
| GPTQModel | 4-bit | All with CUDA/ROCm | Advanced features, latest optimizations |
Sources: packages/dbgpt-accelerator/dbgpt-acc-auto/pyproject.toml78-100
DB-GPT integrates multiple high-throughput inference engines optimized for different use cases and hardware platforms. These engines provide significant performance improvements over standard HuggingFace Transformers inference.
vLLM is a high-throughput and memory-efficient inference engine for LLMs. It uses PagedAttention to manage attention key-value memory efficiently and supports continuous batching.
Installation:
Package Definition:
Platform Constraint: Linux only (sys_platform == 'linux')
Minimum Version: vLLM 0.7.0 or higher
The vLLM extra is defined in packages/dbgpt-accelerator/dbgpt-acc-auto/pyproject.toml67-70
Sources: packages/dbgpt-accelerator/dbgpt-acc-auto/pyproject.toml67-70
MLX provides optimized inference for Apple Silicon (M1, M2, M3 chips) with Metal acceleration. It's specifically designed for efficient LLM inference on macOS ARM64 architecture.
Installation:
Package Definition:
Platform Constraint: macOS only (sys_platform == 'darwin')
Minimum Version: mlx-lm 0.25.2 or higher
The MLX integration is defined in packages/dbgpt-accelerator/dbgpt-acc-auto/pyproject.toml71-73
Sources: packages/dbgpt-accelerator/dbgpt-acc-auto/pyproject.toml71-73
Sources: packages/dbgpt-accelerator/dbgpt-acc-auto/pyproject.toml67-73 packages/dbgpt-core/pyproject.toml116-122
LLAMA.cpp provides efficient CPU and GPU inference for LLaMA-based models with support for Metal on macOS and CUDA on Linux/Windows.
Installation from dbgpt-core:
Package Definition in dbgpt-core: packages/dbgpt-core/pyproject.toml116-122 defines:
llama_cpp = ["llama-cpp-python"] - Core libraryllama_cpp_server = ["llama-cpp-server-py-core>=0.1.4", "llama-cpp-server-py>=0.1.4"] - Server componentsSources: packages/dbgpt-core/pyproject.toml116-122
| Engine | Platform | Primary Advantage | Best For |
|---|---|---|---|
| vLLM | Linux | Highest throughput, PagedAttention | Production serving, batch processing |
| MLX | macOS ARM64 | Metal acceleration, low latency | Apple Silicon development/inference |
| LLAMA.cpp | Cross-platform | CPU optimization, Metal/CUDA support | Resource-constrained environments |
| HuggingFace | Cross-platform | Maximum compatibility, ease of use | Development, prototyping |
Sources: packages/dbgpt-accelerator/dbgpt-acc-auto/pyproject.toml67-73 packages/dbgpt-core/pyproject.toml116-122
Flash Attention is an I/O-aware attention algorithm that reduces memory access and improves computational efficiency. DB-GPT provides a dedicated package for Flash Attention integration.
The dbgpt-acc-flash-attn package uses dependency groups to manage the Flash Attention installation process with specific build requirements.
Sources: packages/dbgpt-accelerator/dbgpt-acc-flash-attn/pyproject.toml1-33
The package defines three dependency groups in packages/dbgpt-accelerator/dbgpt-acc-flash-attn/pyproject.toml19-28:
build group:
setuptools>=75.8.0 - Required for building CUDA extensionsdirect group:
torch>=2.2.1 - PyTorch dependency needed before flash-attn installationmain group:
flash-attn>=2.5.8 - The Flash Attention library itselfA critical aspect of Flash Attention installation is the no-build-isolation setting in packages/dbgpt-accelerator/dbgpt-acc-flash-attn/pyproject.toml31-32:
This configuration:
flash-attn packageflash-attn to access PyTorch during build timeSources: packages/dbgpt-accelerator/dbgpt-acc-flash-attn/pyproject.toml30-32
Method 1: Direct installation of flash-attn package
Method 2: Through dbgpt-acc-auto
The flash_attn extra in dbgpt-acc-auto is defined in packages/dbgpt-accelerator/dbgpt-acc-auto/pyproject.toml101-103:
Flash Attention requires:
Sources: packages/dbgpt-accelerator/dbgpt-acc-flash-attn/pyproject.toml1-33 packages/dbgpt-accelerator/dbgpt-acc-auto/pyproject.toml101-103
DB-GPT's hardware acceleration system includes platform-specific optimizations that automatically configure the appropriate dependencies based on the target environment.
The system uses environment markers to conditionally install packages. These markers are defined throughout the pyproject.toml files and evaluated at installation time.
Sources: packages/dbgpt-accelerator/dbgpt-acc-auto/pyproject.toml28-61
macOS has special version constraints due to PyTorch 2.3.0's requirement for macOS 11.0+ ARM64. The system differentiates between Intel and Apple Silicon Macs:
Intel Mac (x86_64):
Apple Silicon (ARM64):
These constraints are defined in packages/dbgpt-accelerator/dbgpt-acc-auto/pyproject.toml29-35
Sources: packages/dbgpt-accelerator/dbgpt-acc-auto/pyproject.toml28-45
Linux supports the widest range of optimizations with different combinations based on architecture and CUDA version:
| Architecture | CUDA 11.8 | CUDA 12.1 | CUDA 12.4 | CPU-only | vLLM |
|---|---|---|---|---|---|
| x86_64 | ✓ | ✓ | ✓ | ✓ | ✓ |
| aarch64 | ✓ | ✓ | ✓ | ✓ | ✓ |
Platform markers for Linux with different architectures:
platform_machine != 'aarch64' and platform_python_implementation != 'PyPy' and sys_platform == 'linux'platform_machine == 'aarch64' and platform_python_implementation != 'PyPy' and sys_platform == 'linux'Sources: uv.lock1-74
Windows supports CUDA acceleration but not vLLM or MLX:
Supported:
Not Supported:
The Windows platform marker is sys_platform == 'win32' as seen in packages/dbgpt-accelerator/dbgpt-acc-auto/pyproject.toml46-50
Sources: packages/dbgpt-accelerator/dbgpt-acc-auto/pyproject.toml46-61
The dependency resolution includes separate markers for PyPy and CPython implementations:
This ensures compatibility with both Python implementations, though some acceleration features (like compiled CUDA extensions) may have limitations on PyPy.
Sources: uv.lock5-74
DB-GPT provides multiple installation strategies to accommodate different hardware configurations and use cases. The flexible dependency system allows users to install only what they need.
Sources: packages/dbgpt-accelerator/dbgpt-acc-auto/pyproject.toml26-103
For basic functionality without hardware acceleration:
This installs the dependencies from packages/dbgpt-core/pyproject.toml14-26:
aiohttp==3.8.4chardet==5.1.0cachetoolspydantic>=2.6.0typeguardsnowflake-idtyping_inspecttomli>=2.2.1Sources: packages/dbgpt-core/pyproject.toml14-26
For production GPU deployment with all acceleration features:
The dbgpt-app[base] extra includes packages/dbgpt-app/pyproject.toml41-43:
Sources: packages/dbgpt-accelerator/dbgpt-acc-auto/pyproject.toml26-103 packages/dbgpt-app/pyproject.toml41-43
For development with automatic platform detection:
The monorepo structure is defined in pyproject.toml31-40:
Sources: pyproject.toml31-40 packages/dbgpt-accelerator/dbgpt-acc-auto/pyproject.toml28-36
RAG with Vector Stores:
Data Analytics (GBI):
Graph RAG:
These extras are defined in packages/dbgpt-ext/pyproject.toml28-82
Sources: packages/dbgpt-ext/pyproject.toml27-89
DB-GPT includes benchmarking utilities to measure and compare model inference performance across different configurations.
The benchmark script is located at scripts/run_llm_benchmarks.sh1-21 and supports testing multiple models with various input/output lengths and parallelization levels.
Script Parameters:
| Parameter | Default Value | Description |
|---|---|---|
input_lens | "8,8,256,1024" | Comma-separated input sequence lengths |
output_lens | "256,512,1024,1024" | Comma-separated output sequence lengths |
parallel_nums | "1,2,4,16,32" | Comma-separated parallel request counts |
Usage:
Sources: scripts/run_llm_benchmarks.sh1-14
The benchmark script calls the Python benchmark module with model-specific configurations:
Default Benchmark Configurations:
Vicuna-7B HuggingFace:
Vicuna-7B vLLM:
Baichuan2-7B HuggingFace:
Baichuan2-7B vLLM:
These configurations are defined in scripts/run_llm_benchmarks.sh17-20
Sources: scripts/run_llm_benchmarks.sh11-20
The benchmark measures key performance indicators for LLM inference:
The DB_GPT_MODEL_BENCHMARK=true environment variable enables benchmark mode in the inference system.
Sources: scripts/run_llm_benchmarks.sh1-21
| Variable | Purpose | Example |
|---|---|---|
DB_GPT_MODEL_BENCHMARK | Enable benchmark mode | DB_GPT_MODEL_BENCHMARK=true |
CUDA_VISIBLE_DEVICES | Select GPU devices | CUDA_VISIBLE_DEVICES=0,1 |
Current versions across the monorepo (all at 0.7.5):
| Package | Version File | Version |
|---|---|---|
dbgpt | packages/dbgpt-core/src/dbgpt/_version.py1 | 0.7.5 |
dbgpt-ext | packages/dbgpt-ext/src/dbgpt_ext/_version.py1 | 0.7.5 |
dbgpt-app | packages/dbgpt-app/src/dbgpt_app/_version.py1 | 0.7.5 |
dbgpt-serve | packages/dbgpt-serve/src/dbgpt_serve/_version.py1 | 0.7.5 |
dbgpt-client | packages/dbgpt-client/src/dbgpt_client/_version.py1 | 0.7.5 |
dbgpt-acc-auto | packages/dbgpt-accelerator/dbgpt-acc-auto/src/dbgpt_acc_auto/_version.py1 | 0.7.5 |
dbgpt-acc-flash-attn | packages/dbgpt-accelerator/dbgpt-acc-flash-attn/src/dbgpt_acc_flash_attn/_version.py1 | 0.7.5 |
Sources: packages/dbgpt-core/src/dbgpt/_version.py1 packages/dbgpt-ext/src/dbgpt_ext/_version.py1 packages/dbgpt-app/src/dbgpt_app/_version.py1 packages/dbgpt-serve/src/dbgpt_serve/_version.py1 packages/dbgpt-client/src/dbgpt_client/_version.py1 packages/dbgpt-accelerator/dbgpt-acc-auto/src/dbgpt_acc_auto/_version.py1 packages/dbgpt-accelerator/dbgpt-acc-flash-attn/src/dbgpt_acc_flash_attn/_version.py1
| Extra | Package | Platform | Dependencies |
|---|---|---|---|
auto | dbgpt-acc-auto | All | PyTorch (auto-detect) |
cpu | dbgpt-acc-auto | All | PyTorch CPU |
cuda118 | dbgpt-acc-auto | Linux, Windows | PyTorch CUDA 11.8 |
cuda121 | dbgpt-acc-auto | Linux, Windows | PyTorch CUDA 12.1 |
cuda124 | dbgpt-acc-auto | Linux, Windows | PyTorch CUDA 12.4 |
vllm | dbgpt-acc-auto | Linux | vLLM >=0.7.0 |
mlx | dbgpt-acc-auto | macOS | mlx-lm >=0.25.2 |
quant_bnb | dbgpt-acc-auto | Linux, Windows | bitsandbytes, accelerate |
quant_gptq | dbgpt-acc-auto | All | optimum, auto-gptq |
quant_gptqmodel | dbgpt-acc-auto | All | optimum, gptqmodel |
flash_attn | dbgpt-acc-auto | All (CUDA) | dbgpt-acc-flash-attn |
llama_cpp | dbgpt | All | llama-cpp-python |
llama_cpp_server | dbgpt | All | llama-cpp-server-py |
Sources: packages/dbgpt-accelerator/dbgpt-acc-auto/pyproject.toml26-103 packages/dbgpt-core/pyproject.toml116-122
Refresh this wiki