Hardware Acceleration and Performance

Relevant source files

This document describes the hardware acceleration capabilities and performance optimization features in DB-GPT. It covers the packages/dbgpt-accelerator package structure, CUDA support across multiple versions, quantization techniques, high-throughput inference engines, and platform-specific optimizations. The system provides flexible installation options through optional dependencies that automatically configure the appropriate acceleration backend based on the target hardware platform.

For model deployment strategies and worker management, see Model Workers and Inference Backends. For model configuration, see Model Configuration and Deployment.

Acceleration Package Architecture

DB-GPT's hardware acceleration is implemented through the packages/dbgpt-accelerator monorepo package, which consists of two sub-packages that handle different aspects of acceleration:

Package Structure

Sources: packages/dbgpt-accelerator/dbgpt-acc-auto/pyproject.toml1-197 packages/dbgpt-accelerator/dbgpt-acc-flash-attn/pyproject.toml1-33

Key Components

Component	Package	Purpose
`dbgpt-acc-auto`	`packages/dbgpt-accelerator/dbgpt-acc-auto`	Manages PyTorch installation, CUDA versions, and acceleration backends
`dbgpt-acc-flash-attn`	`packages/dbgpt-accelerator/dbgpt-acc-flash-attn`	Provides Flash Attention support with no-build-isolation configuration
PyTorch Index Sources	Configuration in `pyproject.toml`	Custom package indices for CUDA-specific PyTorch builds

The dbgpt-acc-auto package is referenced as a dependency in packages/dbgpt-app/pyproject.toml13 and provides the foundation for all hardware acceleration features.

Sources: packages/dbgpt-accelerator/dbgpt-acc-auto/pyproject.toml1-19 packages/dbgpt-accelerator/dbgpt-acc-flash-attn/pyproject.toml1-18

CUDA Support and PyTorch Configuration

DB-GPT supports three CUDA versions with platform-specific PyTorch installations through the uv package manager's custom index feature. The system uses conditional dependencies based on platform markers to install the correct PyTorch build.

Supported CUDA Versions

Sources: packages/dbgpt-accelerator/dbgpt-acc-auto/pyproject.toml46-61 packages/dbgpt-accelerator/dbgpt-acc-auto/pyproject.toml125-148

PyTorch Version Constraints

The system enforces specific PyTorch version constraints based on the target platform:

Platform	PyTorch Version	Notes
macOS x86_64	`>=2.2.1,<2.3`	PyTorch 2.3.0+ requires macOS 11.0+ ARM64
macOS ARM64	`>=2.2.1`	Full support for Apple Silicon
Linux	`>=2.2.1`	All CUDA versions supported
Windows	`>=2.2.1`	CUDA 11.8, 12.1, 12.4 supported

These constraints are defined in packages/dbgpt-accelerator/dbgpt-acc-auto/pyproject.toml28-45 with conditional markers like torch>=2.2.1; sys_platform != 'darwin' or platform_machine != 'x86_64'.

Package Index Configuration

The tool.uv.sources section in the pyproject.toml maps PyTorch packages to their appropriate index sources:

The corresponding index definitions are in packages/dbgpt-accelerator/dbgpt-acc-auto/pyproject.toml125-143:

pytorch-cu118: https://download.pytorch.org/whl/cu118
pytorch-cu121: https://download.pytorch.org/whl/cu121
pytorch-cu124: https://download.pytorch.org/whl/cu124

Sources: packages/dbgpt-accelerator/dbgpt-acc-auto/pyproject.toml150-189

Installation Examples

The conflict resolution ensures mutually exclusive CUDA versions through packages/dbgpt-accelerator/dbgpt-acc-auto/pyproject.toml110-118:

Sources: packages/dbgpt-accelerator/dbgpt-acc-auto/pyproject.toml26-61 packages/dbgpt-accelerator/dbgpt-acc-auto/pyproject.toml110-118

Quantization Techniques

DB-GPT supports multiple quantization methods to reduce model memory footprint and improve inference speed. The quantization implementations are provided as optional extras in the dbgpt-acc-auto package.

Quantization Methods Overview

Sources: packages/dbgpt-accelerator/dbgpt-acc-auto/pyproject.toml78-100

BitsAndBytes Quantization

BitsAndBytes provides 4-bit and 8-bit quantization for LLM inference with minimal accuracy loss. It requires CUDA support and is only available on Windows and Linux platforms.

Installation:

Dependencies:

bitsandbytes>=0.39.0 - Core quantization library
accelerate - HuggingFace acceleration utilities

Platform Constraint: sys_platform == 'win32' or sys_platform == 'linux'

The implementation is defined in packages/dbgpt-accelerator/dbgpt-acc-auto/pyproject.toml78-81

Sources: packages/dbgpt-accelerator/dbgpt-acc-auto/pyproject.toml78-81

Auto-GPTQ Quantization

GPTQ (Generative Pre-trained Transformer Quantization) provides efficient 4-bit weight quantization with optimized kernels for inference.

Installation:

Dependencies:

optimum - HuggingFace optimization toolkit
auto-gptq - GPTQ implementation

Hardware Requirements:

NVIDIA GPUs: Compute Capability 7.5 or higher, CUDA 11.8+
AMD GPUs: ROCm version compatible with Triton

The configuration is in packages/dbgpt-accelerator/dbgpt-acc-auto/pyproject.toml97-100

Sources: packages/dbgpt-accelerator/dbgpt-acc-auto/pyproject.toml82-100

GPTQModel Quantization

GPTQModel is an advanced GPTQ implementation with additional features and optimization. It requires transformers version greater than 4.48.3 and uses build isolation disabled for proper installation.

Installation:

Dependencies:

optimum - HuggingFace optimization toolkit
device-smi>=0.3.3 - Device monitoring
tokenicer - Tokenization utilities
gptqmodel - Advanced GPTQ implementation

Build Configuration: The package requires special build configuration in packages/dbgpt-accelerator/dbgpt-acc-auto/pyproject.toml194-196:

This disables build isolation for gptqmodel to ensure proper dependency resolution during installation.

Sources: packages/dbgpt-accelerator/dbgpt-acc-auto/pyproject.toml90-96 packages/dbgpt-accelerator/dbgpt-acc-auto/pyproject.toml194-196

Quantization Comparison

Method	Bit-width	Platform Support	Primary Use Case
BitsAndBytes	4-bit, 8-bit	Windows, Linux	Easy integration, HuggingFace models
Auto-GPTQ	4-bit	All with CUDA/ROCm	Production inference, optimized kernels
GPTQModel	4-bit	All with CUDA/ROCm	Advanced features, latest optimizations

Sources: packages/dbgpt-accelerator/dbgpt-acc-auto/pyproject.toml78-100

High-Throughput Inference Engines

DB-GPT integrates multiple high-throughput inference engines optimized for different use cases and hardware platforms. These engines provide significant performance improvements over standard HuggingFace Transformers inference.

vLLM Integration

vLLM is a high-throughput and memory-efficient inference engine for LLMs. It uses PagedAttention to manage attention key-value memory efficiently and supports continuous batching.

Installation:

Package Definition:

Platform Constraint: Linux only (sys_platform == 'linux')

Minimum Version: vLLM 0.7.0 or higher

The vLLM extra is defined in packages/dbgpt-accelerator/dbgpt-acc-auto/pyproject.toml67-70

Sources: packages/dbgpt-accelerator/dbgpt-acc-auto/pyproject.toml67-70

MLX for Apple Silicon

MLX provides optimized inference for Apple Silicon (M1, M2, M3 chips) with Metal acceleration. It's specifically designed for efficient LLM inference on macOS ARM64 architecture.

Installation:

Package Definition:

Platform Constraint: macOS only (sys_platform == 'darwin')

Minimum Version: mlx-lm 0.25.2 or higher

The MLX integration is defined in packages/dbgpt-accelerator/dbgpt-acc-auto/pyproject.toml71-73

Sources: packages/dbgpt-accelerator/dbgpt-acc-auto/pyproject.toml71-73

Inference Engine Selection

Sources: packages/dbgpt-accelerator/dbgpt-acc-auto/pyproject.toml67-73 packages/dbgpt-core/pyproject.toml116-122

LLAMA.cpp Integration

LLAMA.cpp provides efficient CPU and GPU inference for LLaMA-based models with support for Metal on macOS and CUDA on Linux/Windows.

Installation from dbgpt-core:

Package Definition in dbgpt-core: packages/dbgpt-core/pyproject.toml116-122 defines:

llama_cpp = ["llama-cpp-python"] - Core library
llama_cpp_server = ["llama-cpp-server-py-core>=0.1.4", "llama-cpp-server-py>=0.1.4"] - Server components

Sources: packages/dbgpt-core/pyproject.toml116-122

Performance Comparison

Engine	Platform	Primary Advantage	Best For
vLLM	Linux	Highest throughput, PagedAttention	Production serving, batch processing
MLX	macOS ARM64	Metal acceleration, low latency	Apple Silicon development/inference
LLAMA.cpp	Cross-platform	CPU optimization, Metal/CUDA support	Resource-constrained environments
HuggingFace	Cross-platform	Maximum compatibility, ease of use	Development, prototyping

Sources: packages/dbgpt-accelerator/dbgpt-acc-auto/pyproject.toml67-73 packages/dbgpt-core/pyproject.toml116-122

Flash Attention Implementation

Flash Attention is an I/O-aware attention algorithm that reduces memory access and improves computational efficiency. DB-GPT provides a dedicated package for Flash Attention integration.

Package Structure

The dbgpt-acc-flash-attn package uses dependency groups to manage the Flash Attention installation process with specific build requirements.

Sources: packages/dbgpt-accelerator/dbgpt-acc-flash-attn/pyproject.toml1-33

Dependency Groups

The package defines three dependency groups in packages/dbgpt-accelerator/dbgpt-acc-flash-attn/pyproject.toml19-28:

build group:
- setuptools>=75.8.0 - Required for building CUDA extensions
direct group:
- torch>=2.2.1 - PyTorch dependency needed before flash-attn installation
main group:
- flash-attn>=2.5.8 - The Flash Attention library itself

No-Build-Isolation Configuration

A critical aspect of Flash Attention installation is the no-build-isolation setting in packages/dbgpt-accelerator/dbgpt-acc-flash-attn/pyproject.toml31-32:

This configuration:

Disables build isolation for flash-attn package
Allows flash-attn to access PyTorch during build time
Required because Flash Attention needs PyTorch headers for CUDA compilation
Addresses the issue documented in https://github.com/astral-sh/uv/issues/2252

Sources: packages/dbgpt-accelerator/dbgpt-acc-flash-attn/pyproject.toml30-32

Installation

Method 1: Direct installation of flash-attn package

Method 2: Through dbgpt-acc-auto

The flash_attn extra in dbgpt-acc-auto is defined in packages/dbgpt-accelerator/dbgpt-acc-auto/pyproject.toml101-103:

Requirements

Flash Attention requires:

CUDA 11.8 or higher - For GPU acceleration
PyTorch 2.2.1 or higher - Must be installed before flash-attn
NVIDIA GPU with compute capability 7.5+ (Turing architecture or newer)
Sufficient compilation environment - C++ compiler and CUDA toolkit

Sources: packages/dbgpt-accelerator/dbgpt-acc-flash-attn/pyproject.toml1-33 packages/dbgpt-accelerator/dbgpt-acc-auto/pyproject.toml101-103

Platform-Specific Optimizations

DB-GPT's hardware acceleration system includes platform-specific optimizations that automatically configure the appropriate dependencies based on the target environment.

Platform Markers and Conditional Dependencies

The system uses environment markers to conditionally install packages. These markers are defined throughout the pyproject.toml files and evaluated at installation time.

Sources: packages/dbgpt-accelerator/dbgpt-acc-auto/pyproject.toml28-61

macOS-Specific Constraints

macOS has special version constraints due to PyTorch 2.3.0's requirement for macOS 11.0+ ARM64. The system differentiates between Intel and Apple Silicon Macs:

Intel Mac (x86_64):

Apple Silicon (ARM64):

These constraints are defined in packages/dbgpt-accelerator/dbgpt-acc-auto/pyproject.toml29-35

Sources: packages/dbgpt-accelerator/dbgpt-acc-auto/pyproject.toml28-45

Linux Optimization Matrix

Linux supports the widest range of optimizations with different combinations based on architecture and CUDA version:

Architecture	CUDA 11.8	CUDA 12.1	CUDA 12.4	CPU-only	vLLM
x86_64	✓	✓	✓	✓	✓
aarch64	✓	✓	✓	✓	✓

Platform markers for Linux with different architectures:

platform_machine != 'aarch64' and platform_python_implementation != 'PyPy' and sys_platform == 'linux'
platform_machine == 'aarch64' and platform_python_implementation != 'PyPy' and sys_platform == 'linux'

Sources: uv.lock1-74

Windows Optimization Support

Windows supports CUDA acceleration but not vLLM or MLX:

Supported:

CUDA 11.8, 12.1, 12.4
BitsAndBytes quantization
CPU-only inference

Not Supported:

vLLM (Linux only)
MLX (macOS only)
Metal acceleration

The Windows platform marker is sys_platform == 'win32' as seen in packages/dbgpt-accelerator/dbgpt-acc-auto/pyproject.toml46-50

Sources: packages/dbgpt-accelerator/dbgpt-acc-auto/pyproject.toml46-61

PyPy vs CPython Support

The dependency resolution includes separate markers for PyPy and CPython implementations:

This ensures compatibility with both Python implementations, though some acceleration features (like compiled CUDA extensions) may have limitations on PyPy.

Sources: uv.lock5-74

Installation Strategies

DB-GPT provides multiple installation strategies to accommodate different hardware configurations and use cases. The flexible dependency system allows users to install only what they need.

Installation Decision Tree

Sources: packages/dbgpt-accelerator/dbgpt-acc-auto/pyproject.toml26-103

Minimal Installation

For basic functionality without hardware acceleration:

This installs the dependencies from packages/dbgpt-core/pyproject.toml14-26:

aiohttp==3.8.4
chardet==5.1.0
cachetools
pydantic>=2.6.0
typeguard
snowflake-id
typing_inspect
tomli>=2.2.1

Sources: packages/dbgpt-core/pyproject.toml14-26

Full Production Installation

For production GPU deployment with all acceleration features:

The dbgpt-app[base] extra includes packages/dbgpt-app/pyproject.toml41-43:

Sources: packages/dbgpt-accelerator/dbgpt-acc-auto/pyproject.toml26-103 packages/dbgpt-app/pyproject.toml41-43

Development Installation

For development with automatic platform detection:

The monorepo structure is defined in pyproject.toml31-40:

Sources: pyproject.toml31-40 packages/dbgpt-accelerator/dbgpt-acc-auto/pyproject.toml28-36

Specialized Installations

RAG with Vector Stores:

Data Analytics (GBI):

Graph RAG:

These extras are defined in packages/dbgpt-ext/pyproject.toml28-82

Sources: packages/dbgpt-ext/pyproject.toml27-89

Performance Benchmarking

DB-GPT includes benchmarking utilities to measure and compare model inference performance across different configurations.

Benchmark Script

The benchmark script is located at scripts/run_llm_benchmarks.sh1-21 and supports testing multiple models with various input/output lengths and parallelization levels.

Script Parameters:

Parameter	Default Value	Description
`input_lens`	`"8,8,256,1024"`	Comma-separated input sequence lengths
`output_lens`	`"256,512,1024,1024"`	Comma-separated output sequence lengths
`parallel_nums`	`"1,2,4,16,32"`	Comma-separated parallel request counts

Usage:

Sources: scripts/run_llm_benchmarks.sh1-14

Benchmark Implementation

The benchmark script calls the Python benchmark module with model-specific configurations:

Default Benchmark Configurations:

Vicuna-7B HuggingFace:
Vicuna-7B vLLM:
Baichuan2-7B HuggingFace:
Baichuan2-7B vLLM:

These configurations are defined in scripts/run_llm_benchmarks.sh17-20

Sources: scripts/run_llm_benchmarks.sh11-20

Performance Metrics

The benchmark measures key performance indicators for LLM inference:

Throughput - Tokens generated per second
Latency - Time to first token and end-to-end latency
Memory Usage - GPU memory consumption
Batch Processing - Performance under parallel load

The DB_GPT_MODEL_BENCHMARK=true environment variable enables benchmark mode in the inference system.

Benchmark Execution Flow

Sources: scripts/run_llm_benchmarks.sh1-21

Configuration Reference

Environment Variables

Variable	Purpose	Example
`DB_GPT_MODEL_BENCHMARK`	Enable benchmark mode	`DB_GPT_MODEL_BENCHMARK=true`
`CUDA_VISIBLE_DEVICES`	Select GPU devices	`CUDA_VISIBLE_DEVICES=0,1`

Package Version Reference

Current versions across the monorepo (all at 0.7.5):

Package	Version File	Version
`dbgpt`	packages/dbgpt-core/src/dbgpt/_version.py1	0.7.5
`dbgpt-ext`	packages/dbgpt-ext/src/dbgpt_ext/_version.py1	0.7.5
`dbgpt-app`	packages/dbgpt-app/src/dbgpt_app/_version.py1	0.7.5
`dbgpt-serve`	packages/dbgpt-serve/src/dbgpt_serve/_version.py1	0.7.5
`dbgpt-client`	packages/dbgpt-client/src/dbgpt_client/_version.py1	0.7.5
`dbgpt-acc-auto`	packages/dbgpt-accelerator/dbgpt-acc-auto/src/dbgpt_acc_auto/_version.py1	0.7.5
`dbgpt-acc-flash-attn`	packages/dbgpt-accelerator/dbgpt-acc-flash-attn/src/dbgpt_acc_flash_attn/_version.py1	0.7.5

Sources: packages/dbgpt-core/src/dbgpt/_version.py1 packages/dbgpt-ext/src/dbgpt_ext/_version.py1 packages/dbgpt-app/src/dbgpt_app/_version.py1 packages/dbgpt-serve/src/dbgpt_serve/_version.py1 packages/dbgpt-client/src/dbgpt_client/_version.py1 packages/dbgpt-accelerator/dbgpt-acc-auto/src/dbgpt_acc_auto/_version.py1 packages/dbgpt-accelerator/dbgpt-acc-flash-attn/src/dbgpt_acc_flash_attn/_version.py1

Complete Optional Extras Matrix

Extra	Package	Platform	Dependencies
`auto`	`dbgpt-acc-auto`	All	PyTorch (auto-detect)
`cpu`	`dbgpt-acc-auto`	All	PyTorch CPU
`cuda118`	`dbgpt-acc-auto`	Linux, Windows	PyTorch CUDA 11.8
`cuda121`	`dbgpt-acc-auto`	Linux, Windows	PyTorch CUDA 12.1
`cuda124`	`dbgpt-acc-auto`	Linux, Windows	PyTorch CUDA 12.4
`vllm`	`dbgpt-acc-auto`	Linux	vLLM >=0.7.0
`mlx`	`dbgpt-acc-auto`	macOS	mlx-lm >=0.25.2
`quant_bnb`	`dbgpt-acc-auto`	Linux, Windows	bitsandbytes, accelerate
`quant_gptq`	`dbgpt-acc-auto`	All	optimum, auto-gptq
`quant_gptqmodel`	`dbgpt-acc-auto`	All	optimum, gptqmodel
`flash_attn`	`dbgpt-acc-auto`	All (CUDA)	dbgpt-acc-flash-attn
`llama_cpp`	`dbgpt`	All	llama-cpp-python
`llama_cpp_server`	`dbgpt`	All	llama-cpp-server-py

Sources: packages/dbgpt-accelerator/dbgpt-acc-auto/pyproject.toml26-103 packages/dbgpt-core/pyproject.toml116-122

Hardware Acceleration and Performance

Relevant source files

For model deployment strategies and worker management, see Model Workers and Inference Backends. For model configuration, see Model Configuration and Deployment.

Acceleration Package Architecture

DB-GPT's hardware acceleration is implemented through the packages/dbgpt-accelerator monorepo package, which consists of two sub-packages that handle different aspects of acceleration:

Package Structure

Sources: packages/dbgpt-accelerator/dbgpt-acc-auto/pyproject.toml1-197 packages/dbgpt-accelerator/dbgpt-acc-flash-attn/pyproject.toml1-33

Key Components

Component	Package	Purpose
`dbgpt-acc-auto`	`packages/dbgpt-accelerator/dbgpt-acc-auto`	Manages PyTorch installation, CUDA versions, and acceleration backends
`dbgpt-acc-flash-attn`	`packages/dbgpt-accelerator/dbgpt-acc-flash-attn`	Provides Flash Attention support with no-build-isolation configuration
PyTorch Index Sources	Configuration in `pyproject.toml`	Custom package indices for CUDA-specific PyTorch builds

The dbgpt-acc-auto package is referenced as a dependency in packages/dbgpt-app/pyproject.toml13 and provides the foundation for all hardware acceleration features.

Sources: packages/dbgpt-accelerator/dbgpt-acc-auto/pyproject.toml1-19 packages/dbgpt-accelerator/dbgpt-acc-flash-attn/pyproject.toml1-18

CUDA Support and PyTorch Configuration

Supported CUDA Versions

Sources: packages/dbgpt-accelerator/dbgpt-acc-auto/pyproject.toml46-61 packages/dbgpt-accelerator/dbgpt-acc-auto/pyproject.toml125-148

PyTorch Version Constraints

The system enforces specific PyTorch version constraints based on the target platform:

Platform	PyTorch Version	Notes
macOS x86_64	`>=2.2.1,<2.3`	PyTorch 2.3.0+ requires macOS 11.0+ ARM64
macOS ARM64	`>=2.2.1`	Full support for Apple Silicon
Linux	`>=2.2.1`	All CUDA versions supported
Windows	`>=2.2.1`	CUDA 11.8, 12.1, 12.4 supported

These constraints are defined in packages/dbgpt-accelerator/dbgpt-acc-auto/pyproject.toml28-45 with conditional markers like torch>=2.2.1; sys_platform != 'darwin' or platform_machine != 'x86_64'.

Package Index Configuration

The tool.uv.sources section in the pyproject.toml maps PyTorch packages to their appropriate index sources:

The corresponding index definitions are in packages/dbgpt-accelerator/dbgpt-acc-auto/pyproject.toml125-143:

pytorch-cu118: https://download.pytorch.org/whl/cu118
pytorch-cu121: https://download.pytorch.org/whl/cu121
pytorch-cu124: https://download.pytorch.org/whl/cu124

Sources: packages/dbgpt-accelerator/dbgpt-acc-auto/pyproject.toml150-189

Installation Examples

The conflict resolution ensures mutually exclusive CUDA versions through packages/dbgpt-accelerator/dbgpt-acc-auto/pyproject.toml110-118:

Sources: packages/dbgpt-accelerator/dbgpt-acc-auto/pyproject.toml26-61 packages/dbgpt-accelerator/dbgpt-acc-auto/pyproject.toml110-118

Quantization Techniques

Quantization Methods Overview

Sources: packages/dbgpt-accelerator/dbgpt-acc-auto/pyproject.toml78-100

BitsAndBytes Quantization

BitsAndBytes provides 4-bit and 8-bit quantization for LLM inference with minimal accuracy loss. It requires CUDA support and is only available on Windows and Linux platforms.

Installation:

Dependencies:

bitsandbytes>=0.39.0 - Core quantization library
accelerate - HuggingFace acceleration utilities

Platform Constraint: sys_platform == 'win32' or sys_platform == 'linux'

The implementation is defined in packages/dbgpt-accelerator/dbgpt-acc-auto/pyproject.toml78-81

Sources: packages/dbgpt-accelerator/dbgpt-acc-auto/pyproject.toml78-81

Auto-GPTQ Quantization

GPTQ (Generative Pre-trained Transformer Quantization) provides efficient 4-bit weight quantization with optimized kernels for inference.

Installation:

Dependencies:

optimum - HuggingFace optimization toolkit
auto-gptq - GPTQ implementation

Hardware Requirements:

NVIDIA GPUs: Compute Capability 7.5 or higher, CUDA 11.8+
AMD GPUs: ROCm version compatible with Triton

The configuration is in packages/dbgpt-accelerator/dbgpt-acc-auto/pyproject.toml97-100

Sources: packages/dbgpt-accelerator/dbgpt-acc-auto/pyproject.toml82-100

GPTQModel Quantization

GPTQModel is an advanced GPTQ implementation with additional features and optimization. It requires transformers version greater than 4.48.3 and uses build isolation disabled for proper installation.

Installation:

Dependencies:

optimum - HuggingFace optimization toolkit
device-smi>=0.3.3 - Device monitoring
tokenicer - Tokenization utilities
gptqmodel - Advanced GPTQ implementation

Build Configuration: The package requires special build configuration in packages/dbgpt-accelerator/dbgpt-acc-auto/pyproject.toml194-196:

This disables build isolation for gptqmodel to ensure proper dependency resolution during installation.

Sources: packages/dbgpt-accelerator/dbgpt-acc-auto/pyproject.toml90-96 packages/dbgpt-accelerator/dbgpt-acc-auto/pyproject.toml194-196

Quantization Comparison

Method	Bit-width	Platform Support	Primary Use Case
BitsAndBytes	4-bit, 8-bit	Windows, Linux	Easy integration, HuggingFace models
Auto-GPTQ	4-bit	All with CUDA/ROCm	Production inference, optimized kernels
GPTQModel	4-bit	All with CUDA/ROCm	Advanced features, latest optimizations

Sources: packages/dbgpt-accelerator/dbgpt-acc-auto/pyproject.toml78-100

High-Throughput Inference Engines

vLLM Integration

vLLM is a high-throughput and memory-efficient inference engine for LLMs. It uses PagedAttention to manage attention key-value memory efficiently and supports continuous batching.

Installation:

Package Definition:

Platform Constraint: Linux only (sys_platform == 'linux')

Minimum Version: vLLM 0.7.0 or higher

The vLLM extra is defined in packages/dbgpt-accelerator/dbgpt-acc-auto/pyproject.toml67-70

Sources: packages/dbgpt-accelerator/dbgpt-acc-auto/pyproject.toml67-70

MLX for Apple Silicon

MLX provides optimized inference for Apple Silicon (M1, M2, M3 chips) with Metal acceleration. It's specifically designed for efficient LLM inference on macOS ARM64 architecture.

Installation:

Package Definition:

Platform Constraint: macOS only (sys_platform == 'darwin')

Minimum Version: mlx-lm 0.25.2 or higher

The MLX integration is defined in packages/dbgpt-accelerator/dbgpt-acc-auto/pyproject.toml71-73

Sources: packages/dbgpt-accelerator/dbgpt-acc-auto/pyproject.toml71-73

Inference Engine Selection

Sources: packages/dbgpt-accelerator/dbgpt-acc-auto/pyproject.toml67-73 packages/dbgpt-core/pyproject.toml116-122

LLAMA.cpp Integration

LLAMA.cpp provides efficient CPU and GPU inference for LLaMA-based models with support for Metal on macOS and CUDA on Linux/Windows.

Installation from dbgpt-core:

Package Definition in dbgpt-core: packages/dbgpt-core/pyproject.toml116-122 defines:

llama_cpp = ["llama-cpp-python"] - Core library
llama_cpp_server = ["llama-cpp-server-py-core>=0.1.4", "llama-cpp-server-py>=0.1.4"] - Server components

Sources: packages/dbgpt-core/pyproject.toml116-122

Performance Comparison

Engine	Platform	Primary Advantage	Best For
vLLM	Linux	Highest throughput, PagedAttention	Production serving, batch processing
MLX	macOS ARM64	Metal acceleration, low latency	Apple Silicon development/inference
LLAMA.cpp	Cross-platform	CPU optimization, Metal/CUDA support	Resource-constrained environments
HuggingFace	Cross-platform	Maximum compatibility, ease of use	Development, prototyping

Sources: packages/dbgpt-accelerator/dbgpt-acc-auto/pyproject.toml67-73 packages/dbgpt-core/pyproject.toml116-122

Flash Attention Implementation

Flash Attention is an I/O-aware attention algorithm that reduces memory access and improves computational efficiency. DB-GPT provides a dedicated package for Flash Attention integration.

Package Structure

The dbgpt-acc-flash-attn package uses dependency groups to manage the Flash Attention installation process with specific build requirements.

Sources: packages/dbgpt-accelerator/dbgpt-acc-flash-attn/pyproject.toml1-33

Dependency Groups

The package defines three dependency groups in packages/dbgpt-accelerator/dbgpt-acc-flash-attn/pyproject.toml19-28:

build group:
- setuptools>=75.8.0 - Required for building CUDA extensions
direct group:
- torch>=2.2.1 - PyTorch dependency needed before flash-attn installation
main group:
- flash-attn>=2.5.8 - The Flash Attention library itself

No-Build-Isolation Configuration

A critical aspect of Flash Attention installation is the no-build-isolation setting in packages/dbgpt-accelerator/dbgpt-acc-flash-attn/pyproject.toml31-32:

This configuration:

Disables build isolation for flash-attn package
Allows flash-attn to access PyTorch during build time
Required because Flash Attention needs PyTorch headers for CUDA compilation
Addresses the issue documented in https://github.com/astral-sh/uv/issues/2252

Sources: packages/dbgpt-accelerator/dbgpt-acc-flash-attn/pyproject.toml30-32

Installation

Method 1: Direct installation of flash-attn package

Method 2: Through dbgpt-acc-auto

The flash_attn extra in dbgpt-acc-auto is defined in packages/dbgpt-accelerator/dbgpt-acc-auto/pyproject.toml101-103:

Requirements

Flash Attention requires:

CUDA 11.8 or higher - For GPU acceleration
PyTorch 2.2.1 or higher - Must be installed before flash-attn
NVIDIA GPU with compute capability 7.5+ (Turing architecture or newer)
Sufficient compilation environment - C++ compiler and CUDA toolkit

Sources: packages/dbgpt-accelerator/dbgpt-acc-flash-attn/pyproject.toml1-33 packages/dbgpt-accelerator/dbgpt-acc-auto/pyproject.toml101-103

Platform-Specific Optimizations

DB-GPT's hardware acceleration system includes platform-specific optimizations that automatically configure the appropriate dependencies based on the target environment.

Platform Markers and Conditional Dependencies

The system uses environment markers to conditionally install packages. These markers are defined throughout the pyproject.toml files and evaluated at installation time.

Sources: packages/dbgpt-accelerator/dbgpt-acc-auto/pyproject.toml28-61

macOS-Specific Constraints

macOS has special version constraints due to PyTorch 2.3.0's requirement for macOS 11.0+ ARM64. The system differentiates between Intel and Apple Silicon Macs:

Intel Mac (x86_64):

Apple Silicon (ARM64):

These constraints are defined in packages/dbgpt-accelerator/dbgpt-acc-auto/pyproject.toml29-35

Sources: packages/dbgpt-accelerator/dbgpt-acc-auto/pyproject.toml28-45

Linux Optimization Matrix

Linux supports the widest range of optimizations with different combinations based on architecture and CUDA version:

Architecture	CUDA 11.8	CUDA 12.1	CUDA 12.4	CPU-only	vLLM
x86_64	✓	✓	✓	✓	✓
aarch64	✓	✓	✓	✓	✓

Platform markers for Linux with different architectures:

platform_machine != 'aarch64' and platform_python_implementation != 'PyPy' and sys_platform == 'linux'
platform_machine == 'aarch64' and platform_python_implementation != 'PyPy' and sys_platform == 'linux'

Sources: uv.lock1-74

Windows Optimization Support

Windows supports CUDA acceleration but not vLLM or MLX:

Supported:

CUDA 11.8, 12.1, 12.4
BitsAndBytes quantization
CPU-only inference

Not Supported:

vLLM (Linux only)
MLX (macOS only)
Metal acceleration

The Windows platform marker is sys_platform == 'win32' as seen in packages/dbgpt-accelerator/dbgpt-acc-auto/pyproject.toml46-50

Sources: packages/dbgpt-accelerator/dbgpt-acc-auto/pyproject.toml46-61

PyPy vs CPython Support

The dependency resolution includes separate markers for PyPy and CPython implementations:

This ensures compatibility with both Python implementations, though some acceleration features (like compiled CUDA extensions) may have limitations on PyPy.

Sources: uv.lock5-74

Installation Strategies

DB-GPT provides multiple installation strategies to accommodate different hardware configurations and use cases. The flexible dependency system allows users to install only what they need.

Installation Decision Tree

Sources: packages/dbgpt-accelerator/dbgpt-acc-auto/pyproject.toml26-103

Minimal Installation

For basic functionality without hardware acceleration:

This installs the dependencies from packages/dbgpt-core/pyproject.toml14-26:

aiohttp==3.8.4
chardet==5.1.0
cachetools
pydantic>=2.6.0
typeguard
snowflake-id
typing_inspect
tomli>=2.2.1

Sources: packages/dbgpt-core/pyproject.toml14-26

Full Production Installation

For production GPU deployment with all acceleration features:

The dbgpt-app[base] extra includes packages/dbgpt-app/pyproject.toml41-43:

Sources: packages/dbgpt-accelerator/dbgpt-acc-auto/pyproject.toml26-103 packages/dbgpt-app/pyproject.toml41-43

Development Installation

For development with automatic platform detection:

The monorepo structure is defined in pyproject.toml31-40:

Sources: pyproject.toml31-40 packages/dbgpt-accelerator/dbgpt-acc-auto/pyproject.toml28-36

Specialized Installations

RAG with Vector Stores:

Data Analytics (GBI):

Graph RAG:

These extras are defined in packages/dbgpt-ext/pyproject.toml28-82

Sources: packages/dbgpt-ext/pyproject.toml27-89

Performance Benchmarking

DB-GPT includes benchmarking utilities to measure and compare model inference performance across different configurations.

Benchmark Script

The benchmark script is located at scripts/run_llm_benchmarks.sh1-21 and supports testing multiple models with various input/output lengths and parallelization levels.

Script Parameters:

Parameter	Default Value	Description
`input_lens`	`"8,8,256,1024"`	Comma-separated input sequence lengths
`output_lens`	`"256,512,1024,1024"`	Comma-separated output sequence lengths
`parallel_nums`	`"1,2,4,16,32"`	Comma-separated parallel request counts

Usage:

Sources: scripts/run_llm_benchmarks.sh1-14

Benchmark Implementation

The benchmark script calls the Python benchmark module with model-specific configurations:

Default Benchmark Configurations:

Vicuna-7B HuggingFace:
Vicuna-7B vLLM:
Baichuan2-7B HuggingFace:
Baichuan2-7B vLLM:

These configurations are defined in scripts/run_llm_benchmarks.sh17-20

Sources: scripts/run_llm_benchmarks.sh11-20

Performance Metrics

The benchmark measures key performance indicators for LLM inference:

Throughput - Tokens generated per second
Latency - Time to first token and end-to-end latency
Memory Usage - GPU memory consumption
Batch Processing - Performance under parallel load

The DB_GPT_MODEL_BENCHMARK=true environment variable enables benchmark mode in the inference system.

Benchmark Execution Flow

Sources: scripts/run_llm_benchmarks.sh1-21

Configuration Reference

Environment Variables

Variable	Purpose	Example
`DB_GPT_MODEL_BENCHMARK`	Enable benchmark mode	`DB_GPT_MODEL_BENCHMARK=true`
`CUDA_VISIBLE_DEVICES`	Select GPU devices	`CUDA_VISIBLE_DEVICES=0,1`

Package Version Reference

Current versions across the monorepo (all at 0.7.5):

Package	Version File	Version
`dbgpt`	packages/dbgpt-core/src/dbgpt/_version.py1	0.7.5
`dbgpt-ext`	packages/dbgpt-ext/src/dbgpt_ext/_version.py1	0.7.5
`dbgpt-app`	packages/dbgpt-app/src/dbgpt_app/_version.py1	0.7.5
`dbgpt-serve`	packages/dbgpt-serve/src/dbgpt_serve/_version.py1	0.7.5
`dbgpt-client`	packages/dbgpt-client/src/dbgpt_client/_version.py1	0.7.5
`dbgpt-acc-auto`	packages/dbgpt-accelerator/dbgpt-acc-auto/src/dbgpt_acc_auto/_version.py1	0.7.5
`dbgpt-acc-flash-attn`	packages/dbgpt-accelerator/dbgpt-acc-flash-attn/src/dbgpt_acc_flash_attn/_version.py1	0.7.5

Complete Optional Extras Matrix

Extra	Package	Platform	Dependencies
`auto`	`dbgpt-acc-auto`	All	PyTorch (auto-detect)
`cpu`	`dbgpt-acc-auto`	All	PyTorch CPU
`cuda118`	`dbgpt-acc-auto`	Linux, Windows	PyTorch CUDA 11.8
`cuda121`	`dbgpt-acc-auto`	Linux, Windows	PyTorch CUDA 12.1
`cuda124`	`dbgpt-acc-auto`	Linux, Windows	PyTorch CUDA 12.4
`vllm`	`dbgpt-acc-auto`	Linux	vLLM >=0.7.0
`mlx`	`dbgpt-acc-auto`	macOS	mlx-lm >=0.25.2
`quant_bnb`	`dbgpt-acc-auto`	Linux, Windows	bitsandbytes, accelerate
`quant_gptq`	`dbgpt-acc-auto`	All	optimum, auto-gptq
`quant_gptqmodel`	`dbgpt-acc-auto`	All	optimum, gptqmodel
`flash_attn`	`dbgpt-acc-auto`	All (CUDA)	dbgpt-acc-flash-attn
`llama_cpp`	`dbgpt`	All	llama-cpp-python
`llama_cpp_server`	`dbgpt`	All	llama-cpp-server-py

Sources: packages/dbgpt-accelerator/dbgpt-acc-auto/pyproject.toml26-103 packages/dbgpt-core/pyproject.toml116-122

Hardware Acceleration and Performance

Acceleration Package Architecture

Package Structure

Key Components

CUDA Support and PyTorch Configuration

Supported CUDA Versions

PyTorch Version Constraints

Package Index Configuration

Installation Examples

Quantization Techniques

Quantization Methods Overview

BitsAndBytes Quantization

Auto-GPTQ Quantization

GPTQModel Quantization

Quantization Comparison

High-Throughput Inference Engines

vLLM Integration

MLX for Apple Silicon

Inference Engine Selection

LLAMA.cpp Integration

Performance Comparison

Flash Attention Implementation

Package Structure

Dependency Groups

No-Build-Isolation Configuration

Installation

Requirements

Platform-Specific Optimizations

Platform Markers and Conditional Dependencies

macOS-Specific Constraints

Linux Optimization Matrix

Windows Optimization Support

PyPy vs CPython Support

Installation Strategies

Installation Decision Tree

Minimal Installation

Full Production Installation

Development Installation

Specialized Installations

Performance Benchmarking

Benchmark Script

Benchmark Implementation

Performance Metrics

Benchmark Execution Flow

Configuration Reference

Environment Variables

Package Version Reference

Complete Optional Extras Matrix

On this page

Hardware Acceleration and Performance

Acceleration Package Architecture

Package Structure

Key Components

CUDA Support and PyTorch Configuration

Supported CUDA Versions

PyTorch Version Constraints

Package Index Configuration

Installation Examples

Quantization Techniques

Quantization Methods Overview

BitsAndBytes Quantization

Auto-GPTQ Quantization

GPTQModel Quantization

Quantization Comparison

High-Throughput Inference Engines

vLLM Integration

MLX for Apple Silicon

Inference Engine Selection

LLAMA.cpp Integration

Performance Comparison

Flash Attention Implementation

Package Structure

Dependency Groups

No-Build-Isolation Configuration

Installation

Requirements

Platform-Specific Optimizations

Platform Markers and Conditional Dependencies

macOS-Specific Constraints

Linux Optimization Matrix