Generative Business Intelligence (GBI) is one of DB-GPT's core capabilities, providing foundational data intelligence technology for enterprise report analysis and business insights. GBI enables natural language interaction with diverse data sources—including relational databases, data warehouses, analytics databases, and file formats—through an intelligent Text2SQL pipeline that translates user questions into executable SQL queries and generates visualizations.
This page covers the GBI system architecture, Text2SQL translation pipeline, data source integrations, query execution mechanisms, and visualization generation. For knowledge-based retrieval capabilities, see RAG Pipeline and Knowledge Management. For agent-based data analysis workflows, see Multi-Agents and AWEL Workflows.
Sources: README.md70-73 README.zh.md70
The GBI system implements a complete pipeline from natural language input to visual dashboard output, with components organized into distinct layers for data integration, query translation, execution, and presentation.
Pipeline Stages:
Sources: README.md168-171
GBI supports 10+ data source types through an extensible connector architecture. Each data source type is installed as an optional dependency package in the monorepo structure.
| Data Source Type | Examples | Package Extra | Configuration Location |
|---|---|---|---|
| RDBMS | MySQL, PostgreSQL, Oracle, MSSQL | datasource_postgres, datasource_mssql, datasource_oracle | Connection strings in web UI |
| Analytics Databases | ClickHouse, DuckDB | datasource_clickhouse, datasource_duckdb | Connection parameters via UI |
| Data Warehouses | Hive, Spark | Core integration | Cluster configuration |
| NoSQL | MongoDB, Redis | Optional extras | Connection URI |
| File Sources | CSV, Excel, JSON | Core rag extra | File upload interface |
Connector Pattern: Each data source implements a common connector interface, registered in a factory for dynamic instantiation based on data source type. Connection pooling ensures efficient resource utilization across concurrent queries.
Metadata Extraction: Upon connection, connectors extract and cache schema information (tables, columns, data types, relationships) to support schema linking during query translation.
Sources: README.md299 docs/sidebars.js82-100 docs/docs/installation/integrations/clickhouse_install.md1-38 docs/docs/installation/integrations/postgres_install.md1-41 docs/docs/installation/integrations/duckdb_install.md1-42 packages/dbgpt-core/src/dbgpt/storage/metadata/db_storage.py1-52
The Text2SQL pipeline is the core of GBI, translating natural language questions into executable SQL queries through a multi-stage process leveraging LLMs and domain knowledge.
82.5% Spider Accuracy: Through fine-tuning and prompt engineering, the SQL generation achieves 82.5% execution accuracy on the Spider benchmark, a standard Text2SQL evaluation dataset.
Sources: README.md74 README.zh.md72
The first stage parses user input to identify query intent and extract key entities.
Intent Detection: Classifies queries into types:
Entity Extraction: Identifies:
Implementation: Uses LLM-based classification combined with regex patterns and domain-specific lexicons.
Schema linking maps extracted entities to actual database schema elements (tables, columns) using metadata and similarity scoring.
Matching Strategies:
Domain Knowledge Integration: The business glossary stores mappings like:
WHERE order_date >= DATE_SUB(NOW(), INTERVAL 1 MONTH)Historical Query Learning: Frequently used schema elements for similar queries are prioritized in ranking.
The SQL generator constructs executable queries using the linked schema context and LLM-based generation.
Prompt Construction:
Context: Tables [users, orders, products] with columns [...]
Question: What were the top 5 products by revenue last month?
Schema:
- orders table: order_id, user_id, product_id, order_date, amount
- products table: product_id, name, category
Generate SQL:
LLM-Based Generation:
Syntax Validation: Generated SQL is validated against:
Fallback Mechanisms: If generation fails:
Before execution, the generated SQL undergoes optimization for performance.
Optimization Techniques:
Execution Plan Analysis:
Sources: README.md74 README.zh.md72
The execution layer manages query dispatch, connection pooling, error handling, and result retrieval across multiple data sources.
Connection Pooling:
Query Timeout:
Error Handling:
| Error Type | Handling Strategy |
|---|---|
| Syntax Error | Return to SQL generation with error feedback |
| Connection Timeout | Retry with exponential backoff (3 attempts) |
| Permission Denied | Log error, notify user of access restrictions |
| Data Source Unavailable | Fail fast with clear error message |
| Query Timeout | Cancel query, suggest optimization or data reduction |
Result Streaming: For large result sets, data is streamed in chunks to avoid memory exhaustion and enable incremental visualization.
Result Caching: Identical queries within a time window (configurable, default 5 minutes) return cached results to reduce database load.
Sources: packages/dbgpt-core/src/dbgpt/storage/metadata/db_storage.py1-52
The visualization layer transforms query results into interactive charts and narrative reports using the GPT-Vis protocol.
GPT-Vis is a visualization protocol developed by DB-GPT that defines a declarative format for specifying chart types, data mappings, and styling. LLMs generate GPT-Vis specifications from query results, which are rendered in the web UI.
The system automatically selects appropriate chart types based on data characteristics:
| Data Pattern | Chart Type | GPT-Vis Spec |
|---|---|---|
| Time series | Line chart | {"type": "line", "x": "date", "y": "value"} |
| Categorical comparison | Bar chart | {"type": "bar", "x": "category", "y": "count"} |
| Part-to-whole | Pie chart | {"type": "pie", "value": "amount", "category": "segment"} |
| Correlation | Scatter plot | {"type": "scatter", "x": "var1", "y": "var2"} |
| Distribution | Histogram | {"type": "histogram", "value": "metric", "bins": 20} |
| Geographic | Map | {"type": "map", "region": "country", "value": "sales"} |
Chart Selection Logic:
The dashboard combines multiple visualization elements:
Components:
Narrative Generation:
Interactive Features:
Sources: README.md113 README.zh.md83
The knowledge enhancement layer improves Text2SQL accuracy over time by maintaining business glossaries, query history, and comprehensive metadata.
Purpose: Map domain-specific business terminology to technical database schema.
Example Entries:
customer: user_accounts table
revenue: total_sales_amount column
active users: WHERE status = 'active' AND last_login >= DATE_SUB(NOW(), INTERVAL 30 DAY)
fiscal year: Custom date calculation based on company fiscal calendar
Population Methods:
Query Log Storage: Every query is logged with:
Pattern Mining:
Example: If 80% of queries about "customer orders" join users, orders, and order_items tables, this pattern is cached and used to prioritize these tables in schema linking for similar future queries.
Schema Synchronization:
Statistics Updates:
These statistics inform query optimization and help LLMs generate more accurate WHERE clause predicates.
Sources: README.md168-171
GBI is configured through the main DB-GPT configuration file and data source connection parameters in the web UI.
Data source types are enabled by installing corresponding optional extras during installation:
Data sources are configured through the web UI at /api/v2/datasources:
Connection String Format (varies by data source):
mysql://user:password@host:port/databasepostgresql://user:password@host:port/databaseclickhouse://user:password@host:port/databaseduckdb:///path/to/database.db (file-based)Configuration options in configs/dbgpt-proxy-openai.toml:
Sources: docs/docs/quickstart.md1-350 docs/docs/installation/sourcecode.md1-227 docs/docs/installation/integrations/clickhouse_install.md1-38 docs/docs/installation/integrations/postgres_install.md1-41
User Question: "What were the top 5 products by revenue last month?"
GBI Pipeline Execution:
products table, "revenue" → orders.amount column, "last month" → WHERE order_date >= DATE_SUB(NOW(), INTERVAL 1 MONTH)Sources: README.md168-171
Refresh this wiki