Generative Business Intelligence (GBI)

Relevant source files

Purpose and Scope

Generative Business Intelligence (GBI) is one of DB-GPT's core capabilities, providing foundational data intelligence technology for enterprise report analysis and business insights. GBI enables natural language interaction with diverse data sources—including relational databases, data warehouses, analytics databases, and file formats—through an intelligent Text2SQL pipeline that translates user questions into executable SQL queries and generates visualizations.

This page covers the GBI system architecture, Text2SQL translation pipeline, data source integrations, query execution mechanisms, and visualization generation. For knowledge-based retrieval capabilities, see RAG Pipeline and Knowledge Management. For agent-based data analysis workflows, see Multi-Agents and AWEL Workflows.

Sources: README.md70-73 README.zh.md70

System Architecture

The GBI system implements a complete pipeline from natural language input to visual dashboard output, with components organized into distinct layers for data integration, query translation, execution, and presentation.

GBI Pipeline Overview

Pipeline Stages:

Query Translation: Natural language → Intent detection → Schema linking → SQL generation → Optimization
Execution: Validation → Connection pool management → Query execution → Error handling
Visualization: Data transformation → Chart generation via GPT-Vis → Dashboard assembly
Enhancement: Metadata, domain knowledge, and query history improve translation accuracy

Sources: README.md168-171

Data Source Integration

GBI supports 10+ data source types through an extensible connector architecture. Each data source type is installed as an optional dependency package in the monorepo structure.

Supported Data Sources

Data Source Type	Examples	Package Extra	Configuration Location
RDBMS	MySQL, PostgreSQL, Oracle, MSSQL	`datasource_postgres`, `datasource_mssql`, `datasource_oracle`	Connection strings in web UI
Analytics Databases	ClickHouse, DuckDB	`datasource_clickhouse`, `datasource_duckdb`	Connection parameters via UI
Data Warehouses	Hive, Spark	Core integration	Cluster configuration
NoSQL	MongoDB, Redis	Optional extras	Connection URI
File Sources	CSV, Excel, JSON	Core `rag` extra	File upload interface

Data Source Connector Architecture

Connector Pattern: Each data source implements a common connector interface, registered in a factory for dynamic instantiation based on data source type. Connection pooling ensures efficient resource utilization across concurrent queries.

Metadata Extraction: Upon connection, connectors extract and cache schema information (tables, columns, data types, relationships) to support schema linking during query translation.

Sources: README.md299 docs/sidebars.js82-100 docs/docs/installation/integrations/clickhouse_install.md1-38 docs/docs/installation/integrations/postgres_install.md1-41 docs/docs/installation/integrations/duckdb_install.md1-42 packages/dbgpt-core/src/dbgpt/storage/metadata/db_storage.py1-52

Text2SQL Pipeline

The Text2SQL pipeline is the core of GBI, translating natural language questions into executable SQL queries through a multi-stage process leveraging LLMs and domain knowledge.

Pipeline Architecture

82.5% Spider Accuracy: Through fine-tuning and prompt engineering, the SQL generation achieves 82.5% execution accuracy on the Spider benchmark, a standard Text2SQL evaluation dataset.

Sources: README.md74 README.zh.md72

Stage 1: Natural Language Query Understanding

The first stage parses user input to identify query intent and extract key entities.

Intent Detection: Classifies queries into types:

Aggregation queries: COUNT, SUM, AVG operations
Filtering queries: WHERE clause conditions
Join queries: Multi-table relationships
Temporal queries: Date/time-based filters
Ranking queries: ORDER BY, TOP-N

Entity Extraction: Identifies:

Table/column mentions (exact or fuzzy matches)
Numeric values and date ranges
Comparison operators (greater than, less than, equals)
Aggregation functions

Implementation: Uses LLM-based classification combined with regex patterns and domain-specific lexicons.

Stage 2: Schema Linking

Schema linking maps extracted entities to actual database schema elements (tables, columns) using metadata and similarity scoring.

Matching Strategies:

Exact Match: Direct string equality between query entities and schema names
Fuzzy Match: Levenshtein distance for typos and variations
Semantic Match: Embedding-based cosine similarity for conceptual alignment (e.g., "revenue" → "total_sales")
Alias Resolution: Business glossary maps domain terms to technical schema names

Domain Knowledge Integration: The business glossary stores mappings like:

"customers" → "user_accounts" table
"order value" → "order_total_amount" column
"last month" → WHERE order_date >= DATE_SUB(NOW(), INTERVAL 1 MONTH)

Historical Query Learning: Frequently used schema elements for similar queries are prioritized in ranking.

Stage 3: SQL Generation

The SQL generator constructs executable queries using the linked schema context and LLM-based generation.

Prompt Construction:

Context: Tables [users, orders, products] with columns [...]
Question: What were the top 5 products by revenue last month?
Schema: 
  - orders table: order_id, user_id, product_id, order_date, amount
  - products table: product_id, name, category
Generate SQL:

LLM-Based Generation:

Uses configured LLM (local or proxy) from SMMF layer
Temperature typically set low (0.1-0.3) for deterministic output
Few-shot examples included in prompt for complex query patterns

Syntax Validation: Generated SQL is validated against:

Database-specific SQL dialect (MySQL, PostgreSQL, etc.)
Semantic correctness (valid column references, table joins)
Security checks (no DROP/DELETE in read-only contexts)

Fallback Mechanisms: If generation fails:

Retry with simplified schema context
Use template-based generation for common query patterns
Return error with suggested schema elements

Stage 4: Query Optimization

Before execution, the generated SQL undergoes optimization for performance.

Optimization Techniques:

Index utilization: Rewrite queries to use indexed columns
Join order optimization: Place smaller tables first in join sequences
Subquery elimination: Convert correlated subqueries to joins where possible
Predicate pushdown: Move WHERE filters closer to data source

Execution Plan Analysis:

Generate EXPLAIN plan to estimate query cost
Warn users of potentially expensive operations (full table scans, Cartesian products)

Sources: README.md74 README.zh.md72

Query Execution

The execution layer manages query dispatch, connection pooling, error handling, and result retrieval across multiple data sources.

Execution Engine Architecture

Connection Pooling:

Maintains pools of reusable database connections per data source
Configurable pool size (min/max connections)
Connection lifecycle management (acquire, release, cleanup)
Health checks to detect stale connections

Query Timeout:

Configurable timeout per data source type
Graceful cancellation of long-running queries
User notification with partial results if available

Error Handling:

Error Type	Handling Strategy
Syntax Error	Return to SQL generation with error feedback
Connection Timeout	Retry with exponential backoff (3 attempts)
Permission Denied	Log error, notify user of access restrictions
Data Source Unavailable	Fail fast with clear error message
Query Timeout	Cancel query, suggest optimization or data reduction

Result Streaming: For large result sets, data is streamed in chunks to avoid memory exhaustion and enable incremental visualization.

Result Caching: Identical queries within a time window (configurable, default 5 minutes) return cached results to reduce database load.

Sources: packages/dbgpt-core/src/dbgpt/storage/metadata/db_storage.py1-52

Visualization and Reporting

The visualization layer transforms query results into interactive charts and narrative reports using the GPT-Vis protocol.

GPT-Vis Protocol

GPT-Vis is a visualization protocol developed by DB-GPT that defines a declarative format for specifying chart types, data mappings, and styling. LLMs generate GPT-Vis specifications from query results, which are rendered in the web UI.

Chart Type Selection

The system automatically selects appropriate chart types based on data characteristics:

Data Pattern	Chart Type	GPT-Vis Spec
Time series	Line chart	`{"type": "line", "x": "date", "y": "value"}`
Categorical comparison	Bar chart	`{"type": "bar", "x": "category", "y": "count"}`
Part-to-whole	Pie chart	`{"type": "pie", "value": "amount", "category": "segment"}`
Correlation	Scatter plot	`{"type": "scatter", "x": "var1", "y": "var2"}`
Distribution	Histogram	`{"type": "histogram", "value": "metric", "bins": 20}`
Geographic	Map	`{"type": "map", "region": "country", "value": "sales"}`

Chart Selection Logic:

Analyze result set dimensions (1D, 2D, multi-dimensional)
Detect temporal columns (DATE, TIMESTAMP types)
Identify categorical vs. continuous variables
LLM generates GPT-Vis specification with appropriate chart type
Fallback to table view if no suitable visualization

Report Assembly

The dashboard combines multiple visualization elements:

Components:

Title: Natural language question
Summary Statistics: Key metrics (count, sum, average)
Primary Visualization: Main chart answering the question
Supporting Visualizations: Additional charts for context
Narrative Explanation: LLM-generated text describing insights
Data Table: Raw results for drill-down

Narrative Generation:

LLM analyzes query results to identify trends, outliers, and patterns
Generates human-readable explanations: "Sales increased 15% in Q3 compared to Q2, driven primarily by the Electronics category."
Highlights notable data points and provides context

Interactive Features:

Filter controls for date ranges, categories
Drill-down to detailed views
Export to PDF, Excel, or image formats

Sources: README.md113 README.zh.md83

Knowledge Enhancement

The knowledge enhancement layer improves Text2SQL accuracy over time by maintaining business glossaries, query history, and comprehensive metadata.

Components

Business Glossary

Purpose: Map domain-specific business terminology to technical database schema.

Example Entries:

customer: user_accounts table
revenue: total_sales_amount column
active users: WHERE status = 'active' AND last_login >= DATE_SUB(NOW(), INTERVAL 30 DAY)
fiscal year: Custom date calculation based on company fiscal calendar

Population Methods:

Manual curation: Domain experts define mappings
Automatic learning: Frequent query patterns extracted from history
User feedback: Corrections when generated SQL is inaccurate

Query History and Learning

Query Log Storage: Every query is logged with:

Natural language input
Generated SQL
Execution success/failure
Result row count
Latency
User feedback (thumbs up/down, corrections)

Pattern Mining:

Identify frequently co-occurring tables in joins
Extract common WHERE clause patterns
Build template library for recurring query types

Example: If 80% of queries about "customer orders" join users, orders, and order_items tables, this pattern is cached and used to prioritize these tables in schema linking for similar future queries.

Metadata Freshness

Schema Synchronization:

Periodic polling of data sources to detect schema changes (new tables, columns)
Incremental updates to metadata store
Notification of schema changes that may affect existing queries

Statistics Updates:

Cardinality estimates for tables and columns
Data distribution histograms for numeric columns
Null percentage tracking

These statistics inform query optimization and help LLMs generate more accurate WHERE clause predicates.

Sources: README.md168-171

Configuration

GBI is configured through the main DB-GPT configuration file and data source connection parameters in the web UI.

Configuration File

Data source types are enabled by installing corresponding optional extras during installation:

Data Source Connection

Data sources are configured through the web UI at /api/v2/datasources:

Navigate to Data Sources section in web UI
Click Add Data Source
Select data source type (MySQL, PostgreSQL, ClickHouse, etc.)
Provide connection parameters:
- Host/URI
- Port
- Database name
- Username/password (encrypted at rest)
Test connection
Save configuration

Connection String Format (varies by data source):

MySQL: mysql://user:password@host:port/database
PostgreSQL: postgresql://user:password@host:port/database
ClickHouse: clickhouse://user:password@host:port/database
DuckDB: duckdb:///path/to/database.db (file-based)

GBI Feature Flags

Configuration options in configs/dbgpt-proxy-openai.toml:

Sources: docs/docs/quickstart.md1-350 docs/docs/installation/sourcecode.md1-227 docs/docs/installation/integrations/clickhouse_install.md1-38 docs/docs/installation/integrations/postgres_install.md1-41

Usage Example

User Question: "What were the top 5 products by revenue last month?"

GBI Pipeline Execution:

Query Understanding: Detected as aggregation query with temporal filter and ranking
Schema Linking: Mapped "products" → products table, "revenue" → orders.amount column, "last month" → WHERE order_date >= DATE_SUB(NOW(), INTERVAL 1 MONTH)
SQL Generation:
Execution: Query executed against connected MySQL database, returned 5 rows
Visualization: Generated bar chart via GPT-Vis with products on x-axis, revenue on y-axis
Narrative: "The top product was Widget A with $45,230 in revenue, followed by Gadget B at $38,120. Electronics category dominated the top 5."

Sources: README.md168-171

Generative Business Intelligence (GBI)

Relevant source files

Purpose and Scope

Sources: README.md70-73 README.zh.md70

System Architecture

GBI Pipeline Overview

Pipeline Stages:

Query Translation: Natural language → Intent detection → Schema linking → SQL generation → Optimization
Execution: Validation → Connection pool management → Query execution → Error handling
Visualization: Data transformation → Chart generation via GPT-Vis → Dashboard assembly
Enhancement: Metadata, domain knowledge, and query history improve translation accuracy

Sources: README.md168-171

Data Source Integration

GBI supports 10+ data source types through an extensible connector architecture. Each data source type is installed as an optional dependency package in the monorepo structure.

Supported Data Sources

Data Source Type	Examples	Package Extra	Configuration Location
RDBMS	MySQL, PostgreSQL, Oracle, MSSQL	`datasource_postgres`, `datasource_mssql`, `datasource_oracle`	Connection strings in web UI
Analytics Databases	ClickHouse, DuckDB	`datasource_clickhouse`, `datasource_duckdb`	Connection parameters via UI
Data Warehouses	Hive, Spark	Core integration	Cluster configuration
NoSQL	MongoDB, Redis	Optional extras	Connection URI
File Sources	CSV, Excel, JSON	Core `rag` extra	File upload interface

Data Source Connector Architecture

Metadata Extraction: Upon connection, connectors extract and cache schema information (tables, columns, data types, relationships) to support schema linking during query translation.

Text2SQL Pipeline

The Text2SQL pipeline is the core of GBI, translating natural language questions into executable SQL queries through a multi-stage process leveraging LLMs and domain knowledge.

Pipeline Architecture

82.5% Spider Accuracy: Through fine-tuning and prompt engineering, the SQL generation achieves 82.5% execution accuracy on the Spider benchmark, a standard Text2SQL evaluation dataset.

Sources: README.md74 README.zh.md72

Stage 1: Natural Language Query Understanding

The first stage parses user input to identify query intent and extract key entities.

Intent Detection: Classifies queries into types:

Aggregation queries: COUNT, SUM, AVG operations
Filtering queries: WHERE clause conditions
Join queries: Multi-table relationships
Temporal queries: Date/time-based filters
Ranking queries: ORDER BY, TOP-N

Entity Extraction: Identifies:

Table/column mentions (exact or fuzzy matches)
Numeric values and date ranges
Comparison operators (greater than, less than, equals)
Aggregation functions

Implementation: Uses LLM-based classification combined with regex patterns and domain-specific lexicons.

Stage 2: Schema Linking

Schema linking maps extracted entities to actual database schema elements (tables, columns) using metadata and similarity scoring.

Matching Strategies:

Exact Match: Direct string equality between query entities and schema names
Fuzzy Match: Levenshtein distance for typos and variations
Semantic Match: Embedding-based cosine similarity for conceptual alignment (e.g., "revenue" → "total_sales")
Alias Resolution: Business glossary maps domain terms to technical schema names

Domain Knowledge Integration: The business glossary stores mappings like:

"customers" → "user_accounts" table
"order value" → "order_total_amount" column
"last month" → WHERE order_date >= DATE_SUB(NOW(), INTERVAL 1 MONTH)

Historical Query Learning: Frequently used schema elements for similar queries are prioritized in ranking.

Stage 3: SQL Generation

The SQL generator constructs executable queries using the linked schema context and LLM-based generation.

Prompt Construction:

Context: Tables [users, orders, products] with columns [...]
Question: What were the top 5 products by revenue last month?
Schema: 
  - orders table: order_id, user_id, product_id, order_date, amount
  - products table: product_id, name, category
Generate SQL:

LLM-Based Generation:

Uses configured LLM (local or proxy) from SMMF layer
Temperature typically set low (0.1-0.3) for deterministic output
Few-shot examples included in prompt for complex query patterns

Syntax Validation: Generated SQL is validated against:

Database-specific SQL dialect (MySQL, PostgreSQL, etc.)
Semantic correctness (valid column references, table joins)
Security checks (no DROP/DELETE in read-only contexts)

Fallback Mechanisms: If generation fails:

Retry with simplified schema context
Use template-based generation for common query patterns
Return error with suggested schema elements

Stage 4: Query Optimization

Before execution, the generated SQL undergoes optimization for performance.

Optimization Techniques:

Index utilization: Rewrite queries to use indexed columns
Join order optimization: Place smaller tables first in join sequences
Subquery elimination: Convert correlated subqueries to joins where possible
Predicate pushdown: Move WHERE filters closer to data source

Execution Plan Analysis:

Generate EXPLAIN plan to estimate query cost
Warn users of potentially expensive operations (full table scans, Cartesian products)

Sources: README.md74 README.zh.md72

Query Execution

The execution layer manages query dispatch, connection pooling, error handling, and result retrieval across multiple data sources.

Execution Engine Architecture

Connection Pooling:

Maintains pools of reusable database connections per data source
Configurable pool size (min/max connections)
Connection lifecycle management (acquire, release, cleanup)
Health checks to detect stale connections

Query Timeout:

Configurable timeout per data source type
Graceful cancellation of long-running queries
User notification with partial results if available

Error Handling:

Error Type	Handling Strategy
Syntax Error	Return to SQL generation with error feedback
Connection Timeout	Retry with exponential backoff (3 attempts)
Permission Denied	Log error, notify user of access restrictions
Data Source Unavailable	Fail fast with clear error message
Query Timeout	Cancel query, suggest optimization or data reduction

Result Streaming: For large result sets, data is streamed in chunks to avoid memory exhaustion and enable incremental visualization.

Result Caching: Identical queries within a time window (configurable, default 5 minutes) return cached results to reduce database load.

Sources: packages/dbgpt-core/src/dbgpt/storage/metadata/db_storage.py1-52

Visualization and Reporting

The visualization layer transforms query results into interactive charts and narrative reports using the GPT-Vis protocol.

GPT-Vis Protocol

Chart Type Selection

The system automatically selects appropriate chart types based on data characteristics:

Data Pattern	Chart Type	GPT-Vis Spec
Time series	Line chart	`{"type": "line", "x": "date", "y": "value"}`
Categorical comparison	Bar chart	`{"type": "bar", "x": "category", "y": "count"}`
Part-to-whole	Pie chart	`{"type": "pie", "value": "amount", "category": "segment"}`
Correlation	Scatter plot	`{"type": "scatter", "x": "var1", "y": "var2"}`
Distribution	Histogram	`{"type": "histogram", "value": "metric", "bins": 20}`
Geographic	Map	`{"type": "map", "region": "country", "value": "sales"}`

Chart Selection Logic:

Analyze result set dimensions (1D, 2D, multi-dimensional)
Detect temporal columns (DATE, TIMESTAMP types)
Identify categorical vs. continuous variables
LLM generates GPT-Vis specification with appropriate chart type
Fallback to table view if no suitable visualization

Report Assembly

The dashboard combines multiple visualization elements:

Components:

Title: Natural language question
Summary Statistics: Key metrics (count, sum, average)
Primary Visualization: Main chart answering the question
Supporting Visualizations: Additional charts for context
Narrative Explanation: LLM-generated text describing insights
Data Table: Raw results for drill-down

Narrative Generation:

LLM analyzes query results to identify trends, outliers, and patterns
Generates human-readable explanations: "Sales increased 15% in Q3 compared to Q2, driven primarily by the Electronics category."
Highlights notable data points and provides context

Interactive Features:

Filter controls for date ranges, categories
Drill-down to detailed views
Export to PDF, Excel, or image formats

Sources: README.md113 README.zh.md83

Knowledge Enhancement

The knowledge enhancement layer improves Text2SQL accuracy over time by maintaining business glossaries, query history, and comprehensive metadata.

Components

Business Glossary

Purpose: Map domain-specific business terminology to technical database schema.

Example Entries:

customer: user_accounts table
revenue: total_sales_amount column
active users: WHERE status = 'active' AND last_login >= DATE_SUB(NOW(), INTERVAL 30 DAY)
fiscal year: Custom date calculation based on company fiscal calendar

Population Methods:

Manual curation: Domain experts define mappings
Automatic learning: Frequent query patterns extracted from history
User feedback: Corrections when generated SQL is inaccurate

Query History and Learning

Query Log Storage: Every query is logged with:

Natural language input
Generated SQL
Execution success/failure
Result row count
Latency
User feedback (thumbs up/down, corrections)

Pattern Mining:

Identify frequently co-occurring tables in joins
Extract common WHERE clause patterns
Build template library for recurring query types

Metadata Freshness

Schema Synchronization:

Periodic polling of data sources to detect schema changes (new tables, columns)
Incremental updates to metadata store
Notification of schema changes that may affect existing queries

Statistics Updates:

Cardinality estimates for tables and columns
Data distribution histograms for numeric columns
Null percentage tracking

These statistics inform query optimization and help LLMs generate more accurate WHERE clause predicates.

Sources: README.md168-171

Configuration

GBI is configured through the main DB-GPT configuration file and data source connection parameters in the web UI.

Configuration File

Data source types are enabled by installing corresponding optional extras during installation:

Data Source Connection

Data sources are configured through the web UI at /api/v2/datasources:

Navigate to Data Sources section in web UI
Click Add Data Source
Select data source type (MySQL, PostgreSQL, ClickHouse, etc.)
Provide connection parameters:
- Host/URI
- Port
- Database name
- Username/password (encrypted at rest)
Test connection
Save configuration

Connection String Format (varies by data source):

MySQL: mysql://user:password@host:port/database
PostgreSQL: postgresql://user:password@host:port/database
ClickHouse: clickhouse://user:password@host:port/database
DuckDB: duckdb:///path/to/database.db (file-based)

GBI Feature Flags

Configuration options in configs/dbgpt-proxy-openai.toml:

Usage Example

User Question: "What were the top 5 products by revenue last month?"

GBI Pipeline Execution:

Query Understanding: Detected as aggregation query with temporal filter and ranking
Schema Linking: Mapped "products" → products table, "revenue" → orders.amount column, "last month" → WHERE order_date >= DATE_SUB(NOW(), INTERVAL 1 MONTH)
SQL Generation:
Execution: Query executed against connected MySQL database, returned 5 rows
Visualization: Generated bar chart via GPT-Vis with products on x-axis, revenue on y-axis
Narrative: "The top product was Widget A with $45,230 in revenue, followed by Gadget B at $38,120. Electronics category dominated the top 5."

Sources: README.md168-171

Generative Business Intelligence (GBI)

Purpose and Scope

System Architecture

GBI Pipeline Overview

Data Source Integration

Supported Data Sources

Data Source Connector Architecture

Text2SQL Pipeline

Pipeline Architecture

Stage 1: Natural Language Query Understanding

Stage 2: Schema Linking

Stage 3: SQL Generation

Stage 4: Query Optimization

Query Execution

Execution Engine Architecture

Visualization and Reporting

GPT-Vis Protocol

Chart Type Selection

Report Assembly

Knowledge Enhancement

Components

Business Glossary

Query History and Learning

Metadata Freshness

Configuration

Configuration File

Data Source Connection

GBI Feature Flags

Usage Example

On this page

Generative Business Intelligence (GBI)

Purpose and Scope

System Architecture

GBI Pipeline Overview

Data Source Integration

Supported Data Sources

Data Source Connector Architecture

Text2SQL Pipeline

Pipeline Architecture

Stage 1: Natural Language Query Understanding

Stage 2: Schema Linking

Stage 3: SQL Generation

Stage 4: Query Optimization

Query Execution

Execution Engine Architecture

Visualization and Reporting

GPT-Vis Protocol

Chart Type Selection

Report Assembly

Knowledge Enhancement

Components

Business Glossary

Query History and Learning

Metadata Freshness

Configuration

Configuration File

Data Source Connection

GBI Feature Flags

Usage Example

On this page