Openchatbi
OpenChatBI is an intelligent chat-based BI tool powered by large language models, designed to help users query, analyze, and visualize data through natural language conversations. It uses LangGraph and LangChain to build chat agent and workflows that support natural language to SQL conversion and data analysis.
Install / Use
/learn @zhongyu09/OpenchatbiREADME
OpenChatBI
OpenChatBI is an open source, chat-based intelligent BI tool powered by large language models, designed to help users query, analyze, and visualize data through natural language conversations. Built on LangGraph and LangChain ecosystem, it provides chat agents and workflows that support natural language to SQL conversion and streamlined data analysis.
Join the Slack channel to discuss: https://join.slack.com/t/openchatbicommunity/shared_invite/zt-3jpzpx9mv-Sk88RxpO4Up0L~YTZYf4GQ
<img src="https://github.com/zhongyu09/openchatbi/raw/main/example/demo.gif" alt="Demo" width="800">Core Features
- Natural Language Interaction: Get data analysis results by asking questions in natural language
- Automatic SQL Generation: Convert natural language queries into SQL statements using advanced text2sql workflows with schema linking and well organized prompt engineering
- Data Visualization: Generate intuitive data visualizations (via plotly)
- Data Catalog Management: Automatically discovers and indexes database table structures, supports flexible catalog storage backends with vector-based or BM25-based retrieval, and easily maintains business explanations for tables and columns as well as optimizes Prompts.
- Time Series Forecasting: Forecasting models deployed in-house that can be called as tools
- Code Execution: Execute Python code for data analysis and visualization
- Interactive Problem-Solving: Proactively ask users for more context when information is incomplete
- Persistent Memory: Conversation management and user characteristic memory based on LangGraph checkpointing
- MCP Support: Integration with MCP tools by configuration
- Knowledge Base Integration: Answer complex questions by combining catalog based knowledge retrival and external knowledge base retrival (via MCP tools)
- Web UI Interface: Provide 2 sample UI: simple and streaming web interfaces using Gradio and Streamlit, easy to integrate with other web applications
Roadmap
- Anomaly Detection Algorithm: Time series anomaly detection
- Root Cause Analysis Algorithm: Multi-dimensional drill-down capabilities for anomaly investigation
Getting started
Installation & Setup
Prerequisites
- Python 3.11 or higher
- Access to a supported LLM provider (OpenAI, Anthropic, etc.)
- Data Warehouse (Database) credentials (like Presto, PostgreSQL, MySQL, etc.)
- (Optional) Embedding model for vector-based retrieval - if not available, BM25-based retrieval will be used
- (Optional) Docker - required only for
dockerexecutor mode
Note on Chinese Text Segmentation: For better Chinese text retrieval, jieba is used for word segmentation. However, jieba is not compatible with Python 3.12+. On Python 3.12 and higher, the system automatically falls back to simple punctuation-based segmentation for Chinese text.
Installation
- Using uv (recommended):
git clone git@github.com:zhongyu09/openchatbi
uv sync
- Using pip:
pip install openchatbi
- For development:
git clone git@github.com:zhongyu09/openchatbi
uv sync --group dev
Optional: If you want to use pysqlite3 (newer SQLite builds), you can install it manually. If build fails, install SQLite first:
On macOS, try to install sqlite using Homebrew:
brew install sqlite
brew info sqlite
export LDFLAGS="-L/opt/homebrew/opt/sqlite/lib"
export CPPFLAGS="-I/opt/homebrew/opt/sqlite/include"
On Amazon Linux / RHEL / CentOS:
sudo yum install sqlite-devel
On Ubuntu / Debian:
sudo apt-get update
sudo apt-get install libsqlite3-dev
Run Demo
Run demo using example dataset from spider dataset. You need to provide "YOUR OPENAI API KEY" or change config to use other LLM providers.
Note: The demo example includes embedding model configuration. If you want to run without an embedding model, you can remove the embedding_model section in the config - BM25 retrieval will be used automatically.
cp example/config.yaml openchatbi/config.yaml
sed -i 's/YOUR_API_KEY_HERE/[YOUR OPENAI API KEY]/g' openchatbi/config.yaml
python run_streamlit_ui.py
Configuration
- Create configuration file
Copy the configuration template:
cp openchatbi/config.yaml.template openchatbi/config.yaml
Or create an empty YAML file.
- Configure your LLMs:
# Select which provider to use
default_llm: openai
# Define one or more providers
llm_providers:
openai:
default_llm:
class: langchain_openai.ChatOpenAI
params:
api_key: YOUR_API_KEY_HERE
model: gpt-4.1
temperature: 0.02
max_tokens: 8192
# Optional: Embedding model for vector-based retrieval and memory tools
# If not configured, BM25-based retrieval will be used, and the memory tools will not work
embedding_model:
class: langchain_openai.OpenAIEmbeddings
params:
api_key: YOUR_API_KEY_HERE
model: text-embedding-3-large
chunk_size: 1024
- Configure your data warehouse:
organization: Your Company
dialect: presto
data_warehouse_config:
uri: "presto://user@host:8080/catalog/schema"
include_tables:
- your_table_name
database_name: "catalog.schema"
Running the Application
- Invoking LangGraph:
export CONFIG_FILE=YOUR_CONFIG_FILE_PATH
from openchatbi import get_default_graph
graph = get_default_graph()
graph.invoke({"messages": [{"role": "user", "content": "Show me ctr trends for the past 7 days"}]},
config={"configurable": {"thread_id": "1"}})
# System-generated SQL
SELECT date, SUM(clicks)/SUM(impression) AS ctr
FROM ad_performance
WHERE date >= CURRENT_DATE - 7 DAYS
GROUP BY date
ORDER BY date;
- Sample Web UI:
Streamlit based UI:
streamlit run sample_ui streamlit_ui.py
Run Gradio based UI:
python sample_ui/streaming_ui.py
Configuration Instructions
The configuration template is provided at config.yaml.template. Key configuration sections include:
Basic Settings
organization: Organization name (e.g., "Your Company")dialect: Database dialect (e.g., "presto")bi_config_file: Path to BI configuration file (e.g., "example/bi.yaml")
Catalog Store Configuration
catalog_store: Configuration for data catalog storagestore_type: Storage type (e.g., "file_system")data_path: Path to catalog data stored by file system (e.g., "./example")
Data Warehouse Configuration
data_warehouse_config: Database connection settingsuri: Connection string for your databaseinclude_tables: List of tables to include in catalog, leave empty to include all tablesdatabase_name: Database name for catalogtoken_service: Token service URL (for data warehouse that need token authentication like Presto)user_name/password: Token service credentials
LLM Configuration
Various LLMs are supported based on LangChain, see LangChain API
Document(https://python.langchain.com/api_reference/reference.html#integrations) for full list that support
chat_models. You can configure different LLMs for different tasks:
default_llm: Primary language model for general tasksembedding_model: (Optional) Model for embedding generation. If not configured, BM25-based text retrieval will be used as fallback, and the memory tools will not worktext2sql_llm: (Optional) Specialized model for SQL generation. If not configured, usesdefault_llm
Multiple providers (optional):
- Configure multiple providers under
llm_providersand select withdefault_llm: <provider_name>. - In
sample_ui/streamlit_ui.py, a provider dropdown appears whenllm_providersis configured. - In
sample_api/async_api.py, passproviderin the/chat/streamrequest body.
Commonly used LLM providers and their corresponding classes and installation commands:
- Anthropic:
langchain_anthropic.ChatAnthropic,pip install langchain-anthropic - OpenAI:
langchain_openai.ChatOpenAI,pip install langchain-openai - Azure OpenAI:
langchain_openai.AzureChatOpenAI,pip install langchain-openai - Google Vertex AI:
langchain_google_vertexai.ChatVertexAI,pip install langchain-google-vertexai - Bedrock:
langchain_aws.ChatBedrock,pip install langchain-aws - Huggingface:
langchain_huggingface.ChatHuggingFace,pip install langchain-huggingface - Deepseek:
langchain_deepseek.ChatDeepSeek,pip install langchain-deepseek - Ollama:
langchain_ollama.ChatOllama,pip install langchain-ollama
Advanced Configuration
OpenChatBI supports sophisticated customization through prompt engineering and catalog management features:
- Prompt Engineering Configuration: Customize system prompts, business glossaries, and data warehouse introductions
- Data Catalog Management: Configure table metadata, column descriptions, and SQL generation rules
- Business Rules: Define table selection criteria and domain-specific SQL constraints
- Forecasting Service: Configure the forecasting service url and prompt based on your own deployment
For detailed configuration options and examples, see the Advanced Features section.
Architecture Overview
OpenChatBI is built using a modular architecture with clear separation of concerns:
- LangGraph Workflows: Core orchestration using state machines for complex multi-step processes
- Catalog Management: Flexible data catalog system with intelligent retrieval (vector-based or BM25 fallback)
- Text2SQL Pipeline: Advanced natural language to SQL conversion with schema linking
- Code Execution: Sandboxed Python execution environment for data analysis
- Tool Integration: Extensible tool system for human interaction and knowledge search
- **Persistent Memory
