Zillion
Make sense of it all. Semantic data modeling and analytics with a sprinkle of AI. https://totalhack.github.io/zillion/
Install / Use
/learn @totalhack/ZillionREADME
Zillion: Make sense of it all
Introduction
Zillion is a data modeling and analytics tool that allows combining and
analyzing data from multiple datasources through a simple API. It acts as a semantic layer
on top of your data, writes SQL so you don't have to, and easily bolts onto existing
database infrastructure via SQLAlchemy Core. The Zillion NLP extension has experimental
support for AI-powered natural language querying and warehouse configuration.
With Zillion you can:
- Define a warehouse that contains a variety of SQL and/or file-like datasources
- Define or reflect metrics, dimensions, and relationships in your data
- Run multi-datasource reports and combine the results in a DataFrame
- Flexibly aggregate your data with multi-level rollups and table pivots
- Customize or combine fields with formulas
- Apply technical transformations including rolling, cumulative, and rank statistics
- Apply automatic type conversions - i.e. get a "year" dimension for free from a "date" column
- Save and share report specifications
- Utilize ad hoc or public datasources, tables, and fields to enrich reports
- Query your warehouse with natural language (NLP extension)
- Leverage AI to bootstrap your warehouse configurations (NLP extension)
Table of Contents
- Installation
- Primer
- Example - Sales Analytics
- Advanced Topics
- Supported DataSources
- Multiprocess Considerations
- Demo UI / Web API
- Docs
- How to Contribute
<a name="installation"></a>
Installation
Warning: This project is in an alpha state and is subject to change. Please test carefully for production usage and report any issues.
$ pip install zillion
or
$ pip install zillion[nlp]
<a name="primer"></a>
Primer
The following is meant to give a quick overview of some theory and
nomenclature used in data warehousing with Zillion which will be useful
if you are newer to this area. You can also skip below for a usage example or warehouse/datasource creation quickstart options.
In short: Zillion writes SQL for you and makes data accessible through a very simple API:
result = warehouse.execute(
metrics=["revenue", "leads"],
dimensions=["date"],
criteria=[
("date", ">", "2020-01-01"),
("partner", "=", "Partner A")
]
)
<a name="metrics-and-dimensions"></a>
Metrics and Dimensions
In Zillion there are two main types of Fields that will be used in
your report requests:
Dimensions: attributes of data used for labelling, grouping, and filteringMetrics: facts and measures that may be broken down along dimensions
A Field encapsulates the concept of a column in your data. For example, you
may have a Field called "revenue". That Field may occur across several
datasources or possibly in multiple tables within a single datasource. Zillion
understands that all of those columns represent the same concept, and it can try
to use any of them to satisfy reports requesting "revenue".
Likewise there are two main types of tables used to structure your warehouse:
Dimension Tables: reference/attribute tables containing only related dimensionsMetric Tables: fact tables that may contain metrics and some related dimensions/attributes
Dimension tables are often static or slowly growing in terms of row count and contain attributes tied to a primary key. Some common examples would be lists of US Zip Codes or company/partner directories.
Metric tables are generally more transactional in nature. Some common examples would be records for web requests, ecommerce sales, or stock market price history.
<a name="warehouse-theory"></a>
Warehouse Theory
If you really want to go deep on dimensional modeling and the drill-across
querying technique Zillion employs, I recommend reading Ralph Kimball's
book on data warehousing.
To summarize, drill-across
querying
forms one or more queries to satisfy a report request for metrics that may
exist across multiple datasources and/or tables at a particular dimension grain.
Zillion supports flexible warehouse setups such as
snowflake or
star schemas, though it isn't
picky about it. You can specify table relationships through a parent-child
lineage, and Zillion can also infer acceptable joins based on the presence
of dimension table primary keys. Zillion does not support many-to-many relationships at this time, though most analytics-focused scenarios should be able to work around that by adding views to the model if needed.
<a name="query-layers"></a>
Query Layers
Zillion reports can be thought of as running in two layers:
DataSource Layer: SQL queries against the warehouse's datasourcesCombined Layer: A final SQL query against the combined data from the DataSource Layer
The Combined Layer is just another SQL database (in-memory SQLite by default) that is used to tie the datasource data together and apply a few additional features such as rollups, row filters, row limits, sorting, pivots, and technical computations.
<a name="warehouse-creation"></a>
Warehouse Creation
There are multiple ways to quickly initialize a warehouse from a local or remote file:
# Path/link to a CSV, XLSX, XLS, JSON, HTML, or Google Sheet
# This builds a single-table Warehouse for quick/ad-hoc analysis.
url = "https://raw.githubusercontent.com/totalhack/zillion/master/tests/dma_zip.xlsx"
wh = Warehouse.from_data_file(url, ["Zip_Code"]) # Second arg is primary key
# Path/link to a sqlite database
# This can build a single or multi-table Warehouse
url = "https://github.com/totalhack/zillion/blob/master/tests/testdb1?raw=true"
wh = Warehouse.from_db_file(url)
# Path/link to a WarehouseConfigSchema (or pass a dict)
# This is the recommended production approach!
config = "https://raw.githubusercontent.com/totalhack/zillion/master/examples/example_wh_config.json"
wh = Warehouse(config=config)
Zillion also provides a helper script to boostrap a DataSource configuration file for an existing database. See zillion.scripts.bootstrap_datasource_config.py. The bootstrap script requires a connection/database url and output file as arguments. See --help output for more options, including the optional --nlp flag that leverages OpenAI to infer configuration information such as column types, table types, and table relationships. The NLP feature requires the NLP extension to be installed as well as the following set in your Zillion config file:
- OPENAI_MODEL
- OPENAI_API_KEY
<a name="executing-reports"></a>
Executing Reports
The main purpose of Zillion is to execute reports against a Warehouse.
At a high level you will be crafting reports as follows:
result = warehouse.execute(
metrics=["revenue", "leads"],
dimensions=["date"],
criteria=[
("date", ">", "2020-01-01"),
("partner", "=", "Partner A")
]
)
print(result.df) # Pandas DataFrame
When comparing to writing SQL, it's helpful to think of the dimensions as the target columns of a group by SQL statement. Think of the metrics as the columns you are aggregating. Think of the criteria as the where clause. Your criteria are applied in the DataSource Layer SQL queries.
The ReportResult has a Pandas DataFrame with the dimensions as the index and
the metrics as the columns.
A Report is said to have a grain, which defines the dimensions each metric
must be able to join to in order to satisfy the Report requirements. The
grain is a combination of all dimensions, including those referenced in
criteria or in metric formulas. In the example above, the grain would be
{date, partner}. Both "revenue" and "leads" must be able to join to those
dimensions for this report to be possible.
These concepts can take time to sink in and obviously vary with t
