Zillion: Make sense of it all

Introduction

Zillion is a data modeling and analytics tool that allows combining and analyzing data from multiple datasources through a simple API. It acts as a semantic layer on top of your data, writes SQL so you don't have to, and easily bolts onto existing database infrastructure via SQLAlchemy Core. The Zillion NLP extension has experimental support for AI-powered natural language querying and warehouse configuration.

With Zillion you can:

Define a warehouse that contains a variety of SQL and/or file-like datasources
Define or reflect metrics, dimensions, and relationships in your data
Run multi-datasource reports and combine the results in a DataFrame
Flexibly aggregate your data with multi-level rollups and table pivots
Customize or combine fields with formulas
Apply technical transformations including rolling, cumulative, and rank statistics
Apply automatic type conversions - i.e. get a "year" dimension for free from a "date" column
Save and share report specifications
Utilize ad hoc or public datasources, tables, and fields to enrich reports
Query your warehouse with natural language (NLP extension)
Leverage AI to bootstrap your warehouse configurations (NLP extension)

Installation
Primer
Example - Sales Analytics
- Warehouse Configuration
- Reports
Advanced Topics
Supported DataSources
Multiprocess Considerations
Demo UI / Web API
Docs
How to Contribute

Installation

Warning: This project is in an alpha state and is subject to change. Please test carefully for production usage and report any issues.

$ pip install zillion

or

$ pip install zillion[nlp]

Primer

The following is meant to give a quick overview of some theory and nomenclature used in data warehousing with Zillion which will be useful if you are newer to this area. You can also skip below for a usage example or warehouse/datasource creation quickstart options.

In short: Zillion writes SQL for you and makes data accessible through a very simple API:

result = warehouse.execute(
    metrics=["revenue", "leads"],
    dimensions=["date"],
    criteria=[
        ("date", ">", "2020-01-01"),
        ("partner", "=", "Partner A")
    ]
)

Metrics and Dimensions

In Zillion there are two main types of Fields that will be used in your report requests:

Dimensions: attributes of data used for labelling, grouping, and filtering
Metrics: facts and measures that may be broken down along dimensions

A Field encapsulates the concept of a column in your data. For example, you may have a Field called "revenue". That Field may occur across several datasources or possibly in multiple tables within a single datasource. Zillion understands that all of those columns represent the same concept, and it can try to use any of them to satisfy reports requesting "revenue".

Likewise there are two main types of tables used to structure your warehouse:

Dimension Tables: reference/attribute tables containing only related dimensions
Metric Tables: fact tables that may contain metrics and some related dimensions/attributes

Dimension tables are often static or slowly growing in terms of row count and contain attributes tied to a primary key. Some common examples would be lists of US Zip Codes or company/partner directories.

Metric tables are generally more transactional in nature. Some common examples would be records for web requests, ecommerce sales, or stock market price history.

Warehouse Theory

If you really want to go deep on dimensional modeling and the drill-across querying technique Zillion employs, I recommend reading Ralph Kimball's book on data warehousing.

To summarize, drill-across querying forms one or more queries to satisfy a report request for metrics that may exist across multiple datasources and/or tables at a particular dimension grain.

Zillion supports flexible warehouse setups such as snowflake or star schemas, though it isn't picky about it. You can specify table relationships through a parent-child lineage, and Zillion can also infer acceptable joins based on the presence of dimension table primary keys. Zillion does not support many-to-many relationships at this time, though most analytics-focused scenarios should be able to work around that by adding views to the model if needed.

Query Layers

Zillion reports can be thought of as running in two layers:

DataSource Layer: SQL queries against the warehouse's datasources
Combined Layer: A final SQL query against the combined data from the DataSource Layer

The Combined Layer is just another SQL database (in-memory SQLite by default) that is used to tie the datasource data together and apply a few additional features such as rollups, row filters, row limits, sorting, pivots, and technical computations.

Warehouse Creation

There are multiple ways to quickly initialize a warehouse from a local or remote file:

# Path/link to a CSV, XLSX, XLS, JSON, HTML, or Google Sheet
# This builds a single-table Warehouse for quick/ad-hoc analysis.
url = "https://raw.githubusercontent.com/totalhack/zillion/master/tests/dma_zip.xlsx"
wh = Warehouse.from_data_file(url, ["Zip_Code"]) # Second arg is primary key

# Path/link to a sqlite database
# This can build a single or multi-table Warehouse
url = "https://github.com/totalhack/zillion/blob/master/tests/testdb1?raw=true"
wh = Warehouse.from_db_file(url)

# Path/link to a WarehouseConfigSchema (or pass a dict)
# This is the recommended production approach!
config = "https://raw.githubusercontent.com/totalhack/zillion/master/examples/example_wh_config.json"
wh = Warehouse(config=config)

Zillion also provides a helper script to boostrap a DataSource configuration file for an existing database. See zillion.scripts.bootstrap_datasource_config.py. The bootstrap script requires a connection/database url and output file as arguments. See --help output for more options, including the optional --nlp flag that leverages OpenAI to infer configuration information such as column types, table types, and table relationships. The NLP feature requires the NLP extension to be installed as well as the following set in your Zillion config file:

OPENAI_MODEL
OPENAI_API_KEY

Executing Reports

The main purpose of Zillion is to execute reports against a Warehouse. At a high level you will be crafting reports as follows:

result = warehouse.execute(
    metrics=["revenue", "leads"],
    dimensions=["date"],
    criteria=[
        ("date", ">", "2020-01-01"),
        ("partner", "=", "Partner A")
    ]
)
print(result.df) # Pandas DataFrame

When comparing to writing SQL, it's helpful to think of the dimensions as the target columns of a group by SQL statement. Think of the metrics as the columns you are aggregating. Think of the criteria as the where clause. Your criteria are applied in the DataSource Layer SQL queries.

The ReportResult has a Pandas DataFrame with the dimensions as the index and the metrics as the columns.

A Report is said to have a grain, which defines the dimensions each metric must be able to join to in order to satisfy the Report requirements. The grain is a combination of all dimensions, including those referenced in criteria or in metric formulas. In the example above, the grain would be {date, partner}. Both "revenue" and "leads" must be able to join to those dimensions for this report to be possible.

These concepts can take time to sink in and obviously vary with t

Zillion

Install / Use

README