Allstars
a smart and progressive semantic layer
Install / Use
/learn @preset-io/AllstarsREADME
SQL All ⭐ Stars
<img width=600 src="https://github.com/preset-io/estrella/assets/487433/390e73c4-f99b-4912-a6b2-830b0b9a0375">:warning: WIP: unfinished project. Shared for the value of the ideas exposed here. Let us know if you're interested in teaming up or using the ideas shared here.
A Smart & Progressive Semantic Layer
SQL All ⭐ Stars or allstars is a smart semantic layer that takes
a new approach:
- mostly inferred: it looks at your physical schema and makes educated guesses around your semantics (joins, metrics, dimensions, hierarchies). Eventually it will also look at your usage patterns to infer more useful semantics.
- progressively enrichable: building on it's inferences, you can curate, enrich and rewire things as needed at your own pace. It's useful on day 0, and becomes more useful as you extend the semantics.
- served as a virtual database: it exposes itself primarily as a simple database, as one or many collections of flat tables. This means it's universally adopted by anything that can talk to a database. Under the hood, allstars takes your simple SQL against those wide tables, and transpiles the SQL into more complex SQL doing the joining, the union-ing and whatever else is needed to get to your data
- RESTful too: get all the JSONs you need here too, allowing deeper integrations beyond the virtual database
- 100% open: unlike a lot of semantic layers that are tightly coupled with some BI tool, this is 100% open and self-standing
Served as a virtual database
SELECT customer_name, SUM(units_sold), SUM(pre_orders)
FROM ⭐
GROUP BY ⭐;
Something that's novel about SQL All ⭐ Stars is the idea of exposing this semantic layer primarily as a virtual database. This shows as a one or many wide tables exposing essentially all of the columns that you want to expose from your database. While you or your BI tool run very simple SQL against large tables, allstars transpiles this query into a more complex query against the underlying physical schema.
Without getting too deep in the mechanics, there's one large table to rule them all (the "superstar" ⭐ table), and one table per "query context", representing sets of tables that can be joined together to answer queries.
If you are familiar with dimensional modeling and star schemas, you can think of a "query context" as a specific star within a collection of star schemas. If you expose multiple star schemas within your semantic layer, you'll typically have one query context per star schema, and each one will be exposed as a virtual table.
Now about that "superstar" ⭐ table, what happens if you try to use metrics and dimensions that are unrelated (in different query contexts)? Well SQL Allstars will simply return a clear error message at in "The metric [inventory] cannot be joined with [customer.name]".
On top of these large tables, allstars exposes a set of metadata tables with all the semantics (metrics, dimensions, hierarchies, ...)
SELECT ⭐ FROM ⭐.metrics;
SELECT ⭐ FROM ⭐.dimensions;
SELECT ⭐ FROM ⭐.hierachies;
SELECT ⭐ FROM ⭐.tables;
{{ #... }}
If you are curious as to how it works behind the scene, the TLDR is that SQL Allstars is implemented as a Python dbapi driver that acts as a bit of a proxy in front of your database.
Sweet synthactic sugar
As a "transpiler with context" allstars can allow you to sprinkle some sugar in your SQL. Let's get ahead of ANSI SQL and implement things. Ideas to be implemented:
FROM ⭐- straight from the superstar tableGROUP BY ⭐- group by all dimensions referenced in the queryGROUP BY ⭐.DAY- same as above, but truncate the one time column by day, WEEK, MONTH or other timeframe keywords- trailing commas aloud
- a nice hint system
- window function abstraction(?)
RESTfull of itself
If the virtual database doesn't suit your needs, a RESTful API is also exposed. Comprehensive calls to extract and alter semantics, transpile SQL, sync different sources and targets, and much more.
Compiled, serialized, versioned and broadcastable
allstars makes inferences based on your physical schema, naming conventions and can "compiles" all this to a static, deterministic model that can be pushed to the file system for source control, or to a database for driver retrieval.
Some expected workflows around creating, enriching and updating your semantics:
- first inference: where allstars extracts your physical models, infers your joins and populates dimensions and metrics
- mechanics enrichment: where a data engineer or analyst engineer will add new objects, add join criterias that could not be detected, and creates some rich, relevant dimensions and metrics
- presentation erichment: add nice labels, descriptions and slap a folder structure around all your metrics, dimensions and hierarchies
- update: your schemas have evolved, new tables and column created, time to
allstars extractthe latest changes in your schema to keep things up-to-date
As you progressively make these changes, they are all "compiled", versioned and made static to serve predictable outcomes.
Sources & targets:
at launch
- the filesystem as a set of human readable yaml files
- API/DSL: if yaml is too verbose and you need something more dynamic, you can use the python objects directly, and compile to yaml
- a database, where the content is serialized into a
_metatable
future
- a git repo+ref, so that the driver can load up directly from a uri
- a REST service
- s3://artifact.zip
A Dynamic mode
In many cases you'll want the semantic layer to be deterministic and static. This is how semantic layers typically work.
Alternatively, you can run allstars in dynamic mode, point it to a database schema, and let it learn the schema it's working with, and receive hints as to how to behave as it goes. In this mode, it'll look up the physical schema, infer possible joins, and even receive SQL-like commands to enrichment as in
INSERT ('table1', 'table2', 'table_1.id = table2.id') INTO ⭐.joins;
INSERT ('Order Count', 'COUNT(DISTINCT order_id)', 'orders') INTO ⭐.metrics;
It's driver, not a proxy!
allstars is NOT a service that lives somewhere in between your database
and your application, it's a Python dbapi driver + sqlalchemy dialect.
This means it's NOT an external service you need to launch, keep up
and observe, it's meant to be installed and used as a simple database driver
and won't delay or buffer your queries.
What if I'm not in python? well, hoping a community develops and builds driver for other languages that interact with allstars REST service that's written in Python.
It's also very likely that someone will want to host allstars as a proxy
service to a database. There's probably an allstar-dbproxy project in the
future of allstars.
As allstars becomes more popular, decoupling the specificiation from the
python implementation will become important, using something like protobufs
to define the object schemas and share them across languages.
What's in an allstars semantic layer?
Some internals
Opening up the hood, the semantic layer has the following internal representation ->
Relations
Relations, as pointers to physical tables and views, including the underlying columns along with their data types. The semantic layer is essentially a fancy rich map to get to these things. SQL All ⭐ Stars has a full in-memory representation of all this to navigate it all.
Joins
To complete the picture from the relations, we have a full map of which joins can be used against the underlying relations. Each join maps two relations, has a type of join assigned (left, right, full, outer, ...) and has its cardinality (one-to-one, one-to-many, many-to-one, ...).
Query contexts
When semantic layers get big, it becomes large spaghetti plate of joins, and not all joins can be followed. The general goal is to start from a metric, and join it to it's dimension while avoiding "fan traps" where the metric gets duplicated while following a one-to-many join.
Query contexts are effectively a collection of joins that can be used together safely to generate a query. In traditional dimensional modeling, where you have a collection of fact tables and dimension tables, you'll typically have one query context per fact table. In the case of a dimensional query where metrics from multiple fact tables are chosen along with shared dimensions, the semantic layer will resolve by generating multiple queries, each against a single query context, and merge the results.
Exposed to users
The semantic layer is essentially a menu of user-relatable objects that act as a map to more complex set and more abstract set of physical database objects. Like on a restaurant menu, all of these objects are clearly organized in sections and sub-sections (folders), have nice labels, and longer descriptions.
Metric
A metric is a simple aggregate SQL expression against one or many relations. It typically will point to a single relation but can in come cases span across relation and require a join to compute. Note that metrics can reference other metrics, in which case they become a more complex metric that inherits the underlying relations.
Dimension
A dimension is a simple, non-aggregate SQL expression against one or many relations. Like metrics, they typically a built on top of a single relation but not always.
Note that we made the decision to use the term dimension even though it can be confused with the idea of a dimension table in dimensional modeling where it refers to a collection of attributes sharing the same atomicity. Here
