Aorist
Aorist is a tool for managing data for your ML project. It produces readable, intuitive code that you can inspect, edit, and run yourself. You can then focus on the hard parts while automating the repetitive parts. To get this, you just need a description of how your data is formatted and organized, and where it needs to go.
Install / Use
/learn @scie-nz/AoristREADME
Aorist
Aorist is a code-generation tool for MLOps. Its aim is to generate legible code for common repetitive tasks in data science, such as data replication, common transformations, as well as machine learning operations.
Table of contents
Installation instructions
Go to <a href="https://aorist.scie.nz/" target="_blank">aorist.scie.nz</a> for installation instructions and a tutorial. You can find the developer guide below.
Developer Guide
Package organization
Aorist has a Rust core and a Python interface. The project relies on the following sub-projects:
aorist_util-- a Rust crate containing small utility functions used across the project.aorist_derive-- Rust crate exportingderivemacros (and only those macros) used across the project.aorist_primitives-- Rust crate exporting "primitive" macros (such asregister_constraint,define_attribute, etc.) used to abstract away boiler-plate code inside the Rust code base.aorist_concept-- a Rust crate dedicated to theaoristmacro. This macro "decorates" structs and enums to make them "constrainable" in the aorist sense.aorist_ast-- a Rust crate implementing a cross-language Abstract Syntax Tree (AST), used for generating code in both Python and R. Aorist AST nodes get compiled into native Python or R AST nodes. More languages can be supported here.aorist_attributes-- this Rust crate exports a taxonomy of data attributes (e.g.KeyStringIdentifier,POSIXTimestamp), which can be used to impose data quality and compliance constraints across table schemas.aorist_core-- This is the core Rust crate for the Aorist project. The main object taxonomy is defined here. New structs and enums can be added here.aorist_constraint-- This Rust crate lists constraints that can be applied to Aorist universes made up of concepts as listed inaorist_core. Multipleaorist_constraintcrates can be compiled against theaorist_corecrate.aorist-- This Rust crate exports a Python library via a PyO3 binding. This directory also contains the conda recipe used for creating theaoristconda package (which includes the compiled Rust library, as well as a number of Python helpers).aorist_recipes-- This Python package contains recipes (using Python, TrinoSQL, R, or Bash) that can be used to satisfy constraints as defined inaorist_constraint. Multipleaorist_recipespackages can be provided at runtime.scienz-- This Python package contains a set of pre-defined datasets which can be used out-of-the box with theaoristpackage.
How to build
Because Aorist is a mixed Rust / Python project, building involves two stages:
- first a set of Rust libraries is built via
cargo. - then, a Python library is built bia
conda.
Rust library
Pre-requisites
You will need to install Rust in order to compile Aorist.
Building
You can build individual Rust libraries directly by running cargo build from within the respective directory listed in the
Package Organization section.
To build the entire project run cargo build from the root directory.
Conda library
Pre-requisites
-
Install Anaconda.
-
Make sure you use conda-forge, rather than the default conda channel.
conda config --add channels conda-forge
conda config --set channel_priority strict
- Create a new environment w/ mamba
conda create -n aorist-build -c conda-forge mamba
conda activate aorist-build
- Install boa:
mamba install "conda-build>=3.20" colorama \
pip ruamel ruamel.yaml rich mamba jsonschema -c conda-forge
cd ~ && git clone git@github.com:mamba-org/boa.git
cd boa ~ && pip install -e .
Building
Build the packages by running:
cd ~/aorist
cd aorist && conda mambabuild . && cd ..
anaconda upload [ARTIFACT] --label dev
conda search --override -c scienz/label/dev aorist
mamba install aorist dill astor -c conda-forge -c scienz/label/dev
cd aorist_recipes && conda mambabuild . && cd ..
cd scienz && conda mambabuild . && cd ..
Adding new datasets
You can add new canonical datasets to the scienz package. Once accepted for publication metadata associated with these datasets can be distributed painlessly. To do so, please follow the steps described below:
- specify your datasets in a new Python file in the
scienz/scienzdirectory. (You can look at other files in that directory for examples) - make sure to import the datasets in
scienz/__init__.py. - Run
conda build .from within thescienzsubdirectory. The build step will also trigger a test, which ensures that your dataset is correctly specified. - If
conda build .succeeds, submit a Pull Request against scienz/aorist. - Once the PR is accepted, the
scienzpackage will be rebuilt and your dataset will be accessible via Anaconda.
How to test
Run the following commands:
pip install astor black dill
Inside aorist:
python build_for_testing.py
Inside aorist/scienz:
PYTHONPATH=$PYTHONPATH:../aorist_recipes:../scienz:../aorist python run_test.py
If no error messages appear, your new dataset has been successfully added.
Overview of an Aorist universe
(note that the code examples below are provided for illustrative purposes and may have occasional bugs)
Let's say we are starting a new project which involves analyzing a number of large graph datasets, such as the ones provided by the SNAP project.
We will conduct our analysis in a mini data-lake, such as the Trino + MinIO solution specified by Walden.
We would like to replicate all these graphs into our data lake before we can start analyzing them. At a very high-level, this is achieved by defining a "universe", the totality of things we care about in our project. One such universe is specified below:
from snap import snap_dataset
from aorist import (
dag,
Universe,
ComplianceConfig,
HiveTableStorage,
MinioLocation,
StaticHiveTableLayout,
ORCEncoding,
)
from common import DEFAULT_USERS, DEFAULT_GROUPS, DEFAULT_ENDPOINTS
universe = Universe(
name="my_cluster",
datasets=[
snap_dataset,
],
endpoints=DEFAULT_ENDPOINTS,
users=DEFAULT_USERS,
groups=DEFAULT_GROUPS,
compliance=ComplianceConfig(
description="""
Testing workflow for data replication of SNAP data to
local cluster. The SNAP dataset collection is provided
as open data by Stanford University. The collection contains
various social and technological network graphs, with
reasonable and systematic efforts having been made to ensure
the removal of all Personally Identifiable Information.
""",
data_about_human_subjects=True,
contains_personally_identifiable_information=False,
),
)
The universe definition contains a number of things:
- the datasets we are talking about (more about it in a bit),
- the endpoints we have available (e.g. the fact that a MinIO server is available for storage, as opposed to HDFS or S3, etc., and where that server is available; what endpoint we should use for Presto / Trino, etc.)
- who the users and groups are that will access the dataset,
- some compliance annotations.
Note: Currently users, groups, and compliance annotations are supported as a proof of concept -- these concepts are not essential to an introduction so we will skip them for now.
Generating a DAG
To generate a flow that replicates our data all we have to do is run:
DIALECT = "python"
out = dag(
universe, [
"AllAssetsComputed",
], DIALECT
)
This will generate a set of Python tasks, which will do the following, for each asset (i.e., each graph) in our dataset:
- download it from its remote location,
- decompress it, if necessary
- remove its header,
- convert the file to a CSV, if necessary
- upload the CSV data to MinIO
- create a Hive table backing the MinIO location
- convert the CSV-based Hive table to an ORC-based Hive table
- drop the temporary CSV-based Hive table
This set of tasks is also known as a Directed Acyclic Graph (DAG). The same DAG can be generated as a Jupyter notebook, e.g. by setting:
DIALECT = "jupyter"
Or we can set DIALECT to "airflow" for an Airflow DAG.
Aside: what is actually going on?
What Aorist does is quite complex -- the following is an explanation of the conceptual details, but you can skip this if you'd want something a bit more concrete:
- first, you describe the universe. This universe is actually a highly-structured hierarchy of concepts, each of which can be "constrained".
- A constraint is something that "needs to happen". In this example all
you declare that needs to happen is the constraint
AllAssetsComputed. This constraint is attached to the Universe, which is a singleton object. - Constraints attach to specific kinds of objects -- some attach to the entire Universe, others attach to tables, etc.
- Constraints are considered to be satisfied when their dependent constraints are satisfied. When we populate e
