SkillAgentSearch skills...

Aorist

Aorist is a tool for managing data for your ML project. It produces readable, intuitive code that you can inspect, edit, and run yourself. You can then focus on the hard parts while automating the repetitive parts. To get this, you just need a description of how your data is formatted and organized, and where it needs to go.

Install / Use

/learn @scie-nz/Aorist
About this skill

Quality Score

0/100

Supported Platforms

Zed

README

<a href="https://aorist.scie.nz/" target="_blank"> <img src="https://user-images.githubusercontent.com/31727438/136563601-6cfa52c4-0307-45e1-9e77-68f26cc97976.png" alt="Aorist logo" title="Aiorist" align="right" height="60" /> </a>

Aorist

Aorist is a code-generation tool for MLOps. Its aim is to generate legible code for common repetitive tasks in data science, such as data replication, common transformations, as well as machine learning operations.

Table of contents

Installation instructions

Go to <a href="https://aorist.scie.nz/" target="_blank">aorist.scie.nz</a> for installation instructions and a tutorial. You can find the developer guide below.

Developer Guide

Package organization

Aorist has a Rust core and a Python interface. The project relies on the following sub-projects:

  • aorist_util -- a Rust crate containing small utility functions used across the project.
  • aorist_derive -- Rust crate exporting derive macros (and only those macros) used across the project.
  • aorist_primitives -- Rust crate exporting "primitive" macros (such as register_constraint, define_attribute, etc.) used to abstract away boiler-plate code inside the Rust code base.
  • aorist_concept -- a Rust crate dedicated to the aorist macro. This macro "decorates" structs and enums to make them "constrainable" in the aorist sense.
  • aorist_ast -- a Rust crate implementing a cross-language Abstract Syntax Tree (AST), used for generating code in both Python and R. Aorist AST nodes get compiled into native Python or R AST nodes. More languages can be supported here.
  • aorist_attributes -- this Rust crate exports a taxonomy of data attributes (e.g. KeyStringIdentifier, POSIXTimestamp), which can be used to impose data quality and compliance constraints across table schemas.
  • aorist_core -- This is the core Rust crate for the Aorist project. The main object taxonomy is defined here. New structs and enums can be added here.
  • aorist_constraint -- This Rust crate lists constraints that can be applied to Aorist universes made up of concepts as listed in aorist_core. Multiple aorist_constraint crates can be compiled against the aorist_core crate.
  • aorist -- This Rust crate exports a Python library via a PyO3 binding. This directory also contains the conda recipe used for creating the aorist conda package (which includes the compiled Rust library, as well as a number of Python helpers).
  • aorist_recipes -- This Python package contains recipes (using Python, TrinoSQL, R, or Bash) that can be used to satisfy constraints as defined in aorist_constraint. Multiple aorist_recipes packages can be provided at runtime.
  • scienz -- This Python package contains a set of pre-defined datasets which can be used out-of-the box with the aorist package.

How to build

Because Aorist is a mixed Rust / Python project, building involves two stages:

  • first a set of Rust libraries is built via cargo.
  • then, a Python library is built bia conda.

Rust library

Pre-requisites

You will need to install Rust in order to compile Aorist.

Building

You can build individual Rust libraries directly by running cargo build from within the respective directory listed in the Package Organization section.

To build the entire project run cargo build from the root directory.

Conda library

Pre-requisites

  1. Install Anaconda.

  2. Make sure you use conda-forge, rather than the default conda channel.

conda config --add channels conda-forge
conda config --set channel_priority strict
  1. Create a new environment w/ mamba
conda create -n aorist-build -c conda-forge mamba
conda activate aorist-build
  1. Install boa:
mamba install "conda-build>=3.20" colorama \
    pip ruamel ruamel.yaml rich mamba jsonschema -c conda-forge
cd ~ && git clone git@github.com:mamba-org/boa.git
cd boa ~ && pip install -e .

Building

Build the packages by running:

cd ~/aorist
cd aorist && conda mambabuild . && cd ..
anaconda upload [ARTIFACT] --label dev
conda search --override -c scienz/label/dev aorist

mamba install aorist dill astor -c conda-forge -c scienz/label/dev

cd aorist_recipes && conda mambabuild . && cd ..

cd scienz && conda mambabuild . && cd ..

Adding new datasets

You can add new canonical datasets to the scienz package. Once accepted for publication metadata associated with these datasets can be distributed painlessly. To do so, please follow the steps described below:

  1. specify your datasets in a new Python file in the scienz/scienz directory. (You can look at other files in that directory for examples)
  2. make sure to import the datasets in scienz/__init__.py.
  3. Run conda build . from within the scienz subdirectory. The build step will also trigger a test, which ensures that your dataset is correctly specified.
  4. If conda build . succeeds, submit a Pull Request against scienz/aorist.
  5. Once the PR is accepted, the scienz package will be rebuilt and your dataset will be accessible via Anaconda.

How to test

Run the following commands:

pip install astor black dill

Inside aorist:

python build_for_testing.py

Inside aorist/scienz:

PYTHONPATH=$PYTHONPATH:../aorist_recipes:../scienz:../aorist python run_test.py

If no error messages appear, your new dataset has been successfully added.

Overview of an Aorist universe

(note that the code examples below are provided for illustrative purposes and may have occasional bugs)

Let's say we are starting a new project which involves analyzing a number of large graph datasets, such as the ones provided by the SNAP project.

We will conduct our analysis in a mini data-lake, such as the Trino + MinIO solution specified by Walden.

We would like to replicate all these graphs into our data lake before we can start analyzing them. At a very high-level, this is achieved by defining a "universe", the totality of things we care about in our project. One such universe is specified below:

from snap import snap_dataset
from aorist import (
    dag,
    Universe,
    ComplianceConfig,
    HiveTableStorage,
    MinioLocation,
    StaticHiveTableLayout,
    ORCEncoding,
)
from common import DEFAULT_USERS, DEFAULT_GROUPS, DEFAULT_ENDPOINTS

universe = Universe(
    name="my_cluster",
    datasets=[
        snap_dataset,
    ],
    endpoints=DEFAULT_ENDPOINTS,
    users=DEFAULT_USERS,
    groups=DEFAULT_GROUPS,
    compliance=ComplianceConfig(
        description="""
        Testing workflow for data replication of SNAP data to
        local cluster. The SNAP dataset collection is provided
        as open data by Stanford University. The collection contains
        various social and technological network graphs, with
        reasonable and systematic efforts having been made to ensure
        the removal of all Personally Identifiable Information.
        """,
        data_about_human_subjects=True,
        contains_personally_identifiable_information=False,
    ),
)

The universe definition contains a number of things:

  • the datasets we are talking about (more about it in a bit),
  • the endpoints we have available (e.g. the fact that a MinIO server is available for storage, as opposed to HDFS or S3, etc., and where that server is available; what endpoint we should use for Presto / Trino, etc.)
  • who the users and groups are that will access the dataset,
  • some compliance annotations.

Note: Currently users, groups, and compliance annotations are supported as a proof of concept -- these concepts are not essential to an introduction so we will skip them for now.

Generating a DAG

To generate a flow that replicates our data all we have to do is run:

DIALECT = "python"
out = dag(
  universe, [
    "AllAssetsComputed",
  ], DIALECT
)

This will generate a set of Python tasks, which will do the following, for each asset (i.e., each graph) in our dataset:

  • download it from its remote location,
  • decompress it, if necessary
  • remove its header,
  • convert the file to a CSV, if necessary
  • upload the CSV data to MinIO
  • create a Hive table backing the MinIO location
  • convert the CSV-based Hive table to an ORC-based Hive table
  • drop the temporary CSV-based Hive table

This set of tasks is also known as a Directed Acyclic Graph (DAG). The same DAG can be generated as a Jupyter notebook, e.g. by setting:

DIALECT = "jupyter"

Or we can set DIALECT to "airflow" for an Airflow DAG.

Aside: what is actually going on?

What Aorist does is quite complex -- the following is an explanation of the conceptual details, but you can skip this if you'd want something a bit more concrete:

  • first, you describe the universe. This universe is actually a highly-structured hierarchy of concepts, each of which can be "constrained".
  • A constraint is something that "needs to happen". In this example all you declare that needs to happen is the constraint AllAssetsComputed. This constraint is attached to the Universe, which is a singleton object.
  • Constraints attach to specific kinds of objects -- some attach to the entire Universe, others attach to tables, etc.
  • Constraints are considered to be satisfied when their dependent constraints are satisfied. When we populate e

Related Skills

View on GitHub
GitHub Stars21
CategoryProduct
Updated2y ago
Forks3

Languages

Rust

Security Score

75/100

Audited on Mar 1, 2024

No findings