Aethos
Automated Data Science and Machine Learning library to optimize workflow.
Install / Use
/learn @Ashton-Sidhu/AethosREADME
Aethos
<i>"A collection of tools for Data Scientists and ML Engineers to automate their workflow of performing analysis to deploying models and pipelines."</i>
To track development of the project, you can view the Trello board.
What's new in Aethos 2.0? For a summary of new features and changes to Aethos in v2.0 you can read this blog post. Alternatively, there is a Google Collab notebook available here.
<!-- START doctoc generated TOC please keep comment here to allow auto update --> <!-- DON'T EDIT THIS SECTION, INSTEAD RE-RUN doctoc TO UPDATE -->Table of Contents
- Introduction
- Usage
- Installation
- Development Phases
- Feedback
- Contributors
- Sponsors
- Acknowledgments
- For Developers
Introduction
Aethos is a library/platform that automates your data science and analytical tasks at any stage in the pipeline. Aethos is, at its core, a uniform API that helps automate analytical techniques from various libaries such as pandas, sci-kit learn, gensim, etc.
Aethos provides:
- Automated data science cleaning, preprocessing, feature engineering and modelling techniques through one line of code
- Automated visualizations through one line of code
- Reusable code - no more copying code from notebook to notebook
- Automated dependency and corpus management
- Datascience project templates
- Integrated 3rd party jupyter plugins to make analyzing data more friendly
- Model analysis use cases - Confusion Matrix, ROC Curve, all metrics, decision tree plots, etc.
- Model interpretability - Local through SHAP and LIME, global through Morris Sensitivity
- Interactive checklists and tips to either remind or help you through your analysis.
- Comparing train and test data distribution
- Exporting trained models as a service (Generates the necessary code, files and folder structure)
- Experiment tracking with MLFlow
- Pre-trained models - BERT, GPT2, etc.
- Statistical tests - Anova, T-test, etc.
You can view a full list of implemented techniques in the documentation or here: TECHNIQUES.md
Plus more coming soon such as:
- Testing for model drift
- Recommendation models
- Parralelization through Dask and/or Spark
- Uniform API for deep learning models
- Automated code and file generation for jupyter notebook development and a python file of your data pipeline.
Aethos makes it easy to PoC, experiment and compare different techniques and models from various libraries. From imputations, visualizations, scaling, dimensionality reduction, feature engineering to modelling, model results and model deployment - all done with a single, human readable, line of code!
Aethos utilizes other open source libraries to help enhance your analysis from enhanced stastical information, interactive visual plots or statistical tests and models - all your tools in one place, all accessible with one line of code or a click! See below in the Acknowledgments for the open source libraries being used in this project.
Usage
For full documentation on all the techniques and models, click here or here
Examples can be viewed here
To start, we need to import Aethos dependencies as well as pandas.
Before that, we can create a full data science folder structure by running aethos create from the command line and follow the command prompts.
Options
To enable extensions, such as QGrid interactive filtering, enable them as you would in pandas:
import aethos as at
at.options.interactive_df = True
Currently the following options are:
interactive_df: Interactive grid with QGridinteractive_table: Interactive grid with Itable - comes with built in client side searchingproject_metrics: Setting project metrics- Project metrics is a metric or set of metrics to evaluate models.
track_experiments: Uses MLFlow to track models and experiments.
User options such as changing the directory where images, and projects are saved can be edited in the config file. This is located at USER_HOME/.aethos/ .
This location is also the default location of where any images, and projects are stored.
New in 2.0
The Data and Model objects no longer exist but instead there a multiple objects you can use with more of a purpose.
Analysis - Used to analyze, visualize and run statistical models (t-tests, anovas, etc.)
Classification - Used to analyze, visualize, run statistical models and train classification models.
Regression - Used to analyze, visualize, run statistical models and train regression models.
Unsupervised - Used to analyze, visualize, run statistical models and train unsupervised models.
ClassificationModelAnalysis - Used to analyze, interpret and visualize results of a Classification model.
RegressionModelAnalysis - Used to analyze, interpret and visualize results of a Regression model.
UnsupervisedModelAnalysis - Used to analyze, interpret and visualize results of a Unsupervised model.
TextModelAnalysis - Used to analyze, interpret and visualize results of a Text model.
Now all modelling and anlysis can be achieved via one object.
Analysis
import aethos as at
import pandas as pd
x_train = pd.read_csv('https://raw.githubusercontent.com/Ashton-Sidhu/aethos/develop/examples/data/train.csv') # load data into pandas
# Initialize Data object with training data
# By default, if no test data (x_test) is provided, then the data is split with 20% going to the test set
#
# Specify predictor field as 'Survived'
df = at.Classification(x_train, target='Survived')
df.x_train # View your training data
df.x_test # View your testing data
df # Glance at your training data
df[df.Age > 25] # Filter the data
df.x_train['new_col'] = [1,2] # This is the exact same as the either of code above
df.x_test['new_col'] = [1,2]
df.data_report(title='Titanic Summary', output_file='titanic_summary.html') # Automate EDA with pandas profiling with an autogenerated report
df.describe() # Display a high level view of your data using an extended version of pandas describe
df.column_info() # Display info about each column in your data
df.describe_column('Fare') # Get indepth statistics about the 'Fare' column
df.mean() # Run pandas functions on the aethos objects
df.missing_data # View your missing data at anytime
df.correlation_matrix() # Generate a correlation matrix for your training data
df.predictive_power() # Calculates the predictive power of each variable
df.autoviz() # Runs autoviz on the data and runs EDA on your data
df.pairplot() # Generate pairplots for your training data features at any time
df.checklist() # Will provide an iteractive checklist to keep track of your cleaning tasks
NOTE: One of the benefits of using aethos is that any method you apply on your train set, gets applied to your test dataset. For any method that requires fitting (replacing missing data with mean), the method is fit on the training data and then applied to the testing data to avoid data leakage.
# Replace missing values in the 'Fare' and 'Embarked' column with the most common values in each of the respective columns.
df.replace_missing_mostcommon('Fare', 'Embarked')
# To create a "checkpoint" of your data (i.e. if you just want to test this analytical method), assign it to a variable
df.replace_missing_mostcommon('Fare', 'Embarked')
# Replace missing values in the 'Age' column with a random value that follows the probability distribution of the 'Age' column in the training set.
df.replace_missing_random_discrete('Age')
df.drop('Cabin') # Drop the cabin column
As you've started to notice, alot of tasks to df the data and to explore the data have been reduced down to one command, and are also customizable by providing the respective keyword arguments (see documentation).
# Create a barplot of the mean surivial rate grouped by age.
df.barplot(x='Age', y='Survived', method='mean')
# Plots a scatter plot of Age vs. Fare and colours the dots based off the Survived column.
df.scatterplot(x='Age', y='Fare', color='Survived')
# One hot encode the `Person` and `Embarked` columns and then drop the original columns
df.onehot_encode('Person', 'Embarked', drop_col=True)
Modelling
Running a Single Model
Models can be trained one at a time or multiple at a time. They can also be trained by passing in the params for the sklearn, xgboost, etc constructor, by passing in a gridsearch dictionary & params, cross validating with gridsearch & params.
After a model has been ran, it comes with use cases such as plotting RoC curves, calculating performance metrics, confusion matr
