Pipelinex
PipelineX: Python package to build ML pipelines for experimentation with Kedro, MLflow, and more
Install / Use
/learn @Minyus/PipelinexREADME
PipelineX
PipelineX: Python package to build ML pipelines for experimentation with Kedro, MLflow, and more
PipelineX Overview
PipelineX is a Python package to build ML pipelines for experimentation with Kedro, MLflow, and more
PipelineX provides the following options which can be used independently or together.
-
HatchDict: Python in YAML/JSON
HatchDictis a Python dict parser that enables you to include Python objects in YAML/JSON files.Note:
HatchDictcan be used with or without Kedro. -
Flex-Kedro: Kedro plugin for flexible config
-
Flex-Kedro-Pipeline: Kedro plugin for quicker pipeline set up
-
Flex-Kedro-Context: Kedro plugin for YAML lovers
-
-
MLflow-on-Kedro: Kedro plugin for MLflow users
MLflow-on-Kedroprovides integration of Kedro with MLflow with Kedro DataSets and Hooks.Note: You do not need to install MLflow if you do not use.
-
Kedro-Extras: Kedro plugin to use various Python packages
Kedro-Extrasprovides Kedro DataSets, decorators, and wrappers to use various Python packages such as:- <PyTorch>
- <Ignite>
- <Pandas>
- <OpenCV>
- <Memory Profiler>
- <NVIDIA Management Library>
Note: You do not need to install Python packages you do not use.
Please refer here to find out how PipelineX differs from other pipeline/workflow packages: Airflow, Luigi, Gokart, Metaflow, and Kedro.
Install PipelineX
[Option 1] Install from the PyPI
pip install pipelinex
[Option 2] Development install
This is recommended only if you want to modify the source code of PipelineX.
git clone https://github.com/Minyus/pipelinex.git
cd pipelinex
python setup.py develop
Prepare development environment for PipelineX
You can install packages and organize development environment with pipenv. Refer the pipenv document to install pipenv. Once you installed pipenv, you can use pipenv to install and organize your environment.
# install dependent libraries
$ pipenv install
# install development libraries
$ pipenv install --dev
# install pipelinex
$ pipenv run install
# install pipelinex via setup.py
$ pipenv run install_dev
# lint python code
$ pipenv run lint
# format python code
$ pipenv run fmt
# sort imports
$ pipenv run sort
# apply mypy to python code
$ pipenv run vet
# get into shell
$ pipenv shell
# run test
$ pipenv run test
Prepare Docker environment for PipelineX
git clone https://github.com/Minyus/pipelinex.git
cd pipelinex
docker build --tag pipelinex .
docker run --rm -it pipelinex
Getting Started with PipelineX
Kedro (0.17-0.18) Starter projects
Kedro starters (Cookiecutter templates) to use Kedro, Scikit-learn, MLflow, and PipelineX are available at: kedro-starters-sklearn
Iris dataset is included and used, but you can easily change to Kaggle Titanic dataset.
Example/Demo Projects tested with Kedro 0.16
-
-
parameters.ymlat conf/base/parameters.yml -
Essential packages: PyTorch, Ignite, Shap, Kedro, MLflow
-
Application: Image classification
-
Data: MNIST images
-
Model: CNN (Convolutional Neural Network)
-
Loss: Cross-entropy
-
-
Kaggle competition using PyTorch
-
parameters.ymlat kaggle/conf/base/parameters.yml -
Essential packages: PyTorch, Ignite, pandas, numpy, Kedro, MLflow
-
Application: Kaggle competition to predict the results of American Football plays
-
Data: Sparse heatmap-like field images and tabular data
-
Model: Combination of CNN and MLP
-
Loss: Continuous Rank Probability Score (CRPS)
-
-
parameters.ymlat conf/base/parameters.yml- Essential packages: OpenCV, Scikit-image, numpy, TensorFlow (pretrained model), Kedro, MLflow
- Application: Image processing to estimate the empty area ratio of cuboid container on a truck
- Data: container images
-
Uplift Modeling using CausalLift
parameters.ymlat conf/base/parameters.yml- Essential packages: CausalLift, Scikit-learn, XGBoost, pandas, Kedro
- Application: Uplift Modeling to find which customers should be targeted and which customers should not for a marketing campaign (treatment)
- Data: generated by simulation
HatchDict: Python in YAML/JSON
Python objects in YAML/JSON
Introduction to YAML
YAML is a common text format used for application config files.
YAML's most notable advantage is allowing users to mix 2 styles, block style and flow style.
Example:
import yaml
from pprint import pprint # pretty-print for clearer look
# Read parameters dict from a YAML file in actual use
params_yaml="""
block_style_demo:
key1: value1
key2: value2
flow_style_demo: {key1: value1, key2: value2}
"""
parameters = yaml.safe_load(params_yaml)
print("### 2 styles in YAML ###")
pprint(parameters)
### 2 styles in YAML ###
{'block_style_demo': {'key1': 'value1', 'key2': 'value2'},
'flow_style_demo': {'key1': 'value1', 'key2': 'value2'}}
To store highly nested (hierarchical) dict or list, YAML is more conveinient than hard-coding in Python code.
-
YAML's block style, which uses indentation, allows users to omit opening and closing symbols to specify a Python dict or list (
{}or[]). -
YAML's flow style, which uses opening and closing symbols, allows users to specify a Python dict or list within a single line.
So simply using YAML with Python will be the best way for Machine Learning experimentation?
Let's check out the next example.
Example:
import yaml
from pprint import pprint # pretty-print for clearer look
# Read parameters dict from a YAML file in actual use
params_yaml = """
model_kind: LogisticRegression
model_params:
C: 1.23456
max_iter: 987
random_state: 42
"""
parameters = yaml.safe_load(params_yaml)
print("### Before ###")
pprint(parameters)
model_kind = parameters.get("model_kind")
model_params_dict = parameters.get("model_params")
if model_kind == "LogisticRegression":
from sklearn.linear_model import LogisticRegression
model = LogisticRegression(**model_params_dict)
elif model_kind == "DecisionTree":
from sklearn.tree import DecisionTreeClassifier
model = DecisionTreeClassifier(**model_params_dict)
elif model_kind == "RandomForest":
from sklearn.ensemble import RandomForestClassifier
model = RandomForestClassifier(**model_params_dict)
else:
raise ValueError("Unsupported model_kind.")
print("\n### After ###")
print(model)
### Before ###
{'model_kind': 'LogisticRegression',
'model_params': {'C': 1.23456, 'max_iter': 987, 'random_state': 42}}
### After ###
LogisticRegression(C=1.23456, class_weight=None, dual=False, fit_intercept=True,
intercept_scaling=1, l1_ratio=None, max_iter=987,
multi_class='warn', n_jobs=None, penalty='l2',
random_state=42, solver='warn', tol=0.0001, verbose=0,
warm_start=False)
This way is inefficient as we need to add import and if statements for the options in the Python code in addition to modifying the YAML config file.
Any better way?
Python tags in YAML
PyYAML provides UnsafeLoader which can load Python objects without import.
Example usage of !!python/object
import yaml
# You do not need `import sklearn.linear_model` using PyYAML's UnsafeLoader
# Read parameters dict from a YAML file in actual use
params_yaml = """
model:
!!python/object:sklearn.linear_model.LogisticRegression
C: 1.23456
max_iter: 987
random_state: 42
"""
parameters = yaml.unsafe_load(params_yaml) # unsafe_load required
model = parameters.get("model")
print("### model object by PyYAML's UnsafeLoader ###")
print(model)
### model object by PyYAML's UnsafeLoader ###
LogisticRegression(C=1.23456, class_weight=None, dual=None, fit_intercept=None,
intercept_scaling=None, l1_ratio=None, max_iter=987,
multi_class=None, n_jobs=None, penalty=None, random_state=42,
solver=None, tol=None, verbose=None, warm_start=None
