PipelineX

PipelineX: Python package to build ML pipelines for experimentation with Kedro, MLflow, and more

PipelineX Overview

PipelineX is a Python package to build ML pipelines for experimentation with Kedro, MLflow, and more

PipelineX provides the following options which can be used independently or together.

HatchDict: Python in YAML/JSON

HatchDict is a Python dict parser that enables you to include Python objects in YAML/JSON files.

Note: HatchDict can be used with or without Kedro.
Flex-Kedro: Kedro plugin for flexible config
- Flex-Kedro-Pipeline: Kedro plugin for quicker pipeline set up
- Flex-Kedro-Context: Kedro plugin for YAML lovers
MLflow-on-Kedro: Kedro plugin for MLflow users

MLflow-on-Kedro provides integration of Kedro with MLflow with Kedro DataSets and Hooks.

Note: You do not need to install MLflow if you do not use.
Kedro-Extras: Kedro plugin to use various Python packages

Kedro-Extras provides Kedro DataSets, decorators, and wrappers to use various Python packages such as:
- <PyTorch>
- <Ignite>
- <Pandas>
- <OpenCV>
- <Memory Profiler>
- <NVIDIA Management Library>
Note: You do not need to install Python packages you do not use.

Please refer here to find out how PipelineX differs from other pipeline/workflow packages: Airflow, Luigi, Gokart, Metaflow, and Kedro.

Install PipelineX

[Option 1] Install from the PyPI

pip install pipelinex

[Option 2] Development install

This is recommended only if you want to modify the source code of PipelineX.

git clone https://github.com/Minyus/pipelinex.git
cd pipelinex
python setup.py develop

Prepare development environment for PipelineX

You can install packages and organize development environment with pipenv. Refer the pipenv document to install pipenv. Once you installed pipenv, you can use pipenv to install and organize your environment.

# install dependent libraries
$ pipenv install

# install development libraries
$ pipenv install --dev

# install pipelinex
$ pipenv run install

# install pipelinex via setup.py
$ pipenv run install_dev

# lint python code
$ pipenv run lint

# format python code
$ pipenv run fmt

# sort imports
$ pipenv run sort

# apply mypy to python code
$ pipenv run vet

# get into shell
$ pipenv shell

# run test
$ pipenv run test

Prepare Docker environment for PipelineX

git clone https://github.com/Minyus/pipelinex.git
cd pipelinex
docker build --tag pipelinex .
docker run --rm -it pipelinex

Getting Started with PipelineX

Kedro (0.17-0.18) Starter projects

Kedro starters (Cookiecutter templates) to use Kedro, Scikit-learn, MLflow, and PipelineX are available at: kedro-starters-sklearn

Iris dataset is included and used, but you can easily change to Kaggle Titanic dataset.

Example/Demo Projects tested with Kedro 0.16

Computer Vision using PyTorch
- parameters.yml at conf/base/parameters.yml
- Essential packages: PyTorch, Ignite, Shap, Kedro, MLflow
- Application: Image classification
- Data: MNIST images
- Model: CNN (Convolutional Neural Network)
- Loss: Cross-entropy
Kaggle competition using PyTorch
- parameters.yml at kaggle/conf/base/parameters.yml
- Essential packages: PyTorch, Ignite, pandas, numpy, Kedro, MLflow
- Application: Kaggle competition to predict the results of American Football plays
- Data: Sparse heatmap-like field images and tabular data
- Model: Combination of CNN and MLP
- Loss: Continuous Rank Probability Score (CRPS)
Computer Vision using OpenCV
- parameters.yml at conf/base/parameters.yml
- Essential packages: OpenCV, Scikit-image, numpy, TensorFlow (pretrained model), Kedro, MLflow
- Application: Image processing to estimate the empty area ratio of cuboid container on a truck
- Data: container images
Uplift Modeling using CausalLift
- parameters.yml at conf/base/parameters.yml
- Essential packages: CausalLift, Scikit-learn, XGBoost, pandas, Kedro
- Application: Uplift Modeling to find which customers should be targeted and which customers should not for a marketing campaign (treatment)
- Data: generated by simulation

HatchDict: Python in YAML/JSON

API document

Python objects in YAML/JSON

Introduction to YAML

YAML is a common text format used for application config files.

YAML's most notable advantage is allowing users to mix 2 styles, block style and flow style.

Example:

import yaml
from pprint import pprint  # pretty-print for clearer look

# Read parameters dict from a YAML file in actual use
params_yaml="""
block_style_demo:
  key1: value1
  key2: value2
flow_style_demo: {key1: value1, key2: value2}
"""
parameters = yaml.safe_load(params_yaml)

print("### 2 styles in YAML ###")
pprint(parameters)

### 2 styles in YAML ###
{'block_style_demo': {'key1': 'value1', 'key2': 'value2'},
 'flow_style_demo': {'key1': 'value1', 'key2': 'value2'}}

To store highly nested (hierarchical) dict or list, YAML is more conveinient than hard-coding in Python code.

YAML's block style, which uses indentation, allows users to omit opening and closing symbols to specify a Python dict or list ({} or []).
YAML's flow style, which uses opening and closing symbols, allows users to specify a Python dict or list within a single line.

So simply using YAML with Python will be the best way for Machine Learning experimentation?

Let's check out the next example.

Example:

import yaml
from pprint import pprint  # pretty-print for clearer look


# Read parameters dict from a YAML file in actual use
params_yaml = """
model_kind: LogisticRegression
model_params:
  C: 1.23456
  max_iter: 987
  random_state: 42
"""
parameters = yaml.safe_load(params_yaml)

print("### Before ###")
pprint(parameters)

model_kind = parameters.get("model_kind")
model_params_dict = parameters.get("model_params")

if model_kind == "LogisticRegression":
    from sklearn.linear_model import LogisticRegression
    model = LogisticRegression(**model_params_dict)

elif model_kind == "DecisionTree":
    from sklearn.tree import DecisionTreeClassifier
    model = DecisionTreeClassifier(**model_params_dict)

elif model_kind == "RandomForest":
    from sklearn.ensemble import RandomForestClassifier
    model = RandomForestClassifier(**model_params_dict)

else:
    raise ValueError("Unsupported model_kind.")

print("\n### After ###")
print(model)

### Before ###
{'model_kind': 'LogisticRegression',
 'model_params': {'C': 1.23456, 'max_iter': 987, 'random_state': 42}}

### After ###
LogisticRegression(C=1.23456, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=987,
                   multi_class='warn', n_jobs=None, penalty='l2',
                   random_state=42, solver='warn', tol=0.0001, verbose=0,
                   warm_start=False)

This way is inefficient as we need to add import and if statements for the options in the Python code in addition to modifying the YAML config file.

Any better way?

Python tags in YAML

PyYAML provides UnsafeLoader which can load Python objects without import.

Example usage of !!python/object

import yaml
# You do not need `import sklearn.linear_model` using PyYAML's UnsafeLoader


# Read parameters dict from a YAML file in actual use
params_yaml = """
model:
  !!python/object:sklearn.linear_model.LogisticRegression
  C: 1.23456
  max_iter: 987
  random_state: 42
"""

parameters = yaml.unsafe_load(params_yaml)  # unsafe_load required

model = parameters.get("model")

print("### model object by PyYAML's UnsafeLoader ###")
print(model)

### model object by PyYAML's UnsafeLoader ###
LogisticRegression(C=1.23456, class_weight=None, dual=None, fit_intercept=None,
                   intercept_scaling=None, l1_ratio=None, max_iter=987,
                   multi_class=None, n_jobs=None, penalty=None, random_state=42,
                   solver=None, tol=None, verbose=None, warm_start=None

Pipelinex

Install / Use

README