OptimalFlow
OptimalFlow is an omni-ensemble and scalable automated machine learning Python toolkit, which uses Pipeline Cluster Traversal Experiments(PCTE) and Selection-based Feature Preprocessor with Ensemble Encoding(SPEE), to help data scientists build optimal models, and automate supervised learning workflow with simpler coding.
Install / Use
/learn @tonyleidong/OptimalFlowREADME
OptimalFlow
<!--- [](https://Optimal-Flow.readthedocs.io/) -->Author: Tony Dong
<img src="https://github.com/tonyleidong/OptimalFlow/blob/master/docs/OptimalFlow_Logo.png" width="150">Documentation: https://Optimal-Flow.readthedocs.io/
Publication
Lei(Tony) Dong, Syed Khader. GENPACT GVector Augmented Intelligence Conference, November 2020. OptimalFlow: Omni-ensemble and Scalable Automated Machine Learning.pdf
Installation
pip install OptimalFlow
Introduction
OptimalFlow is an omni-ensemble and scalable automated machine learning Python toolkit, which uses Pipeline Cluster Traversal Experiments(PCTE) and Selection-based Feature Preprocessor with Ensemble Encoding(SPEE), to help data scientists build optimal models, and automate supervised learning workflow with simpler coding.
OptimalFlow wraps the Scikit-learn supervised learning framework to automatically create an ensemble of machine learning pipelines(Omni-ensemble Pipeline Cluster) based on algorithms permutation in each framework component. It includes feature engineering methods in its preprocessing module such as missing value imputation, categorical feature encoding, numeric feature standardization, and outlier winsorization. The models inherit algorithms from Scikit-learn and XGBoost estimators for classification and regression problems. And the extendable coding structure supports adding models from external estimator libraries, which distincts OptimalFlow ’s scalability out of most of AutoML toolkits.
OptimalFlow uses Pipeline Cluster Traversal Experiments as the optimizer to build an omni-ensemble workflow for an optimal baseline model searching, including feature preprocessing/selection optimization, hyperparameters tuning, model selection and assessment.
After version 0.1.10, it added a "no-code" Web App as an application demo built on OptimalFlow. The web app allows simple click and selection for all of the parameters inside of OptimalFLow, which means users could build end-to-end Automated Machine Learning workflow without coding at all!(Read more details https://optimal-flow.readthedocs.io/en/latest/webapp.html)

Comparing other popular "AutoML or Automated Machine Learning" APIs, OptimalFlow is designed as an omni-ensembled ML workflow optimizer with higher-level API targeting to avoid manual repetitive train-along-evaluate experiments in general pipeline building.
To achieve that, OptimalFlow applies Pipeline Cluster Traversal Experiments algorithm to assemble all cross-matching pipelines covering major tasks of Machine Learning workflow, and apply traversal-experiment to search the optimal baseline model.
Besides, by modularizing all key pipeline components in reuseable packages, it allows all components to be custom tunable along with high scalability.
<img src="https://github.com/tonyleidong/OptimalFlow/blob/master/docs/OptimalFlow_Workflow.PNG" width="980">The core concept in OptimalFlow is Pipeline Cluster Traversal Experiments, which is a theory, first raised by Tony Dong during Genpact 2020 GVector Conference, to optimize and automate Machine Learning Workflow using ensemble pipelines algorithm.
Comparing other automated or classic machine learning workflow's repetitive experiments using single pipeline, Pipeline Cluster Traversal Experiments is more powerful, with larger coverage scope, to find the best model without manual intervention, and also more flexible with elasticity to cope with unseen data due to its ensemble designs in each component.
<img src="https://github.com/tonyleidong/OptimalFlow/blob/master/docs/PipelineClusterTraversalExperiments.PNG" width="980">In summary, OptimalFlow provides the following benefits to data scientists:
-
Easy coding - High-level APIs to implement PCTE, and each machine learning component is highly automated and modularized;
-
Easy transformation - Focus on process automation for local implementation, which is easy to transfer current operation and meet compliance restrictions, i.e. pharmaceutical compliance policy;
-
Easy maintenance - Wrap machine learning components to well-modularized code packages without miscellaneous parameters inside;
-
Light & swift - Easy to transplant among projects. Quick and convenient, to deploy compared with other cloud-based solutions(i.e. Microsoft Azure AutoML);
-
Omni ensemble - Easy for data scientists to implement iterated experiments across all ensemble components in the workflow;
-
High Scalability - Each module allows addition of add new algorithms easily owing to its ensemble and reusable coding design. PCTE and SPEE make it easier to adapt unseen datasets with the right pipeline;
-
Customization - Support custom settings to add/remove algorithms or modify hyperparameters for elastic requirements.
Core Modules:
- autoPP for feature preprocessing
- autoFS for classification/regression features selection
- autoCV for classification/regression model selection and evaluation
- autoPipe for Pipeline Cluster Traversal Experiments
- autoViz for pipeline cluster visualization. Current available: Model retrieval diagram
- autoFlow for logging & tracking.
Notebook Demo:
An End-to-End OptimalFlow Automated Machine Learning Tutorial with Real Projects
-
Part 1: https://towardsdatascience.com/end-to-end-optimalflow-automated-machine-learning-tutorial-with-real-projects-formula-e-laps-8b57073a7b50
-
Part 2: https://towardsdatascience.com/end-to-end-optimalflow-automated-machine-learning-tutorial-with-real-projects-formula-e-laps-31d810539102
Other Stories:
-
Ensemble Feature Selection in Machine Learning using OptimalFlow - Easy Way with Simple Code to Select top Features: https://towardsdatascience.com/ensemble-feature-selection-in-machine-learning-by-optimalflow-49f6ee0d52eb
-
Ensemble Model Selection & Evaluation in Machine Learning using OptimalFlow - Easy Way with Simple Code to Select the Optimal Model: https://towardsdatascience.com/ensemble-model-selection-evaluation-in-machine-learning-by-optimalflow-9e5126308f12
-
Build No-code Automated Machine Learning Model with OptimalFlow Web App: https://towardsdatascience.com/build-no-code-automated-machine-learning-model-with-optimalflow-web-app-8acaad8262b1
-
Feature Preprocessor in Automated Machine Learning: https://tonyleidong.medium.com/feature-preprocessor-in-automated-machine-learning-c3af6f22f015
Support OptimalFlow
If you like OptimalFlow, please consider starring or forking it on GitHub and spreading the word!
Please, Avoid Selling this Work as Yours
Voice from the Author: I am glad if you find OptimalFlow useful and helpful. Feel free to add it to your project and let more people know how good it is. But please avoid simply changing the name and selling it as your work. That's not why I'm sharing the source code, at all. All copyrights reserved by Tony Dong following MIT license.
License:
MIT
Updates History:
Updates on 9/29/2020
- Added SearchinSpace settings page in Web App. Users could custom set estimators/regressors' parameters for optimal tuning outputs.
- Modified some layouts of existing pages in Web App.
Updates on 9/16/2020
- Created a Web App based on flask framework as OptimalFlow's GUI, to build PCTE Automated Machine Learning by simply clicks without any coding at all!
- Web App included PCTE workflow bulder, LogsViewer, Visualization, Documentation sections.
- Fix the filename issues in autoViz module, and remove auto_open function when generating new html format plots.
Updates on 8/31/2020
- Modify autoPP's default_parameters: Remove "None" in "scaler", modify "sparsity" : [0.50], modify "cols" : [100]
- Modify autoViz clf_table_report()'s coloring settings
- Fix bugs in autoViz reg_table_report()'s gradient coloring function
Updates on 8/28/2020
- Remove evaluate_model() function's round() bugs in coping with classification problem
- Move out SVM based algorithm from fastClassifier & fastRegressor's default estimators settings
- Move out SVM based algorithm from autoFS class's default selectors settings
Updates on 8/26/2020
- Fix evaluate_model()
