Automunge

Conclusion

Introduction

Automunge is an open source python library that has formalized and automated the data preparations for tabular learning in between the workflow boundaries of received “tidy data” (one column per feature and one row per sample) and returned dataframes suitable for the direct application of machine learning. Under automation numeric features are normalized, categoric features are binarized, and missing data is imputed. Data transformations are fit to properties of a training set for a consistent basis on any partitioned “validation data” or additional “test data”. When preparing training data, a compact python dictionary is returned recording the steps and parameters of transformation, which then may serve as a key for preparing additional data on a consistent basis.

In other words, put simply:<br/>

automunge(.) prepares tabular data for machine learning with encodings, missing data infill, and may channel stochastic perturbations into features<br/>

postmunge(.) consistently prepares additional data very efficiently<br/>

We make machine learning easy.

In addition to data preparations under automation, Automunge may also serve as a platform for engineering data pipelines. An extensive internal library of univariate transformations includes options like numeric translations, bin aggregations, date-time encodings, noise injections, categoric encodings, and even “parsed categoric encodings” in which categoric strings are vectorized based on shared grammatical structure between entries. Feature transformations may be mixed and matched in sets that include generations and branches of derivations by use of our “family tree primitives”. Feature transformations fit to properties of a training set may be custom defined from a very simple template for incorporation into a pipeline. Dimensionality reductions may be applied, such as by principal component analysis, feature importance rankings, or categoric consolidations. Missing data receives “ML infill”, in which models are trained for a feature to impute missing entries based on properties of the surrounding features. Random sampling may be channeled into features as stochastic perturbations.

Be sure to check out our Tutorial Notebooks. If you are looking for something to cite, our paper Tabular Engineering with Automunge was accepted to the Data-Centric AI workshop at NeurIPS 2021.

Install, Initialize, and Basics

Automunge is now available for pip install:

pip install Automunge

Or to upgrade:

pip install Automunge --upgrade

Once installed, run this in a local session to initialize:

from Automunge import *
am = AutoMunge()

Where e.g. for train set processing with default parameters run:

train, train_ID, labels, \
val, val_ID, val_labels, \
test, test_ID, test_labels, \
postprocess_dict = \
am.automunge(df_train)

Importantly, if the df_train set passed to automunge(.) includes a column intended for use as labels, it should be designated with the labels_column parameter.

Or for subsequent consistent processing of train or test data, using the dictionary returned from original application of automunge(.), run:

test, test_ID, test_labels, \
postreports_dict = \
am.postmunge(postprocess_dict, df_test)

I find it helpful to pass these functions with the full range of arguments included for reference, thus a user may simply copy and past this form.

#for automunge(.) function on original train and test data

train, train_ID, labels, \
val, val_ID, val_labels, \
test, test_ID, test_labels, \
postprocess_dict = \
am.automunge(df_train, df_test = False,
             labels_column = False, trainID_column = False, testID_column = False,
             valpercent=0.0, floatprecision = 32, cat_type = False, shuffletrain = True, noise_augment = 0,
             dupl_rows = False, TrainLabelFreqLevel = False, powertransform = False, binstransform = False,
             MLinfill = True, infilliterate=1, randomseed = False, eval_ratio = .5,
             numbercategoryheuristic = 255, pandasoutput = 'dataframe', NArw_marker = True,
             featureselection = False, featurethreshold = 0., inplace = False, orig_headers = False,
             Binary = False, PCAn_components = False, PCAexcl = [], excl_suffix = False,
             ML_cmnd = {'autoML_type':'randomforest',
                        'MLinfill_cmnd':{'RandomForestClassifier':{}, 'RandomForestRegressor':{}},
                        'PCA_type':'default',
                        'PCA_cmnd':{}},
             assigncat = {'1010':[], 'onht':[], 'ordl':[], 'bnry':[], 'hash':[], 'hsh2':[],
                          'DP10':[], 'DPoh':[], 'DPod':[], 'DPbn':[], 'DPhs':[], 'DPh2':[],
                          'nmbr':[], 'mnmx':[], 'retn':[], 'DPnb':[], 'DPmm':[], 'DPrt':[],
                          'bins':[], 'pwr2':[], 'bnep':[], 'bsor':[], 'por2':[], 'bneo':[],
                          'ntgr':[], 'srch':[], 'or19':[], 'tlbn':[], 'excl':[], 'exc2':[]},
             assignparam = {'global_assignparam'  : {'(parameter)': 42},
                            'default_assignparam' : {'(category)' : {'(parameter)' : 42}},
                                     '(category)' : {'(column)'   : {'(parameter)' : 42}}},
             assigninfill = {'stdrdinfill':[], 'MLinfill':[], 'zeroinfill':[], 'oneinfill':[],
                             'adjinfill':[], 'meaninfill':[], 'medianinfill':[], 'negzeroinfill':[],
                             'interpinfill':[], 'modeinfill':[], 'lcinfill':[], 'naninfill':[]},
             assignnan = {'categories':{}, 'columns':{}, 'global':[]},
             transformdict = {}, processdict = {}, evalcat = False, ppd_append = False,
             entropy_seeds = False, random_generator = False, sampling_dict = False,
             privacy_encode = False, encrypt_key = False, printstatus = 'summary', logger = {})

Please remember to save the automunge(.) returned object postprocess_dict such as using pickle library, which can then be later passed to the postmunge(.) function to consistently prepare subsequently available data.

#Sample pickle code:

#sample code to download postprocess_dict dictionary returned from automunge(.)
import pickle
with open('filename.pickle', 'wb') as handle:
  pickle.dump(postprocess_dict, handle, protocol=pickle.HIGHEST_PROTOCOL)

#to upload for later use in postmunge(.) in another notebook
import pickle
with open('filename.pickle', 'rb') as handle:
  postprocess_dict = pickle.load(handle)

#Please note that if you included externally initialized functions in an automunge(.) call
#like for custom_train transformation functions or customML inference functions
#they will need to be reinitialized prior to uploading the postprocess_dict with pickle.

We can then apply the postprocess_dict saved from a prior application of automunge for consistent processing of additional data.

#for postmunge(.) function on additional available train or test data
#using the postprocess_dict object returned from original automunge(.) application

test, test_ID, test_labels, \
postreports_dict = \
am.postmunge(postprocess_dict, df_test,
             testID_column = False,
             pandasoutput = 'dataframe', printstatus = 'summary',
             dupl_rows = False, TrainLabelFreqLevel = False,
             featureeval = False, traindata = False, noise_augment = 0,
             driftreport = False, inversion = False,
             returnedsets = True, shuffletrain = False,
             entropy_seeds = False, random_generator = False, sampling_dict = False,
             randomseed = False, encrypt_key = False, logger = {})

The functions accept pandas dataframe or numpy array input and return encoded dataframes with consistent order of columns between train and test data. (For input numpy arrays any label column should be positioned as final column in set.) The functions return data with categoric features translated to numerical encodings and normalized numeric such as to make them suitable for direct application to a machine learning model in the framework of a user's choice, including sets for the various activities of a generic machine learning project such as training (train), validation (val), and inference (test). The automunge(.) function also returns a python dictionary (the "postprocess_dict") that can be

AutoMunge

Install / Use

README

Automunge

Table of Contents

Introduction

Install, Initialize, and Basics