Mlim

mlim: single and multiple imputation with automated machine learning

Generate Convert Improve

Install / Use

/learn @haghish/Mlim

About this skill

Quality Score

0/100

README

`mlim` : Single and Multiple Imputation with Automated Machine Learning

mlim is the first missing data imputation software to implement automated machine learning for performing multiple imputation or single imputation of missing data. The software, which is currently implemented as an R package, brings the state-of-the-arts of machine learning to provide a versatile missing data solution for various data types (continuous, binary, multinomial, and ordinal). In a nutshell, mlim is expected to outperform any other available missing data imputation software on many grounds. For example, mlim is expected to deliver:

Lower imputation error compared to other missing data imputation software.
Higher imputation fairness, when the data suffers from severe class imbalance, unnormal destribution, or the variables (features) have interactions with one another.
Faster imputation of big datasets because mlim excells in making an efficient use of available CPU cores and the runtime scales fairly well as the size of data becomes huge.

The high performance of mlim is mainly by fine-tuning an ELNET algorithm, which often outperforms any standard statistical procedure or untuned machine learning algorithm and generalizes very well. However, mlim is an active research project and hence, it comes with a set of experimental optimization toolkit for exploring the possibility of performing multiple imputation with industry-standard machine learning algorithms such as Deep Learning, Gradient Boosting Machine, Extreme Gradient Boosting, and Stacked Ensembles. These algorithms can be used for either imputing missing data or optimizing already imputed data, but are NOT used by default NOR recommended to all users. Advanced users who are interested in exploring the possibilities of imputing missing data with these algorithms are recommended to read the free handbook (see below). These algorithms, as noted, are experimental, and the author is intended to examine their effectiveness for academic research (at this point). If you are interested to collaborate, get in touch with the author.

Fine-tuning missing data imputation

Simply put, for each variable in the dataset, mlim automatically fine-tunes a fast machine learning model, which results in significantly lower imputation error compared to classical statistical models or even untuned machine learning imputation software that use Random Forest or unsuperwised learning algorithms. Moreover, mlim is intended to give social scientists a powerful solution to their missing data problem, a tool that can automatically adopts to different variable types, that can appear at different rates, with unknown destributions and have high correlations or interactions with one another. But it is not just about higher accuracy! mlim also delivers fairer imputation, particularly for categorical and ordinal variables because it automatically balances the levels of the avriable, minimizing the bias resulting from class imbalance, which can often be seen in social science data and has been commonly ignored by missing data imputation software.

mlim outperforms other R packages for all variable types, continuous, binary (factor), multinomial (factor), and ordinal (ordered factor). The reason for this improved performance is that mlim:

Automatically fine-tunes the parameters of the Machile Learning models
Delivers a very high prediction accuracy
Does not make any assumption about the destribution of the data
Takes the interactions between the variables into account
Can to some extend take the hierarchical structure of the data into account
- Imputes missing data in nested observations with higher accuracy compared to the HLM imputation methods
Does not force a particular linear model
Uses a blend of different machine learning models

Procedure: From preimputation to imputation and postimputation

When a dataframe with NAs is given to mlim, the NAs are replaced with plausible values (e.g. Mean and Mode) to prepare the dataset for the imputation, as shown in the flowchart below:

mlim follows three steps to optimize the missing data imputation. This procedure is optional, depending on the amount of computing resources available to you. In general, ELNET imputation already outperforms other available single and multiple imputation methods available in R. However, the imputation error can be further improved by training stronger algorithms such as GBM, XGB, DL, or even Ensemble, stacking several models on top of one another. For the majority of the users, the GBM or XGB (XGB is available only in Mac OSX and Linux) will significantly imprive the ELNET imputation, if long-enough time is provided to generate a lot of models to fine-tune them.

You do not necessarily need the post-imputation. Once you have reimputed the data with ELNET, you can stop there. ELNET is relatively a fast algorithm and it is easy to fine-tune it compared to GBM, XGB, DL, or Ensemble. In addition, ELNET generalizes nicely and is less prone to overfiting. In the flowchart below the procedure of mlim algorithm is drawn. When using mlim, you can use ELNET to impute a dataset with NA or optimize the imputed values of a dataset that is already imputed. If you wish to go the extra mile, you can use heavier algorithms as well to activate the postimputation procedure, but it is strictly optional and by default, mlim does not use postimputation.

Fast imputation with `ELNET` (without postimputation)

Below are some comparisons between different R packages for carrying out multiple imputations (bars with error) and single imputation. In these analyses, I only used the ELNET algorithm, which fine-tunes much faster than other algorithms (GBM, XGBoost, and DL). As it evident, ELNET already outperforms all other single and multiple imputation procedures available in R language. However,