SkillAgentSearch skills...

Mlim

mlim: single and multiple imputation with automated machine learning

Install / Use

/learn @haghish/Mlim

README

<a href="https://github.com/haghish/mlim"><img src='man/figures/mlim.png' align="right" height="200" /></a>

mlim : Single and Multiple Imputation with Automated Machine Learning

<!--<a href="https://github.com/haghish/mlim"><img src="./web/mlim.png" align="left" width="140" hspace="10" vspace="6"></a> -->

GitHub dev CRAN version

<!-- [![](man/figures/handbook_stupid.svg)](https://github.com/haghish/mlim_handbook/blob/main/mlim_handbook.pdf) [![](https://cranlogs.r-pkg.org/badges/mlim?color=a958d1)](https://cran.r-project.org/package=mlim) https://shields.io/ [![GitHub dev](https://img.shields.io/github/v/tag/haghish/mlim.svg?sort=semver?color=2eb885)](https://github.com/haghish/mlim/releases/?include_prereleases&sort=semver "View GitHub releases") -->

mlim is the first missing data imputation software to implement automated machine learning for performing multiple imputation or single imputation of missing data. The software, which is currently implemented as an R package, brings the state-of-the-arts of machine learning to provide a versatile missing data solution for various data types (continuous, binary, multinomial, and ordinal). In a nutshell, mlim is expected to outperform any other available missing data imputation software on many grounds. For example, mlim is expected to deliver:

  1. Lower imputation error compared to other missing data imputation software.
  2. Higher imputation fairness, when the data suffers from severe class imbalance, unnormal destribution, or the variables (features) have interactions with one another.
  3. Faster imputation of big datasets because mlim excells in making an efficient use of available CPU cores and the runtime scales fairly well as the size of data becomes huge.

The high performance of mlim is mainly by fine-tuning an ELNET algorithm, which often outperforms any standard statistical procedure or untuned machine learning algorithm and generalizes very well. However, mlim is an active research project and hence, it comes with a set of experimental optimization toolkit for exploring the possibility of performing multiple imputation with industry-standard machine learning algorithms such as Deep Learning, Gradient Boosting Machine, Extreme Gradient Boosting, and Stacked Ensembles. These algorithms can be used for either imputing missing data or optimizing already imputed data, but are NOT used by default NOR recommended to all users. Advanced users who are interested in exploring the possibilities of imputing missing data with these algorithms are recommended to read the free handbook (see below). These algorithms, as noted, are experimental, and the author is intended to examine their effectiveness for academic research (at this point). If you are interested to collaborate, get in touch with the author.

<!-- > **NOTE**: Prior to version 0.3.0, `mlim` did not use a stochastic procedure and thus, the multiple imputation algorithm was inflating the relationships between the imputed variables. A solution is implemented in version 0.3.0, which is automatically activated for multiple imputation and is currently under testing. You can investigate the code for `stochastic = TRUE` argument to see how this procedure is implemented. In stochastic multiple imputation, `mlim` uses the estimated RMSE of each continuous variable as an indication of standard error and replaces the imputed values with stochastic values drawn with a mean equal to the imputed value and SD equal to the RMSE. For factor variables, however, `mlim` draws a random value based on estimated probabilities of each factor level for each missing value. These two procedures are still under testing... Meanwhile, for a single imputation, `mlim` continues to be the top performer among other R packages that I have tested. -->

Fine-tuning missing data imputation

Simply put, for each variable in the dataset, mlim automatically fine-tunes a fast machine learning model, which results in significantly lower imputation error compared to classical statistical models or even untuned machine learning imputation software that use Random Forest or unsuperwised learning algorithms. Moreover, mlim is intended to give social scientists a powerful solution to their missing data problem, a tool that can automatically adopts to different variable types, that can appear at different rates, with unknown destributions and have high correlations or interactions with one another. But it is not just about higher accuracy! mlim also delivers fairer imputation, particularly for categorical and ordinal variables because it automatically balances the levels of the avriable, minimizing the bias resulting from class imbalance, which can often be seen in social science data and has been commonly ignored by missing data imputation software.

<!-- The figure below shows the normalized RMSE of the imputation of several algorithms, including `MICE`, `missForest`, `missRanger`, and `mlim`. Here, two of **`mlim`**'s algorithms, Elastic Net (ELNET) and Gradient Boosting Machine (GBM) are used for the imputation and the result are compared with Random Forest imputations as well as Multiple Imputation with Chained Equations (MICE), which uses Predictive Mean Matching (PMM). This imputation was carried out on __iris__ dataset in R, by adding 10% artifitial missing data and comparing the imputed values with the original. -->

mlim outperforms other R packages for all variable types, continuous, binary (factor), multinomial (factor), and ordinal (ordered factor). The reason for this improved performance is that mlim:

  • Automatically fine-tunes the parameters of the Machile Learning models
  • Delivers a very high prediction accuracy
  • Does not make any assumption about the destribution of the data
  • Takes the interactions between the variables into account
  • Can to some extend take the hierarchical structure of the data into account
    • Imputes missing data in nested observations with higher accuracy compared to the HLM imputation methods
  • Does not force a particular linear model
  • Uses a blend of different machine learning models
<!-- Download mlim multiple imputation handbook ------------------------------------------ <a href="https://github.com/haghish/mlim_handbook/blob/main/mlim_handbook.pdf"><img src='https://github.com/haghish/mlim_handbook/blob/main/figures/handbook.png' align="left" height="150" /></a> `mlim` comes with a free and open-source handbook to help you get started with either single or multiple imputation. The handbook is written in _LaTeX_ and its source is publically hosted on GitHub, visit [github.com/haghish/mlim_handbook](https://github.com/haghish/mlim_handbook) for more information. <br> <br> <br> <br> -->

Procedure: From preimputation to imputation and postimputation

When a dataframe with NAs is given to mlim, the NAs are replaced with plausible values (e.g. Mean and Mode) to prepare the dataset for the imputation, as shown in the flowchart below:

<img src='man/figures/flowchart_base.png' align="center" width="300" />

mlim follows three steps to optimize the missing data imputation. This procedure is optional, depending on the amount of computing resources available to you. In general, ELNET imputation already outperforms other available single and multiple imputation methods available in R. However, the imputation error can be further improved by training stronger algorithms such as GBM, XGB, DL, or even Ensemble, stacking several models on top of one another. For the majority of the users, the GBM or XGB (XGB is available only in Mac OSX and Linux) will significantly imprive the ELNET imputation, if long-enough time is provided to generate a lot of models to fine-tune them.

<img src='man/figures/procedure5.png' align="center" height="400" />

You do not necessarily need the post-imputation. Once you have reimputed the data with ELNET, you can stop there. ELNET is relatively a fast algorithm and it is easy to fine-tune it compared to GBM, XGB, DL, or Ensemble. In addition, ELNET generalizes nicely and is less prone to overfiting. In the flowchart below the procedure of mlim algorithm is drawn. When using mlim, you can use ELNET to impute a dataset with NA or optimize the imputed values of a dataset that is already imputed. If you wish to go the extra mile, you can use heavier algorithms as well to activate the postimputation procedure, but it is strictly optional and by default, mlim does not use postimputation.

<img src='man/figures/flowchart_optimization.png' align="center" width="300" />

Fast imputation with ELNET (without postimputation)

Below are some comparisons between different R packages for carrying out multiple imputations (bars with error) and single imputation. In these analyses, I only used the ELNET algorithm, which fine-tunes much faster than other algorithms (GBM, XGBoost, and DL). As it evident, ELNET already outperforms all other single and multiple imputation procedures available in R language. However,

Related Skills

View on GitHub
GitHub Stars32
CategoryData
Updated1mo ago
Forks0

Languages

R

Security Score

80/100

Audited on Feb 22, 2026

No findings