Mdgc

Provides functions to impute missing values using Gaussian copulas for mixed data types.

Generate Convert Improve

Install / Use

/learn @boennecd/Mdgc

About this skill

Quality Score

0/100

README

mdgc

This package contains a marginal likelihood approach to estimating the model discussed by Hoff (2007), Zhao and Udell (2020b), and Zhao and Udell (2020a). That is, a missing data approach where one uses Gaussian copulas in the latter case. We have modified the Fortran code by Genz and Bretz (2002) to supply an approximation of the gradient for the log marginal likelihood and to use an approximation of the marginal likelihood similar to the CDF approximation in Genz and Bretz (2002). We have also used the same Fortran code to perform the imputation conditional on a covariance matrix and the observed data. The method is described by Christoffersen et al. (2021) which can be found at arxiv.org.

Importantly, we also extend the model used by Zhao and Udell (2020b) to support multinomial variables. Thus, our model supports both continuous, binary, ordinal, and multinomial variables which makes it applicable to a large number of data sets.

The package can be useful for a lot of other models. For instance, the methods are directly applicable to other Gaussian copula models and some mixed effect models. All methods are implemented in C++, support computation in parallel, and should easily be able to be ported to other languages.

Installation

The package can be installed from Github by calling:

remotes::install_github("boennecd/mdgc")

or from CRAN by calling:

install.packages("mdgc")

The code benefits from being build with automatic vectorization so having e.g.
-O3 -mtune=native in the CXX11FLAGS flags in your Makevars file may be useful.

The Model

We observe four types of variables for each observation: continuous, binary, ordinal, and multinomial variables. Let $\vec X_i$ be a K dimensional vector for the i’th observation. The variables $X_{ij}$ are continuous if $j\in\mathcal C$ , binary if $j\in\mathcal B$ with probability $p_j$ of being true, ordinal if $j\in\mathcal O$ with $m_j$ levels and borders $\alpha_{j0} = -\infty < \alpha_1<\cdots < \alpha_{m_j} = \infty$ , and multinomial if $j\in\mathcal M$ with $m_j$ levels. $\mathcal C$ , $\mathcal B$ , $\mathcal O$ , and $\mathcal M$ are mutually exclusive.

We assume that there is a latent variable $\vec Z_i$ which is multivariate normally distributed such that:

$\begin{align*} \vec Z_i & \sim N\left(\vec\mu, \Sigma\right) \nonumber\\ X_{ij} &= f_j(Z_{ih(j)}) & j &\in \mathcal C \\ X_{ij} &= 1_{\{Z_{ih(j)} > \underbrace{-\Phi^{-1}(p_{j})}_{\mu_{h(j)}}\}} & j &\in \mathcal B \\ X_{ij} &= k\Leftrightarrow \alpha_{jk} < Z_{ih(j)} \leq \alpha_{j,k + 1} & j &\in \mathcal O\wedge k = 0,\dots m_j -1 \\ X_{ij} &= k \Leftrightarrow Z_{i,h(j) + k} \geq \max(Z_{ih(j)},\cdots,Z_{i,h(j) + m_j - 1}) & j&\in \mathcal M \wedge k = 0,\dots m_j -1 \end{align*}$

where $1_{\{\cdot\}}$ is one if the condition in the subscript is true and zero otherwise, $h(j)$ is a map to the index of the first latent variable associated with the j’th variable in $\vec X_i$ and $f_j$ is a bijective function. We only estimate some of the means, the $\vec\mu$ , and some of the covariance parameters. Furthermore, we set $Z_{ih(j)} = 0$ if $j\in\mathcal M$ and assume that the variable is uncorrelated with all the other $\vec Z_i$ ’s.

In principle, we could use other distributions than a multivariate normal distribution for $\vec Z_i$ . However, the multivariate normal distribution has the advantage that it is very easy to marginalize which is convenient when we have to estimate the model with missing entries and it is also has some computational advantages for approximating the log marginal likelihood as similar intractable problem have been thoroughly studied.

Examples

Below, we provide an example similar to Zhao and Udell (2020b Section 7.1). The authors use a data set with a random correlation matrix, 5 continuous variables, 5 binary variables, and 5 ordinal variables with 5 levels. There is a total of 2000 observations and 30% of the variables are missing completely at random.

To summarize Zhao and Udell (2020b) results, they show that their approximate EM algorithm converges in what seems to be 20-25 seconds (this is with a pure R implementation to be fair) while it takes more than 150 seconds for the MCMC algorithm used by Hoff (2007). These figures should be kept in mind when looking at the results below. Importantly, Zhao and Udell (2020b) use an approximation in the E-step of an EM algorithm which is fast but might be crude in some settings. Using a potentially arbitrarily precise approximation of the log marginal likelihood is useful if this can be done quickly enough.

We will provide a quick example and an even shorter example where we show how to use the methods in the package to estimate the correlation matrix and to perform the imputation. We then show a simulation study where we compare with the method suggested by Zhao and Udell (2020b).

The last section called [adding multinomial variables](#adding-m

Related Skills

node-connect

337.4k

Diagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps

frontend-design

83.2k

Create distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.

openai-whisper-api

337.4k

Transcribe audio via OpenAI Audio Transcriptions API (Whisper).

commit-push-pr

83.2k

Commit, push, and open a PR