SkillAgentSearch skills...

Mdgc

Provides functions to impute missing values using Gaussian copulas for mixed data types.

Install / Use

/learn @boennecd/Mdgc
About this skill

Quality Score

0/100

Supported Platforms

Universal

README

mdgc

R-CMD-check CRAN RStudio mirror
downloads

This package contains a marginal likelihood approach to estimating the model discussed by Hoff (2007), Zhao and Udell (2020b), and Zhao and Udell (2020a). That is, a missing data approach where one uses Gaussian copulas in the latter case. We have modified the Fortran code by Genz and Bretz (2002) to supply an approximation of the gradient for the log marginal likelihood and to use an approximation of the marginal likelihood similar to the CDF approximation in Genz and Bretz (2002). We have also used the same Fortran code to perform the imputation conditional on a covariance matrix and the observed data. The method is described by Christoffersen et al. (2021) which can be found at arxiv.org.

Importantly, we also extend the model used by Zhao and Udell (2020b) to support multinomial variables. Thus, our model supports both continuous, binary, ordinal, and multinomial variables which makes it applicable to a large number of data sets.

The package can be useful for a lot of other models. For instance, the methods are directly applicable to other Gaussian copula models and some mixed effect models. All methods are implemented in C++, support computation in parallel, and should easily be able to be ported to other languages.

Installation

The package can be installed from Github by calling:

remotes::install_github("boennecd/mdgc")

or from CRAN by calling:

install.packages("mdgc")

The code benefits from being build with automatic vectorization so having e.g.
-O3 -mtune=native in the CXX11FLAGS flags in your Makevars file may be useful.

The Model

We observe four types of variables for each observation: continuous, binary, ordinal, and multinomial variables. Let \vec
X_i be a K dimensional vector for the i’th observation. The variables X_{ij} are continuous if j\in\mathcal
C, binary if j\in\mathcal
B with probability p_j of being true, ordinal if j\in\mathcal
O with m_j levels and borders \alpha_{j0} = -\infty < \alpha_1<\cdots <
\alpha_{m_j} =
\infty, and multinomial if j\in\mathcal
M with m_j levels. \mathcal
C, \mathcal
B, \mathcal
O, and \mathcal
M are mutually exclusive.

We assume that there is a latent variable \vec
Z_i which is multivariate normally distributed such that:

<!-- $$ --> <!-- \begin{align*} --> <!-- \vec Z_i & \sim N\left(\vec\mu, --> <!-- \Sigma\right) \nonumber\\ --> <!-- X_{ij} &= f_j(Z_{ih(j)}) & j &\in \mathcal C \\ --> <!-- X_{ij} &= \begin{cases} --> <!-- 1 & Z_{ij} > \underbrace{-\Phi^{-1}(p_{j})}_{\mu_{h(j)}} \\ --> <!-- 0 & \text{otherwise} --> <!-- \end{cases} & j &\in \mathcal B \\ --> <!-- X_{ij} &= k\Leftrightarrow \alpha_{jk} < Z_{ih(j)} \leq \alpha_{j,k + 1} --> <!-- & j &\in \mathcal O\wedge k = 0,\dots m_j -1 \\ --> <!-- X_{ij} &= k \Leftrightarrow Z_{i,h(j) + k} \geq --> <!-- \max(Z_{ih(j)},\cdots,Z_{i,h(j) + m_j - 1}) --> <!-- & j&\in \mathcal M \wedge k = 0,\dots m_j -1 --> <!-- \end{align*} --> <!-- $$ -->

\begin{align*} \vec Z_i & \sim N\left(\vec\mu, \Sigma\right)
\nonumber\\ X_{ij} &= f_j(Z_{ih(j)}) & j &\in \mathcal C \\
X_{ij} &= 1_{\{Z_{ih(j)} >
\underbrace{-\Phi^{-1}(p_{j})}_{\mu_{h(j)}}\}} & j &\in
\mathcal B \\ X_{ij} &= k\Leftrightarrow \alpha_{jk} <
Z_{ih(j)} \leq \alpha_{j,k + 1} & j &\in \mathcal O\wedge k
= 0,\dots m_j -1 \\ X_{ij} &= k \Leftrightarrow Z_{i,h(j) + k}
\geq \max(Z_{ih(j)},\cdots,Z_{i,h(j) + m_j - 1}) & j&\in
\mathcal M \wedge k = 0,\dots m_j -1
\end{align*}

where 1_{\{\cdot\}} is one if the condition in the subscript is true and zero otherwise, h(j) is a map to the index of the first latent variable associated with the j’th variable in \vec
X_i and f_j is a bijective function. We only estimate some of the means, the \vec\mu, and some of the covariance parameters. Furthermore, we set Z_{ih(j)}
= 0 if j\in\mathcal
M and assume that the variable is uncorrelated with all the other \vec
Z_i’s.

In principle, we could use other distributions than a multivariate normal distribution for \vec
Z_i. However, the multivariate normal distribution has the advantage that it is very easy to marginalize which is convenient when we have to estimate the model with missing entries and it is also has some computational advantages for approximating the log marginal likelihood as similar intractable problem have been thoroughly studied.

Examples

Below, we provide an example similar to Zhao and Udell (2020b Section 7.1). The authors use a data set with a random correlation matrix, 5 continuous variables, 5 binary variables, and 5 ordinal variables with 5 levels. There is a total of 2000 observations and 30% of the variables are missing completely at random.

To summarize Zhao and Udell (2020b) results, they show that their approximate EM algorithm converges in what seems to be 20-25 seconds (this is with a pure R implementation to be fair) while it takes more than 150 seconds for the MCMC algorithm used by Hoff (2007). These figures should be kept in mind when looking at the results below. Importantly, Zhao and Udell (2020b) use an approximation in the E-step of an EM algorithm which is fast but might be crude in some settings. Using a potentially arbitrarily precise approximation of the log marginal likelihood is useful if this can be done quickly enough.

We will provide a quick example and an even shorter example where we show how to use the methods in the package to estimate the correlation matrix and to perform the imputation. We then show a simulation study where we compare with the method suggested by Zhao and Udell (2020b).

The last section called [adding multinomial variables](#adding-m

Related Skills

View on GitHub
GitHub Stars10
CategoryDevelopment
Updated1y ago
Forks2

Languages

C++

Security Score

65/100

Audited on Sep 26, 2024

No findings