Mdgc
Provides functions to impute missing values using Gaussian copulas for mixed data types.
Install / Use
/learn @boennecd/MdgcREADME
mdgc
This package contains a marginal likelihood approach to estimating the model discussed by Hoff (2007), Zhao and Udell (2020b), and Zhao and Udell (2020a). That is, a missing data approach where one uses Gaussian copulas in the latter case. We have modified the Fortran code by Genz and Bretz (2002) to supply an approximation of the gradient for the log marginal likelihood and to use an approximation of the marginal likelihood similar to the CDF approximation in Genz and Bretz (2002). We have also used the same Fortran code to perform the imputation conditional on a covariance matrix and the observed data. The method is described by Christoffersen et al. (2021) which can be found at arxiv.org.
Importantly, we also extend the model used by Zhao and Udell (2020b) to support multinomial variables. Thus, our model supports both continuous, binary, ordinal, and multinomial variables which makes it applicable to a large number of data sets.
The package can be useful for a lot of other models. For instance, the methods are directly applicable to other Gaussian copula models and some mixed effect models. All methods are implemented in C++, support computation in parallel, and should easily be able to be ported to other languages.
Installation
The package can be installed from Github by calling:
remotes::install_github("boennecd/mdgc")
or from CRAN by calling:
install.packages("mdgc")
The code benefits from being build with automatic vectorization so
having e.g.
-O3 -mtune=native in the CXX11FLAGS flags in your Makevars file may
be useful.
The Model
We observe four types of variables for each observation: continuous,
binary, ordinal, and multinomial variables. Let be a K dimensional vector for the i’th observation. The
variables
are continuous if
, binary if
with probability
of being true, ordinal if
with
levels and borders
,
and multinomial if
with
levels.
,
,
, and
are mutually exclusive.
We assume that there is a latent variable which is multivariate normally distributed such that:
where
is one if the condition in the subscript is true and
zero otherwise,
is a map to the index of the first latent variable associated
with the j’th variable in
and
is a bijective function. We only estimate some of the means, the
, and some of the covariance parameters. Furthermore, we set
if
and assume that the variable is uncorrelated with
all the other
’s.
In principle, we could use other distributions than a multivariate
normal distribution for . However, the multivariate normal distribution has the
advantage that it is very easy to marginalize which is convenient when
we have to estimate the model with missing entries and it is also has
some computational advantages for approximating the log marginal
likelihood as similar intractable problem have been thoroughly studied.
Examples
Below, we provide an example similar to Zhao and Udell (2020b Section 7.1). The authors use a data set with a random correlation matrix, 5 continuous variables, 5 binary variables, and 5 ordinal variables with 5 levels. There is a total of 2000 observations and 30% of the variables are missing completely at random.
To summarize Zhao and Udell (2020b) results, they show that their approximate EM algorithm converges in what seems to be 20-25 seconds (this is with a pure R implementation to be fair) while it takes more than 150 seconds for the MCMC algorithm used by Hoff (2007). These figures should be kept in mind when looking at the results below. Importantly, Zhao and Udell (2020b) use an approximation in the E-step of an EM algorithm which is fast but might be crude in some settings. Using a potentially arbitrarily precise approximation of the log marginal likelihood is useful if this can be done quickly enough.
We will provide a quick example and an even shorter example where we show how to use the methods in the package to estimate the correlation matrix and to perform the imputation. We then show a simulation study where we compare with the method suggested by Zhao and Udell (2020b).
The last section called [adding multinomial variables](#adding-m
Related Skills
node-connect
337.4kDiagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps
frontend-design
83.2kCreate distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.
openai-whisper-api
337.4kTranscribe audio via OpenAI Audio Transcriptions API (Whisper).
commit-push-pr
83.2kCommit, push, and open a PR
