Comets
Algorithm-agnostic significance testing in supervised learning with multimodal data
Install / Use
/learn @LucasKook/CometsREADME
Covariance Measure Tests (COMETs) in R <img src='inst/comets-pkg.png' align="right" height="138.5" />
The Generalised [1], Projected [2], weighted generalised [3], kernel
generalised [4] Covariance Measure tests (GCM, PCM, wGCM, kGCM tests) can be
used to test conditional independence between a real-valued response $Y$ and
features/modalities $X$ given additional features/modalities $Z$ using any
sufficiently predictive supervised learning algorithms. An extension of the GCM
to censored responses was proposed in [5] and is implemented with survival
regression methods. The comets R package implements these covariance measure
tests (COMETs) with a user-friendly interface which allows the user to use any
sufficiently predictive supervised learning algorithm of their choosing. The
default is to use random forests implemented in ranger for all regressions. A
Python version of this package is available
here.
Here, we showcase how to use comets with a simple example in which $Y$ is not
independent of $X$ given $Z$. More elaborate examples including conditional
variable significance testing and modality selection on real-world data can be
found in [6].
set.seed(1)
n <- 300
X <- matrix(rnorm(2 * n), ncol = 2)
colnames(X) <- c("X1", "X2")
Z <- matrix(rnorm(2 * n), ncol = 2)
colnames(Z) <- c("Z1", "Z2")
Y <- X[, 1]^2 + Z[, 2] + rnorm(n)
GCM <- gcm(Y, X, Z) # plot(GCM)
The output for the GCM test, which fails to reject the null hypothesis of
conditional independence in this example, is shown below. The residuals for the
$Y$ on $Z$ and $X$ on $Z$ regressions can be investigated by calling plot(GCM)
(not shown here).
##
## Generalized covariance measure test
##
## data: gcm(Y = Y, X = X, Z = Z)
## X-squared = 2.8211, df = 2, p-value = 0.244
## alternative hypothesis: true E[cov(Y, X | Z)] is not equal to 0
The PCM test can be run likewise.
PCM <- pcm(Y, X, Z) # plot(PCM)
The output is shown below: The PCM test correctly rejects the null hypothesis of conditional independence in this example.
##
## Projected covariance measure test
##
## data: pcm(Y = Y, X = X, Z = Z)
## Z = 4.8589, p-value = 5.901e-07
## alternative hypothesis: true E[Y | X, Z] is not equal to E[Y | Z]
The comets package contains an alternative formula-based interface, in which
$H_0 : Y \perp\hspace{-5pt}\perp X \mid Z$ can be supplied as Y ~ X | Z with a
corresponding data argument. This interface is implemented in comets() and
shown below.
dat <- data.frame(Y = Y, X, Z)
comets(Y ~ X1 + X2 | Z1 + Z2, data = dat, test = "gcm")
##
## Generalized covariance measure test
##
## data: comets(formula = Y ~ X1 + X2 | Z1 + Z2, data = dat, test = "gcm")
## X-squared = 3.2184, df = 2, p-value = 0.2
## alternative hypothesis: true E[cov(Y, X | Z)] is not equal to 0
Specifying regression methods
Different regression methods can supplied for both GCM and PCM tests using the
reg_* arguments (for instance, reg_YonZ in gcm() for the regression of $Y$
on $Z$). Pre-implemented regressions are "rf" for random forests and "lasso"
for cross-validated $L_1$-penalized regression. Custom regression functions can
be supplied as character strings or functions, require a residual() (GCM and
PCM) or predict() (PCM only) method and the following structure:
my_regression <- function(y, x, ...) {
ret <- <run the regression>
class(ret) <- "my_regression"
ret
}
predict.my_regression <- function(object, data, ...) {
<run the prediction routine>
}
residuals.my_regression <- function(object, response, data, ...) {
<run the routine for computing residuals>
}
The input y and x and data are vector and matrix-valued. The output of
predict.my_regression() should be a vector of length NROW(data).
Usage example: Survival response
For survival responses, comets offers the TRAM-GCM test [5] and supports
parametric and semiparametric survival models from the survival package, as
well as random survival forests from ranger. As an example, we test whether
survival is independent of sex given age in the cancer dataset, once using a
Cox model and once using a random survival forest. Both tests agree to reject
the null hypothesis at conventional significance levels.
library("survival")
data("cancer", package = "survival")
cancer$surv <- with(cancer, Surv(time, status == 2))
comets(surv ~ sex | age, data = cancer, reg_YonZ = "cox")
##
## Generalized covariance measure test
##
## data: comets(formula = surv ~ sex | age, data = cancer, reg_YonZ = "cox")
## X-squared = 7.7608, df = 1, p-value = 0.005339
## alternative hypothesis: true E[cov(Y, X | Z)] is not equal to 0
comets(surv ~ sex | age, data = cancer, reg_YonZ = "survforest")
##
## Generalized covariance measure test
##
## data: comets(formula = surv ~ sex | age, data = cancer, reg_YonZ = "survforest")
## X-squared = 14.827, df = 1, p-value = 0.0001178
## alternative hypothesis: true E[cov(Y, X | Z)] is not equal to 0
Usage example: Multivariate response
The GCM test also supports multivariate responses. Continuing the example from
above, we generate a bivariate response $Y$. Internally, since $Y$ and $X$ are
both two dimensional, four random forest regressions are performed. Advanced
usage with the multivariate argument also allows the specification of
multivariate regression models (this option is experimental).
bivY <- cbind(Y, 0.5 * X[, 1] + Z[, 1] + rnorm(n))
gcm(bivY, X, Z)
##
## Generalized covariance measure test
##
## data: gcm(Y = bivY, X = X, Z = Z)
## X-squared = 42.177, df = 4, p-value = 1.533e-08
## alternative hypothesis: true E[cov(Y, X | Z)] is not equal to 0
Installation
The development version of comets can be installed using:
# install.packages("remotes")
remotes::install_github("LucasKook/comets")
A stable version of comets can be installed from CRAN via:
install.packages("comets")
Replication materials
All results in [3] can be reproduced by running make all in ./inst after
downloading all required data from the
zenodo repository.
The scripts for reproducing the results manually can be found in ./inst/code/
for the CCLE data (ccle.R), TCGA data (multiomics.R) and MIMIC data
(mimic.R).
Citation
Please cite the comets package as
@article{10.1093/bib/bbae475,
title={{Algorithm-Agnostic Significance Testing in Supervised Learning With Multimodal Data}},
author={Lucas Kook and Anton Rask Lundborg},
year={2024},
journal={Briefings in Bioinformatics},
volume={25},
number={6},
doi={10.1093/bib/bbae475},
}
References
[1] Rajen D. Shah, Jonas Peters "The hardness of conditional independence testing and the generalised covariance measure," The Annals of Statistics, 48(3), 1514-1538. doi:10.1214/19-aos1857
[2] Lundborg, A. R., Kim, I., Shah, R. D., & Samworth, R. J. (2024). The Projected Covariance Measure for assumption-lean variable significance testing. The Annals of Statistics, 52(6), 2851-2878. doi:10.1214/24-AOS2447
[3] Scheidegger, C., Hörrmann, J., & Bühlmann, P. (2022). The weighted generalised covariance measure. Journal of Machine Learning Research, 23(273), 1-68.
[4] Fernández, T., & Rivera, N. (2024). A general framework for the analysis of kernel-based tests. Journal of Machine Learning Research, 25(95), 1-40.
[5] Kook, L., Saengkyongam, S., Lundborg, A. R., Hothorn, T., & Peters, J. (2025). Model-based causal feature selection for general response types. Journal of the American Statistical Association, 120(550), 1090-1101. doi:10.1080/01621459.2024.2395588
[6] Kook, L. & Lundborg A. R. (2024). Algorithm-agnostic significance testing in supervised learning with multimodal data. Briefings in Bioinformatics 25(6) 2024. doi:10.1093/bib/bbae475
Related Skills
node-connect
352.9kDiagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps
frontend-design
111.5kCreate distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.
openai-whisper-api
352.9kTranscribe audio via OpenAI Audio Transcriptions API (Whisper).
qqbot-media
352.9kQQBot 富媒体收发能力。使用 <qqmedia> 标签,系统根据文件扩展名自动识别类型(图片/语音/视频/文件)。
