causalglm : interpretable and model-robust causal inference for heterogeneous treatment effects

In the search for answers to causal questions, assuming parametric models can be dangerous. With a seemingly small amount of confounding and model misspecificaton, they can give biased answers. One way of mitigating this challenge is to only parametrically model the feature of the data-generating distribution that you care about. It is not even necessary to assume your parametric model is correct! Instead, view it as a "working" or "approximation" model and define your estimand as the best causal approximation of the true nonparametric estimand with respect to your parametric working model. This allows for causal estimates and robust inference under no parametric assumptions on the functional form of any feature of the data generating distribution. Alternatively, you can assume a semiparametric model that only assumes the parametric form of the relevant part of the data distribution is correct. Let the data speak for itself and use machine-learning to model the nuisance features of the data that are not directly related to your causal question. It is in fact possible to get robust and efficient inference for causal quantities using machine-learning. Why worry about things that don't matter for your question? It is not worth the risk of being wrong.

This package fully utilizes the powerful tlverse/tmle3 generalized targeted learning framework as well as the machine-learning frameworks tlverse/sl3 and tlverse/hal9001. We recommend taking a look at these packages and the rest of the tlverse!

For in progress theoretical details and methods descriptions, see the writeup causalglm.pdf in the "paper" folder.

For a walk-through guide with example code, check out vignette.Rmd in the "vignette" folder.

Installing

NOTE: This package is actively in development and is subject to continuous change. Feel free to contact me personally or through the Issues tab.

To install this package, install the devtools CRAN package and run:

if(!require(devtools)) {
  install.packages(devtools)
}
devtools::install_github("tlverse/causalglm")

This package also requires the following github packages. Make sure to update the version of these packages upon installation.

devtools::install_github("tlverse/hal9001@master")
devtools::install_github("tlverse/tmle3@general_submodels_devel")
devtools::install_github("tlverse/sl3@Larsvanderlaan-formula_fix")

What is causalglm?

causalglm is a package for interpretable and robust causal inference for heterogeneous treatment effects using generalized linear working model and machine-learning. Unlike parametric methods based on generalized linear models and semiparametric methods based on partially-linear models, the methods implemented in causalglm do not assume that any user-specified parametric models are correctly specified. That is, causalglm does not assume that the true data-generating distribution satisfies the parametric model. Instead, the user-specified parametric model is viewed as an approximation or "working model", and an interpretable estimand is defined as a projection of the true conditional treatment effect estimand onto the working model. Moreover, causalglm, unlike glm, only requires a user-specified parametric working model for the causal estimand of interest. All nuisance components of the data-generating distribution are left unspecified and data-adaptively learned using machine-learning. Thus, causalglm provides not only nonparametrically robust inference but also provides estimates (different from glm) for causally interpretable estimands that maximally adjust for confounding. To allow for valid inference with the use of variable-selection and machine-learning, Targeted Maximum Likelihood Estimation (van der Laan, Rose, 2011) is employed.

The statistical data-structure used throughout this package is O = (W,A,Y) where W represents a random vector of baseline (pretreatment) covariates/confounders, A is a binary, categorical or continuous treatment assignment, and Y is some outcome variable. For marginal structural models, we also consider a subvector V of W that represents a subset of baseline variables that are of interest.

Estimands supported by `causalglm`

causalglm supports causal working-model-based estimands for the

Conditional average treatment effect (CATE) for arbitrary outcomes: E[Y|A=a,W] - E[Y|A=0,W] (categorical and continuous treatments)
Conditional odds ratio (OR) for binary outcomes: {P(Y=1|A=1,W)/P(Y=0|A=1,W)} / {P(Y=1|A=0,W)/P(Y=0|A=0,W)} (binary treatments and continuous treatments)
Conditional relative risk (RR) for binary, count or nonnegative outcomes: E[Y|A=a,W]/E[Y|A=0,W] (categorical and continuous treatments)
Conditional treatment-specific mean (TSM) : E[Y|A=a,W] (categorical treatments)
Conditional average treatment effect among the treated (CATT) : the best approximation of E[Y|A=a,W] - E[Y|A=0,W] based on a user-specified formula/parametric model among the treated (i.e. observations with A=a) (categorical treatments)

All methods support binary treatments. Most methods support categorical treatments. And, continuous treatments are only supported for the CATE, OR and RR through contglm. Each method allows for arbitrary user-specified parametric working models for the estimands. For binary and categorical treatments, the working model used for all estimands is of the form E[Y|A=a,W] - E[Y|A=0,W] = formula(W) where formula is specified by the user. For continuous treatments (only supported by contglm), the working models used are CATE(a,W) := E[Y|A=a,W] - E[Y|A=0,W] = 1(a > 0) * formula_binary(W) + a * formula_continuous(W), log OR(a,W) := log {P(Y=1|A=a,W)/P(Y=0|A=a,W) } - log {P(Y=1|A=0,W)/P(Y=0|A=0,W) }= 1(a > 0) * formula_binary(W) + a * formula_continuous(W), and log RR(a,W) := log E[Y|A=a,W] - log E[Y|A=0,W] = 1(a > 0) * formula_binary(W) + a * formula_continuous(W) Since estimates and inference are provided for the best approximation of the estimand by these working models, the outputs of contglm can be viewed as the best linear approximation to the continuous treatment estimand.

causalglm also supports the following working marginal structural model estimands:

Working marginal structural models for the CATE: E[CATE(W)|V] := E[E[Y|A=a,W] - E[Y|A=0,W]|V] (categorical treatments)
Working marginal structural models for the RR: E[E[Y|A=a,W]|V]/E[E[Y|A=0,W]|V] (categorical treatments)
Working marginal structural models for the TSM : E[E[Y|A=a,W]|V] (categorical treatments)
Working marginal structural models for the CATT : E[CATE(W)|V, A=a] := E[E[Y|A=a,W] - E[Y|A=0,W]|V, A=a] (categorical treatments)

Methods provided by `causalglm`

causalglm consists of 2 main functions:

npglm for robust nonparametric estimation and inference for user-specified working models for the CATE, CATT, TSM, RR or OR
contglm for robust nonparametric estimation and inference for user-specified working models for the CATE, OR and RR as a function of a continuous or ordered numeric treatment.

And 3 more specialized functions:

msmglm for robust nonparametric estimation and inference for user-specified working marginal structural models for the CATE, CATT, TSM or RR
spglm for semiparametric estimation and inference for correctly specified parametric models for the CATE, RR and OR
causalglmnet for semiparametric estimation and inference with high dimensional confounders W (a custom wrapper function for spglm focused on big data where standard ML may struggle)

For most user applications with discrete treatments, npglm suffices. For continuous treatments, users may use contglm.

Outputs provided by all `causalglm` methods

The outputs of the methods include:

Coefficient estimates (using the S3 summary function)
Z-scores and p-values for coefficients
95% confidence intervals for coefficients
Individual-level treatment-effect predictions and 95% confidence (prediction) intervals can be extracted with the predict function and argument data.
Plotting with plot_msm for objects returned by msmglm.

Which method to use?

A rule of thumb for choosing between these methods is as follows:

Use npglm if you believe your parametric model for the treatment effect estimand is a good approximation
Use msmglm if you want to know how the treatment effect is causally affected by one or a number of variables V (fully adjusting for the remaining variables W) (or to learn univariate confounder-adjusted variable importance measures!)
Use contglm if your treatment is continuous or ordered and you are interested in the treatment effect per unit dose.
Use causalglmnet if the variables W for which to adjust are (very) high dimensional.
Use spglm if you believe your parametric model for the treatment effect estimand is correct (not recommended)

msmglm deals with marginal structural models for the conditional treatment effect estimands. This method is useful if you are only interested in modeling the causal treatment effect as a function of a subset of variables V adjusting for all the available confounders W that remain. This allows for parsimonious causal modeling, still maximally adjusting for confounding. This function can be used to understand the causal variable importance of individual variables (by having V be a single variable) and allows for nice plots (see plot_msm). contglm is a version of npglm that provides inference for working-model-based estimands for conditional treatment effects of continuous or ordered treatments.

User-friendly interface

causalglm has a minimalistic yet still quite flexible front-end. Check out the vignette to see how to use it! The neces

Causalglm

Install / Use

README

causalglm : interpretable and model-robust causal inference for heterogeneous treatment effects

Installing

What is causalglm?

Estimands supported by `causalglm`

Methods provided by `causalglm`

Outputs provided by all `causalglm` methods

Which method to use?

User-friendly interface

Causalglm

Install / Use

README

causalglm : interpretable and model-robust causal inference for heterogeneous treatment effects

Installing

What is causalglm?

Estimands supported by causalglm

Methods provided by causalglm

Outputs provided by all causalglm methods

Which method to use?

User-friendly interface

Estimands supported by `causalglm`

Methods provided by `causalglm`

Outputs provided by all `causalglm` methods