Collaborative Targeted Maximum Likelihood Estimation

Collaborative Targeted Maximum Likelihood Estimation (C-TMLE) is an extention of Targeted Maximum Likelihood Estimation (TMLE). It applies variable/model selection for nuisance parameter (e.g. the propensity score) estimation in a 'collaborative' way, by directly optimizing the empirical metric on the causal estimator.

In this package, we implemented the general template of C-TMLE, for the estimation of the average treatment effect (ATE).

The package also offers convenient functions for discrete C-TMLE for variable selection, and LASSO-C-TMLE for model selection of LASSO, in estimation of the propensity score (PS).

Installation

To install the CRAN release version of ctmle:

install.packages('ctmle')

To install the development version (requires the devtools package):

devtools::install_github('jucheng1992/ctmle')

C-TMLE for variable selection

In this section, we start with examples of discrete C-TMLE for variable selection, using greedy forward searching, and scalable discrete C-TMLE with pre-ordering option.

library(ctmle)
#> Loading required package: SuperLearner
#> Loading required package: nnls
#> Super Learner
#> Version: 2.0-22
#> Package created on 2017-07-18
#> Loading required package: tmle
#> Welcome to the tmle package, version 1.2.0-5
#> 
#> Use tmleNews() to see details on changes and bug fixes
#> Loading required package: glmnet
#> Loading required package: Matrix
#> Loading required package: foreach
#> Loaded glmnet 2.0-10
library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
set.seed(123)

N <- 1000
p = 5
Wmat <- matrix(rnorm(N * p), ncol = p)
beta1 <- 4+2*Wmat[,1]+2*Wmat[,2]+2*Wmat[,5]
beta0 <- 2+2*Wmat[,1]+2*Wmat[,2]+2*Wmat[,5]
tau <- 2
gcoef <- matrix(c(-1,-1,rep(-(3/((p)-2)),(p)-2)),ncol=1)
W <- as.matrix(Wmat)

g <- 1/(1+exp(W%*%gcoef /3))
A <- rbinom(N, 1, prob = g)

epsilon <-rnorm(N, 0, 1)
Y  <- beta0 + tau * A + epsilon

# With initial estimate of Q
Q <- cbind(rep(mean(Y[A == 0]), N), rep(mean(Y[A == 1]), N))

time_greedy <- system.time(
      ctmle_discrete_fit1 <- ctmleDiscrete(Y = Y, A = A, W = data.frame(Wmat), Q = Q,
                                           preOrder = FALSE, detailed = TRUE)
)
ctmle_discrete_fit2 <- ctmleDiscrete(Y = Y, A = A, W = data.frame(Wmat),
                                     preOrder = FALSE, detailed = TRUE)


time_preorder <- system.time(
      ctmle_discrete_fit3 <- ctmleDiscrete(Y = Y, A = A, W = data.frame(Wmat), Q = Q,
                                           preOrder = TRUE,
                                           order = rev(1:p), detailed = TRUE)
)

Scalable (discrete) C-TMLE takes much less computation time:

time_greedy
#>    user  system elapsed 
#>   1.589   0.045   1.646
time_preorder
#>    user  system elapsed 
#>   0.994   0.012   1.008

Show the brief results from greedy CTMLE:

ctmle_discrete_fit1
#> C-TMLE result:
#>  parameter estimate:  1.99472 
#>  estimated variance:  0.00838 
#>             p-value:  <2e-16 
#>   95% conf interval: (1.81533, 2.1741)

Summary function offers detial information of which variable is selected.

summary(ctmle_discrete_fit1)
#> 
#> Number of candidate TMLE estimators created:  6 
#> A candidate TMLE estimator was created at each move, as each new term
#> was incorporated into the model for g.
#> ---------------------------------------------------------------------- 
#>         term added cleverCovar estimate cv-RSS cv-varIC cv-penRSS
#> cand 1 (intercept)           1     4.22   19.9   0.0788     14045
#> cand 2          X2           1     3.22   19.6   0.0851     13818
#> cand 3          X5           1     2.61   19.1   0.0870     13485
#> cand 4          X1           1     2.00   18.3   0.0955     12945
#> cand 5          X4           2     1.99   18.3   0.0950     12937
#> cand 6          X3           3     2.01   18.3   0.1008     12941
#> ---------------------------------------------------------------------- 
#> Selected TMLE estimator is candidate 5 
#> 
#> Each TMLE candidate was created by fluctuating the initial fit, Q0(A,W)=E[Y|A,W], obtained in stage 1.
#> 
#>  cand 1: Q1(A,W) = Q0(A,W) + epsilon1a * h1a 
#>              h1a is based on an intercept-only model for treatment mechanism g(A,W)
#> 
#>      cand 2: Q2(A,W) = Q0(A,W) + epsilon1b * h1b 
#>              h1b is based on a treatment mechanism model containing covariates X2
#> 
#>      cand 3: Q3(A,W) = Q0(A,W) + epsilon1c * h1c 
#>              h1c is based on a treatment mechanism model containing covariates X2, X5
#> 
#>      cand 4: Q4(A,W) = Q0(A,W) + epsilon1d * h1d 
#>              h1d is based on a treatment mechanism model containing covariates X2, X5, X1
#> 
#>      cand 5: Q5(A,W) = Q0(A,W) + epsilon1d * h1d + epsilon2 * h2                     = Q4(A,W) + epsilon2 * h2,
#>              h2 is based on a treatment mechanism model containing covariates X2, X5, X1, X4
#> 
#>      cand 6: Q6(A,W) = Q0(A,W) + epsilon1d * h1d + epsilon2 * h2 + epsilon3 * h3                     = Q5(A,W) + epsilon3 * h3,
#>              h3 is based on a treatment mechanism model containing covariates X2, X5, X1, X4, X3
#> 
#> ---------- 
#> C-TMLE result:
#>  parameter estimate:  1.99472 
#>  estimated variance:  0.00838 
#>             p-value:  <2e-16 
#>   95% conf interval: (1.81533, 2.1741)

LASSO-C-TMLE for model selection of LASSO

In this section, we introduce the LASSO-C-TMLE algorithm for model selection of LASSO in the estimation of the propensity score. We implemented three variations of the LASSO-C-TMLE algorithm. For simplicity, we call them C-TMLE1-3. See technical details in the corresponding references.

# Generate high-dimensional data
set.seed(123)

N <- 1000
p = 100
Wmat <- matrix(rnorm(N * p), ncol = p)
beta1 <- 4 + 2 * Wmat[,1] + 2 * Wmat[,2] + 2 * Wmat[,5] + 2 * Wmat[,6] + 2 * Wmat[,8]
beta0 <- 2 + 2 * Wmat[,1] + 2 * Wmat[,2] + 2 * Wmat[,5] + 2 * Wmat[,6] + 2 * Wmat[,8]
tau <- 2
gcoef <- matrix(c(-1,-1,rep(-(3/((p)-2)),(p)-2)),ncol=1)
W <- as.matrix(Wmat)

g <- 1/(1+exp(W%*%gcoef /3))
A <- rbinom(N, 1, prob = g)

epsilon <-rnorm(N, 0, 1)
Y  <- beta0 + tau * A + epsilon

# With initial estimate of Q
Q <- cbind(rep(mean(Y[A == 0]), N), rep(mean(Y[A == 1]), N))

glmnet_fit <- cv.glmnet(y = A, x = W, family = 'binomial', nlambda = 20)

We start build a sequence of lambdas from the lambda selected by cross-validation, as the model selected by cv.glmnet would over-smooth w.r.t. the target parameter.

lambdas <- glmnet_fit$lambda[(which(glmnet_fit$lambda==glmnet_fit$lambda.min)):length(glmnet_fit$lambda)]

We fit C-TMLE1 algorithm by feed the algorithm with a vector of lambda, in decreasing order:

time_ctmlelasso1 <- system.time(
      ctmle_fit1 <- ctmleGlmnet(Y = Y, A = A,
                                W = data.frame(W = W),
                                Q = Q, lambdas = lambdas, ctmletype=1, 
                                family="gaussian",gbound=0.025, V=5)
)

We fit C-TMLE2 algorithm:

time_ctmlelasso2 <- system.time(
      ctmle_fit2 <- ctmleGlmnet(Y = Y, A = A,
                                W = data.frame(W = W),
                                Q = Q, lambdas = lambdas, ctmletype=2, 
                                family="gaussian",gbound=0.025, V=5)
)

For C-TMLE3, we need two gn estimators, one with lambda selected by cross-validation, and the other with lambda slightly different from the selected lambda:

gcv <- predict.cv.glmnet(glmnet_fit, newx=W, s="lambda.min",type="response")
gcv <- bound(gcv,c(0.025,0.975))

s_prev <- glmnet_fit$lambda[(which(glmnet_fit$lambda == glmnet_fit$lambda.min))] * (1+5e-2)
gcvPrev <- predict.cv.glmnet(glmnet_fit,newx = W,s = s_prev,type="response")
gcvPrev <- bound(gcvPrev,c(0.025,0.975))

time_ctmlelasso3 <- system.time(
      ctmle_fit3 <- ctmleGlmnet(Y = Y, A = A, W = W, Q = Q,
                                ctmletype=3, g1W = gcv, g1WPrev = gcvPrev,
                                family="gaussian",
                                gbound=0.025, V = 5)
)

Les't compare the running time for each LASSO-C-TMLE

time_ctmlelasso1
#>    user  system elapsed 
#>  15.005   0.104  15.266
time_ctmlelasso2
#>    user  system elapsed 
#>  18.351   0.083  18.528
time_ctmlelasso3
#>    user  system elapsed 
#>   0.005   0.000   0.006

Finally, we compare three C-TMLE estimates:

ctmle_fit1
#> C-TMLE result:
#>  parameter estimate:  2.20368 
#>  estimated variance:  0.09796 
#>             p-value:  1.9124e-12 
#>   95% conf interval: (1.59022, 2.81714)
ctmle_fit2
#> C-TMLE result:
#>  parameter estimate:  2.16669 
#>  estimated variance:  0.05327 
#>             p-value:  <2e-16 
#>   95% conf interval: (1.71429, 2.61908)
ctmle_fit3
#> C-TMLE result:
#>  parameter estimate:  2.02388 
#>  estimated variance:  0.04972 
#>             p-value:  <2e-16 
#>   95% conf interval: (1.58684, 2.46093)

Show which regularization parameter (lambda) is selected by C-TMLE1:

lambdas[ctmle_fit1$best_k]
#> [1] 0.004409285

In comparison, we show which regularization parameter (lambda) is selected by cv.glmnet:

glmnet_fit$lambda.min
#> [1] 0.03065303

Ctmle

Install / Use

README

Collaborative Targeted Maximum Likelihood Estimation

Installation

C-TMLE for variable selection

LASSO-C-TMLE for model selection of LASSO

Advanced topic: the general template of C-TMLE