ck37r

My R toolkit for organizing analysis projects, cleaning data for machine learning, parallelizing code for multiple cores or in a SLURM cluster, and extended functionality for SuperLearner and TMLE. Some of the SuperLearner functions may eventually be migrated into the SuperLearner package.

Installation

Install the latest release from CRAN:

install.packages("ck37r")

Install the development version from github (recommended):

# install.packages("remotes")
remotes::install_github("ck37/ck37r")

Functions

Project Utilities
- import_csvs - import all CSV files in a given directory.
- load_all_code - source() all R files in a given directory.
- load_packages - load a list of packages; for the ones that fail it can attempt to install them automatically from CRAN, then load them again.
Machine Learning
- categoricals_to_factors - convert numeric categoricals into factors.
- factors_to_indicators - convert all factors in a dataframe to series of indicators (one-hot encoding).
- impute_missing_values - impute missing values in a dataframe (median for numerics and mode for factors, GLRM, or k-nearest neighbors), add missingness indicators.
- missingness_indicators - return a matrix of missingness indicators for a dataframe, (optionally) omitting any constant or collinear columns.
- rf_count_terminal_nodes - count the number of terminal nodes in each tree in a random forest. That information can then be used to grid-search the maximum number of nodes allowed in a Random Forest (along with mtry).
- standardize - standardize a dataset (center, scale), optionally omitting certain variables.
- vim_corr - rudimentary variable importance based on correlation with an outcome.
Parallelization
- parallelize - starts a multicore or multinode parallel cluster. Automatically detects parallel nodes in a SLURM environment, which makes code work seemlessly on a laptop or a cluster.
- stop_cluster - stops a cluster started by parallelize().
SuperLearner
- auc_table - table of cross-validated AUCs for each learner in an ensemble, including SE, CI, and p-value. Supports SuperLearner and CV.SuperLearner objects.
- gen_superlearner - create a SuperLearner and CV.SuperLearner function setup to transparently use a certain parallelization configuration.
- cvsl_weights - table of the meta-weight distribution for each learner in a CV.SuperLearner analysis.
- cvsl_auc - cross-validated AUC for a CV.SuperLearner analysis.
- plot_roc - ROC plot with AUC and CI for a SuperLearner or CV.SuperLearner object.
- plot.SuperLearner - plot risk estimates and CIs for a SuperLearner, similar to CV.Superlearner except without SL or Discrete SL.
- prauc_table - table of cross-validated PR-AUCs for each learner in an ensemble, including SE and CI. Supports SuperLearner and CV.SuperLearner objects.
- sl_stderr - calculate standard error for each learner’s risk in SL.
- SL.h2o_auto() - wrapper for h2o’s automatic machine learning system, to be added to SuperLearner.
- SL.bartMachine2() - wrapper for bartMachine, to be added to SuperLearner.
TMLE
- tmle_parallel - allows the SuperLearner estimation in TMLE to be customized, esp. to support parallel estimation via mcSuperLearner and snowSuperLearner.
- setup_parallel_tmle - helper function to start a cluster and setup SuperLearner and tmle_parallel to use the created cluster.
h2o
- h2o_init_multinode() - function to start an h2o cluster on multiple nodes from within R, intended for use on SLURM or other multi-node clusters.

Examples

Impute missing values

# Load a test dataset.
# TODO: need to switch to a different dataset.
data(PimaIndiansDiabetes2, package = "mlbench")

# Check for missing values.
colSums(is.na(PimaIndiansDiabetes2))
#> pregnant  glucose pressure  triceps  insulin     mass pedigree      age 
#>        0        5       35      227      374       11        0        0 
#> diabetes 
#>        0

# Impute missing data and add missingness indicators.
# Don't impute the outcome though.
result = impute_missing_values(PimaIndiansDiabetes2, skip_vars = "diabetes")

# Confirm we have no missing data.
colSums(is.na(result$data))
#>      pregnant       glucose      pressure       triceps       insulin 
#>             0             0             0             0             0 
#>          mass      pedigree           age      diabetes  miss_glucose 
#>             0             0             0             0             0 
#> miss_pressure  miss_triceps  miss_insulin     miss_mass 
#>             0             0             0             0

Impute with GLRM

We are using default hyperparameters here, but it would be best to optimize the hyperparameters.

#############
# Generalized low-rank model imputation via h2o.
result2 = impute_missing_values(PimaIndiansDiabetes2, type = "glrm", skip_vars = "diabetes")

# Confirm we have no missing data.
colSums(is.na(result2$data))
#>      pregnant       glucose      pressure       triceps       insulin 
#>             0             0             0             0             0 
#>          mass      pedigree           age      diabetes  miss_glucose 
#>             0             0             0             0             0 
#> miss_pressure  miss_triceps  miss_insulin     miss_mass 
#>             0             0             0             0

Load packages

This loads a vector of packages, automatically installing any packages that aren’t already installed.

# Load these 4 packages and install them if necessary.
load_packages(c("MASS", "SuperLearner", "tmle", "doParallel"), auto_install = TRUE)
#> Loaded gam 1.20
#> Super Learner
#> Version: 2.0-28
#> Package created on 2021-05-04
#> Loaded glmnet 4.1-3
#> Welcome to the tmle package, version 1.5.0-1.1
#> 
#> Major changes since v1.3.x. Use tmleNews() to see details on changes and bug fixes

Random Forest: count terminal nodes

We estimate one standard Random Forest first and examine how many terminal nodes are in each decision tree. We take the maximum of that as the most data-adaptive Random Forest in terms of decision tree size, then compare to Random Forests in which they are restricted to have smaller decision trees. This allows the SuperLearner to explore under vs. over-fitting for a Random Forest. See Segal (2004) and Segal & Xiao (2011) for details on overfitting in Random Forests.

library(SuperLearner)
library(ck37r)

data(Boston, package = "MASS")

set.seed(1)
(sl = SuperLearner(Boston$medv, subset(Boston, select = -medv), family = gaussian(),
                  cvControl = list(V = 3L),
                  SL.library = c("SL.mean", "SL.glm", "SL.randomForest")))
#> Loading required namespace: randomForest
#> 
#> Call:  
#> SuperLearner(Y = Boston$medv, X = subset(Boston, select = -medv), family = gaussian(),  
#>     SL.library = c("SL.mean", "SL.glm", "SL.randomForest"), cvControl = list(V = 3L)) 
#> 
#> 
#> 
#>                         Risk     Coef
#> SL.mean_All         85.22648 0.000000
#> SL.glm_All          25.31448 0.063065
#> SL.randomForest_All 12.89105 0.936935

summary(rf_count_terminal_nodes(sl$fitLibrary$SL.randomForest_All$object))
#>    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
#>   133.0   163.0   167.0   166.6   171.0   187.0

(max_terminal_nodes = max(rf_count_terminal_nodes(sl$fitLibrary$SL.randomForest_All$object)))
#> [1] 187

# Now run create.Learner() based on that maximum.

# It is often handy to convert to log scale of a hyperparameter before testing a ~linear grid.
# NOTE: -0.7 ~ log(0.5) which is the multiplier that yields sqrt(max)
(maxnode_seq = unique(round(exp(log(max_terminal_nodes) * exp(c(-0.6, -0.35, -0.15, 0))))))
#> [1]  18  40  90 187

rf = create.Learner("SL.randomForest", detailed_names = TRUE,
                    name_prefix = "rf",
                    params = list(ntree = 100L), # fewer trees for testing speed only.
                    tune = list(maxnodes = maxnode_seq))

# We see that an RF with simpler decision trees performs better than the default.
(sl = SuperLearner(Boston$medv, subset(Boston, select = -medv), family = gaussian(),
                  cvControl = list(V = 3L),
                  SL.library = c("SL.mean", "SL.glm", rf$names)))
#> 
#> Call:  
#> SuperLearner(Y = Boston$medv, X = subset(Boston, select = -medv), family = gaussian(),  
#>     SL.library = c("SL.mean", "SL.glm", rf$names), cvControl = list(V = 3L)) 
#> 
#> 
#>                 Risk      Coef
#> SL.mean_All 84.50244 0.0000000
#> SL.glm_All  24.49011 0.0000000
#> rf_18_All   13.47510 0.0000000
#> rf_40_All   11.82894 0.0000000
#> rf_90_All   11.09016 0.5311252
#> rf_187_All  11.08278 0.4688748

Parallel TMLE

library(ck37r)
library(tmle)

# Use multiple cores as available.
ck37r::setup_parallel_tmle()

# Basic SL library.
sl_lib = c("SL.mean", "SL.rpart", "SL.glmnet")

# Set a parallel-compatible seed so cross-validation folds are

Ck37r

Install / Use

README