sigclust2

Introduction
Installing
Testing
Plotting
References
Session Information

<a name="intro"></a> Introduction

This package may be used to assess statistical significance in hierarchical clustering. To assess significance in high-dimensional data, the approach assumes that a cluster may be well approximated by a single Gaussian (normal) distribution. Given the results of hierarchical clustering, the approach sequentially tests from the root node whether the data at each split/join correspond to one or more Gaussian distributions. The hypothesis test performed at each node is based on a Monte Carlo simulation procedure, and the family-wise error rate (FWER) is controlled across the dendrogram using a sequential testing procedure.

An illustration of the basic usage of the package's testing procedure is provided in the Testing section. Variations on the basic testing procedure are described in the associated subsections. Basic plotting procedures are described in the Plotting section.

<a name="install"></a> Installing

To install sigclust2, which depends on packages in both CRAN and Bioconductor, use the following call to the BiocManager package:

R> BiocManager::install("pkimes/sigclust2")

The package can then be loaded using the standard call to library.

suppressPackageStartupMessages(library("sigclust2"))

While not necessary, installing the Rclusterpp package for faster clustering is recommended when using sigclust2. Unforunately, the Rclusterpp package is no longer available on CRAN. However, Rclusterpp can still be installed from the author's GitHub repo with the following command from the devtools package (or equivalently with a call to BiocManager::install again).

R> devtools::install_github("nolanlab/Rclusterpp")

<a name="test"></a> Testing

For the following examples, we will use a simple toy example with 150 samples (n) with 100 measurements (p). The data are simulated from three Gaussian (normal) distributions.

set.seed(1508)
n1 <- 60; n2 <- 40; n3 <- 50; n <- n1 + n2 + n3
p <- 100
data <- matrix(rnorm(n*p), nrow=n, ncol=p)
data[, 1] <- data[, 1] + c(rep(2, n1), rep(-2, n2), rep(0, n3))
data[, 2] <- data[, 2] + c(rep(0, n1+n2), rep(sqrt(3)*3, n3))

The separation of the three underlying distributions can be observed from a PCA (principal components analysis) scatterplot. While the separation is clear in the first 2 PCs, recall that the data actually exists in 100 dimensions.

data_pc <- prcomp(data)
par(mfrow=c(1, 2))
plot(data_pc$x[, 2], data_pc$x[, 1], xlab="PC2", ylab="PC1")
plot(data_pc$x[, 3], data_pc$x[, 1], xlab="PC3", ylab="PC1")

plot of chunk unnamed-chunk-4

The SHC testing procedure is performed using the shc function. The function requires the following three arguments:

x: the data as a matrix with samples in rows,
metric: the dissimilarity metric, and
linkage: the linkage function to be used for hierarchical clustering.

For reasons outlined in the corresponding paper (Kimes et al. 2017) relating to how the method handles testing when n << p, we recommmend using "euclidean" as the metric, and any of "ward.D2", "single", "average", "complete" as the linkage. If a custom dissimilarity metric is desired, either of vecmet or matmet should be specified, as described later in this section.

If metric functions which do not statisfy rotation invariance are desired, e.g. one minus Pearson correlation ("cor") or L1 ("manhattan"), null_alg = "2means" and ci = "2CI" should be specified. The null_alg and ci parameters specify the algorithm for clustering and measure of "cluster strength" used to generate the null distribution for assessing significance. Since the K-means algorithm (2means) optimizes the 2-means CI (2CI), the resulting p-value will be conservative. However, since the hierarchical algorithm is not rotation invariant, using null_alg = "hclust" or ci = "linkage" produces unreliable results. An example for testing using Pearson correlation is given later in this section.

For now, we just use the recommended and default parameters.

shc_result <- shc(data, metric="euclidean", linkage="ward.D2")

The output is a S3 object of class shc, and a brief description of the analysis results can be obtained by the summary function.

summary(shc_result)

## 
## shc object created using shc(..)
## --------------------------------
## Clustering Parameters:
##     dissimilarity = euclidean
##     linkage = ward.D2
## Testing Parameters:
##     n_sim = 100
##     icovest = 1
##     ci = 2CI
##     null_alg = hclust
##     n_min = 10
##     FWER control = FALSE

The analysis output can be accessed using the $ accessor. More details on the different entries can be found in the documentation for the shc function.

names(shc_result)

##  [1] "in_mat"     "in_args"    "eigval_dat" "eigval_sim" "backvar"   
##  [6] "nd_type"    "ci_dat"     "ci_sim"     "p_emp"      "p_norm"    
## [11] "idx_hc"     "hc_dat"

The computed p-values are probably of greatest interest. Two p-values are computed as part of the SHC testing procedure: (1) an empirical p-value (p_emp), and (2) a Gaussian approximate p-value (p_norm). The p-values are computed based on comparing the observed strength of clustering in the data against the expected strength of clustering under the null hypothesis that the data from a single cluster. The null distribution is approximated using a specified number of simulated datasets (n_sim = 100 default argument). p_emp is the empirical p-value computed from the collection of simulated null datasets. p_norm is an approximation to the empirical p-value which provides more continuous p-values. nd_type stores the results of the test and takes values in: n_small, no_test, sig, not_sig, cutoff_skipped. With the default implementation of shc using no FWER control, all nodes are either cutoff_skipped or n_small.

The p-values are reported for each of 149 (n-1) nodes along the hierarchical dendrogram. The entries of p_emp and p_norm are ordered descending from the top of the dendrogram, with the first entry corresponding to the very top (root) node of the tree.

data.frame(result = head(shc_result$nd_type, 5),
           round(head(shc_result$p_norm, 5), 5),
           round(head(shc_result$p_emp, 5), 5))

##           result hclust_2CI hclust_2CI.1
## 1 cutoff_skipped    0.00000         0.00
## 2 cutoff_skipped    0.41475         0.37
## 3 cutoff_skipped    0.88019         0.90
## 4 cutoff_skipped    0.84834         0.85
## 5 cutoff_skipped    0.86693         0.88

In addition to values between 0 and 1, some p-values are reported as 2. These values correspond to nodes which were not tested, either because of the implemented family-wise error rate (FWER) controlling procedure (alpha) or the minimum tree size for testing (min_n).

Variations on the standard testing procedure are possible by changing the default parameters of the call to shc(..).

<a name="newmetric"></a>Explicitly specifying a dissimilarity function

The method also supports specifying your own metric function through the vecmet and matmet parameters. Only one of vecmet and matmet should be specified. If either is specified, the metric parameter will be ignored. The vecmet parameter should be passed a function which takes two vectors as input and returns the dissimilarity between the two vectors. The matmet parameter should be passed a function which takes a matrix as input and returns a dist object of dissimilarities of the matrix rows.

The vecmet example is not actually run in this tutorial since it is incredibliy computationally expensive. Internally, the function passed to vecmet is wrapped in the following call to outer to compute dissimilarities between all rows of a matrix.

as.dist(outer(split(x, row(x)), split(x, row(x)), Vectorize(vecmet)))

The following simple benchmarking example with cor illustrates the overhead for using outer to call on a vector function rather than using an optimized matrix dissimilarity function.

vfun <- function(x, y) {1 - cor(x, y)}
mfun1 <- function(x) {
    as.dist(outer(split(x, row(x)), split(x, row(x)),
                  Vectorize(vfun)))
}
mfun2 <- function(x) { as.dist(1 - cor(t(x))) }

system.time(mfun1(data))

##    user  system elapsed 
##   1.884   0.053   1.997

system.time(mfun2(data))

##    user  system elapsed 
##   0.001   0.000   0.002

The first matrix correlation function, mfun1, is written it would be processed if vfun were passed to shc as vecmet. The second funtion, mfun2, is a function that could be passed to matmet. The performance difference is clearly significant.

When specifying a custom dissimilarity function for shc, it is important to remember that the function must be used to compute dissimilarity matrices n_sim times for each node. In our toy example where n_sim = 100 and n = 150, this means calling on the dissimilarity function >10,000 times.

Our custom function, mfun2 can be passed to shc through the matmet parameter.

Sigclust2

Install / Use

README

sigclust2

Contents

<a name="intro"></a> Introduction

<a name="install"></a> Installing

<a name="test"></a> Testing

<a name="newmetric"></a>Explicitly specifying a dissimilarity function