PCAtools: everything Principal Component Analysis

Kevin Blighe, Aaron Lun 2021-07-23

Introduction

Principal Component Analysis (PCA) is a very powerful technique that has wide applicability in data science, bioinformatics, and further afield. It was initially developed to analyse large volumes of data in order to tease out the differences/relationships between the logical entities being analysed. It extracts the fundamental structure of the data without the need to build any model to represent it. This ‘summary’ of the data is arrived at through a process of reduction that can transform the large number of variables into a lesser number that are uncorrelated (i.e. the ‘principal components’), while at the same time being capable of easy interpretation on the original data (Blighe and Lun 2019) (Blighe 2013).

PCAtools provides functions for data exploration via PCA, and allows the user to generate publication-ready figures. PCA is performed via BiocSingular (Lun 2019) - users can also identify optimal number of principal components via different metrics, such as elbow method and Horn’s parallel analysis (Horn 1965) (Buja and Eyuboglu 1992), which has relevance for data reduction in single-cell RNA-seq (scRNA-seq) and high dimensional mass cytometry data.

Installation

1. Download the package from Bioconductor

    if (!requireNamespace('BiocManager', quietly = TRUE))
        install.packages('BiocManager')

    BiocManager::install('PCAtools')

Note: to install development version direct from GitHub:

    if (!requireNamespace('devtools', quietly = TRUE))
        install.packages('devtools')

    devtools::install_github('kevinblighe/PCAtools')

2. Load the package into R session

    library(PCAtools)

Quick start: DESeq2

For this example, we will follow the tutorial (from Section 3.1) of RNA-seq workflow: gene-level exploratory analysis and differential expression. Specifically, we will load the ‘airway’ data, where different airway smooth muscle cells were treated with dexamethasone.

  library(airway)
  library(magrittr)

  data('airway')
  airway$dex %<>% relevel('untrt')

Annotate the Ensembl gene IDs to gene symbols:

  ens <- rownames(airway)

  library(org.Hs.eg.db)
  symbols <- mapIds(org.Hs.eg.db, keys = ens,
    column = c('SYMBOL'), keytype = 'ENSEMBL')
  symbols <- symbols[!is.na(symbols)]
  symbols <- symbols[match(rownames(airway), names(symbols))]
  rownames(airway) <- symbols
  keep <- !is.na(rownames(airway))
  airway <- airway[keep,]

Normalise the data and transform the normalised counts to variance-stabilised expression levels:

  library('DESeq2')

  dds <- DESeqDataSet(airway, design = ~ cell + dex)
  dds <- DESeq(dds)
  vst <- assay(vst(dds))

Conduct principal component analysis (PCA):

  p <- pca(vst, metadata = colData(airway), removeVar = 0.1)

## -- removing the lower 10% of variables based on variance

A scree plot

  screeplot(p, axisLabSize = 18, titleLabSize = 22)

Figure 1: A scree plot

A bi-plot

Different interpretations of the biplot exist. In the OMICs era, for most general users, a biplot is a simple representation of samples in a 2-dimensional space, usually focusing on just the first two PCs:

  biplot(p)

However, the original definition of a biplot by Gabriel KR (Gabriel 1971) is a plot that plots both variables and observations (samples) in the same space. The variables are indicated by arrows drawn from the origin, which indicate their ‘weight’ in different directions. We touch on this later via the plotLoadings function.

  biplot(p, showLoadings = TRUE,
    labSize = 5, pointSize = 5, sizeLoadingsNames = 5)

Figure 2: A bi-plot

Quick start: Gene Expression Omnibus (GEO)

Here, we will instead start with data from Gene Expression Omnibus. We will load breast cancer gene expression data with recurrence free survival (RFS) from Gene Expression Profiling in Breast Cancer: Understanding the Molecular Basis of Histologic Grade To Improve Prognosis.

First, let’s read in and prepare the data:

  library(Biobase)
  library(GEOquery)

  # load series and platform data from GEO
    gset <- getGEO('GSE2990', GSEMatrix = TRUE, getGPL = FALSE)
    mat <- exprs(gset[[1]])

  # remove Affymetrix control probes
    mat <- mat[-grep('^AFFX', rownames(mat)),]

  # extract information of interest from the phenotype data (pdata)
   idx <- which(colnames(pData(gset[[1]])) %in%
      c('relation', 'age:ch1', 'distant rfs:ch1', 'er:ch1',
        'ggi:ch1', 'grade:ch1', 'size:ch1',
        'time rfs:ch1'))
    metadata <- data.frame(pData(gset[[1]])[,idx],
      row.names = rownames(pData(gset[[1]])))

  # tidy column names
    colnames(metadata) <- c('Study', 'Age', 'Distant.RFS', 'ER', 'GGI', 'Grade',
      'Size', 'Time.RFS')

  # prepare certain phenotypes of interest
    metadata$Study <- gsub('Reanalyzed by: ', '', as.character(metadata$Study))
    metadata$Age <- as.numeric(gsub('^KJ', NA, as.character(metadata$Age)))
    metadata$Distant.RFS <- factor(metadata$Distant.RFS,
      levels = c(0,1))
    metadata$ER <- factor(gsub('\\?', NA, as.character(metadata$ER)),
      levels = c(0,1))
    metadata$ER <- factor(ifelse(metadata$ER == 1, 'ER+', 'ER-'),
      levels = c('ER-', 'ER+'))
    metadata$GGI <- as.numeric(as.character(metadata$GGI))
    metadata$Grade <- factor(gsub('\\?', NA, as.character(metadata$Grade)),
      levels = c(1,2,3))
    metadata$Grade <- gsub(1, 'Grade 1', gsub(2, 'Grade 2', gsub(3, 'Grade 3', metadata$Grade)))
    metadata$Grade <- factor(metadata$Grade, levels = c('Grade 1', 'Grade 2', 'Grade 3'))
    metadata$Size <- as.numeric(as.character(metadata$Size))
    metadata$Time.RFS <- as.numeric(gsub('^KJX|^KJ', NA, metadata$Time.RFS))

  # remove samples from the pdata that have any NA value
    discard <- apply(metadata, 1, function(x) any(is.na(x)))
    metadata <- metadata[!discard,]

  # filter the expression data to match the samples in our pdata
    mat <- mat[,which(colnames(mat) %in% rownames(metadata))]

  # check that sample names match exactly between pdata and expression data 
    all(colnames(mat) == rownames(metadata))

## [1] TRUE

Conduct principal component analysis (PCA):

  p <- pca(mat, metadata = metadata, removeVar = 0.1)

## -- removing the lower 10% of variables based on variance

A bi-plot

  biplot(p)

  biplot(p, showLoadings = TRUE, lab = NULL)

Figure 3: A bi-plot

One of the probes pointing downward is 205225_at, which targets the ESR1 gene. This is already a useful validation, as the oestrogen receptor, which is in part encoded by ESR1, is strongly represented by PC2 (y-axis), with negative-to-positive receptor status going from top-to-bottom.

A pairs plot

  pairsplot(p)

Figure 4: A pairs plot

A loadings plot

If the biplot was previously generated with showLoadings = TRUE, check how this loadings plot corresponds to the biplot loadings - they should match up for the top hits.

  plotloadings(p, labSize = 3)

## -- variables retained:

## 215281_x_at, 214464_at, 211122_s_at, 210163_at, 204533_at, 205225_at, 209351_at, 205044_at, 202037_s_at, 204540_at, 215176_x_at, 214768_x_at, 212671_s_at, 219415_at, 37892_at, 208650_s_at, 206754_s_at, 205358_at, 205380_at, 205825_at

## Warning: ggrepel: 17 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps

Figure 5: A loadings plot

An eigencor plot

  eigencorplot(p,
    metavars = c('Study','Age','Distant.RFS','ER',
      'GGI','Grade','Size','Time.RFS'))

Figure 6: An eigencor plot

Access the internal data

The rotated data that represents the observations / samples is stored in rotated, while the variable loadings are stored in loadings

  p$rotated[1:5,1:5]

##                PC1        PC2        PC3        PC4       PC5
## GSM65752 -30.24272  43.826310   3.781677 -39.536149 18.612835
## GSM65753 -37.73436 -15.464421  -4.913100  -5.877623  9.060108
## GSM65755 -29.95155   7.788280 -22.980076 -15.222649 23.123766
## GSM65757 -33.73509   1.261410 -22.834375   2.494554 13.629207
## GSM65758 -40.95958  -8.588458   4.995440  14.340150  0.417101

  p$loadings[1:5,1:5]

##                     PC1         PC2          PC3        PC4           PC5
## 206378_at -0.0024336244 -0.05312797 -0.004809456 0.04045087  0.0096616577
## 205916_at -0.0051057533  0.00122765 -0.010593760 0.04023264  0.0285972617
## 206799_at  0.0005723191 -0.05048096 -0.009992964 0.02568142  0.0024626261
## 205242_at  0.0129147329  0.02867789  0.007220832 0.04424070 -0.0006138609
## 206509_at  0.0019058729 -0.05447596 -0.004979062 0.01510060 -0.0026213610

Advanced features

All functions in PCAtools are highly configurable and should cover virtually all basic and advanced user requirements. The following sections take a look at some of these advanced features, and form a somewhat practical example of how one can use PCAtools to make a clinical interpretation of data.

First,

PCAtools

Install / Use

README

PCAtools: everything Principal Component Analysis

Introduction

Installation

1. Download the package from Bioconductor

2. Load the package into R session

Quick start: DESeq2

Conduct principal component analysis (PCA):

A scree plot

A bi-plot

Quick start: Gene Expression Omnibus (GEO)

A bi-plot

A pairs plot

A loadings plot

An eigencor plot

Access the internal data

Advanced features