SPONGE
Sparse Partial correlation ON Gene Expression - an R package for fast and robust ceRNA network inference
Install / Use
/learn @daisybio/SPONGEREADME
title: "Sparse Partial correlation ON Gene Expression with SPONGE"
author: "Markus List, Azim Dehghani Amirabad, Dennis Kostka, Marcel H. Schulz"
date: "r Sys.Date()"
output: rmarkdown::html_vignette
vignette: >
%\VignetteIndexEntry{SPONGE vignette}
%\VignetteEngine{knitr::rmarkdown}
%\VignetteEncoding{UTF-8}
Purpose
<div style="float:right;"><img src="https://raw.githubusercontent.com/mlist/SPONGE/master/vignettes/sponge_logo.png" alt="SPONGE logo" style="width: 200px; display: block; margin-left: auto; margin-right: auto;"/></div>SPONGE is the first method to solve the computationally demanding task of reporting significant ceRNA interactions efficiently on a genome-wide scale. Beyond ceRNAs, this method is well suited to infer other types of regulatory interactions such as transcription factor regulation.
Introduction
MicroRNAs (miRNAs) are small 19-22 nucleotide long molecules that facilitate the degradation of messenger RNA (mRNA) transcripts targeted via matching seed sequences. The competing endogenous RNA (ceRNA) hypothesis suggests that mRNAs that possess binding sites for the same miRNAs are in competition. This motivates the existence of so-called sponges, i.e., genes that exert strong regulatory control via miRNA binding in a ceRNA interaction network. It is currently an unsolved problem how to estimate miRNA-mediated ceRNA interactions genome-wide. The most widely used approach considers miRNA and mRNA expression jointly measured for the same cell state. Several partial association methods have been proposed for determining ceRNA interaction strength using conditional mutual information or partial correlation, for instance.
However, we identified three key limitations of existing approaches that prevent the construction of an accurate genome-wide ceRNA interaction network: (i) none of the existing methods considers the combinatorial effect of several miRNAs; (ii) due to the computational demand, the inference of a ceRNA interaction for all putative gene-miRNA-gene interactions in the human genome is prohibitive; (iii) an efficient strategy for determining the significance of inferred ceRNA interactions is missing, and thus important parameters of the system are neglected.
To overcome these challenges, we developed a novel method called Sparse partial correlation on gene expression (SPONGE). We reduce the computational complexity of constructing a genome-wide ceRNA interaction network in several steps. First, we consider only miRNA-gene interactions that are either predicted or experimentally validated. Second, we retain only miRNA-gene interactions that have a negative coefficient in a regularized regression model. Third, instead of each gene-miRNA-gene triplet, we compute a single sensitivity correlation (correlation - partial correlation) for each gene-gene pair given all shared miRNAs that pass the above filter as putative regulators. Finally, we derived the first mathematical formulation to simulate the null distribution of the process for different parameters of the system: number of miRNAs, correlation between genes and sample number. Our formulation enables the computation of empirical p-values for the statistic in a very efficient manner, an order of magnitude faster than previous methods. Analyses revealed that previous studies have underestimated the effect of these parameters in their network inference. Network centrality measures can be applied to SPONGE inferred ceRNA networks to reveal known and novel sponges, many of which are potential biomarkers.
Further details demonstrating how SPONGE improves over the state of the art and how SPONGE inferred ceRNA networks can be used for biomarker discovery will be available in our paper (manuscript submitted, link will be included here at a later point).
General Workflow
<img src="https://raw.githubusercontent.com/mlist/SPONGE/master/vignettes/overview.png" alt="SPONGE workflow" style="width: 600px; display: block; margin-left: auto; margin-right: auto;"/>Overview of the SPONGE workflow. (A) Predicted and/or experimentally validated gene-miRNA interactions are subjected to regularized regression on gene and miRNA expression data. Interactions with negative coefficients for miRNA regulators are retained since they indicate miRNA induced inhibition of gene expression. (B) We compute sensitivity correlation coefficients for gene pairs based on shared miRNAs identified in (A). (C) Given the sample number, we compute empirical null models for various gene-gene correlation coefficients (k, columns) and number of miRNAs (m, rows). Sensitivity correlation coefficients are assigned to the best matching null model and a p-value is inferred. (D) After multiple testing correction, significant ceRNA interactions can be used to construct a genome-wide, disease or tissue-specific ceRNA interaction network. In the following hands-on tutorial, we will highlight how each of these steps can be achieved with the SPONGE R package.
(A) gene-miRNA interactions
We start with loading the package and its dependencies
library(SPONGE)
SPONGE comes with a very small example gene and miRNA expression dataset useful for illustrating functionality. We suggest obtaining larger expression data sets via the GEOquery or TCGAbiolinks R packages, for instance. After loading the package, the example datasets can be accessed:
Gene expression:
head(gene_expr)
knitr::kable(gene_expr[1:5,1:8])
miRNA expression:
head(mir_expr)
knitr::kable(mir_expr[1:5,1:5])
Note that the expected format for both expression matrices is that columns correspond to genes / miRNAs and rows correspond to samples. Please make sure that samples are ordered identically in both matrices and that each sample has an entry in both matrices (paired gene and miRNA expression data).
SPONGE uses a two-tier filtering approach to identify gene-miRNA interactions.
First, the user may incorporate prior knowledge to narrow down the number of gene-miRNA interactions. This could be, for instance, the output of a sequence-based miRNA target prediction software such as TargetScan or miRcode or experimental evidence obtained from databases such as miRTarBase or LncBase. The user is free to use his or her trusted resource(s) and only needs to make sure that the miRNA-gene interactions are formatted as SPONGE expects it:
SPONGE allows for an arbitrary number of miRNA-gene interaction matrices $I$ that need to be formatted with genes $G$ in rows and miRNAs $M$ in columns. For each element $i_{g, m} \in I$, we consider the miRNA-gene interaction if $i_{g, m} > 0$. While SPONGE only checks if an entry is zero or not, this approach allows for additional information such as the respective number of binding sites or a score to be encoded in the matrix for later analysis. We included examples for TargetScan:
head(targetscan_symbol)
knitr::kable(targetscan_symbol[1:5,1:5])
Note that the gene and miRNA identifiers in the interaction matrices need to be of the same type and format as the ones used in the expression matrices. Otherwise they can not be matched by SPONGE.
SPONGE combines user-provided matrices as the ones shown above by summing up their entries and subsequently identifies gene-miRNA interaction candidates with non-zero entries. In the second step, we use sparse regularized regression with the gene expression as dependent and miRNA expression as explanatory variables. More specifically, we use elasticnet (via the R package glmnet) to balance lasso and ridge regression. Lasso drives coefficients of miRNAs with negligible influence on gene expression to zero, while ridge regression prevents us from kicking out correlated miRNAs that have similar influence. Elasticnet has two parameters, $\alpha$ is used to balance the influence of lasso and ridge regression. $\lambda$ is the regularization parameter. We select the optimal parameters for elastic net as follows: For each $\alpha = (0.1, 0.2, ..., 1.0)$ we perform 10x cross validation to obtain the optimal parameter $\lambda$. The best $\alpha$ is then selected such that the residual sum of squares error is minimal. We are only interested in miRNAs that repress gene expression. Hence, for each gene, we retain only those miRNAs as interaction partners that have a negative coefficient. Instead of checking if the coefficients are smaller than zero, SPONGE also allows for a more stringent threshold to be set. The default is -0.05. Alternatively, SPONGE offers the possibility to perform an F-test to identify important miRNAs. See documentation for details.
We apply step A of the SPONGE workflow as follows:
genes_miRNA_candidates <- sponge_gene_miRNA_interaction_filter(
gene_expr = gene_expr,
mir_expr = mir_expr,
mir_predicted_targets = targetscan_symbol)
The result is a list of genes, where each entry holds a data frame with retained miRNAs and their coefficients (or p-values in case the F-test is used). Note that the mir_predicted_targets parameter may also be NULL. In this case, no prior knowledge is used and all putative miRNA-gene interactions are used in elasticnet regression.
genes_miRNA_candidates[1:2]
(B) ceRNA interactions
In the second step of the SPONGE workflow, we identify ceRNA actions candidates. SPONGE uses the information from (A) and checks for each pair of genes if they share one or more miRNAs with regulatory effect. This reduce
