GSECA

Gene Set Enrichment Class Analysis for heterogeneous RNA sequencing data

Generate Convert Improve

Install / Use

/learn @ceredamatteo-lab/GSECA

About this skill

Quality Score

0/100

README

General Notes and How to cite GSECA

GSECA is a R application with a Shiny web interface. The stand-alone version can be run from R GUI or from UNIX shell. Examples on how to use GSECA are provided below.

For further details please refer to our paper in Nucleic Acid Research here

GSECA description

GSECA is an R software that implements Gene Set Enrichment Class Analysis to detect deregulated biological processes in heterogeneous dataset using RNA sequencing experiments. Given two cohorts ( Cases / Controls ), the algorithm follows three main steps:

1. Finite Mixture Modeling (FMM): for each sample, GSECA computes the distribution of gene expression levels and fits it with a two-components Gaussian Mixture Model, using the Expectation-Maximization algorithm.
2. Data Discretization (DD): once the sample-specific parameters of the FMM are estimated, GSECA implements a supervised data discretization approach, identifying 7 Expression Classes (EC). Each gene is assigned to one of these EC, accordingly to its expression level.
3. Statistical Framework: discretized data are then evaluated in a statistical framework to detect altered biological processesbetween the two cohorts. For each tested gene set, the cumulative proportions of genes in each EC are compared by Fisher's Exact test. To quantify the degree of perturbation across the ECs for all gene sets, the algorithm combines the significance level of each comparison into an enrichment score ES. To reduce false positive discoveries and correct for different sample size of the cohorts, two (optional) bootstrapping procedures are implemented, which measure the empirical P value and the success rate (SR) of each ES.

GSECA provides to the user a graphical overview of the variation of expression of each gene set across the seven classess between the two cohorts. The deregulated gene sets are visualized as EC map (see below image). The EC maps display the difference of the cumulative proportion of genes of a gene set in the seven ECs between the two cohorts as triangles, whose size is proportional to such difference. Furthermore, the upper and the lower vertex of the triangles represent enrichment and depletion in the cohort A as compared to B, respectively. Finally, GSECA orders AGSs accordingly to their ES (SR and emprical P-values, if calculated), thus obtaining the list of the most altered processes in the phenotype of interest.

For further details check our paper in Nucleic Acid Research here

Usage

GSECA requires as input files:

Gene expression matrix (".tsv", tab separated): matrix of normalized gene expression levels from RNA-seq experiments. Rows represent genes, columns represent samples and the corresponding expression levels. The first column must contain ensembl gene id, thus the first row must contain the label "ensembl_gene_id" followed by sample identifiers (e.g. barcodes). one line one gene be sure that your data do not contains duplicated gene ids. no ensg.version if you use ensembl_gene_id, please be sure to remove the version of each ensg (i.e. ".XX") before running GSECA.

ensembl_gene_id TCGA.YL.A8HL TCGA.EJ.5516 TCGA.KK.A8IC TCGA.EJ.7314 ...
ENSG00000000003        12.37        23.16        11.61        13.97 ...
ENSG00000000005         0.01         0.13         0.03         0.00 ...
ENSG00000000419        23.83        24.07        32.79        21.10 ...
ENSG00000000457         4.69         4.92         2.76         2.48 ...
ENSG00000000460         0.78         0.95         0.77         0.77 ...
...

Sample type labels (".tsv", tab separated): an ordered list of phenotype labels (CASE / CNTR), one per row matching the order of samples given in the gene expression matrix. The first row must contain the label "x".

[1] "CASE" "CASE" "CASE" .... "CNTR" "CNTR" "CNTR" ...

Gene sets (".gmt" file): the list of gene sets to be tested. It can be predefined by the user, or selected from a collection of pre-processed gene sets of biological pathways and diseases included in the Shiny app. If you provide you own gene set list, please be sure that it satisfies all gmt format requirments (see details here)

List of 4
 $ gene_set_A  : chr [1:61] "geneA" "geneB" "geneC" "geneD" ...
 $ gene_set_B  : chr [1:31] "geneE" "geneF" "geneB" "geneG"...
 $ gene_set_C  : chr [1:26] "geneA" "geneF" "geneH" "geneI" ...
 $ ...

How to run GSECA as a stand-alone app

Clone the repository on your local machine and open R from the repository folder.

git lfs clone https://github.com/matteocereda/GSECA.git

To use GSECA functions, load source the file Scripts/config.R in the R Global Environment:

source("Scripts/config.R")

GSECA can be run on an R shell following the commands reported in the script Scripts/GSECA.R

Examples

Example of running GSECA on an R shell.

An example dataset is provided, which contains 100 prostate adenocarcinoma RNA-seq samples (i.e. 50 cases + 50 controls) of the TCGA-PRAD dataset, obtained from the GDC data portal (https://gdc-portal.nci.nih.gov). The samples are stratified for the somatic loss of PTEN.

# load configurations
source("Scripts/config.R")

# Gene expression matrix & sample type
M = read.delim("Examples/PRAD.ptenloss.M.tsv")
L = read.delim("Examples/PRAD.ptenloss.L.tsv")[,1]

# Gene set list
pl = read.gmt.file("gene_sets/cereda.158.KEGG.gmt")

# Run GSECA
res = GSECA_executor(  M # Gene Expression matrix
                     , L # Sample label list
                     , pl # gene set list
                     , outdir = "Results" #outdir folder
                     , analysis = "my_analysis"# analysis name
                     , N.CORES = 2 # number of cores
                     , EMPIRICAL = T # true if empirical p-value is requested
                     , BOOTSTRP = T
                     , nsim = 2 # number of bootstrapping
                     , AS = 0.25 # AS threshold
                     , PEMP = 1 # p.emp threshold
                     , SR   = 0.7 # success rate threshold
                     , toprank = 10 # success rate threshold
                     , iphen = c("CASE", "CNTR") #  phenotype lables
)

These are our settings:

devtools::session_info()
─ Session info ───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
 setting  value                       
 version  R version 3.6.1 (2019-07-05)
 os       macOS Catalina 10.15.1      
 system   x86_64, darwin15.6.0        
 ui       RStudio                     
 language (EN)                        
 collate  en_GB.UTF-8                 
 ctype    en_GB.UTF-8                 
 tz       Europe/Rome                 
 date     2019-12-31                  

─ Packages ───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
 ! package              * version    date       lib source                                  
   assertthat             0.2.1      2019-03-21 [1] CRAN (R 3.6.0)                          
   backports              1.1.5      2019-10-02 [1] CRAN (R 3.6.0)                          
   bibtex                 0.4.2      2017-06-30 [1] CRAN (R 3.6.0)                          
   Biobase              * 2.44.0     2019-05-02 [1] Bioconductor                            
   BiocGenerics         * 0.30.0     2019-05-02 [1] Bioconductor                            
   BiocManager            1.30.10    2019-11-16 [1] CRAN (R 3.6.1)                          
   BiocParallel         * 1.18.1     2019-08-06 [1] Bioconductor                            
   bit                    1.1-14     2018-05-29 [1] CRAN (R 3.6.0)                          
   bit64                  0.9-7      2017-05-08 [1] CRAN (R 3.6.0)                          
   bitops               * 1.0-6      2013-08-17 [1] CRAN (R 3.6.0)                          
   blob                   1.2.0      2019-07-09 [1] CRAN (R 3.6.0)                          
   callr                  3.4.0      2019-12-09 [1] CRAN (R 3.6.0)                          
   caTools                1.17.1.3   2019-11-30 [1] CRAN (R 3.6.0)                          
   circlize             * 0.4.8      2019-09-08 [1] CRAN (R 3.6.0)                          
   cli                    2.0.0      2019-12-09 [1] CRAN (R 3.6.0)                          
   clue                   0.3-57     2019-02-25 [1] CRAN (R 3.6.0)                          
   cluster                2.1.0      2019-06-19 [1] CRAN (R 3.6.1)                          
   codetools              0.2-16     2018-12-24 [1] CRAN (R 3.6.1)                          
   colorspace             1.4-1      2019-03-18 [1] CRAN (R 3.6.0)                          
   ComplexHeatmap       * 2.1.1      2019-10-18 [1] Github (jokergoo/ComplexHeatmap@1cb7fea)
   crayon                 1.3.4      2017-09-16 [1] CRAN (R 3.6.0)                          
   crosstalk              1.0.0      2016-12-21 [1] CRAN (R 3.6.0)                          
   data.table           * 1.12.8     2019-12-09 [1] CRAN (R 3.6.0)                          
   DBI                    1.0.0      2018-05-02 [1] CRAN (R 3.6.0)                          
   DelayedArray         * 0.10.0     2019-05-02 [1] Bioconductor                            
   dendextend             1.13.2     2019-12-02 [1] CRAN (R 3.6.0)                          
   desc                   1.2.0      2018-05-01 [1] CRAN (R 3.6.0)                          
   dev