GEGVIC

GEGVIC is a workflow to analyse Gene Expression, Genomic Variations and Immune cell Composition of tumour samples using Next Generation Sequencing data. This is a common need in the majority of the laboratories in the world, however, many times the high variety of tools available to perform each individual task can confuse and difficult the process.

Here we present an easy-to-use tool that requires few input files, provides a good flexibility and produces appealing outputs when comparing a group of samples for (i) differential gene expression, (ii) genomic variations and (iii) immune cell composition.

<figure> <img src="vignettes/GEGVIC_outline.png" style="width:60.0%" alt="GEGVIC outline" /> <figcaption aria-hidden="true">GEGVIC outline</figcaption> </figure>

Installation

NOTE!!: GEGVIC does not run properly under R version 4.2 since there were problems with some third package dependencies. Use either versions $\le$ 4.1.3 or $\ge$ 4.3.0.

You can install the development version of GEGVIC from GitHub with:

# install.packages("devtools")
devtools::install_github("oriolarques/GEGVIC")

Since this package requires many dependencies, it is recommended to execute the following code before the first usage to prepare the environment correctly.

# CRAN packages 
if(!require(shiny)) install.packages("shiny")
if(!require(dplyr)) install.packages("dplyr")
if(!require(tibble)) install.packages("tibble")
if(!require(tidyr)) install.packages("tidyr")
if(!require(ggplot2)) install.packages("ggplot2")
if(!require(ggrepel)) install.packages("ggrepel")
if(!require(rlang)) install.packages("rlang")
if(!require(ggplotify)) install.packages("ggplotify")
if(!require(ggpubr)) install.packages("ggpubr")
if(!require(patchwork)) install.packages("patchwork")
if(!require(gridExtra)) install.packages("gridExtra")
if(!require(pheatmap)) install.packages("pheatmap")
if(!require(devtools)) install.packages("devtools")
if(!require(remotes)) install.packages("remotes")
if(!require(DT)) install.packages("DT")
if(!require(shinyFiles)) install.packages("shinyFiles")
if(!require(shinythemes)) install.packages("shinythemes")
if(!require(tm)) install.packages("tm")
if(!require(rmarkdown)) install.packages("rmarkdown")

# Bioconductor packages
if (!requireNamespace("BiocManager", quietly = TRUE)) install.packages("BiocManager")
if(!require(DESeq2)) BiocManager::install("DESeq2")
if(!require(apeglm)) BiocManager::install("apeglm")
if(!require(maftools)) BiocManager::install("maftools")
if(!require(clusterProfiler)) BiocManager::install("clusterProfiler")
if(!require(GSEAmining)) BiocManager::install("GSEAmining")
if(!require(GSEABase)) BiocManager::install("GSEABase")
if(!require(GSVA)) BiocManager::install("GSVA")
if(!require(SummarizedExperiment)) BiocManager::install("SummarizedExperiment")
if(!require(BSgenome)) BiocManager::install("BSgenome")
if(!require(BSgenome.Hsapiens.UCSC.hg19)) BiocManager::install("BSgenome.Hsapiens.UCSC.hg19")
if(!require(BSgenome.Hsapiens.UCSC.hg38)) BiocManager::install("BSgenome.Hsapiens.UCSC.hg38")
if(!require(BSgenome.Mmusculus.UCSC.mm10)) BiocManager::install("BSgenome.Mmusculus.UCSC.mm10")
if(!require(BSgenome.Mmusculus.UCSC.mm39)) BiocManager::install("BSgenome.Mmusculus.UCSC.mm39")
if(!require(DO.db)) BiocManager::install("DO.db")
if(!require(GO.db)) BiocManager::install("GO.db")

# Github packages
remotes::install_github("icbi-lab/immunedeconv")
devtools::install_github('raerose01/deconstructSigs')
devtools::install_github("oriolarques/GEGVIC")

Input data format

GEGVIC requires three main input data:

RNA-sequencing raw counts (Counts): Table containing raw gene counts as rows and samples as columns. The first column must contain gene identifiers that can be either NCBI ID, ENSEMBL gene ID or HGNC ID and its column name MUST be adequately named as either:

entrezgene_id
ensembl_gene_id
hgnc_symbol

input_counts

Genomic variations data (Muts): Table containing short variant calls. Necessary columns MUST have the following names (following the MAF format):
- Hugo_Symbol: Gene symbol from HGNC.
- Chromosome: Affected chromosome.
- Start_Position: Mutation start coordinate.
- End_Position: Mutation end coordinate.
- Reference_Allele: The plus strand reference allele at this position. Includes the deleted sequence for a deletion or “-” for an insertion.
- Tumor_Seq_Allele2: Tumor sequencing discovery allele.
- Variant_Classification: Translational effect of variant allele. Can be one of the following: Frame_Shift_Del, Frame_Shift_Ins, In_Frame_Del, In_Frame_Ins, Missense_Mutation, Nonsense_Mutation, Silent, Splice_Site, Translation_Start_Site, Nonstop_Mutation, RNA, Targeted_Region.
- Variant_Type: Type of mutation. Can be: ‘SNP’ (Single nucleotide polymorphism), ‘DNP’ (Double nucleotide polymorphism), ‘INS’ (Insertion), ‘DEL’ (Deletion).
- Tumor_Sample_Barcode: Sample name.

input_muts

Samples metadata: Table that contains additional information to the samples to create groups such as response to a therapy. The first column MUST be named Samples and contain the same nomenclature for each sample as in the RNA-sequencing raw counts and Genomic variations data tables.

input_metadata

Example of usage

Here, we will explore the workflow, how to use each function and the outputs that they generate. Users can use their own data (with the appropriate format as indicated before) by loading them in the R workspace, however, the package comes with pre-loaded input data from a subset of the TCGA-COADREAD cohort.

Notes:

All the functions names have a prefix that indicate to which module they belong.
For further information about specific function argument, install the GEGVIC package and use the help function or visit the description page for the corresponding function in the GitHub respository under the ‘R/’ section.

# load the package
library(GEGVIC)

1. Gene Expression module (GE)

This module uses the functionalities provided by the DESeq2 package.

1.1. PCA

First, using the ge_pca() function we can perform a PCA to evaluate how samples and groups relate to each other. For that, we indicate the raw counts file (sample_counts), how the gene identifiers are encoded (‘ensembl_gene_id’), the metadata file (sample_metadata) and the unquoted name of the column that contains the groups of interest as the response argument. Then, the design should be a formula that expresses how the counts for each gene depend on the variables in the metadata, and finally the colours to represent each sample group. The function outputs a plot.

ge_pca(counts = sample_counts,
       genes_id = 'ensembl_gene_id',
       metadata = sample_metadata,
       response = MSI_status,
       design = 'MSI_status',
       colors = c('orange', 'black'))

PCA

1.2. Differential gene expression

Then, we can compute differential gene expression between groups of interest using the ge_diff_exp() function and store the results in an object (results.dds).

We need to define new parameters such as the samples group that will be used as the level of reference (the group to which the others will be compared against in a form of a vector) and the shrinkage method of the log2 fold changes to be applied (or not).

In the case there are multiple levels of comparison the object will be in a form of a list.

results.dds <- ge_diff_exp(counts = sample_counts,
                           genes_id = 'ensembl_gene_id',
                           metadata = sample_metadata,
                           design = 'MSI_status',
                           ref_level = c('MSI_status', 'MSS'),
                           shrink = 'apeglm')

1.3. Gene annotation

In the case that the gene identifiers provided are not in form of HGNC symbols but are NCBI or ENSEMBL ID, we have to use the ge_annot() function to perform the appropriate conversion and store the results in a new object (annot.res). For that we will have to indicate a query from the biomart package with the following attributes: ensembl_gene_id, hgnc_symbol, entrezgene_id, transcript_length, refseq_mrna. GEGVIC has already available the following databases:

Genome Reference Consortium Human Build 37: ensembl_biomart_GRCh37.
Genome Reference Consortium Human Build 38: ensembl_biomart_GRCh38_p13.
Genome Reference Consortium Mouse Build 38 (mm10): ensembl_biomart_GRCm38_p6.
Genome Reference Consortium Mouse Build 39 (mm39): ensembl_biomart_GRCm39.

annot.res <- ge_annot(results_dds = results.dds,
                      genes_id = 'ensembl_gene_id',
                      biomart = ensembl_biomart_GRCh38_p13)

1.4. Volcano plot

To represent differential gene expression in form of Volcano plots the function ge_volcano() is used to generate a plot for each comparison groups. In the plot, the top ten most significantly up- and dw-regulated genes will be highlighted. Furthermore, the function allow users to define the fold change and adjusted p-value to further customize the plot.

ge_volcano(annot_res = annot.res, 
           fold_change = 2, 
           p.adj = 0.05)

Volcano plot

GEGVIC

Install / Use

README

GEGVIC

Installation

Input data format

Example of usage

1. Gene Expression module (GE)

1.1. PCA

1.2. Differential gene expression

1.3. Gene annotation

1.4. Volcano plot

1.5. GSEA: Gene Set E