GEGVIC
A workflow to analyse Gene Expression, Genetic Variations and Immune cell Composition of tumour samples using Next Generation Sequencing data
Install / Use
/learn @oriolarques/GEGVICREADME
GEGVIC
<!-- badges: start --> <!-- badges: end -->GEGVIC is a workflow to analyse Gene Expression, Genomic Variations and Immune cell Composition of tumour samples using Next Generation Sequencing data. This is a common need in the majority of the laboratories in the world, however, many times the high variety of tools available to perform each individual task can confuse and difficult the process.
Here we present an easy-to-use tool that requires few input files, provides a good flexibility and produces appealing outputs when comparing a group of samples for (i) differential gene expression, (ii) genomic variations and (iii) immune cell composition.
<figure> <img src="vignettes/GEGVIC_outline.png" style="width:60.0%" alt="GEGVIC outline" /> <figcaption aria-hidden="true">GEGVIC outline</figcaption> </figure>Installation
NOTE!!: GEGVIC does not run properly under R version 4.2 since there were problems with some third package dependencies. Use either versions $\le$ 4.1.3 or $\ge$ 4.3.0.
You can install the development version of GEGVIC from
GitHub with:
# install.packages("devtools")
devtools::install_github("oriolarques/GEGVIC")
Since this package requires many dependencies, it is recommended to execute the following code before the first usage to prepare the environment correctly.
# CRAN packages
if(!require(shiny)) install.packages("shiny")
if(!require(dplyr)) install.packages("dplyr")
if(!require(tibble)) install.packages("tibble")
if(!require(tidyr)) install.packages("tidyr")
if(!require(ggplot2)) install.packages("ggplot2")
if(!require(ggrepel)) install.packages("ggrepel")
if(!require(rlang)) install.packages("rlang")
if(!require(ggplotify)) install.packages("ggplotify")
if(!require(ggpubr)) install.packages("ggpubr")
if(!require(patchwork)) install.packages("patchwork")
if(!require(gridExtra)) install.packages("gridExtra")
if(!require(pheatmap)) install.packages("pheatmap")
if(!require(devtools)) install.packages("devtools")
if(!require(remotes)) install.packages("remotes")
if(!require(DT)) install.packages("DT")
if(!require(shinyFiles)) install.packages("shinyFiles")
if(!require(shinythemes)) install.packages("shinythemes")
if(!require(tm)) install.packages("tm")
if(!require(rmarkdown)) install.packages("rmarkdown")
# Bioconductor packages
if (!requireNamespace("BiocManager", quietly = TRUE)) install.packages("BiocManager")
if(!require(DESeq2)) BiocManager::install("DESeq2")
if(!require(apeglm)) BiocManager::install("apeglm")
if(!require(maftools)) BiocManager::install("maftools")
if(!require(clusterProfiler)) BiocManager::install("clusterProfiler")
if(!require(GSEAmining)) BiocManager::install("GSEAmining")
if(!require(GSEABase)) BiocManager::install("GSEABase")
if(!require(GSVA)) BiocManager::install("GSVA")
if(!require(SummarizedExperiment)) BiocManager::install("SummarizedExperiment")
if(!require(BSgenome)) BiocManager::install("BSgenome")
if(!require(BSgenome.Hsapiens.UCSC.hg19)) BiocManager::install("BSgenome.Hsapiens.UCSC.hg19")
if(!require(BSgenome.Hsapiens.UCSC.hg38)) BiocManager::install("BSgenome.Hsapiens.UCSC.hg38")
if(!require(BSgenome.Mmusculus.UCSC.mm10)) BiocManager::install("BSgenome.Mmusculus.UCSC.mm10")
if(!require(BSgenome.Mmusculus.UCSC.mm39)) BiocManager::install("BSgenome.Mmusculus.UCSC.mm39")
if(!require(DO.db)) BiocManager::install("DO.db")
if(!require(GO.db)) BiocManager::install("GO.db")
# Github packages
remotes::install_github("icbi-lab/immunedeconv")
devtools::install_github('raerose01/deconstructSigs')
devtools::install_github("oriolarques/GEGVIC")
Input data format
GEGVIC requires three main input data:
- RNA-sequencing raw counts (Counts): Table containing raw gene counts as rows and samples as columns. The first column must contain gene identifiers that can be either NCBI ID, ENSEMBL gene ID or HGNC ID and its column name MUST be adequately named as either:
-
entrezgene_id
-
ensembl_gene_id
-
hgnc_symbol

- Genomic variations data (Muts): Table containing short variant
calls. Necessary columns MUST have the following names
(following the MAF
format):
- Hugo_Symbol: Gene symbol from HGNC.
- Chromosome: Affected chromosome.
- Start_Position: Mutation start coordinate.
- End_Position: Mutation end coordinate.
- Reference_Allele: The plus strand reference allele at this position. Includes the deleted sequence for a deletion or “-” for an insertion.
- Tumor_Seq_Allele2: Tumor sequencing discovery allele.
- Variant_Classification: Translational effect of variant allele. Can be one of the following: Frame_Shift_Del, Frame_Shift_Ins, In_Frame_Del, In_Frame_Ins, Missense_Mutation, Nonsense_Mutation, Silent, Splice_Site, Translation_Start_Site, Nonstop_Mutation, RNA, Targeted_Region.
- Variant_Type: Type of mutation. Can be: ‘SNP’ (Single nucleotide polymorphism), ‘DNP’ (Double nucleotide polymorphism), ‘INS’ (Insertion), ‘DEL’ (Deletion).
- Tumor_Sample_Barcode: Sample name.

- Samples metadata: Table that contains additional information to the samples to create groups such as response to a therapy. The first column MUST be named Samples and contain the same nomenclature for each sample as in the RNA-sequencing raw counts and Genomic variations data tables.

Example of usage
Here, we will explore the workflow, how to use each function and the outputs that they generate. Users can use their own data (with the appropriate format as indicated before) by loading them in the R workspace, however, the package comes with pre-loaded input data from a subset of the TCGA-COADREAD cohort.
Notes:
- All the functions names have a prefix that indicate to which module they belong.
- For further information about specific function argument, install the GEGVIC package and use the help function or visit the description page for the corresponding function in the GitHub respository under the ‘R/’ section.
# load the package
library(GEGVIC)
1. Gene Expression module (GE)
This module uses the functionalities provided by the DESeq2
package.
1.1. PCA
First, using the ge_pca() function we can perform a PCA to evaluate
how samples and groups relate to each other. For that, we indicate the
raw counts file (sample_counts), how the gene identifiers are encoded
(‘ensembl_gene_id’), the metadata file (sample_metadata) and the
unquoted name of the column that contains the groups of interest as
the response argument. Then, the design should be a formula that
expresses how the counts for each gene depend on the variables in the
metadata, and finally the colours to represent each sample group. The
function outputs a plot.
ge_pca(counts = sample_counts,
genes_id = 'ensembl_gene_id',
metadata = sample_metadata,
response = MSI_status,
design = 'MSI_status',
colors = c('orange', 'black'))

1.2. Differential gene expression
Then, we can compute differential gene expression between groups of
interest using the ge_diff_exp() function and store the results in an
object (results.dds).
We need to define new parameters such as the samples group that will be used as the level of reference (the group to which the others will be compared against in a form of a vector) and the shrinkage method of the log2 fold changes to be applied (or not).
In the case there are multiple levels of comparison the object will be in a form of a list.
results.dds <- ge_diff_exp(counts = sample_counts,
genes_id = 'ensembl_gene_id',
metadata = sample_metadata,
design = 'MSI_status',
ref_level = c('MSI_status', 'MSS'),
shrink = 'apeglm')
1.3. Gene annotation
In the case that the gene identifiers provided are not in form of HGNC
symbols but are NCBI or ENSEMBL ID, we have to use the ge_annot()
function to perform the appropriate conversion and store the results in
a new object (annot.res). For that we will have to indicate a query
from the biomart
package
with the following attributes: ensembl_gene_id, hgnc_symbol,
entrezgene_id, transcript_length, refseq_mrna. GEGVIC has already
available the following databases:
-
Genome Reference Consortium Human Build 37: ensembl_biomart_GRCh37.
-
Genome Reference Consortium Human Build 38: ensembl_biomart_GRCh38_p13.
-
Genome Reference Consortium Mouse Build 38 (mm10): ensembl_biomart_GRCm38_p6.
-
Genome Reference Consortium Mouse Build 39 (mm39): ensembl_biomart_GRCm39.
annot.res <- ge_annot(results_dds = results.dds,
genes_id = 'ensembl_gene_id',
biomart = ensembl_biomart_GRCh38_p13)
1.4. Volcano plot
To represent differential gene expression in form of Volcano plots the
function ge_volcano() is used to generate a plot for each comparison
groups. In the plot, the top ten most significantly up- and dw-regulated
genes will be highlighted. Furthermore, the function allow users to
define the fold change and adjusted p-value to further customize the
plot.
ge_volcano(annot_res = annot.res,
fold_change = 2,
p.adj = 0.05)

