GSE37001
RNA-Seq Analysis - Jeremy Leipzig M6A
Install / Use
/learn @BarryDigby/GSE37001README
M6A robustness test
<p markdown="1" align="center"> <img src="assets/images/M6a_paper.png" alt="paper_header"> </p>This repository contains the results of differential gene, isofrom, exon and intron usage in siRNA directed METTL3 knockdown & control HepG2 cells (n=2) in an attempt to faithfully reproduce the RNA-Seq analysis performed in the Nature paper Topology of the human and mouse m6A RNA methylomes revealed by m6A-seq.
Methods and results are detailed below for posterity, with comments describing missing information that would have aided the analysis from a reviewers perpective. The results from this repository are to be used in conjunction with a M6A peak calling analysis of HepG2 cells, to interrogate correlations between M6A peaks and differential gene/isoform/exon/intron usage, a topic that has 13 contrasting citations to date.
Analysis overview
- Download data
- Quantification
- Differentially expressed genes
- Differentially expressed isoforms
- Differentially expressed exons
- Differentially expressed introns
- Comparison of results
Please note: The direction of fold change is with respect to control. I.e a log2FC value of 2 refers to up-regulation in the METTL3 knockdown samples.
Download data
Raw sequencing data was downloaded using SRAtools fastq-dump via a singularity container. The nextflow script to download the reads is provided in scripts/ and the set of commands used is given below:
singularity pull sratoolkit.img docker://pegi3s:sratoolkit
nextflow -bg run dl_sra.nf --sra_id 'SRP012096' -with-singularity 'sratoolkit.img'
</details>
SRA ID's can be mapped to the corresponding experiment metadata using the table below:
| SRA ID | Experimental design | |:------------------:|:----------------------:| | SRR456526.fastq.gz | METTL3_KD1.fastq.gz | | SRR456527.fastq.gz | METTL3_KD2.fastq.gz | | SRR456528.fastq.gz | Mock_control1.fastq.gz | | SRR456529.fastq.gz | Mock_control2.fastq.gz |
Reference genome and GTF files were prepared as per the paper, using H. sapiens ENSEMBL release 54 (NCBI36/hg18).
wget http://ftp.ensembl.org/pub/release-54/fasta/homo_sapiens/dna/Homo_sapiens.NCBI36.54.dna.toplevel.fa.gz
guzip Homo_sapiens.NCBI36.54.dna.toplevel.fa.gz
wget http://ftp.ensembl.org/pub/release-54/gtf/homo_sapiens/Homo_sapiens.NCBI36.54.gtf.gz
gunzip Homo_sapiens.NCBI36.54.gtf.gz
</details>
Quantification
RNA-Seq analysis was performed using nf-core/rnaseq v3.1 with default parameters (except for the reference files provided). The samples.csv file provided to nf-core/rnaseq is given below, of note the dataset is single-end and unstranded:
| sample | fastq_1 | fastq_2 | strandedness | |---------------|------------------------------|---------|--------------| | METTL3_KD1 | fastq/METTL3_KD1.fastq.gz | | unstranded | | METTL3_KD2 | fastq/METTL3_KD2.fastq.gz | | unstranded | | Mock_control1 | fastq/Mock_control1.fastq.gz | | unstranded | | Mock_control2 | fastq/Mock_control2.fastq.gz | | unstranded |
<details markdown="1"> <summary>Nextflow command</summary>
nextflow pull nf-core/rnaseq
nextflow -bg run nf-core/rnaseq -profile singularity --input 'samples.csv' --fasta 'assets/Homo_sapiens.NCBI36.54.dna.toplevel.fa' --gtf 'Homo_sapiens.NCBI36.54.gtf' --max_memory '62.GB' --max_cpus 16
</details>
Differentially expressed genes
Pairwise comparisons of knockdown vs control populations were generated using DESeq2 with a simple model design ~ condition.
The R code used to generate results are given below:
# stage files
meta <- data.frame(row.names=c("METTL3_KD1", "METTL3_KD2", "Mock_control1", "Mock_control2"),
"condition"=c("KD", "KD", "CTRL", "CTRL"),
"replicates"=as.factor(c("1", "2", "1", "2")))
dir <- "/data/projects/leipzig/results/star_salmon"
files <- file.path(dir, rownames(meta), "quant.sf")
# Use release 54 here *crucial for accurate coordinates across DE analyses
mart <- useMart("ENSEMBL_MART_ENSEMBL",
dataset="hsapiens_gene_ensembl",
host="may2009.archive.ensembl.org",
path="/biomart/martservice",
archive=FALSE)
tx2gene <- getBM(attributes = c("ensembl_transcript_id", "hgnc_symbol"), mart = mart)
txi <- tximport(files, type = "salmon", tx2gene = tx2gene, txOut = FALSE)
# DDS object
dds <- DESeqDataSetFromTximport(txi, colData = meta, design = ~ condition)
dds$condition <- relevel(dds$condition, ref="CTRL")
dds <- DESeq(dds)
# DGE
res <- results(dds, alpha=0.05, c("condition", "KD", "CTRL"))
res_df <- as.data.frame(res)
# functions
# use paper cutoff here (FC > 2, FDR 5%)
get_upregulated <- function(df){
key <- intersect(rownames(df)[which(df$log2FoldChange>=2)], rownames(df)[which(df$padj<=0.05)])
results <- as.data.frame((df)[which(rownames(df) %in% key),])
results <- results[order(-results$log2FoldChange),]
return(results)
}
get_downregulated <- function(df){
key <- intersect(rownames(df)[which(df$log2FoldChange<=-2)], rownames(df)[which(df$padj<=0.05)])
results <- as.data.frame((df)[which(rownames(df) %in% key),])
results <- results[order(results$log2FoldChange),]
return(results)
}
annotate_de_genes <- function(df){
df$hgnc_symbol <- rownames(df)
mart <- useMart("ENSEMBL_MART_ENSEMBL",
dataset="hsapiens_gene_ensembl",
host="may2009.archive.ensembl.org",
path="/biomart/martservice",
archive=FALSE)
info <- getBM(attributes=c("hgnc_symbol",
"ensembl_gene_id",
"chromosome_name",
"start_position",
"end_position",
"strand"),
filters = c("hgnc_symbol"),
values = df$hgnc_symbol,
mart = mart,
useCache=FALSE)
tmp <- merge(df, info, by="hgnc_symbol")
tmp$strand <- gsub("-1", "-", tmp$strand)
tmp$strand <- gsub("1", "+", tmp$strand)
tmp$hgnc_symbol <- make.names(tmp$hgnc_symbol, unique = T)
#tmp <- tmp[!grepl("CHR", tmp$chromosome_name),]
output_col <- c("hgnc", "ensembl_gene_id", "chromosome", "start", "end", "strand", "log2FC", "pvalue", "padj")
tmp <- subset(tmp, select=c(hgnc_symbol, ensembl_gene_id, chromosome_name, start_position, end_position, strand, log2FoldChange, pvalue, padj))
colnames(tmp) <- output_col
if(min(tmp$Log2FC) > 0){
tmp <- tmp[order(-tmp$log2FC),]
}else{
tmp <- tmp[order(tmp$log2FC),]
}
return(tmp)
}
# get up regulated
up <- get_upregulated(res_df)
up$hgnc_symbol <- rownames(up)
up <- annotate_de_genes(up)
# get down regulated
down <- get_downregulated(res_df)
down$hgnc_symbol <- rownames(down)
down <- annotate_de_genes(down)
# not sure why this is not working in the function, re-run to order by LFC
down <- down[order(down$log2FC),]
# write to file
write.table(up, "/data/github/GSE37001/gene/DESeq2_gene_upregulated.txt", sep="\t", quote = FALSE, row.names = FALSE)
write.table(down, "/data/github/GSE37001/gene/DESeq2_gene_downregulated.txt", sep="\t", quote = FALSE, row.names = FALSE)
</details>
Comments
The number of differentially expressed genes returned by the analysis (428) were significantly lower than those reported by the study (1977), despite using the same cut-off values (LFC > 2 & FDR 5%).
Despite the discrepancy in results, this was a relatively simple analysis to perform. The paper stated which reference genome files were used (ENSEMBL release 54) which is crucial in returning the same genomic coordinates for M6A peak overlap analysis.
Differentially expressed isoforms
The analysis was performed using the DESeq2 workflow above with one change: txi <- tximport(files, type = "salmon", tx2gene = tx2gene, txOut = TRUE) to use transcript counts in the analysis instead of gene counts. The workflow is given below.
# same as above, but use TX counts
txi <- tximport(files, type = "salmon", tx2gene = tx2gene, txOut = TRUE)
# DDS object
dds <- DESeqDataSetFromTximport(txi, colData = meta, design = ~ condition )
dds$condition <- relevel(dds$condition, ref="CTRL")
dds <- DESeq(dds)
# DGE
res <- results(dds, alpha=0.05, c("condition", "KD", "CTRL"))
res_df <- as.data.frame(res)
# functions
# use paper cutoff here (FC > 2, FDR 5%)
get_upregulated <- function(df){
key <- intersect(rownames(df)[which(df$log2FoldChange>=2)], rownames(df)[which(df$padj<=0.05)])
results <- as.data.frame((df)[which(rownames(df) %in% key),])
results <- results[order(-results$log2FoldChange),]
return(results)
}
get_downregulated <- function(df){
key <- intersect(rownames(df)[which(df$log2FoldChange<=-2)], rownames(df)[which(df$padj<=0.05)])
results <- as.data.frame((df)[which(rownames(df) %in% key),])
results <- results[order(results$log2FoldChange),]
return(results)
}
annotate_de_genes <- function(df){
df$hgnc_symbol <- rownames(df)
mart <- useMart("ENSEMBL_MART_ENSEMBL",
dataset="hsapiens_gen
