M6A robustness test

This repository contains the results of differential gene, isofrom, exon and intron usage in siRNA directed METTL3 knockdown & control HepG2 cells (n=2) in an attempt to faithfully reproduce the RNA-Seq analysis performed in the Nature paper Topology of the human and mouse m6A RNA methylomes revealed by m6A-seq.

Methods and results are detailed below for posterity, with comments describing missing information that would have aided the analysis from a reviewers perpective. The results from this repository are to be used in conjunction with a M6A peak calling analysis of HepG2 cells, to interrogate correlations between M6A peaks and differential gene/isoform/exon/intron usage, a topic that has 13 contrasting citations to date.

Please note: The direction of fold change is with respect to control. I.e a log2FC value of 2 refers to up-regulation in the METTL3 knockdown samples.

Download data

Raw sequencing data was downloaded using SRAtools fastq-dump via a singularity container. The nextflow script to download the reads is provided in scripts/ and the set of commands used is given below:

<details markdown="1"> <summary>Download raw reads</summary>

singularity pull sratoolkit.img docker://pegi3s:sratoolkit
nextflow -bg run dl_sra.nf --sra_id 'SRP012096' -with-singularity 'sratoolkit.img'

</details>

SRA ID's can be mapped to the corresponding experiment metadata using the table below:

Reference genome and GTF files were prepared as per the paper, using H. sapiens ENSEMBL release 54 (NCBI36/hg18).

<details markdown="1"> <summary>Download reference files</summary>

wget http://ftp.ensembl.org/pub/release-54/fasta/homo_sapiens/dna/Homo_sapiens.NCBI36.54.dna.toplevel.fa.gz
guzip Homo_sapiens.NCBI36.54.dna.toplevel.fa.gz

wget http://ftp.ensembl.org/pub/release-54/gtf/homo_sapiens/Homo_sapiens.NCBI36.54.gtf.gz
gunzip Homo_sapiens.NCBI36.54.gtf.gz

</details>

Quantification

RNA-Seq analysis was performed using nf-core/rnaseq v3.1 with default parameters (except for the reference files provided). The samples.csv file provided to nf-core/rnaseq is given below, of note the dataset is single-end and unstranded:

| sample | fastq_1 | fastq_2 | strandedness | |---------------|------------------------------|---------|--------------| | METTL3_KD1 | fastq/METTL3_KD1.fastq.gz | | unstranded | | METTL3_KD2 | fastq/METTL3_KD2.fastq.gz | | unstranded | | Mock_control1 | fastq/Mock_control1.fastq.gz | | unstranded | | Mock_control2 | fastq/Mock_control2.fastq.gz | | unstranded |

<details markdown="1"> <summary>Nextflow command</summary>

nextflow pull nf-core/rnaseq
nextflow -bg run nf-core/rnaseq -profile singularity --input 'samples.csv' --fasta 'assets/Homo_sapiens.NCBI36.54.dna.toplevel.fa' --gtf 'Homo_sapiens.NCBI36.54.gtf' --max_memory '62.GB' --max_cpus 16

</details>

Differentially expressed genes

Pairwise comparisons of knockdown vs control populations were generated using DESeq2 with a simple model design ~ condition.

The R code used to generate results are given below:

<details markdown="1"> <summary>Differential gene analysis</summary>

# stage files
meta <- data.frame(row.names=c("METTL3_KD1", "METTL3_KD2", "Mock_control1", "Mock_control2"),
                   "condition"=c("KD", "KD", "CTRL", "CTRL"),
                   "replicates"=as.factor(c("1", "2", "1", "2")))
dir <- "/data/projects/leipzig/results/star_salmon"
files <- file.path(dir, rownames(meta), "quant.sf")
# Use release 54 here *crucial for accurate coordinates across DE analyses
mart <- useMart("ENSEMBL_MART_ENSEMBL",
                dataset="hsapiens_gene_ensembl",
                host="may2009.archive.ensembl.org",
                path="/biomart/martservice",
                archive=FALSE)
tx2gene <- getBM(attributes = c("ensembl_transcript_id", "hgnc_symbol"), mart = mart)
txi <- tximport(files, type = "salmon", tx2gene = tx2gene, txOut = FALSE)
# DDS object
dds <- DESeqDataSetFromTximport(txi, colData = meta, design = ~ condition)
dds$condition <- relevel(dds$condition, ref="CTRL")
dds <- DESeq(dds)
# DGE
res <- results(dds, alpha=0.05, c("condition", "KD", "CTRL"))
res_df <- as.data.frame(res)
# functions
# use paper cutoff here (FC > 2, FDR 5%)
get_upregulated <- function(df){

    key <- intersect(rownames(df)[which(df$log2FoldChange>=2)], rownames(df)[which(df$padj<=0.05)])

    results <- as.data.frame((df)[which(rownames(df) %in% key),])
    results <- results[order(-results$log2FoldChange),]
    return(results)
}
get_downregulated <- function(df){

    key <- intersect(rownames(df)[which(df$log2FoldChange<=-2)], rownames(df)[which(df$padj<=0.05)])

    results <- as.data.frame((df)[which(rownames(df) %in% key),])
    results <- results[order(results$log2FoldChange),]
    return(results)
}
annotate_de_genes <- function(df){

    df$hgnc_symbol <- rownames(df)
    mart <- useMart("ENSEMBL_MART_ENSEMBL",
                dataset="hsapiens_gene_ensembl",
                host="may2009.archive.ensembl.org",
                path="/biomart/martservice",
                archive=FALSE)
    info <- getBM(attributes=c("hgnc_symbol",
                               "ensembl_gene_id",
                               "chromosome_name",
                               "start_position",
                               "end_position",
                               "strand"),
                  filters = c("hgnc_symbol"),
                  values = df$hgnc_symbol,
                  mart = mart,
                  useCache=FALSE)

    tmp <- merge(df, info, by="hgnc_symbol")
    tmp$strand <- gsub("-1", "-", tmp$strand)
    tmp$strand <- gsub("1", "+", tmp$strand)
    tmp$hgnc_symbol <- make.names(tmp$hgnc_symbol, unique = T)
    #tmp <- tmp[!grepl("CHR", tmp$chromosome_name),]

    output_col <- c("hgnc", "ensembl_gene_id", "chromosome", "start", "end", "strand", "log2FC", "pvalue", "padj")
    tmp <- subset(tmp, select=c(hgnc_symbol, ensembl_gene_id, chromosome_name, start_position, end_position, strand, log2FoldChange, pvalue, padj))
    colnames(tmp) <- output_col

    if(min(tmp$Log2FC) > 0){
        tmp <- tmp[order(-tmp$log2FC),]
    }else{
        tmp <- tmp[order(tmp$log2FC),]
    }

    return(tmp)

}
# get up regulated
up <- get_upregulated(res_df)
up$hgnc_symbol <- rownames(up)
up <- annotate_de_genes(up)
# get down regulated
down <- get_downregulated(res_df)
down$hgnc_symbol <- rownames(down)
down <- annotate_de_genes(down)
# not sure why this is not working in the function, re-run to order by LFC
down <- down[order(down$log2FC),]
# write to file
write.table(up, "/data/github/GSE37001/gene/DESeq2_gene_upregulated.txt", sep="\t", quote = FALSE, row.names = FALSE)
write.table(down, "/data/github/GSE37001/gene/DESeq2_gene_downregulated.txt", sep="\t", quote = FALSE, row.names = FALSE)

</details>

Comments

The number of differentially expressed genes returned by the analysis (428) were significantly lower than those reported by the study (1977), despite using the same cut-off values (LFC > 2 & FDR 5%).

Despite the discrepancy in results, this was a relatively simple analysis to perform. The paper stated which reference genome files were used (ENSEMBL release 54) which is crucial in returning the same genomic coordinates for M6A peak overlap analysis.

Differentially expressed isoforms

The analysis was performed using the DESeq2 workflow above with one change: txi <- tximport(files, type = "salmon", tx2gene = tx2gene, txOut = TRUE) to use transcript counts in the analysis instead of gene counts. The workflow is given below.

<details markdown="1"> <summary>Differential isoform analysis</summary>

# same as above, but use TX counts
txi <- tximport(files, type = "salmon", tx2gene = tx2gene, txOut = TRUE)
# DDS object
dds <- DESeqDataSetFromTximport(txi, colData = meta, design = ~ condition )
dds$condition <- relevel(dds$condition, ref="CTRL")
dds <- DESeq(dds)
# DGE
res <- results(dds, alpha=0.05, c("condition", "KD", "CTRL"))
res_df <- as.data.frame(res)
# functions
# use paper cutoff here (FC > 2, FDR 5%)
get_upregulated <- function(df){

    key <- intersect(rownames(df)[which(df$log2FoldChange>=2)], rownames(df)[which(df$padj<=0.05)])

    results <- as.data.frame((df)[which(rownames(df) %in% key),])
    results <- results[order(-results$log2FoldChange),]
    return(results)
}
get_downregulated <- function(df){

    key <- intersect(rownames(df)[which(df$log2FoldChange<=-2)], rownames(df)[which(df$padj<=0.05)])

    results <- as.data.frame((df)[which(rownames(df) %in% key),])
    results <- results[order(results$log2FoldChange),]
    return(results)
}
annotate_de_genes <- function(df){

    df$hgnc_symbol <- rownames(df)
    mart <- useMart("ENSEMBL_MART_ENSEMBL",
                dataset="hsapiens_gen

GSE37001

Install / Use

README

M6A robustness test

Analysis overview

Download data

Quantification

Differentially expressed genes

Comments

Differentially expressed isoforms