DRUMMER

DRUMMER is designed to identify RNA modifications at nucleotide-level resolution on distinct transcript isoforms through the comparative analysis of basecall errors in Nanopore direct RNA sequencing (DRS) datasets.

DRUMMER was designed and implemented by Jonathan S. Abebe and Daniel P. Depledge

Updates

DRUMMER v1.0 has now been released with the following improvements

a single output summary file containing all candidate site information across all biological replicates
a parsing test figure to confirm that all reads have been included
improved tutorials for both human and viral datasets

NOTE: Please not that the DRUMMER dependency bam-readcount is reportedly unstable on OSX systems and may cause DRUMMER to fail.

Introduction
Installation
- Pre-requisites
- Setup
Running DRUMMER
Output
Running DRUMMER with the test datasets
Data preparation
- Alignment and filtering
- Setting up a transcript list for isoform mode
Troubleshooting
Citation
Wisdom

Introduction

Experimental design defines the context of analysis and the modifications likely to be identified. For instance, comparison of DRS datasets from a parental cell line and METTL3-knockout cell line allows for the detection of m6A.

Installation

Pre-requisites

DRUMMER requires the following packages to be installed and available in the users path:

SAMTools v1.3 or higher

BEDTools v2.26 or higher

BASH v4.2 or higher

Python3 and modules: seaborn scipy pandas numpy biopython matplotlib

Installing DRUMMER with git

git clone https://github.com/DepledgeLab/DRUMMER
cd DRUMMER
python DRUMMER.py -h

Note that upon installation, we strongly recommend testing DRUMMER using one or more of the test datasets included - see Running DRUMMER with the test datasets

Setting up Conda environment

##Install environment 
conda env create --file environment-setup.yml 

##Activate DRUMMER environment
conda activate DRUMMER

##Run DRUMMER
python DRUMMER.py -h

##Deactivate DRUMMER environment
conda deactivate

Running DRUMMER

DRUMMER requires two co-ordinate sorted and indexed BAM files as input. These should contain read alignments for the test (RNA modification absent) and control (RNA modification present) datasets (see Data Preparation section below). DRUMMER can be run in either exome or isoform mode. Exome mode (-a exome) uses DRS read alignments against the genome of a given organism to identify putatively modified bases while isoform mode (-a isoform) uses DRS read alignments against the transcriptome of a given organism to provide a high resolution mapping. While isoform mode is more sensitive, it is also (currently) slower.

Usage:

Usage: python DRUMMER.py -r [FASTA] -l|-n [TARGETS] -c [CONTROL] -t [TREATMENT] -o [OUTPUT] -a [RUNMODE] (OPTIONS)

Required flags

-r              fasta format reference genome (exome) or transcriptome (isoforms)

-l              list of transcripts (isoform) to be examined (single column or seven-column format)
OR
-n              name of genome (exome) - must match fasta file header

-c              sorted.bam file - control (RNA modification(s) present)
-t              sorted.bam file - treatment (RNA modification(s) absent)
-o              output directory
-a              runmode (exome|isoform)

Optional flags

-z              odds ratio cutoff (default = +/- 1.5)
-p              padj cutoff (default = < 0.05 for both OR and G-test)
-m              run in m6A mode (default = false)
-f              Reference fraction difference (default = 0.01)
-v              produce visualizations for individual transcripts (default = false)
-i              Filter for indels or retain

Output

When run to completion, DRUMMER generates a single tab-seperated text file (summary.txt) within each comparison directory containing all predicted candidate RNA modification sites with contextual information (genome position, isoform position, sequence motif, etc). When run in m6A mode (-a TRUE), a distribution plot is also generated in an accompanying .pdf file (m6A_plot.pdf). The output directory 'complete_analysis' contains individual data files for each reference sequence provided. A second (optional [-v]) directory 'visualization' contains individual plots of both G-test scores (accumulation/depletion) versus position for each individual reference sequence, along with each odd-ratio score.

A detailed description of column headers in the summary.txt file is shown below. For the individual outputs, please see the accompanying file 'individual_output_headers.txt' for a full description of headers.

[1] transcript_id:      name of transcript (isoform mode only)
[2] chromosome:         name of chromosome
[3] reference_base:     reference nucleotide at this position
[4] pos_mod:            position of nucleotide on transcript (isoform mode) or genome (exome mode)
[5] depth_mod:          read depth at this position (RNA modification present dataset)
[6] ref_fraction_mod:   fraction of reads with reference base at this position (RNA modification present dataset)
[7] depth_unmod:        read depth at this position (RNA modification absent dataset)
[8] ref_fraction_unmod: fraction of reads with reference base at this position (RNA modification absent dataset)
[9] frac_diff:          difference between ref_fraction_unmod and ref_fraction_mod
[10] odds_ratio:        odds ratio
[11] OR_padj:           odds ratio adjusted p-value (bonferroni)
[12] eleven_bp_motif:   sequence (11-mer) centered on current position
[13] G_test:            result of 2x5 G-test
[14] G_padj:            G-test adjusted p-value (bonferroni)
[15] candidate_site:    values limited to candidate, [candidate masked], or empty based on cutoffs chosen
[16] nearest_ac:        (m6A only) distance (nt) to nearest AC motif (-ve indicates upstream, +ve indicates downstream)
[17] nearest_ac_motif:  (m6A only) sequence (5-mer) of nearest AC motif (centered on A)
[18] genomic_position:  position of nucleotide on genome (isoform mode)

When run to completion, assuming multiple biological replicates, a single tab-seperated text file (multiple_comp.txt) should exist in the main directory location which the user specified using the -o flag. The file contains information relating to each biological replicate summarized into a single file.

A detailed description of column headers in the multiple_comp.txt file is shown below. Each putative candidate position and the accompanying information is merged with/if that same candidate position occurs in the other biological replicates. Information from column [1-9, 11-16] are taken from the replicate that has the highest G-test [8] and odd-ratio [11].

[1] transcript_id:
[2] position:
[3] genomic_position:
[4] strand:                
[5] eleven_bp_motif:
[6] nearest_ac_motif:
[7] nearest_ac:
[8] max-G_test:                                           Maximum G-test value seen for this location across all replicates
[9] max-G_padj:                                           padj value corresponding to the max G-test value
[10] support:                                             The number of biological replicates this site occurs in
[11] max_odds:                                            Maximum odds ratio seen for this location across all replicates
[12] max_odds_padj:                                       padj value corresponding to the max odds value
[13] accumulation:          
[14] depletion:
[15] frac_diff:
[16] Comparison1-pos:Gtest:padj:OR:ORpadj:frac_diff:      Information relating to biological replicate 1, same a corresponding summary.txt file
[17] Comparison2-pos:Gtest:padj:OR:ORpadj:frac_diff:      Information relating to biological replicate 2, same a corresponding summary.txt file
[18] Comparison3-pos:Gtest:padj:OR:ORpadj:frac_diff:                                           ...
[19] Comparison4-pos:Gtest:padj:OR:ORpadj:frac_diff:                                           ...

Running DRUMMER with the test datasets

Several test datasets are included in the DRUMMER repository and can be used to verify DRUMMER is working correctly in your environment. Note that expected outputs are reliant on default parameters and changing these may change the output.

m6A detection in a sample adenovirus dataset using 'exome' mode

The following command parses genome-level alignments to identify putative m6A sites in the adenovirus exome. The command should run to completion in ~5 mins and identify 2 candidate sites

python DRUMMER.py -r TESTDATA/Adenovirus-Ad5.fasta -n Ad5 -o exome-test -c TESTDATA/exome.Ad5.MOD.bam -t TESTDATA/exome.Ad5.UNMOD.bam -a exome -m True

(note: run python DRUMMER.py from within DRUMMER directory)

m6A detection in a sample adenovirus dataset using 'isoform' mode

The following command parses transcriptome-level alignments to identify putative m6A sites in a limited adenovirus transcriptome comprising seven transcript isoforms originating from the E3 locus. The command should run to completion in ~5 mins and identify 9 candidate sites across three distinct transcripts (E3.12K = 2 sites, E3.RIDa = 1 site, E3.10K = 6 sites)

python DRUMMER.py -r TESTDATA/Ad5_v9.1_complete.fasta -l TESTDATA/Ad5.sample.transcripts.txt -o isoform-test -c TESTDATA/isoform.Ad5.MOD.bam -t TESTDATA/isoform.Ad5.UNMOD.bam -a isoform -m True

Multiple biological replicates

The following command sh

DRUMMER

Install / Use

README

DRUMMER

Updates

Table of contents

Introduction

Installation

Pre-requisites

Installing DRUMMER with git

Setting up Conda environment

Running DRUMMER

Output

Running DRUMMER with the test datasets

m6A detection in a sample adenovirus dataset using 'exome' mode

m6A detection in a sample adenovirus dataset using 'isoform' mode

Multiple biological replicates

Related Skills