DRUMMER
DRUMMER: Detection of RNA modifications in nanopore direct RNA Sequencing datasets
Install / Use
/learn @DepledgeLab/DRUMMERREADME
DRUMMER
DRUMMER is designed to identify RNA modifications at nucleotide-level resolution on distinct transcript isoforms through the comparative analysis of basecall errors in Nanopore direct RNA sequencing (DRS) datasets.
DRUMMER was designed and implemented by Jonathan S. Abebe and Daniel P. Depledge
Updates
DRUMMER v1.0 has now been released with the following improvements
- a single output summary file containing all candidate site information across all biological replicates
- a parsing test figure to confirm that all reads have been included
- improved tutorials for both human and viral datasets
NOTE: Please not that the DRUMMER dependency bam-readcount is reportedly unstable on OSX systems and may cause DRUMMER to fail.
Table of contents
- Introduction
- Installation
- Running DRUMMER
- Output
- Running DRUMMER with the test datasets
- Data preparation
- Troubleshooting
- Citation
- Wisdom
Introduction
Experimental design defines the context of analysis and the modifications likely to be identified. For instance, comparison of DRS datasets from a parental cell line and METTL3-knockout cell line allows for the detection of m6A.
Installation
Pre-requisites
DRUMMER requires the following packages to be installed and available in the users path:
SAMTools v1.3 or higher
BEDTools v2.26 or higher
BASH v4.2 or higher
Python3 and modules: seaborn scipy pandas numpy biopython matplotlib
Installing DRUMMER with git
git clone https://github.com/DepledgeLab/DRUMMER
cd DRUMMER
python DRUMMER.py -h
Note that upon installation, we strongly recommend testing DRUMMER using one or more of the test datasets included - see Running DRUMMER with the test datasets
Setting up Conda environment
##Install environment
conda env create --file environment-setup.yml
##Activate DRUMMER environment
conda activate DRUMMER
##Run DRUMMER
python DRUMMER.py -h
##Deactivate DRUMMER environment
conda deactivate
Running DRUMMER
DRUMMER requires two co-ordinate sorted and indexed BAM files as input. These should contain read alignments for the test (RNA modification absent) and control (RNA modification present) datasets (see Data Preparation section below). DRUMMER can be run in either exome or isoform mode. Exome mode (-a exome) uses DRS read alignments against the genome of a given organism to identify putatively modified bases while isoform mode (-a isoform) uses DRS read alignments against the transcriptome of a given organism to provide a high resolution mapping. While isoform mode is more sensitive, it is also (currently) slower.
Usage:
Usage: python DRUMMER.py -r [FASTA] -l|-n [TARGETS] -c [CONTROL] -t [TREATMENT] -o [OUTPUT] -a [RUNMODE] (OPTIONS)
Required flags
-r fasta format reference genome (exome) or transcriptome (isoforms)
-l list of transcripts (isoform) to be examined (single column or seven-column format)
OR
-n name of genome (exome) - must match fasta file header
-c sorted.bam file - control (RNA modification(s) present)
-t sorted.bam file - treatment (RNA modification(s) absent)
-o output directory
-a runmode (exome|isoform)
Optional flags
-z odds ratio cutoff (default = +/- 1.5)
-p padj cutoff (default = < 0.05 for both OR and G-test)
-m run in m6A mode (default = false)
-f Reference fraction difference (default = 0.01)
-v produce visualizations for individual transcripts (default = false)
-i Filter for indels or retain
Output
When run to completion, DRUMMER generates a single tab-seperated text file (summary.txt) within each comparison directory containing all predicted candidate RNA modification sites with contextual information (genome position, isoform position, sequence motif, etc). When run in m6A mode (-a TRUE), a distribution plot is also generated in an accompanying .pdf file (m6A_plot.pdf). The output directory 'complete_analysis' contains individual data files for each reference sequence provided. A second (optional [-v]) directory 'visualization' contains individual plots of both G-test scores (accumulation/depletion) versus position for each individual reference sequence, along with each odd-ratio score.
A detailed description of column headers in the summary.txt file is shown below. For the individual outputs, please see the accompanying file 'individual_output_headers.txt' for a full description of headers.
[1] transcript_id: name of transcript (isoform mode only)
[2] chromosome: name of chromosome
[3] reference_base: reference nucleotide at this position
[4] pos_mod: position of nucleotide on transcript (isoform mode) or genome (exome mode)
[5] depth_mod: read depth at this position (RNA modification present dataset)
[6] ref_fraction_mod: fraction of reads with reference base at this position (RNA modification present dataset)
[7] depth_unmod: read depth at this position (RNA modification absent dataset)
[8] ref_fraction_unmod: fraction of reads with reference base at this position (RNA modification absent dataset)
[9] frac_diff: difference between ref_fraction_unmod and ref_fraction_mod
[10] odds_ratio: odds ratio
[11] OR_padj: odds ratio adjusted p-value (bonferroni)
[12] eleven_bp_motif: sequence (11-mer) centered on current position
[13] G_test: result of 2x5 G-test
[14] G_padj: G-test adjusted p-value (bonferroni)
[15] candidate_site: values limited to candidate, [candidate masked], or empty based on cutoffs chosen
[16] nearest_ac: (m6A only) distance (nt) to nearest AC motif (-ve indicates upstream, +ve indicates downstream)
[17] nearest_ac_motif: (m6A only) sequence (5-mer) of nearest AC motif (centered on A)
[18] genomic_position: position of nucleotide on genome (isoform mode)
When run to completion, assuming multiple biological replicates, a single tab-seperated text file (multiple_comp.txt) should exist in the main directory location which the user specified using the -o flag. The file contains information relating to each biological replicate summarized into a single file.
A detailed description of column headers in the multiple_comp.txt file is shown below. Each putative candidate position and the accompanying information is merged with/if that same candidate position occurs in the other biological replicates. Information from column [1-9, 11-16] are taken from the replicate that has the highest G-test [8] and odd-ratio [11].
[1] transcript_id:
[2] position:
[3] genomic_position:
[4] strand:
[5] eleven_bp_motif:
[6] nearest_ac_motif:
[7] nearest_ac:
[8] max-G_test: Maximum G-test value seen for this location across all replicates
[9] max-G_padj: padj value corresponding to the max G-test value
[10] support: The number of biological replicates this site occurs in
[11] max_odds: Maximum odds ratio seen for this location across all replicates
[12] max_odds_padj: padj value corresponding to the max odds value
[13] accumulation:
[14] depletion:
[15] frac_diff:
[16] Comparison1-pos:Gtest:padj:OR:ORpadj:frac_diff: Information relating to biological replicate 1, same a corresponding summary.txt file
[17] Comparison2-pos:Gtest:padj:OR:ORpadj:frac_diff: Information relating to biological replicate 2, same a corresponding summary.txt file
[18] Comparison3-pos:Gtest:padj:OR:ORpadj:frac_diff: ...
[19] Comparison4-pos:Gtest:padj:OR:ORpadj:frac_diff: ...
Running DRUMMER with the test datasets
Several test datasets are included in the DRUMMER repository and can be used to verify DRUMMER is working correctly in your environment. Note that expected outputs are reliant on default parameters and changing these may change the output.
m6A detection in a sample adenovirus dataset using 'exome' mode
The following command parses genome-level alignments to identify putative m6A sites in the adenovirus exome. The command should run to completion in ~5 mins and identify 2 candidate sites
python DRUMMER.py -r TESTDATA/Adenovirus-Ad5.fasta -n Ad5 -o exome-test -c TESTDATA/exome.Ad5.MOD.bam -t TESTDATA/exome.Ad5.UNMOD.bam -a exome -m True
(note: run python DRUMMER.py from within DRUMMER directory)
m6A detection in a sample adenovirus dataset using 'isoform' mode
The following command parses transcriptome-level alignments to identify putative m6A sites in a limited adenovirus transcriptome comprising seven transcript isoforms originating from the E3 locus. The command should run to completion in ~5 mins and identify 9 candidate sites across three distinct transcripts (E3.12K = 2 sites, E3.RIDa = 1 site, E3.10K = 6 sites)
python DRUMMER.py -r TESTDATA/Ad5_v9.1_complete.fasta -l TESTDATA/Ad5.sample.transcripts.txt -o isoform-test -c TESTDATA/isoform.Ad5.MOD.bam -t TESTDATA/isoform.Ad5.UNMOD.bam -a isoform -m True
Multiple biological replicates
The following command sh
Related Skills
node-connect
341.2kDiagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps
frontend-design
84.5kCreate distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.
openai-whisper-api
341.2kTranscribe audio via OpenAI Audio Transcriptions API (Whisper).
commit-push-pr
84.5kCommit, push, and open a PR
