CHEUI
Concurrent identification of m6A and m5C modifications in individual molecules from nanopore sequencing
Install / Use
/learn @comprna/CHEUIREADME
CHEUI: Methylation (CH<sub>3</sub>) Estimation Using Ionic current <img src="https://github.com/comprna/CHEUI/blob/master/misc/CHEUI_logo.png" width="280" height="250">
About CHEUI
CHEUI (Methylation (CH<sub>3</sub>) Estimation Using Ionic current) is an RNA modification detection software for Oxford Nanopore direct RNA sequencing data. CHEUI can be used to detect m6A and m5C in individual reads at single-nucleotide resolution from any sample (e.g. single condition), or detect differential m6A or m5C between any two conditions. CHEUI uses a two-stage deep learning method to detect m6A and m5C transcriptome-wide at single-read and single-site resolution in any sequence context (i.e. without any sequence constrains).
CHEUI is open source and freely available under an Academic Public License (see copy of the license in this repository).
Table of Contents
- Dependencies
- Outline of CHEUI-solo and CHEUI-diff
- Preprocessing data before running CHEUI
- Install CHEUI
- IMPORTANT
- Detect m6A and m5C modifications in one condition
- Identify differential RNA modifications between two conditions
Dependencies
python=3.7
numpy==1.19.2
pandas==1.3.4
tensorflow-gpu==2.4.1
keras-preprocessing==1.1.2
Outline of CHEUI-solo and CHEUI-diff
<img src="https://github.com/comprna/CHEUI/blob/master/misc/pipeline_CHEUI-solo+diff_github.png" width="900" height="500">Preprocessing data before running CHEUI:
Before running CHEUI:
- Raw signal data (fast5) should be basecalled using Guppy 4.0.11+ (4.0.11 or later) (https://community.nanoporetech.com/downloads/guppy/)(basecaller model used template_rna_r9.4.1_70bps*)
- Basecalled sequences (fastq) should be aligned to a reference transcriptome using minimap2 and primary, positive strand alignments should be selected, e.g.
minimap2 -ax map-ont -k14 <transcriptome fasta> <read fastq> | samtools view -F 2324 -b | samtools sort > <sorted-bam-file>
samtools index <sorted-bam-file>
- Signal data should be resquiggled to aligned sequences using Nanopolish (https://nanopolish.readthedocs.io/en/latest/), ensuring that events are rescaled, e.g.
nanopolish index -s <sequencing_summary.txt> -d <fast5_folder> <read fastq>
nanopolish eventalign -t 48 \
--reads <read fastq> \
--bam <sorted-bam-file> \
--genome <transcriptome fasta> \
--scale-events --signal-index --samples --print-read-names > nanopolish_out.txt
Install CHEUI
Installation can be performed manually or using Conda (recommended).
Manual installation:
git clone https://github.com/comprna/CHEUI.git
cd CHEUI/test
Conda installation with manual CUDA installation (recommended):
conda create --name cheui python=3.7 tensorflow-gpu=2.4.1 pandas=1.3.4 -y && conda activate cheui
git clone https://github.com/comprna/CHEUI.git
cd CHEUI/test
Conda installation with integrated CUDA installation (not recommended):
conda create --name cheui python=3.7 tensorflow-gpu=2.4.1 pandas=1.3.4 conda-forge::cudatoolkit-dev -y && conda activate cheui
git clone https://github.com/comprna/CHEUI.git
cd CHEUI/test
IMPORTANT
Please follow the instructions below carefully.
-
Notice that for detecting m6A or m5C, the nanopolish output files require different preprocessing scripts:
CHEUI_preprocess_m6A.pyfor m6A andCHEUI_preprocess_m5C.pyfor m5C. -
CHEUI model 1 (read level predictions) and model 2 (site level predictions) use different predictive models for m6A and m5C that have to be specified using the --DL_model flag:
for m6A: ```../CHEUI_trained_models/CHEUI_m6A_model1.h5``` and ```../CHEUI_trained_models/CHEUI_m6A_model2.h5``` For m5C: ```../CHEUI_trained_models/CHEUI_m5C_model1.h5``` and ```../CHEUI_trained_models/CHEUI_m5C_model2.h5```
Detect m6A and m5C modifications in one condition
CHEUI preprocessing step
This script takes the output from nanopolish and creates a file containing signals corresponding to 9-mers centered in As and IDs.
../scripts/CHEUI_preprocess_m6A.py --help
required arguments:
-i, --input_nanopolish Nanopolish output file. Nanopolish should be run with the following flags:
nanopolish eventalign --reads <in.fasta>--bam
<in.bam> --genome <genome.fa> --print-read-names--
scale-events --samples > <out.txt>
-m, --kmer_model file containing the expected signal k-mer means
(available at CHEUI/kmer_models/model_kmer.csv)
-o, --out_dir output directory
optional arguments:
-h, --help show this help message and exit
-v, --version show program's version number and exit
-s <str>, --suffix_name <str>
name to use for output files
-n CPU, --cpu CPU Number of CPUs (threads) to use
Example command of the preprocessing step for m6A:
python3 ../scripts/CHEUI_preprocess_m6A.py -i nanopolish_output_test.txt -m ../kmer_models/model_kmer.csv -o out_A_signals+IDs.p -n 15
The processing of the Nanopolish output for m5C is very similar:
../scripts/CHEUI_preprocess_m5C.py --help
required arguments:
-i, --input_nanopolish Nanopolish output file. Nanopolish should be run with the following flags:
nanopolish eventalign --reads <in.fasta>--bam
<in.bam> --genome <genome.fa> --print-read-names--
scale-events --samples > <out.txt>
-m, --kmer_model file containing the expected signal k-mer means
(available at CHEUI/kmer_models/model_kmer.csv)
-o, --out_dir output directory
optional arguments:
-h, --help show this help message and exit
-v, --version show program's version number and exit
-s <str>, --suffix_name <str>
name to use for output files
-n CPU, --cpu CPU Number of cores to use
Example command of the preprocessing step for m5C:
python3 ../scripts/CHEUI_preprocess_m5C.py -i nanopolish_output_test.txt -m ../kmer_models/model_kmer.csv -o out_C_signals+IDs.p -n 15
CHEUI preprocessing step -- C++ version
A faster method to run the CHEUI preprocessing step. The C++ version is 2-10x times faster than the python version.
Installation
cd ../scripts/preprocessing_CPP/
./build.sh
Parameters of the program
$ ./CHEUI -h
required arguments:
-i, --input-nanopolish Nanopolish output file. Nanopolish should be run with the following flags:
nanopolish eventalign --reads <in.fasta>--bam
<in.bam> --genome <genome.fa> --print-read-names--
scale-events --samples > <out.txt>
-m, --kmer-model file containing the expected signal k-mer means
(available at CHEUI/kmer_models/model_kmer.csv)
-o, --out-dir output directory
--m6A/--m5C preprocessing type
optional arguments:
-h, --help show this help message and exit
-s <str>, --suffix_name <str>
name to use for output files
-n CPU, --cpu CPU Number of cores to use
-t, --temp-dir temp file directory (default: out dir)
Example command of the preprocessing step for m6A:
./CHEUI -i ../../test/nanopolish_output_test.txt -o ../../test/out_A_signals+IDs.p/ -m ../../kmer_models/model_kmer.csv -n 16 --m6A
Example command of the preprocessing step for m5C:
./CHEUI -i ../../test/nanopolish_output_test.txt -o ../../test/out_C_signals+IDs.p/ -m ../../kmer_models/model_kmer.csv -n 16 --m5C
For large nanopolish file, we recommend to split the file into smaller files and run the preprocessing step, then using the following command to combine the outputs
python3 ../scripts/combine_binary_file.py -i [output binary folder] -o [combined output file name]
`
Related Skills
YC-Killer
2.7kA library of enterprise-grade AI agents designed to democratize artificial intelligence and provide free, open-source alternatives to overvalued Y Combinator startups. If you are excited about democratizing AI access & AI agents, please star ⭐️ this repository and use the link in the readme to join our open source AI research team.
best-practices-researcher
The most comprehensive Claude Code skills registry | Web Search: https://skills-registry-web.vercel.app
groundhog
398Groundhog's primary purpose is to teach people how Cursor and all these other coding agents work under the hood. If you understand how these coding assistants work from first principles, then you can drive these tools harder (or perhaps make your own!).
isf-agent
a repo for an agent that helps researchers apply for isf funding
