SkillAgentSearch skills...

DENA

Deep learning model used to detect RNA m6a with read level based on the Nanopore direct RNA data.

Install / Use

/learn @weir12/DENA
About this skill

Quality Score

0/100

Supported Platforms

Universal

README

image image image image

DENA (Deeplearning Explore Nanopore m6A)

Deep learning model used to detect RNA m6a with read level based on the Nanopore direct RNA data.

Author: liang Ou

E-mail:liangou@ips.ac.cn

Sincere thanks to in-house_scripts developed by Hang Qin (https://github.com/q1134269149).

Getting Started

These instructions will get you a copy of the project up and running on your local machine for development and testing purposes. See deployment for notes on how to deploy the project on a live system.

Download well-trained LSTM-model files

The links of DENA model files are:

(1) Baidu network disk (百度网盘): https://pan.baidu.com/s/166FmGUziN91kCLknern6Rw?pwd=ocm8;

(2) Google Drive: https://drive.google.com/file/d/10GZaENvetOZ0ClSsHxDhSal0VXHlmK26/view?usp=sharing.

Please clone DENA again, and download the updated DENA model files through the above links, and replace the old model files, and then re-perform RNA m6A modification prediction using the code of step4 (Predict). Thanks!

0.Prerequisites

Utilizing Conda or virtualenv to create a relatively independent & clean work environment may be a wise choice for using DENA

Here are What things you need to install(Please confirm one by one):

  1. Unix like system(centos,ubuntu,etc)
  2. Cuda-supported graphics cards(optional)
  3. Python>=3.7.x and Pytorch
  4. tombo,minimap2,samtools

Note:

You can get source code of DENA from zenodo with the link: https://zenodoorg/record/5603381 (Discard).

And you can also learn more by reading our research entitled "DENA: training an authentic neural network model using Nanopore sequencing data of Arabidopsis transcripts for detection and quantification of N6-methyladenosine on RNA" .

Input data required

  1. a batch of fast5 files containing the raw current signals
  2. a fastq file which is contain basecalled sequence corresponding fast5 above
  3. Appropriate reference sequence(Transcriptome is recommended for RNA data)

*Tips: ${variable} : You need to assign it with the your actual value

1.Obtain coordinates matching motif in fasta sequence of reference (Must be transcriptome reference)

python3 LSTM_extract.py get_pos --fasta ${fasta_fn}  --motif 'RRACH' --output ./candidate_predict_pos.txt

You will get result(candidate_predict_pos.txt) like this

AT1G01010.1     17      22      +       AAACC

*Note:Please confirm that the transcriptome reference is provided instead of the genomic reference before this step.

New version function

  • We no longer need external C++ tools
  • Fixed compatibility bugs in FASTA file of some species

2.fast5 base-calling, Signal re-sqguiggle and sequence alignment

2.1 fast5 base-calling

(Optional)If the fast5 files was multi_read_fast5 files, it is necessary to convert folders containing multi_read_fast5 files into single_read_fast5 files using https://github.com/nanoporetech/ont_fast5_api.

multi_to_single_fast5 -t 20 -i ${multi_read_fast5_folder} -s ${single_read_fast5_folder} --recursive

This step is to obtain the fastq sequences from fast5 files by base-calling using [guppy](Preferred version3.2.4)

${SoftPath}/guppy_basecaller -i ${single_read_fast5_folder} -s ${outfile} --flowcell FLO-MIN106 --kit SQK-RNA001 --cpu_threads_per_caller {thread} --qscore_filtering --fast5_out --records_per_fastq 0 --recursive
cat ${outfile}/pass/*.fastq > basecalls.fq
  • ${SoftPath}: the path of guppy software
  • ${single_read_fast5_folder}: the path of single fast5 files that need to base-call.
  • ${outfile}: the path of output folder
    *Note: Please check the version of flowcell and kit of the Library Building used in the experiments,and set them correctly.

2.2 tombo re-sqguiggle

This step is to obtain a unique mapping between the signal fragment of each base of each reads and the reference sequence For detailed help, please see https://github.com/nanoporetech/tombo

tombo resquiggle --rna --processes {thread} --corrected-group RawGenomeCorrected_001 --basecall-group Basecall_1D_001 --include-event-stdev --overwrite --ignore-read-locks ${params.fast5} ${params.ref}

*Note: Please check the basecall-group to be used before re-sqguiggle, and set the corrected-group.

2.3 sequence alianment based on minimap2

For detailed help, please see minimap2 samtools

minimap2 -ax map-ont -L --secondary=no ${transcriptome} ${basecalls.fq} | samtools view -bh -F 2324 | samtools sort -O bam > basecalls.bam
samtools index basecalls.bam
  • ${transcriptome}: the fasta of transcriptome reference
  • ${basecalls.fq}: the fastq of base-calling from fast5 files in step 2.1

3.extract features

New version function

  • Support for reading BAM files in BRI mode to reduce memory consumption

Install the C ++ libraries and Python wrappers to enable this functionality https://github.com/nanoporetech/bripy https://github.com/jts/bri

  • Flexible window Settings are now supported
  • In this step,you need provide two input params for program:fast5_folder(has re-squiggled by tombo) and bam file(sorted & index)

Parameters panel

	parser.add_argument('--processes',default=24,type=int,
						help=("Number of processes allocated"))
	parser_a = subparsers.add_parser('get_pos',formatter_class=argparse.RawDescriptionHelpFormatter,help='get candidate position')
	parser_a.add_argument('--fasta',required=True, default=None,
						help=("reference fasta"))
	parser_a.add_argument('--motif',  default='RRACH',
						help=("specifies a motif pattern"))
	parser_a.add_argument('--output', default='./candidate_predict_pos.txt',
						help=("output file"))
	parser_a.set_defaults(func=get_pos)
	parser_b = subparsers.add_parser('predict',formatter_class=argparse.RawDescriptionHelpFormatter,help='predict')
	parser_b.add_argument('--fast5',required=True, default=None,
						help=("a directory(has been re-squiggled by tombo) that contains the FAST5 files"))	
	parser_b.add_argument('--corr_grp',default="RawGenomeCorrected_000",
						help=("Analysis slot containing the re-squiggle information"))		
	parser_b.add_argument('--bam',required=True, default=None,	
						help=("BAM file used to extract base-quality(feature)"))
	parser_b.add_argument('--sites',default='./candidate_predict_pos.txt',	
						help=("candidate position are used to extract features of mapped reads"))
	parser_b.add_argument('--label',required=True,	
						help=("The string used to distinguish the sample"))	
	parser_b.add_argument('--windows',required=True,nargs=2,metavar='3',type=int,
						help=("Window drift away from the center of m6A"))
	parser_b.add_argument('--debug',action='store_true',default=False,
						help=("Enable debug mode (output more detailed run log)"))
	parser_b.add_argument('--bri',action='store_true',default=False,
						help=("Enable BRI mode (Reduce RAM consumption of BAM files)"))	
python3 LSTM_extract.py --processes ${number} predict --fast5 ${fast5_fn}  --corr_grp ${RawGenomeCorrected_000} --bam ${bam_fn}  --sites ${candidate_predict_pos.txt} --label ${any meaningful string} --windows 2 2
  • ${RawGenomeCorrected_000}: The path of corr_grp generated in step 2.2. Please confirm to set the same corr_grp as step 2.2.

  • ${number}: The Number of threads, default: 25. *Note: --windows 2 2 indicates that a total of 5 bases are extracted, which contains the candidate modified site and 2 bases upstream and downstream of it, e.g. "AAACA".

  • You will get result(*.tmp) like this

>AT4G35300.4_2258_GGACT
37b79f1c-c3c2-4c6f-a25c-65e618b7bb6f    28.0,27.0,24.0,31.0,24.0,74.0,6.0,33.0,27.0,63.0,2.6680189601886424,2.4252261046166588,0.16661375589914051,-0.4574264926055352,0.8548283129364287,2.6957830373199303,2.568959475115271,0.08321765590395497,-0.5128530864579424,0.8695237415728408,0.45954324955726,0.7483187244003591,0.23116851931302654,0.21901705697503512,0.23758996696952467
e944b3ff-156c-409f-95f3-996dfa3d3fd3    26.0,30.0,28.0,25.0,31.0,158.0,6.0,47.0,37.0,27.0,2.3582877438291905,2.354059678502405,-0.22480153918158693,-0.5296594244218555,1.084794467401169,2.5690832257777787,2.4814037210635487,-0.24292374684288695,-0.5435391915773902
View on GitHub
GitHub Stars22
CategoryEducation
Updated1y ago
Forks5

Languages

Python

Security Score

75/100

Audited on Oct 9, 2024

No findings