RATTLE
Reference-free reconstruction and error correction of transcriptomes from Nanopore long-read sequencing
Install / Use
/learn @comprna/RATTLEREADME
RATTLE
Reference-free reconstruction and quantification of transcriptomes from long-read sequencing
- de la Rubia I, Srivastava A, Xue W, Indi JA, Carbonell-Sala S, Lagarde J, Albà MM, Eyras E. RATTLE: reference-free reconstruction and quantification of transcriptomes from Nanopore sequencing. Genome Biol. 2022 Jul 8;23(1):153. doi: https://doi.org/10.1186/s13059-022-02715-w. PMID: 35804393
Table of Contents
- Requirements
- Installation
- Quick start
- Running RATTLE
- Example datasets
- Reference based benchmarking
- Snakemake
Requirements
GCC, G++ with C++14 suppport
(GCC version < 10)
Installation
- Clone the repository
git clone --recurse-submodules https://github.com/comprna/RATTLE
- Build RATTLE
cd RATTLE
./build.sh
(this will generally take less than 1 minute)
Quick start
We provide here some of the most common commands used when running RATTLE. Note: The commands and parameters are still under development and may be subject to changes in future versions
- Cluster cDNA Nanopore reads at gene level with 24 threads
$ ./rattle cluster -i reads.fq -t 24 -o .
- Cluster cDNA Nanopore reads with labels at gene level with 24 threads
$ ./rattle cluster -i reads.fq -t 24 -o .
- Cluster cDNA Nanopore reads at isoform level with 24 threads
$ ./rattle cluster -i reads.fq -t 24 --iso
- Cluster RNA Nanopore reads at isoform level with 24 threads
$ ./rattle cluster -i reads.fq -t 24 --iso --rna
- View clustering summary (csv with read_id,cluster_id)
$ ./rattle cluster_summary -i reads.fq -c clusters.out
- Extract 1 fastq file per cluster in clusters/ folder
$ mkdir clusters
$ ./rattle extract_clusters -i reads.fq -c transcripts.out -o clusters --fastq
- Correct reads with 24 threads using isoform clusters
$ ./rattle correct -i reads.fq -c clusters.out -t 24
- Polish RNA consensus sequences and build final transcriptome using 24 threads
$ ./rattle polish -i consensi.fq -t 24 --rna
Running RATTLE
We provide here the details of each RATTLE command.
Clustering
This is the first and most important step in RATTLE. This command will generate the first set of read clusters representing potential genes and transcripts.
$ ./rattle cluster -h
-h, --help
shows this help message
-i, --input
input fasta/fastq file or compressed with .gz extension, will automatically check file extension (required)
-l, --label
labels for the files in order of entry (optional)
-o, --output
output folder (default: .)
-t, --threads
number of threads to use (default: 1)
-k, --kmer-size
k-mer size for gene clustering (default: 10, maximum: 16)
-s, --score-threshold
minimum score for two reads to be in the same gene cluster (default: 0.2)
-v, --max-variance
max allowed variance for two reads to be in the same gene cluster (default: 1000000)
--iso
perform clustering at the isoform level
--iso-kmer-size
k-mer size for isoform clustering (default: 11, maximum: 16)
--iso-score-threshold
minimum score for two reads to be in the same isoform cluster (default: 0.3)
--iso-max-variance
max allowed variance for two reads to be in the same isoform cluster (default: 25)
-B, --bv-start-threshold
starting threshold for the bitvector k-mer comparison (default: 0.4)
-b, --bv-end-threshold
ending threshold for the bitvector k-mer comparison (default: 0.2)
-f, --bv-falloff
falloff value for the bitvector threshold for each iteration (default: 0.05)
-r, --min-reads-cluster
minimum number of reads per cluster (default: 0)
-p, --repr-percentile
cluster representative percentile (default: 0.15)
--rna
use this mode if data is direct RNA (disables checking both strands)
--verbose
use this flag to print the progress of the run
--raw
set this flag to use all the reads without any length filtering (off by default)
--lower-length
filter out reads shorter than this value (default: 150)
--upper-length
filter out reads longer than this value (default: 100,000)
This clustering step will generate a file containing read clusters in binary format (clusters.out). To work with these clusters, the following commands are used.
You can run the RATTLE pipeline with multiple inputs. RATTLE will keep track of the source label for each read. That is, RATTLE will create gene clusters, transcript clusters, and consensus transcript with all reads from all input samples, and you can identify which reads were used from each input sample in each cluster or transcript. You can simply specify the multiple inputs and labels separated by commas:
-i input_1,input_2,input_3,...,input_n -l label_1,label_2,label_3,...,label_n
Description of clustering parameters
General parameters for the clustering step
--raw
If this flag is used, all reads from the input are used without any filtering for length, i.e. --lower-length and –upper-length parameters below are not used
--lower-length (default: 150)
By default, we do not use reads that are shorter than 150nt. This limit can be increase to produce longer transcript models. Nanopore short-read sequencing is becoming possible, so this lower bound could be lowered to enable the reference-free reconstruction of small non-coding RNAs.
--upper-length (default: 100,000)
Although very long transcripts are possible, we generally found reads longer than 100,000 nt not to be reliable, possibly resulting from experimental artifacts. As data improves, this parameter can be relaxed to identify ultra-long transcripts.
Parameters related to the bitvector comparison in the Clustering step
-B, --bv-start-threshold (default: 0.4)
This threshold is the minimal bitvector score to consider two reads to be potentially in the same gene cluster. The bitvector score defined as the fraction of unique k-mers that two reads have in common over the maximum of unique k-mers in the two reads. If the score is above this threshold, the two reads are compared using the LIS similarity score (see below –score-threshold). This threshold is the minimal score used. RATTLE performs multiple iterations of this test with all reads starting at the value of “-B” and decreasing by a step of “-f” until the threshold of “-b”. These multiple iterations makes it possible to test all reads under various conditions for clustering. To see the result of changing this parameter, please see Table S11 from RATTLE’s paper.
-b, --bv-end-threshold (default: 0.2)
The ending threshold for the bitvector score in the iterations. A low value for -b makes possible to rescue reads have not been clustered in previous iterations. If -b is close to -B (or the same) only one or few iterations will be performed. This will make the clustering less sensitive, potentially resulting in many unclustered reads. To see the result of changing this parameter, please see Table S11 from RATTLE’s paper.
-f, --bv-falloff (default: 0.05)
This is the step-change value between the first (-B) and final (-b) bitvector score thresholds and determines the number of iterations to perform clustering. A small value will provide more resolution in the definition of clusters but will result in more iterations, potentially leading to longer computational time. To see the result of changing this parameter, please see Table S11 from RATTLE’s paper.
-r, --min-reads-cluster (default: 0)
Only clusters with more than this number of reads will be reported and used in the next step. The default means that also singletons (clusters composed of 1 single read) are also included.
-p, --repr-percentile (default: 0.15)
In the iterative algorithm for clustering, reads are tested against representative of a cluster, rather than all the reads from that cluster. The value of -p is the position percentile position of the read in the ranking of reads sorted by length (from longest to shortest) in a cluster that is used as representative. The smaller the value, the closer to the top of the ranking. The longes read in a cluster may seem to be a better representative. However, during RATTLE optimization, we observed that this is not always the case, and using one few positions below (0.15 percentile, i.e. position 15th in a cluster of 100 reads) results in better performance.
Parameters related to the LIS-based similarity in the Clustering step
-k, --kmer-size (default: 10, maximum: 16)
This is the size of k-mer used to compare two reads using the Longest Increasing Subsequence (LIS) algorithm (see RATTLE’s paper for details). A low value will enable a more sensitive comparison but will result in longer computing times. A higher value will make the comparison faster, but may miss cases due to sequencing errors. For reads with low error rate, this can be set to a value higher than the default. The maximum of 16 is used to ensure the efficiency of the algorithmic implementation.
-s, --score-threshold (default: 0.2)
This parameter sets the minimum
