RATTLE

Reference-free reconstruction and quantification of transcriptomes from long-read sequencing

de la Rubia I, Srivastava A, Xue W, Indi JA, Carbonell-Sala S, Lagarde J, Albà MM, Eyras E. RATTLE: reference-free reconstruction and quantification of transcriptomes from Nanopore sequencing. Genome Biol. 2022 Jul 8;23(1):153. doi: https://doi.org/10.1186/s13059-022-02715-w. PMID: 35804393

Requirements
Installation
Quick start
Running RATTLE
Example datasets
- Human direct RNA sequencing
Reference based benchmarking
Snakemake

Requirements

GCC, G++ with C++14 suppport

(GCC version < 10)

Installation

Clone the repository

git clone --recurse-submodules https://github.com/comprna/RATTLE

Build RATTLE

cd RATTLE
./build.sh

(this will generally take less than 1 minute)

Quick start

We provide here some of the most common commands used when running RATTLE. Note: The commands and parameters are still under development and may be subject to changes in future versions

Cluster cDNA Nanopore reads at gene level with 24 threads

$ ./rattle cluster -i reads.fq -t 24 -o .

Cluster cDNA Nanopore reads with labels at gene level with 24 threads

$ ./rattle cluster -i reads.fq -t 24 -o .

Cluster cDNA Nanopore reads at isoform level with 24 threads

$ ./rattle cluster -i reads.fq -t 24 --iso

Cluster RNA Nanopore reads at isoform level with 24 threads

$ ./rattle cluster -i reads.fq -t 24 --iso --rna

View clustering summary (csv with read_id,cluster_id)

$ ./rattle cluster_summary -i reads.fq -c clusters.out

Extract 1 fastq file per cluster in clusters/ folder

$ mkdir clusters
$ ./rattle extract_clusters -i reads.fq -c transcripts.out -o clusters --fastq

Correct reads with 24 threads using isoform clusters

$ ./rattle correct -i reads.fq -c clusters.out -t 24

Polish RNA consensus sequences and build final transcriptome using 24 threads

$ ./rattle polish -i consensi.fq -t 24 --rna

Running RATTLE

We provide here the details of each RATTLE command.

Clustering

This is the first and most important step in RATTLE. This command will generate the first set of read clusters representing potential genes and transcripts.

$ ./rattle cluster -h
    -h, --help
        shows this help message
    -i, --input
        input fasta/fastq file or compressed with .gz extension, will automatically check file extension (required)
    -l, --label
        labels for the files in order of entry (optional)
    -o, --output
        output folder (default: .)
    -t, --threads
        number of threads to use (default: 1)
    -k, --kmer-size
        k-mer size for gene clustering (default: 10, maximum: 16)
    -s, --score-threshold
        minimum score for two reads to be in the same gene cluster (default: 0.2)
    -v, --max-variance
        max allowed variance for two reads to be in the same gene cluster (default: 1000000)
    --iso
        perform clustering at the isoform level
    --iso-kmer-size
        k-mer size for isoform clustering (default: 11, maximum: 16)
    --iso-score-threshold
        minimum score for two reads to be in the same isoform cluster (default: 0.3)
    --iso-max-variance
        max allowed variance for two reads to be in the same isoform cluster (default: 25)
    -B, --bv-start-threshold
        starting threshold for the bitvector k-mer comparison (default: 0.4)
    -b, --bv-end-threshold
        ending threshold for the bitvector k-mer comparison (default: 0.2)
    -f, --bv-falloff
        falloff value for the bitvector threshold for each iteration (default: 0.05)
    -r, --min-reads-cluster
        minimum number of reads per cluster (default: 0)
    -p, --repr-percentile
        cluster representative percentile (default: 0.15)
    --rna
        use this mode if data is direct RNA (disables checking both strands)
    --verbose
        use this flag to print the progress of the run
    --raw
        set this flag to use all the reads without any length filtering (off by default)
    --lower-length
        filter out reads shorter than this value (default: 150)
    --upper-length
        filter out reads longer than this value (default: 100,000)

This clustering step will generate a file containing read clusters in binary format (clusters.out). To work with these clusters, the following commands are used.

You can run the RATTLE pipeline with multiple inputs. RATTLE will keep track of the source label for each read. That is, RATTLE will create gene clusters, transcript clusters, and consensus transcript with all reads from all input samples, and you can identify which reads were used from each input sample in each cluster or transcript. You can simply specify the multiple inputs and labels separated by commas:

-i input_1,input_2,input_3,...,input_n -l label_1,label_2,label_3,...,label_n

Description of clustering parameters

General parameters for the clustering step

--raw

If this flag is used, all reads from the input are used without any filtering for length, i.e. --lower-length and –upper-length parameters below are not used

--lower-length (default: 150)

By default, we do not use reads that are shorter than 150nt. This limit can be increase to produce longer transcript models. Nanopore short-read sequencing is becoming possible, so this lower bound could be lowered to enable the reference-free reconstruction of small non-coding RNAs.

--upper-length (default: 100,000)

Although very long transcripts are possible, we generally found reads longer than 100,000 nt not to be reliable, possibly resulting from experimental artifacts. As data improves, this parameter can be relaxed to identify ultra-long transcripts.

Parameters related to the bitvector comparison in the Clustering step

-B, --bv-start-threshold (default: 0.4)

This threshold is the minimal bitvector score to consider two reads to be potentially in the same gene cluster. The bitvector score defined as the fraction of unique k-mers that two reads have in common over the maximum of unique k-mers in the two reads. If the score is above this threshold, the two reads are compared using the LIS similarity score (see below –score-threshold). This threshold is the minimal score used. RATTLE performs multiple iterations of this test with all reads starting at the value of “-B” and decreasing by a step of “-f” until the threshold of “-b”. These multiple iterations makes it possible to test all reads under various conditions for clustering. To see the result of changing this parameter, please see Table S11 from RATTLE’s paper.

-b, --bv-end-threshold (default: 0.2)

The ending threshold for the bitvector score in the iterations. A low value for -b makes possible to rescue reads have not been clustered in previous iterations. If -b is close to -B (or the same) only one or few iterations will be performed. This will make the clustering less sensitive, potentially resulting in many unclustered reads. To see the result of changing this parameter, please see Table S11 from RATTLE’s paper.

-f, --bv-falloff (default: 0.05)

This is the step-change value between the first (-B) and final (-b) bitvector score thresholds and determines the number of iterations to perform clustering. A small value will provide more resolution in the definition of clusters but will result in more iterations, potentially leading to longer computational time. To see the result of changing this parameter, please see Table S11 from RATTLE’s paper.

-r, --min-reads-cluster (default: 0)

Only clusters with more than this number of reads will be reported and used in the next step. The default means that also singletons (clusters composed of 1 single read) are also included.

-p, --repr-percentile (default: 0.15)

In the iterative algorithm for clustering, reads are tested against representative of a cluster, rather than all the reads from that cluster. The value of -p is the position percentile position of the read in the ranking of reads sorted by length (from longest to shortest) in a cluster that is used as representative. The smaller the value, the closer to the top of the ranking. The longes read in a cluster may seem to be a better representative. However, during RATTLE optimization, we observed that this is not always the case, and using one few positions below (0.15 percentile, i.e. position 15th in a cluster of 100 reads) results in better performance.

Parameters related to the LIS-based similarity in the Clustering step

-k, --kmer-size (default: 10, maximum: 16)

This is the size of k-mer used to compare two reads using the Longest Increasing Subsequence (LIS) algorithm (see RATTLE’s paper for details). A low value will enable a more sensitive comparison but will result in longer computing times. A higher value will make the comparison faster, but may miss cases due to sequencing errors. For reads with low error rate, this can be set to a value higher than the default. The maximum of 16 is used to ensure the efficiency of the algorithmic implementation.

-s, --score-threshold (default: 0.2)

This parameter sets the minimum

RATTLE

Install / Use

README

RATTLE

Table of Contents

Requirements

Installation

Quick start

Running RATTLE

Clustering

Description of clustering parameters