Improving gene isoform quantification with miniQuant

miniQuant features:

Optimal use of long and/or short RNA-seq reads: transcript abundance estimation that can be applied to different data scenarios: long-read-alone and hybrid (long reads + short reads) integrating the strengths of both technologies.
Fast RNA-seq quantification: less than 15 minutes to analyze unaligned 40 million paired-end short reads + 5 million long reads on a standard laptop computer.
Calculate novel K-value metric: a key feature of the sequence share pattern that causes particularly high abundance estimation error, allowing us to identify a problematic set of gene isoforms with erroneous quantification that researchers should take extra attention in the study.

Our newest version is recommended with faster speed and better performance. However, to reproduce the results on our Nature Biotechnology paper, download old version from miniQuant v1.0. Feel free to run miniQuant online without installation!

Dependency
Installation
Usage
- Gene isoform quantification by miniQuant
  - 1. If quantify using long reads data alone
  - 2. If quantify using short and long reads data in hybrid mode
- Calculate K-value by miniQuant

Dependency

Linux operating system (tested on Red Hat 8.8)

Installation

Download latest binary executable (wget https://github.com/Augroup/miniQuant/releases/download/latest/miniQuant_linux_latest.tar.gz) and decompress by tar -zxvf miniQuant_linux_latest.tar.gz.
cd miniQuant_linux && chmod +x miniQuant
(Optional. Only if you want to directly call miniQuant in command line) cp ./miniQuant /usr/local/bin; cp ./miniQuant ~/.local/bin
Run ./miniQuant

If your operating system doesn't have GLIBC 2.28 or later

Download latest binary executable (wget https://github.com/Augroup/miniQuant/releases/download/latest/miniQuant_linux_latest.tar.gz) and decompress by tar -zxvf miniQuant_linux_latest.tar.gz.
cd miniQuant_linux && chmod +x miniQuant
Run miniQuant under Docker or Singularity container with following commands: docker run -i -t tidesun/miniquant:latest ./miniQuant OR singularity run docker://tidesun/miniquant:latest ./miniQuant OR singularity run https://miniquant.s3.us-east-2.amazonaws.com/miniQuant_latest.sif ./miniQuant

Build by source

If you want to compile from the source, you need to have a C compiler and GNU make installed. Then type make in the src to compile.

Usage

miniQuant provides two options for gene isoform quantification:

quantify by long reads data alone.
quantify using short and long reads data in hybrid mode. A toy dataset example is provided in example/. Please following example command below for instruction.

1. If quantify using long reads data alone

<details> <summary>Click me</summary>

miniQuant requires reference transcripts sequences in FASTA format (-r) and long-read RNA-seq sequences in plain or gzipped FASTA/FASTQ format (-l) as the input.

Example: quantify using long reads data (`example/LR.fasta.gz`) with reference transcripts sequences (e.g. `example/reference.fa`), results in `miniQuant_LR_alone_res` folder

miniQuant quant -r example/reference.fa -l example/LR.fasta.gz -t 1 -o miniQuant_LR_alone_res

Available parameters

Required arguments:
  -r, --reference arg           Reference sequence file in plain or gzipped
                                FASTA format
  -l, --long_reads arg          Input long reads file in plain or gzipped
                                FASTA/FASTQ format.(default: "")

Optional arguments:
  -o arg, --output arg          The path of output folder. (default: ./miniQuant_res/)
  --long_reads_library_prep arg The library preparation for long reads.
                                Choices:[cDNA-ONT,dRNA-ONT,cDNA-PacBio]
                                (default: cDNA-ONT)
  -t arg, --threads arg         Number of threads. Default is 1.
  --mem arg                     Max RAM usage in GB allowed when aligning 
                                the reads (default: 20.0)

Results explanation

The result will be in TSV format (miniQuant_LR_alone_res/abundance.tsv) showing the abundance of each transcript, one transcript per line, with following columns:

Transcript ID: transcript ID provided in the reference sequences(--reference)
TPM: transcript relative abundance in TPM (Transcripts Per Kilobase Million).
Expected_num_long_reads: expected counts of long reads, corresponding to the total number of long reads of input.

<details> <summary>Click me for example</summary>

| Transcript_id | TPM | Expected_num_long_reads | | --- | --- | --- | | ENST00000379080.5 | 0 | 0 | | ENST00000379081.5 | 30326.9 | 9.85623 | | ENST00000379084.5 | 0 | 0 | | ENST00000379087.5 | 0.000665636 | 0.000000216332 | | ENST00000379089.5 | 0 | 0 | | ENST00000651358.1 | 2.68181 | 0.00087159 | | ENST00000445726.5 | 2.76325 | 0.000898056 | | ENST00000297620.8 | 31447 | 10.2203 | | ENST00000422409.5 | 0.0521039 | 0.0000169338 | | ENST00000379078.1 | 12294.7 | 3.99577 | | ENST00000294244.9 | 807593 | 262.468 | | ENST00000540893.1 | 56604.9 | 18.3966 | | ENST00000535820.1 | 61728.4 | 20.0617 |

</details> </details>

2. If quantify using short and long reads data in hybrid mode

<details> <summary>Click me</summary>

Integrates short and long reads RNA-seq reads from the same organism for better quantification performance.
In hybrid mode, miniQuant requires reference transcripts sequences in FASTA format (-r), long-read RNA-seq sequences in plain or gzipped FASTA/FASTQ format (-l), and short-read paired-end RNA-seq sequences in plain or gzipped FASTA/FASTQ format (-1 and -2) as the input.

Example: quantify using short reads (e.g. `example/SR_R1.fasta.gz` and `example/SR_R2.fasta.gz`) and long reads (e.g. `example/LR.fasta.gz`) with reference transcripts sequences (e.g. `example/reference.fa`), results in `miniQuant_hybrid_res` folder

miniQuant quant -r example/reference.fa -l example/LR.fasta.gz -1 example/SR_R1.fasta.gz -2 example/SR_R2.fasta.gz -t 1 -o miniQuant_hybrid_res

Available parameters

Required arguments:
  -r, --reference arg           Reference sequence file in plain or gzipped
                                FASTA format
  -l, --long_reads arg          Input long reads file in plain or gzipped
                                FASTA/FASTQ format.(default: "")
  -1, --short_reads_pair_1 arg  Input short reads pair 1 in plain or
                                gzipped FASTA/FASTQ format. Leave blank if using
                                only long reads. (default: "")
  -2, --short_reads_pair_2 arg  Input short reads pair 2 in plain or
                                gzipped FASTA/FASTQ format. Leave blank if using
                                only long reads. (default: "")

Optional arguments:
  -o arg, --output arg          The path of output folder. (default: ./miniQuant_res/)
  --long_reads_library_prep arg The library preparation for long reads. Choices:[cDNA-ONT,dRNA-ONT,cDNA-PacBio] (default: cDNA-ONT)
  --short_reads_strandness arg  The strandness of short reads.          Choices:[unstranded,fr-stranded,rf-stranded] (default: unstranded)

                                *fr-stranded: Strand specific reads, first
                                read forward
                                *rf-stranded: Strand specific reads, first
                                read reverse
                                 
  -t arg, --threads arg         Number of threads. Default is 1.
  --mem arg                     Max RAM usage in GB allowed when aligning 
                                the reads (default: 20.0)

Results explanation

The result will be in TSV format (miniQuant_res_hybrid/abundance.tsv) showing the abundance of each transcript, one transcript per line, with following columns:

Transcript ID: transcript ID provided in the reference sequences(--reference)
TPM: transcript relative abundance in TPM (Transcripts Per Kilobase Million). It is calculated by integrating both short and long reads.
Expected_num_long_reads: expected counts of long reads. It is calculated by integrating both short and long reads, corresponding to the total number of long reads of input.
Expected_num_short_read_pairs: expected counts of short read pairs, corresponding to the total number of short read pairs of input.
Effective_length: effective length of each transcript.

<details> <summary>Click me for example</summary>

| Transcript_id | TPM | Expected_num_long_reads | Expected_num_short_read_pairs | Effective_length | | --- | --- | --- | --- | --- | | ENST00000379080.5 | 0.000143216 | 0.0000000465451 | 0.00000118102 | 3357 | | ENST00000379081.5 | 10983.9 | 3.56975 | 89.2826 | 3309 | | ENST00000379084.5 | 0 | 0 | 0 | 659 | | ENST00000379087.5 | 11371 | 3.69557 | 93.2673 | 3339 | | ENST00000379089.5 | 283.145 | 0.0920222 | 2.35789 | 3390 | | ENST00000651358.1 | 9.77862 | 0.00317805 | 0.0

MiniQuant

Install / Use

README

Improving gene isoform quantification with miniQuant

Table of contents

Dependency

Installation

If your operating system doesn't have GLIBC 2.28 or later

Build by source

Usage

1. If quantify using long reads data alone

Example: quantify using long reads data (`example/LR.fasta.gz`) with reference transcripts sequences (e.g. `example/reference.fa`), results in `miniQuant_LR_alone_res` folder

Available parameters

Results explanation

2. If quantify using short and long reads data in hybrid mode

Example: quantify using short reads (e.g. `example/SR_R1.fasta.gz` and `example/SR_R2.fasta.gz`) and long reads (e.g. `example/LR.fasta.gz`) with reference transcripts sequences (e.g. `example/reference.fa`), results in `miniQuant_hybrid_res` folder

Available parameters

Results explanation

MiniQuant

Install / Use

README

Improving gene isoform quantification with miniQuant

Table of contents

Dependency

Installation

If your operating system doesn't have GLIBC 2.28 or later

Build by source

Usage

1. If quantify using long reads data alone

Example: quantify using long reads data (example/LR.fasta.gz) with reference transcripts sequences (e.g. example/reference.fa), results in miniQuant_LR_alone_res folder

Available parameters

Results explanation

2. If quantify using short and long reads data in hybrid mode

Example: quantify using short reads (e.g. example/SR_R1.fasta.gz and example/SR_R2.fasta.gz) and long reads (e.g. example/LR.fasta.gz) with reference transcripts sequences (e.g. example/reference.fa), results in miniQuant_hybrid_res folder

Available parameters

Results explanation

Example: quantify using long reads data (`example/LR.fasta.gz`) with reference transcripts sequences (e.g. `example/reference.fa`), results in `miniQuant_LR_alone_res` folder

Example: quantify using short reads (e.g. `example/SR_R1.fasta.gz` and `example/SR_R2.fasta.gz`) and long reads (e.g. `example/LR.fasta.gz`) with reference transcripts sequences (e.g. `example/reference.fa`), results in `miniQuant_hybrid_res` folder