Mikado - pick your transcript: a pipeline to determine and select the best RNA-Seq prediction

Mikado is a lightweight Python3 pipeline to identify the most useful or “best” set of transcripts from multiple transcript assemblies. Our approach leverages transcript assemblies generated by multiple methods to define expressed loci, assign a representative transcript and return a set of gene models that selects against transcripts that are chimeric, fragmented or with short or disrupted CDS. Loci are first defined based on overlap criteria, and each transcript therein is scored based on up to 50 available metrics relating to ORF and cDNA size, relative position of the ORF within the transcript, UTR length and presence of multiple ORFs. Mikado can also utilise blast data to score transcripts based on protein similarity and to identify and split chimeric transcripts. Optionally, junction confidence data as provided by [Portcullis][Portcullis] can be used to improve the assessment. The best-scoring transcripts are selected as the primary transcripts of their respective gene loci; additionally, Mikado can bring back other valid splice variants that are compatible with the primary isoform.

Mikado uses GTF or GFF files as mandatory input. Non-mandatory but highly recommended input data can be generated by obtaining a set of reliable splicing junctions with Portcullis_, by locating coding ORFs on the transcripts using either [Transdecoder][Transdecoder] or [Prodigal][Prodigal], and by obtaining homology information through either [BLASTX][Blast+] or [DIAMOND][DIAMOND].

Our approach is amenable to include sequences generated by de novo Illumina assemblers or reads generated from long read technologies such as Pacbio.

Extended documentation is hosted on ReadTheDocs: http://mikado.readthedocs.org/

Installation

Installation

Docker Installation

Mikado can be installed with Docker. If you don't have Docker, please install docker first. Then you can pull the Docker image with Mikado installed

VERSION=2.3.5rc3
docker run gemygk/mikado:v${VERSION} mikado -h

Singularity Installation

Mikado can be installed with Singularity. If you don't have Singularity, please install singularity first. Then you can pull the singularity image with Mikado installed.

VERSION=2.3.5rc3
singularity exec docker://gemygk/mikado:v${VERSION} mikado -h

Or, we can build and run a singularity image

# 1. Create a Singularity definition file

$ cat mikado-2.3.5rc3.def
bootstrap: docker
from: gemygk/mikado:v2.3.5rc3

# Build the Singularity image
$ sudo singularity build mikado-2.3.5rc3.sif mikado-2.3.5rc3.def

# Execute Mikado
$ singularity exec mikado-2.3.5rc3_CBG.sif mikado -h
usage: Mikado [-h] [--version] {configure,prepare,serialise,pick,compare,util} ...

Mikado is a program to analyse RNA-Seq data and determine the best transcript for each locus in accordance to user-specified criteria.

optional arguments:
  -h, --help            show this help message and exit
  --version             Print Mikado current version and exit.

Components:
  {configure,prepare,serialise,pick,compare,util}
                        These are the various components of Mikado:
    configure           This utility guides the user through the process of creating a configuration file for Mikado.
    prepare             Mikado prepare analyses an input GTF file and prepares it for the picking analysis by sorting its transcripts and performing some simple consistency checks.
    serialise           Mikado serialise creates the database used by the pick program. It handles Junction and ORF BED12 files as well as BLAST XML results.
    pick                Mikado pick analyses a sorted GTF/GFF files in order to identify its loci and choose the best transcripts according to user-specified criteria. It is dependent on files produced by the "prepare" and "serialise"
                        components.
    compare             Mikado compare produces a detailed comparison of reference and prediction files. It has been directly inspired by Cufflinks's cuffcompare and ParsEval.
    util                Miscellaneous utilities

Conda/Mamba/Manual Installation

Mikado can be installed with conda. If you don't have conda, please install mamba first. Then you can create a new environment with Mikado installed.

Install mamba with PyPy 3.9 in the base environment (https://github.com/conda-forge/miniforge?tab=readme-ov-file#miniforge-pypy3)

Replace /path/to with your installation directory when following the steps below:

/path/to/src
[src]$ wget -c https://github.com/conda-forge/miniforge/releases/latest/download/Miniforge-pypy3-Linux-x86_64.sh
[src]$ bash Miniforge-pypy3-Linux-x86_64.sh

I have installed the base to /path/to/x86_64/ location

If you have chosen not to have conda modify your shell scripts at all, to activate conda's base environment in your current shell session, please do:

/path/to/src
[src]$ eval "$(/path/to/x86_64/bin/conda shell.bash hook)"

Install Git

/path/to/src
(base) [src]$ mamba install -y git

Clone mikado

/path/to/src
(base) [src]$ git clone git@github.com:EI-CoreBioinformatics/mikado.git
(base) [src]$ cd mikado

Install Mikado dependencies

/path/to/src/mikado
(base) [mikado]$ mamba env create -f environment.yml --prefix /path/to/x86_64/envs/mikado_env

Activate mikado_env

/path/to/src/mikado
(base) [mikado]$ conda activate mikado_env
(mikado_env) [mikado]$

Do checks if all dependencies are installed. A full list of library dependencies can be found in the file requirements.txt

/path/to/src/mikado
(mikado_env) [mikado]$ pip3 install wheel==0.37.1 numpy==1.23.3 cython==0.29.32
(mikado_env) [mikado]$ pip3 install -r requirements.txt

We should see status 'Requirement already satisfied', when executing the above commands

We need gcc for bdist_wheel (tested on gcc v5.2.0, v9.4.0)

/path/to/src/mikado
(mikado_env) [mikado]$ python3 setup.py bdist_wheel
(mikado_env) [mikado]$ pip3 install dist/*.whl

Now that installation is complete, run Mikado help

/path/to/src/mikado
(mikado_env) [mikado]$ mikado -h
usage: Mikado [-h] [--version] {configure,prepare,serialise,pick,compare,util} ...

Mikado is a program to analyse RNA-Seq data and determine the best transcript for each locus in accordance to user-specified criteria.

optional arguments:
  -h, --help            show this help message and exit
  --version             Print Mikado current version and exit.

Components:
  {configure,prepare,serialise,pick,compare,util}
                        These are the various components of Mikado:
    configure           This utility guides the user through the process of creating a configuration file for Mikado.
    prepare             Mikado prepare analyses an input GTF file and prepares it for the picking analysis by sorting its transcripts and performing some simple consistency checks.
    serialise           Mikado serialise creates the database used by the pick program. It handles Junction and ORF BED12 files as well as BLAST XML results.
    pick                Mikado pick analyses a sorted GTF/GFF files in order to identify its loci and choose the best transcripts according to user-specified criteria. It is dependent on files produced by the "prepare" and "serialise"
                        components.
    compare             Mikado compare produces a detailed comparison of reference and prediction files. It has been directly inspired by Cufflinks's cuffcompare and ParsEval.
    util                Miscellaneous utilities

Additional dependencies

Mikado by itself does require only the presence of a database solution, such as SQLite (although we do support MySQL and PostGRESQL as well). However, the Daijin pipeline requires additional programs to run.

For driving Mikado through Daijin, the following programs are required:

[DIAMOND][DIAMOND] or [Blast+][Blast+] to provide protein homology. DIAMOND is preferred for its speed.
[Prodigal][Prodigal] or [Transdecoder][Transdecoder] to calculate ORFs. The versions of Transdecoder that we tested scale poorly in terms of runtime and disk usage, depending on the size of the input dataset. Prodigal is much faster and lighter, however, the data on our paper has been generated through Transdecoder - not Prodigal. Currently, we set Prodigal as default.
Mikado also makes use of a dataset of RNA-Seq high-quality junctions. We are using [Portcullis][Portcullis] to calculate this data alongside the alignments and assemblies.

If you plan to generate the alignment and assembly part as well through Daijin, the pipeline requires the following:

SAMTools
If you have short-read RNA-Seq data:
- At least one short-read RNA-Seq aligner, choice between [GSNAP], [GMAP][GMAP], [STAR][STAR], [TopHat2][TopHat2], [HI

Mikado

Install / Use

README