Kaiju
Fast taxonomic classification of metagenomic sequencing reads using a protein reference database
Install / Use
/learn @bioinformatics-centre/KaijuREADME
Kaiju
Kaiju is a program for the taxonomic classification of high-throughput sequencing reads, e.g., Illumina or Roche/454, from whole-genome sequencing of metagenomic DNA. Reads are directly assigned to taxa using the NCBI taxonomy and a reference database of protein sequences from microbial and viral genomes.
The program is described in Menzel, P. et al. (2016) Fast and sensitive taxonomic classification for metagenomics with Kaiju. Nat. Commun. 7:11257 (open access).
See the release notes for all releases here.
Authors
Peter Menzel pmenzel@gmail.com
Anders Krogh krogh@binf.ku.dk
License
Copyright (c) 2015-2024 Peter Menzel and Anders Krogh
Kaiju is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.
Kaiju is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
See the file LICENSE for more details.
You should have received a copy of the GNU General Public License along with the source code. If not, see http://www.gnu.org/licenses/.
Installation
Compiling Kaiju from source
Kaiju's source code can be downloaded directly from GitHub either as a compressed archive or using the git command line client:
git clone https://github.com/bioinformatics-centre/kaiju.git
This will create the directory kaiju in the current directory.
Kaiju is written in C/C++11 for Linux. It uses the zlib library for reading gzip-compressed files. If not already installed, it is necessary to install the zlib development library, e.g. on Ubuntu using:
sudo apt install libz-dev
For compiling Kaiju and its associated programs, type:
cd kaiju/src
make
After compilation, Kaiju's executable files are available in the kaiju/bin directory.
You can add this directory to your shell's $PATH variable or copy all files from kaiju/bin to a directory in your $PATH.
Installation via Bioconda
Kaiju is also available via the bioconda channel and can be installed via
conda install -c bioconda kaiju
# or
mamba install -c bioconda kaiju
Creating the Kaiju index
Before classification of reads, Kaiju's database index needs to be built from the reference protein database. You can either create a local index based on the currently available reference databases, or download a pre-built index.
For creating a local index, the program kaiju-makedb in the bin/ directory
will download a source database and the taxonomy files from the NCBI FTP server,
convert them into a protein database and construct Kaiju's index (the
Burrows-Wheeler transform and the FM-index) in one go.
kaiju-makedb needs curl and wget for downloading the reference databases.
The downloaded files can be very large, depending on the selected reference database.
It is therefore recommended to run kaiju-makedb in a directory with at least 500 GB of free space.
Example usage:
mkdir kaijudb
cd kaijudb
kaiju-makedb -s <DB>
The table below lists the available source databases.
Use the database name shown in the first column as argument to option -s in kaiju-makedb.
The last column denotes the required memory for running Kaiju with the
respective index and for creating the index (in brackets).
| Index name | Description | Sequences<sup>*</sup> | RAM in GB (makedb)<sup>*</sup> |
| --- | --- | --- | --- |
| refseq | Completely assembled and annotated reference genomes of Archaea, Bacteria, and viruses from the NCBI RefSeq database. | 164 M | 111 (144) |
| refseq_nr | Sequences for Archaea, Bacteria, viruses and microbial eukaryotes from the NCBI RefSeq non-redundant protein collection. | 276 M | 153 (263) |
| refseq_ref | Protein sequences from representative assemblies of Archaea and bacteria from NCBI RefSeq plus viruses from NCBI RefSeq. | 76.9 M | 54 (70) |
| progenomes | Representative set of genomes from the proGenomes v3 database and viruses from the NCBI RefSeq database. | 141 M | 102 (120) |
| viruses | Only viruses from the NCBI RefSeq database. | 0.68 M | 0.5 (0.6) |
| plasmids | Plasmid sequences from the NCBI RefSeq database. | 7 M | 4 (6) |
| fungi | Fungi sequences from the NCBI RefSeq database. | 6.5 M | 6 (9) |
| nr | Subset of NCBI BLAST nr database containing all proteins belonging to Archaea, bacteria and viruses. | 353 M | 219 (491) |
| nr_euk | Like option -s nr and additionally include proteins from fungi and microbial eukaryotes, see taxon list in bin/kaiju-taxonlistEuk.tsv. | 397 M | 250 (432) |
| rvdb | Protein sequences from RVDB-prot | 39 M | 86 (253) |
* as of late 2024.
Pre-built indexes for each reference database can be downloaded.
By default, kaiju-makedb uses 5 parallel threads for constructing the index, which can
be changed by using the option -t. Note that a higher number of threads
increases the memory usage during index construction, while reducing the number
of threads decreases memory usage.
After kaiju-makedb is finished, only the files kaiju_db_*.fmi, nodes.dmp,
and names.dmp are needed to run Kaiju.
Custom database
It is also possible to make a custom database from a collection of protein sequences. The format needs to be a FASTA file in which the headers are the numeric NCBI taxon identifiers of the protein sequences, which can optionally be prefixed by another identifier (e.g. a counter) followed by an underscore, for example:
>1_1358
MAQQRRGGFKRRKKVDFIAANKIEVVDYKDTELLKRFISERGKILPRRVTGTSAKNQRKVVNAIKRARVMALLPFVAEDQN
>2_44689
MASTQNIVEEVQKMLDTYDTNKDGEITKAEAVEYFKGKKAFNPERSAIYLFQVYDKDNDGKITIKELAGDIDFDKALKEYKEKQAKSKQQEAEVEEDIEAFILRHNKDDNTDITKDELIQGFKETGAKDPEKSANFILTEMDTNKDGTITVKELRVYYQKVQKLLNPDQ
>3_352472
MKTKSSNNIKKIYYISSILVGIYLCWQIIIQIIFLMDNSIAILEAIGMVVFISVYSLAVAINGWILVGRMKKSSKKAQYEDFYKKMILKSKILLSTIIIVIIVVVVQDIVINFILPQNPQPYVYMIISNFIVGIADSFQMIMVIFVMGELSFKNYFKFKRIEKQKNHIVIGGSSLNSLPVSLPTVKSNESNESNTISINSENNNSKVSTDDTINNVM
>4_91061
MTNPFENDNYTYKVLKNEEGQYSLWPAFLDVPIGWNVVHKEASRNDCLQYVENNWEDLNPKSNQVGKKILVGKR
...
The taxon identifiers must be contained in the NCBI taxonomy files nodes.dmp and names.dmp.
Then, Kaiju's index is created using the programs kaiju-mkbwt and kaiju-mkfmi. For example, if the database FASTA file is called proteins.faa, then run:
kaiju-mkbwt -n 5 -a ACDEFGHIKLMNPQRSTVWY -o proteins proteins.faa
kaiju-mkfmi proteins
which creates the file proteins.fmi that is used by Kaiju. Note that the protein sequences may only contain the uppercase characters of the standard 20 amino acids, all other characters need to be removed.
Running Kaiju
Kaiju requires at least three arguments:
kaiju -t nodes.dmp -f kaiju_db_*.fmi -i inputfile.fastq
Replace kaiju_db_*.fmi by the actual .fmi file depending on the selected database.
For example, when running kaiju-makedb -s refseq, the corresponding index file is refseq/kaiju_db_refseq.fmi.
For paired-end reads use -i firstfile.fastq and -j secondfile.fastq.
The reads must be in the same order in both files. Kaiju will strip suffixes
from the read names by deleting all characters after a / or space. The read
names are then compared between the first and second file and an error is
issued if they are not identical.
Kaiju can read input files in FASTQ and FASTA format, which may also be gzip-compressed.
By default, Kaiju will print the output to the terminal (STDOUT).
The output can also be written to a file using the -o option:
kaiju -t nodes.dmp -f kaiju_db.fmi -i inputfile.fastq -o kaiju.out
Kaiju can use multiple parallel threads, which can be specified with the -z option, e.g. for using 25 parallel threads:
kaiju -z 25 -t nodes.dmp -f kaiju_db.fmi -i inputfile.fastq -o kaiju.out
kaiju-multi
While kaiju can only process one input, kaiju-multi can take a comma-separated list of input files (and optionally output files) for processing multiple samples at once:
kaiju-multi -z 25 -t nodes.dmp -f kaiju_db.fmi -i sample1_R1.fastq,sample2_R1.fastq,sample3_R1.fastq -j sample1_R2.fastq,sample2_R2.fastq,sample3_R2.fastq -o sample1.out,sample2.out,sample3.out
These lists must have the same length. It's also possible to merge all outputs into one file using output redirection:
kaiju-multi -z 25 -t nodes.dmp -f kaiju_db.fmi -i sample1_R1.fastq,sample2_R1.fastq,sample3_R1.fastq -j sample1_R2.fastq,sample2_R2.fastq,sample3_R2.fastq > all_samples.out
Run modes
The default run mode is Greedy with three allowed mismatches.
The number of allowed mismatches can be changed using option -e.
In Greedy mode, matches are filtered by a minimum length and score, but also by their E-value (similar to blastp), which can be adjusted with the option -E. The default value is 0.01.
The cutoffs for minimum required match length and match score can be changed using the options -m (default: 11) and -s (default: 65).
The run
