Taxor: Fast and space-efficient taxonomic classification of long reads with hierarchical interleaved XOR filters

Citation

Ulrich, J. U., & Renard, B. Y. (2024). Fast and space-efficient taxonomic classification of long reads with hierarchical interleaved XOR filters. Genome Research, gr-278623. doi: 10.1101/gr.278623.123

Description
Installation
Databases
Commands
Usage

<a name="description"></a>Description

Taxor is a taxonomic classification and profiling tool that efficiently classifies DNA sequences against large sets of genomic reference sequences. Taxor stores k-mers in an optimized hierarchical interleaved XOR filter (HIXF) index and combines k-mer similarity and genome coverage information for precise taxonomic classification and profiling. It features:

Low false positive rates for k-mer matching
NCBI taxonomy integration
Open canonical syncmers as k-mer selection scheme for improved downsampling
classification with binning and taxonomic profiling
read reassignment EM algorithm for multi-matching reads
advanced filtration of search results
taxonomic and sequence abundance reports with genome size correction

Benchmarking results based on simulated and real long-read data sets demonstrate that Taxor enables more precise taxonomic classification and profiling of microbial populations while having a smaller memory footprint than other tools.

<a name="installation"></a>Installation

The easiest way is to install Taxor via Conda. However, you can also build Taxor on your own using the following commands. Just make sure that you have installed CMake (>=3.16) and GCC (>= 10).

git clone https://github.com/JensUweUlrich/Taxor.git
cd Taxor
mkdir build
cd build
cmake  ../src
cmake --build . --config Release

<a name="databases"></a>Pre-built databases

Users can easily build custom databases as described below or use the following pre-built database index files

|Kingdom |Source | Parameters | Size | Link | MD5 Hash | |:-------------------------------------|:----------------------|:------------|:--------|:-------------|:---------| | Viruses | Genbank Release 258 | k=22, s=12 | 373 MB | download | 0e8edd19a6314450f88f556dcf6b7c95 | | Archaea, Bacteria, Fungi, Viruses | RefSeq Release 216 | k=22, s=12 | 9.9 GB | download | 768ee0320dcf41f5b15efafa028ba836 |

<a name="commands"></a>Commands

|Subcommand |Function | |:-------------------------------------------------------------------------|:---------------------------------------------------------------| |build | Construct HIXF index from fasta reference files | |search | Search sequences against a database index | |profile | Generate the taxonomic profile from search results |

<a name="build"></a>Taxor build

taxor-build - Creates and HIXF index of a given set of fasta files
==================================================================

DESCRIPTION
    Creates an HIXF index using either k-mers or syncmers

OPTIONS

  Basic options:
    -h, --help
          Prints the help page.
    -hh, --advanced-help
          Prints the help page including advanced options.
    --version
          Prints the version information.
    --copyright
          Prints the copyright/license information.
    --export-help (std::string)
          Export the help page information. Value must be one of [html, man].

  Main options:
    --input-file (std::string)
          tab-separated-value file containing taxonomy information and reference file names
    --input-sequence-dir (std::string)
          directory containing the fasta reference files Default: .
    --output-filename (std::string)
          A file name for the resulting index. Default: .
    --kmer-size (signed 32 bit integer)
          size of kmers used for index construction Default: 20. Value must be in range [1,30].
    --syncmer-size (signed 32 bit integer)
          size of syncmer used for index construction Default: 10. Value must be in range [1,26].
    --threads (signed 32 bit integer)
          The number of threads to use. Default: 1. Value must be in range [1,32].
    --use-syncmer
          enable using syncmers for smaller index size

input-file This file contains all relevant information about the organisms in the database, which will be indexed. All values are tab-separated and the file should have following columns:

Column 1: Assembly accession: the assembly accession.version reported in this field is a unique identifier for the set of sequences in this particular version of the genome assembly.
Column 2: Taxonomy ID: the NCBI taxonomy identifier for the organism from which the genome assembly was derived. The NCBI Taxonomy Database is a curated classification and nomenclature for all of the organisms in the public sequence databases. The taxonomy record can be retrieved from the NCBI Taxonomy resource: https://www.ncbi.nlm.nih.gov/taxonomy/
Column 3: FTP path: the path to the directory on the NCBI genomes FTP site from which data for this genome assembly can be downloaded
Column 4: Organism name
Column 5: Taxonomy string
Column 6: Taxonomy ID string

A two-line example of such a file is provided below. You can easily create such a file by following the preprocessing steps described in the Usage section.

GCF_000002495.2	318829	https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/002/495/GCF_000002495.2_MG8	Pyricularia oryzae	k__Eukaryota;p__Ascomycota;c__Sordariomycetes;o__Magnaporthales;f__Pyriculariaceae;g__Pyricularia;s__Pyricularia oryzae	2759;4890;147550;639021;2528436;48558;318829
GCF_000002515.2	28985	https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/002/515/GCF_000002515.2_ASM251v1	Kluyveromyces lactis	k__Eukaryota;p__Ascomycota;c__Saccharomycetes;o__Saccharomycetales;f__Saccharomycetaceae;g__Kluyveromyces;s__Kluyveromyces lactis	2759;4890;4891;4892;4893;4910;28985

input-sequence-dir Path to the directory containing fasta files (compressed) of organisms listed in the tab-separated file explained above. The file stem of the fasta files needs to match the last directory path string of the FTP path in column 3 of the input file (e.g. GCF_000002495.2_MG8)

output-filename Path to the output file containing the hierarchical interleaved XOR filter index of the reference sequences and taxonomy information for the profiling step.

kmer-size Size of k-length-substrings used for pseudo-mapping. When using syncmers for downsampling, the kmer-size has to be even-numbered because of using open canonical syncmers. The maximum supported k-mer size is 30.

syncmer-size Size of the substrings used for selecting a k-mer for pseudo-mapping. The syncmer-size also has to be even-numbered because of the usage of open canonical syncmers. This number needs to be smaller than the k-mer size and the maximum supported size is 26.

use-syncmer Switch that enables the usage of syncmers for downsampling of k-mers.

threads Number of threads used for computing the hierarchical structure and building the HIXF index.

<a name="search"></a>Taxor search

taxor-search - Queries a file of DNA sequences against an HIXF index
====================================================================

DESCRIPTION
    Query sequences against the taxor HIXF index structure

OPTIONS

  Basic options:
    -h, --help
          Prints the help page.
    -hh, --advanced-help
          Prints the help page including advanced options.
    --version
          Prints the version information.
    --copyright
          Prints the copyright/license information.
    --export-help (std::string)
          Export the help page information. Value must be one of [html, man].

  Main options:
    --index-file (std::string)
          taxor index file containing HIXF index and reference sequence information
    --query-file (std::string)
          file containing sequences to query against the index Default: .
    --output-file (std::string)
          A file name for the resulting output. Default: .
    --threads (unsigned 8 bit integer)
          The number of threads to use. Default: 1. Value must be in range [1,32].
    --percentage (double)
          If set, this threshold is used instead of the k-mer/syncmer models. Default: -1. Value must be in range
          [0,1].
    --error-rate (double)
          Expected error rate of reads that will be queried Default: 0.04. Value must be in range [0,1].

index-file Path to the file containing the hierarchical interleaved XOR filter index of the reference sequences and taxonomy information for the profiling step.

query-file Path to a fast(a/q) file containing sequenced reads of a sample, which shall be taxonomically classified. This file can be gzip compressed.

output-file Path to the output file containing result

Taxor

Install / Use

README