SkillAgentSearch skills...

Taxor

Fast and space-efficient taxonomic classification of long reads

Install / Use

/learn @JensUweUlrich/Taxor

README

<center><img src="img/Logo.png" alt="Taxonomic classification with XOR filters" width="300"/></center>

Taxor: Fast and space-efficient taxonomic classification of long reads with hierarchical interleaved XOR filters

Citation

Ulrich, J. U., & Renard, B. Y. (2024). Fast and space-efficient taxonomic classification of long reads with hierarchical interleaved XOR filters. Genome Research, gr-278623. doi: 10.1101/gr.278623.123

Table of contents

<a name="description"></a>Description

Taxor is a taxonomic classification and profiling tool that efficiently classifies DNA sequences against large sets of genomic reference sequences. Taxor stores k-mers in an optimized hierarchical interleaved XOR filter (HIXF) index and combines k-mer similarity and genome coverage information for precise taxonomic classification and profiling. It features:

  • Low false positive rates for k-mer matching
  • NCBI taxonomy integration
  • Open canonical syncmers as k-mer selection scheme for improved downsampling
  • classification with binning and taxonomic profiling
  • read reassignment EM algorithm for multi-matching reads
  • advanced filtration of search results
  • taxonomic and sequence abundance reports with genome size correction

Benchmarking results based on simulated and real long-read data sets demonstrate that Taxor enables more precise taxonomic classification and profiling of microbial populations while having a smaller memory footprint than other tools.

<a name="installation"></a>Installation

The easiest way is to install Taxor via Conda. <br> However, you can also build Taxor on your own using the following commands. Just make sure that you have installed CMake (>=3.16) and GCC (>= 10).

git clone https://github.com/JensUweUlrich/Taxor.git
cd Taxor
mkdir build
cd build
cmake  ../src
cmake --build . --config Release

<a name="databases"></a>Pre-built databases

Users can easily build custom databases as described below or use the following pre-built database index files

|Kingdom |Source | Parameters | Size | Link | MD5 Hash | |:-------------------------------------|:----------------------|:------------|:--------|:-------------|:---------| | Viruses | Genbank Release 258 | k=22, s=12 | 373 MB | download | 0e8edd19a6314450f88f556dcf6b7c95 | | Archaea, Bacteria, Fungi, Viruses | RefSeq Release 216 | k=22, s=12 | 9.9 GB | download | 768ee0320dcf41f5b15efafa028ba836 |

<a name="commands"></a>Commands

|Subcommand |Function | |:-------------------------------------------------------------------------|:---------------------------------------------------------------| |build | Construct HIXF index from fasta reference files | |search | Search sequences against a database index | |profile | Generate the taxonomic profile from search results |

<a name="build"></a>Taxor build

taxor-build - Creates and HIXF index of a given set of fasta files
==================================================================

DESCRIPTION
    Creates an HIXF index using either k-mers or syncmers

OPTIONS

  Basic options:
    -h, --help
          Prints the help page.
    -hh, --advanced-help
          Prints the help page including advanced options.
    --version
          Prints the version information.
    --copyright
          Prints the copyright/license information.
    --export-help (std::string)
          Export the help page information. Value must be one of [html, man].

  Main options:
    --input-file (std::string)
          tab-separated-value file containing taxonomy information and reference file names
    --input-sequence-dir (std::string)
          directory containing the fasta reference files Default: .
    --output-filename (std::string)
          A file name for the resulting index. Default: .
    --kmer-size (signed 32 bit integer)
          size of kmers used for index construction Default: 20. Value must be in range [1,30].
    --syncmer-size (signed 32 bit integer)
          size of syncmer used for index construction Default: 10. Value must be in range [1,26].
    --threads (signed 32 bit integer)
          The number of threads to use. Default: 1. Value must be in range [1,32].
    --use-syncmer
          enable using syncmers for smaller index size

<b> input-file</b><br> This file contains all relevant information about the organisms in the database, which will be indexed. All values are tab-separated and the file should have following columns:

  • Column 1: Assembly accession: the assembly accession.version reported in this field is a unique identifier for the set of sequences in this particular version of the genome assembly.
  • Column 2: Taxonomy ID: the NCBI taxonomy identifier for the organism from which the genome assembly was derived. The NCBI Taxonomy Database is a curated classification and nomenclature for all of the organisms in the public sequence databases. The taxonomy record can be retrieved from the NCBI Taxonomy resource: https://www.ncbi.nlm.nih.gov/taxonomy/
  • Column 3: FTP path: the path to the directory on the NCBI genomes FTP site from which data for this genome assembly can be downloaded
  • Column 4: Organism name
  • Column 5: Taxonomy string
  • Column 6: Taxonomy ID string

A two-line example of such a file is provided below. You can easily create such a file by following the preprocessing steps described in the Usage section.

GCF_000002495.2	318829	https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/002/495/GCF_000002495.2_MG8	Pyricularia oryzae	k__Eukaryota;p__Ascomycota;c__Sordariomycetes;o__Magnaporthales;f__Pyriculariaceae;g__Pyricularia;s__Pyricularia oryzae	2759;4890;147550;639021;2528436;48558;318829
GCF_000002515.2	28985	https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/002/515/GCF_000002515.2_ASM251v1	Kluyveromyces lactis	k__Eukaryota;p__Ascomycota;c__Saccharomycetes;o__Saccharomycetales;f__Saccharomycetaceae;g__Kluyveromyces;s__Kluyveromyces lactis	2759;4890;4891;4892;4893;4910;28985

<b> input-sequence-dir</b><br> Path to the directory containing fasta files (compressed) of organisms listed in the tab-separated file explained above. The file stem of the fasta files needs to match the last directory path string of the FTP path in column 3 of the input file (e.g. GCF_000002495.2_MG8)

<b> output-filename</b><br> Path to the output file containing the hierarchical interleaved XOR filter index of the reference sequences and taxonomy information for the profiling step.

<b> kmer-size</b><br> Size of k-length-substrings used for pseudo-mapping. When using syncmers for downsampling, the kmer-size has to be even-numbered because of using open canonical syncmers. The maximum supported k-mer size is 30.

<b> syncmer-size</b><br> Size of the substrings used for selecting a k-mer for pseudo-mapping. The syncmer-size also has to be even-numbered because of the usage of open canonical syncmers. This number needs to be smaller than the k-mer size and the maximum supported size is 26.

<b> use-syncmer</b><br> Switch that enables the usage of syncmers for downsampling of k-mers.

<b> threads</b><br> Number of threads used for computing the hierarchical structure and building the HIXF index.

<a name="search"></a>Taxor search

taxor-search - Queries a file of DNA sequences against an HIXF index
====================================================================

DESCRIPTION
    Query sequences against the taxor HIXF index structure

OPTIONS

  Basic options:
    -h, --help
          Prints the help page.
    -hh, --advanced-help
          Prints the help page including advanced options.
    --version
          Prints the version information.
    --copyright
          Prints the copyright/license information.
    --export-help (std::string)
          Export the help page information. Value must be one of [html, man].

  Main options:
    --index-file (std::string)
          taxor index file containing HIXF index and reference sequence information
    --query-file (std::string)
          file containing sequences to query against the index Default: .
    --output-file (std::string)
          A file name for the resulting output. Default: .
    --threads (unsigned 8 bit integer)
          The number of threads to use. Default: 1. Value must be in range [1,32].
    --percentage (double)
          If set, this threshold is used instead of the k-mer/syncmer models. Default: -1. Value must be in range
          [0,1].
    --error-rate (double)
          Expected error rate of reads that will be queried Default: 0.04. Value must be in range [0,1].

<b> index-file</b><br> Path to the file containing the hierarchical interleaved XOR filter index of the reference sequences and taxonomy information for the profiling step.

<b> query-file</b><br> Path to a fast(a/q) file containing sequenced reads of a sample, which shall be taxonomically classified. This file can be gzip compressed.

<b> output-file</b><br> Path to the output file containing result

View on GitHub
GitHub Stars45
CategoryDevelopment
Updated26d ago
Forks2

Languages

C++

Security Score

95/100

Audited on Mar 3, 2026

No findings