Taxor
Fast and space-efficient taxonomic classification of long reads
Install / Use
/learn @JensUweUlrich/TaxorREADME
Taxor: Fast and space-efficient taxonomic classification of long reads with hierarchical interleaved XOR filters
Citation
Ulrich, J. U., & Renard, B. Y. (2024). Fast and space-efficient taxonomic classification of long reads with hierarchical interleaved XOR filters. Genome Research, gr-278623. doi: 10.1101/gr.278623.123
Table of contents
<a name="description"></a>Description
Taxor is a taxonomic classification and profiling tool that efficiently classifies DNA sequences against large sets of genomic reference sequences. Taxor stores k-mers in an optimized hierarchical interleaved XOR filter (HIXF) index and combines k-mer similarity and genome coverage information for precise taxonomic classification and profiling. It features:
- Low false positive rates for k-mer matching
- NCBI taxonomy integration
- Open canonical syncmers as k-mer selection scheme for improved downsampling
- classification with binning and taxonomic profiling
- read reassignment EM algorithm for multi-matching reads
- advanced filtration of search results
- taxonomic and sequence abundance reports with genome size correction
Benchmarking results based on simulated and real long-read data sets demonstrate that Taxor enables more precise taxonomic classification and profiling of microbial populations while having a smaller memory footprint than other tools.
<a name="installation"></a>Installation
The easiest way is to install Taxor via Conda. <br> However, you can also build Taxor on your own using the following commands. Just make sure that you have installed CMake (>=3.16) and GCC (>= 10).
git clone https://github.com/JensUweUlrich/Taxor.git
cd Taxor
mkdir build
cd build
cmake ../src
cmake --build . --config Release
<a name="databases"></a>Pre-built databases
Users can easily build custom databases as described below or use the following pre-built database index files
|Kingdom |Source | Parameters | Size | Link | MD5 Hash | |:-------------------------------------|:----------------------|:------------|:--------|:-------------|:---------| | Viruses | Genbank Release 258 | k=22, s=12 | 373 MB | download | 0e8edd19a6314450f88f556dcf6b7c95 | | Archaea, Bacteria, Fungi, Viruses | RefSeq Release 216 | k=22, s=12 | 9.9 GB | download | 768ee0320dcf41f5b15efafa028ba836 |
<a name="commands"></a>Commands
|Subcommand |Function | |:-------------------------------------------------------------------------|:---------------------------------------------------------------| |build | Construct HIXF index from fasta reference files | |search | Search sequences against a database index | |profile | Generate the taxonomic profile from search results |
<a name="build"></a>Taxor build
taxor-build - Creates and HIXF index of a given set of fasta files
==================================================================
DESCRIPTION
Creates an HIXF index using either k-mers or syncmers
OPTIONS
Basic options:
-h, --help
Prints the help page.
-hh, --advanced-help
Prints the help page including advanced options.
--version
Prints the version information.
--copyright
Prints the copyright/license information.
--export-help (std::string)
Export the help page information. Value must be one of [html, man].
Main options:
--input-file (std::string)
tab-separated-value file containing taxonomy information and reference file names
--input-sequence-dir (std::string)
directory containing the fasta reference files Default: .
--output-filename (std::string)
A file name for the resulting index. Default: .
--kmer-size (signed 32 bit integer)
size of kmers used for index construction Default: 20. Value must be in range [1,30].
--syncmer-size (signed 32 bit integer)
size of syncmer used for index construction Default: 10. Value must be in range [1,26].
--threads (signed 32 bit integer)
The number of threads to use. Default: 1. Value must be in range [1,32].
--use-syncmer
enable using syncmers for smaller index size
<b> input-file</b><br> This file contains all relevant information about the organisms in the database, which will be indexed. All values are tab-separated and the file should have following columns:
- Column 1: Assembly accession: the assembly accession.version reported in this field is a unique identifier for the set of sequences in this particular version of the genome assembly.
- Column 2: Taxonomy ID: the NCBI taxonomy identifier for the organism from which the genome assembly was derived. The NCBI Taxonomy Database is a curated classification and nomenclature for all of the organisms in the public sequence databases. The taxonomy record can be retrieved from the NCBI Taxonomy resource: https://www.ncbi.nlm.nih.gov/taxonomy/
- Column 3: FTP path: the path to the directory on the NCBI genomes FTP site from which data for this genome assembly can be downloaded
- Column 4: Organism name
- Column 5: Taxonomy string
- Column 6: Taxonomy ID string
A two-line example of such a file is provided below. You can easily create such a file by following the preprocessing steps described in the Usage section.
GCF_000002495.2 318829 https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/002/495/GCF_000002495.2_MG8 Pyricularia oryzae k__Eukaryota;p__Ascomycota;c__Sordariomycetes;o__Magnaporthales;f__Pyriculariaceae;g__Pyricularia;s__Pyricularia oryzae 2759;4890;147550;639021;2528436;48558;318829
GCF_000002515.2 28985 https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/002/515/GCF_000002515.2_ASM251v1 Kluyveromyces lactis k__Eukaryota;p__Ascomycota;c__Saccharomycetes;o__Saccharomycetales;f__Saccharomycetaceae;g__Kluyveromyces;s__Kluyveromyces lactis 2759;4890;4891;4892;4893;4910;28985
<b> input-sequence-dir</b><br> Path to the directory containing fasta files (compressed) of organisms listed in the tab-separated file explained above. The file stem of the fasta files needs to match the last directory path string of the FTP path in column 3 of the input file (e.g. GCF_000002495.2_MG8)
<b> output-filename</b><br> Path to the output file containing the hierarchical interleaved XOR filter index of the reference sequences and taxonomy information for the profiling step.
<b> kmer-size</b><br> Size of k-length-substrings used for pseudo-mapping. When using syncmers for downsampling, the kmer-size has to be even-numbered because of using open canonical syncmers. The maximum supported k-mer size is 30.
<b> syncmer-size</b><br> Size of the substrings used for selecting a k-mer for pseudo-mapping. The syncmer-size also has to be even-numbered because of the usage of open canonical syncmers. This number needs to be smaller than the k-mer size and the maximum supported size is 26.
<b> use-syncmer</b><br> Switch that enables the usage of syncmers for downsampling of k-mers.
<b> threads</b><br> Number of threads used for computing the hierarchical structure and building the HIXF index.
<a name="search"></a>Taxor search
taxor-search - Queries a file of DNA sequences against an HIXF index
====================================================================
DESCRIPTION
Query sequences against the taxor HIXF index structure
OPTIONS
Basic options:
-h, --help
Prints the help page.
-hh, --advanced-help
Prints the help page including advanced options.
--version
Prints the version information.
--copyright
Prints the copyright/license information.
--export-help (std::string)
Export the help page information. Value must be one of [html, man].
Main options:
--index-file (std::string)
taxor index file containing HIXF index and reference sequence information
--query-file (std::string)
file containing sequences to query against the index Default: .
--output-file (std::string)
A file name for the resulting output. Default: .
--threads (unsigned 8 bit integer)
The number of threads to use. Default: 1. Value must be in range [1,32].
--percentage (double)
If set, this threshold is used instead of the k-mer/syncmer models. Default: -1. Value must be in range
[0,1].
--error-rate (double)
Expected error rate of reads that will be queried Default: 0.04. Value must be in range [0,1].
<b> index-file</b><br> Path to the file containing the hierarchical interleaved XOR filter index of the reference sequences and taxonomy information for the profiling step.
<b> query-file</b><br> Path to a fast(a/q) file containing sequenced reads of a sample, which shall be taxonomically classified. This file can be gzip compressed.
<b> output-file</b><br> Path to the output file containing result
