Sepia

Sepia is a (taxonomic) read classifier written in Rust based on (extensions) of the kraken2 algorithms (https://github.com/DerrickWood/kraken2), hence the name Sepia (just like the kraken a cephalopod, and the rust colored pigment derived from it's ink sac). The reason I wrote Sepia is to create a software package that can be easily adapted to novel taxonomies or updates of taxonomies (e.g. GTDB), alter- novel algorithms or data structures, and has features I would like myself (e.g., a fast batch mode).

Installation

Precompiled binaries

Download one of the precompiled binaries (currently only for macOS and Linux) here.

Install from source:

Install Rust (https://www.rust-lang.org/)

Clone the sepia repository:

git clone https://github.com/hcdenbakker/sepia.git

Compile sepia with cargo (Rust's compiler):
```
 cd sepia; cargo build --release
```

Running Sepia

Building an index

To build an index you need collection of reference files (I masked low complexity regions using NCBI's DustMasker (https://www.ncbi.nlm.nih.gov/IEB/ToolBox/CPP_DOC/lxr/source/src/app/dustmasker/) in the 'masked folder') and a tab delimited file with the path to the reference file and the taxonomy string of the reference file. In the ref_demo.txt file you will find an example of what this should look like; we have reference genomes of Salmonella and Escherichia of GTDB r95, and the associated taxonomy in string format.

./masked/Salmonella_enterica_masked.fasta       d__Bacteria;p__Proteobacteria;c__Gammaproteobacteria;o__Enterobacterales;f__Enterobacteriaceae;g__Salmonella;s__Salmonella enterica
./masked/Salmonella_enterica_C_masked.fasta     d__Bacteria;p__Proteobacteria;c__Gammaproteobacteria;o__Enterobacterales;f__Enterobacteriaceae;g__Salmonella;s__Salmonella enterica_C
./masked/Salmonella_enterica_E_masked.fasta     d__Bacteria;p__Proteobacteria;c__Gammaproteobacteria;o__Enterobacterales;f__Enterobacteriaceae;g__Salmonella;s__Salmonella enterica_E
./masked/Salmonella_enterica_D_masked.fasta     d__Bacteria;p__Proteobacteria;c__Gammaproteobacteria;o__Enterobacterales;f__Enterobacteriaceae;g__Salmonella;s__Salmonella enterica_D
./masked/Salmonella_bongori_masked.fasta        d__Bacteria;p__Proteobacteria;c__Gammaproteobacteria;o__Enterobacterales;f__Enterobacteriaceae;g__Salmonella;s__Salmonella bongori
./masked/Escherichia_flexneri_masked.fasta      d__Bacteria;p__Proteobacteria;c__Gammaproteobacteria;o__Enterobacterales;f__Enterobacteriaceae;g__Escherichia;s__Escherichia flexneri
./masked/Escherichia_coli_masked.fasta  d__Bacteria;p__Proteobacteria;c__Gammaproteobacteria;o__Enterobacterales;f__Enterobacteriaceae;g__Escherichia;s__Escherichia coli
./masked/Escherichia_dysenteriae_masked.fasta   d__Bacteria;p__Proteobacteria;c__Gammaproteobacteria;o__Enterobacterales;f__Enterobacteriaceae;g__Escherichia;s__Escherichia dysenteriae
./masked/Escherichia_coli_D_masked.fasta        d__Bacteria;p__Proteobacteria;c__Gammaproteobacteria;o__Enterobacterales;f__Enterobacteriaceae;g__Escherichia;s__Escherichia coli_D
./masked/Escherichia_albertii_masked.fasta      d__Bacteria;p__Proteobacteria;c__Gammaproteobacteria;o__Enterobacterales;f__Enterobacteriaceae;g__Escherichia;s__Escherichia albertii
./masked/Escherichia_coli_C_masked.fasta        d__Bacteria;p__Proteobacteria;c__Gammaproteobacteria;o__Enterobacterales;f__Enterobacteriaceae;g__Escherichia;s__Escherichia coli_C
./masked/Escherichia_marmotae_masked.fasta      d__Bacteria;p__Proteobacteria;c__Gammaproteobacteria;o__Enterobacterales;f__Enterobacteriaceae;g__Escherichia;s__Escherichia marmotae
./masked/Escherichia_sp000208585_masked.fasta   d__Bacteria;p__Proteobacteria;c__Gammaproteobacteria;o__Enterobacterales;f__Enterobacteriaceae;g__Escherichia;s__Escherichia sp000208585
./masked/Escherichia_fergusonii_masked.fasta    d__Bacteria;p__Proteobacteria;c__Gammaproteobacteria;o__Enterobacterales;f__Enterobacteriaceae;g__Escherichia;s__Escherichia fergusonii
./masked/Escherichia_sp001660175_masked.fasta   d__Bacteria;p__Proteobacteria;c__Gammaproteobacteria;o__Enterobacterales;f__Enterobacteriaceae;g__Escherichia;s__Escherichia sp001660175
./masked/Escherichia_sp002965065_masked.fasta   d__Bacteria;p__Proteobacteria;c__Gammaproteobacteria;o__Enterobacterales;f__Enterobacteriaceae;g__Escherichia;s__Escherichia sp002965065

If we have these files in place we can use the build subcommand to build an index.

./sepia build --help

builds an index

USAGE:
    sepia build [OPTIONS] --index <index> --kmer <k-mer_size> --minimizer <minimizer> --refs <ref_file>

FLAGS:
    -h, --help       Prints help information
    -V, --version    Prints version information

OPTIONS:
    -c, --batch <batch>            Sets size of batch of reads to be processed in parallel (default 300)
    -g, --gamma <gamma>            gamma parameter used for perfect hash function in boom mode, default value 5.0
    -i, --index <index>            
    -k, --kmer <k-mer_size>        Sets k-mer size
    -m, --minimizer <minimizer>    minimizer size, default length minimizer is 0 (no minimizers), unless indicated
                                   otherwise [default: 0]
    -M, --mode <mode>              build kraken2-like db (sepia), or a index with a perfect hash function (boom)
                                   [default: sepia]
    -r, --refs <ref_file>          Sets the input file to use
    -p, --threads <threads>        number of threads to use, if not set one thread will be used

For the demo we build an index of the Salmonella and Escherichia references using a compact hash table (the data structure for the index used in kraken2), which is the default in Sepia:

./sepia build -r ref_demo.txt -k 31 -m 21 -p 4 -c 2 -i demo_index

This creates an index called 'demo_index'. Alternatively we can build an index if the -M boom option, which will create a smaller index with a perfect hash function. While the index is smaller with a perfect hash function, the amount of RAM needed to build it is considerably larger, this is one of the experimental elements of Sepia.

In the index directory/folder you will find a text file reporting potential taxonomic incongruities called taxonomy_ambiguities.txt. It is up to the user to solve these incongruities if necessary, for example:

s__Hypnum_recurvatum:
root;k__Viridiplantae;p__Bryophyta;c__Bryopsida;o__Hypnales;f__Hypnaceae;g__Drepanium/s__Hypnum_recurvatum
root;k__Viridiplantae;p__Bryophyta;c__Bryopsida;o__Hypnales;f__Hypnaceae;g__Hypnum/s__Hypnum_recurvatum

In this case the species name s__Hypnum_recurvatum is found in two different lineages, once in the genus g__Drepanium, and once (as expected) in the genus g__Hypnum. This clearly a situation that needs to be corrected, as it may affect the accuracy of Sepia to classify reads belonging to either g__Drepanium or g__Hypnum. Here is another example that does not need correction:

g__Gomphus:
root;k__Metazoa;p__Arthropoda;c__Insecta;o__Odonata;f__Gomphidae/g__Gomphus
root;k__Fungi;p__Basidiomycota;c__Agaricomycetes;o__Gomphales;f__Gomphaceae/g__Gomphus

In this case the genus g__Gomphus is found in two lineages; (i) a lineage leading up to a family of Dragonflies (f__Gomphidae) and (ii) a lineage leading up a family of ectomycorrhizal mushrooms (f__Gomphaceae). This happens quite often when taxonomies that are covered by different nomenclature codes (https://en.wikipedia.org/wiki/Nomenclature_codes) are combined in one index. Taxonomies in Sepia are encoded in such a way that this should not pose a problem.

Classifying reads

./sepia classify --help

classifies reads using an lca approach

USAGE:
    sepia classify [FLAGS] [OPTIONS] --index <index> --prefix <prefix> --query <query>

FLAGS:
    -h, --help           Prints help information
    -H, --hll            include HLL based estimates of cardinilaty of kmers and duplicity of kmers
    -I, --interleaved    Type of sequence data; pe: paired-end fastq.gz, se: single-end fastq.gz, inter: interleaved
                         paired-end fastq.gz, fasta: reads in fasta format
    -V, --version        Prints version information

OPTIONS:
    -c, --batch <batch>                        Sets size of batch of reads to be processed in parallel (default 50,000)
    -z, --compress_output <compress_output>    gzip compress read classification file [default: true]
    -i, --index <index>                        index to be used for search
    -n, --prefix <prefix>                      prefix for output file(-s)
    -Q, --quality <quality>                    kmers with nucleotides below this minimum phred score will be excluded
                                               from the analyses (default 15)
    -q, --query <query>                        query file(-s)fastq.gz
    -t, --threads <threads>                    number of threads to use, if not set the maximum available number threads
                                               will be used

To classify reads, we use the classify subcommand. Thanks to needletail, sepia will automatically detect if the files are fastq or fasta files, and if they are compressed (e.g., fastq.gz). If two files are given, Sepia will assume the files are paired end. If you use an interleaved paired-end file, you have to use the -I flag, otherwise Sepia will assume it is a single end file. The -H can be used to implement hyperloglog, however the currently implementation is slow and still experimental. To perform read classification with the previously created index with default values: `sepia classify -q your_dat

Sepia

Install / Use

README

Sepia

Installation

Precompiled binaries

Install from source:

Running Sepia

Building an index

Classifying reads