PyPI - Python Version GitHub release PyPI - Status

ReferenceSeeker: rapid determination of appropriate reference genomes

Description
Input & Output
Installation
- BioConda
- GitHub
Usage
Examples
Databases
- RefSeq
- Custom
Dependencies
Citation
Feedback

Description

ReferenceSeeker determines closely related reference genomes following a scalable hierarchical approach combining an fast kmer profile-based database lookup of candidate reference genomes and subsequent computation of specific average nucleotide identity (ANI) values for the rapid determination of suitable reference genomes.

ReferenceSeeker computes kmer-based genome distances between a query genome and potential reference genome candidates via Mash (Ondov et al. 2016). For resulting candidates ReferenceSeeker subsequently computes (bidirectional) ANI values picking genomes meeting community standard thresholds by default (ANI >= 95 % & conserved DNA >= 69 %) (Goris, Konstantinos et al. 2007) ranked by the product of ANI and conserved DNA values to take into account both genome coverage and identity.

Custom databases can be built with local genomes. For further convenience, we provide pre-built databases with sequences from RefSeq (https://www.ncbi.nlm.nih.gov/refseq), GTDB and PLSDB copmrising the following taxa:

bacteria
archaea
fungi
protozoa
viruses

as well as plasmids.

The reasoning for subsequent calculations of both ANI and conserved DNA values is that Mash distance values correlate well with ANI values for closely related genomes, however the same is not true for conserved DNA values. A kmer fingerprint-based comparison alone cannot distinguish if a kmer is missing due to a SNP, for instance or a lack of the kmer-comprising subsequence. As DNA conservation (next to DNA identity) is very important for many kinds of analyses, e.g. reference based SNP detections, ranking potential reference genomes based on a mash distance alone is often not sufficient in order to select the most appropriate reference genomes. If desired, ANI and conserved DNA values can be computed bidirectionally.

Mash D vs. ANI / conDNA

Input & Output

Input

Path to a taxon database and a draft or finished genome in (zipped) fasta format:

$ referenceseeker ~/bacteria GCF_000013425.1.fna

Output

Tab separated lines to STDOUT comprising the following columns:

Unidirectionally (query -> references):

RefSeq Assembly ID
Mash Distance
ANI
Conserved DNA
NCBI Taxonomy ID
Assembly Status
Organism

#ID    Mash Distance    ANI    Con. DNA    Taxonomy ID    Assembly Status    Organism
GCF_000013425.1    0.00000    100.00    100.00    93061    complete    Staphylococcus aureus subsp. aureus NCTC 8325
GCF_001900185.1    0.00002    100.00    99.89     46170    complete    Staphylococcus aureus subsp. aureus HG001
GCF_900475245.1    0.00004    100.00    99.57     93061    complete    Staphylococcus aureus subsp. aureus NCTC 8325 NCTC8325
GCF_001018725.2    0.00016    100.00    99.28     1280     complete    Staphylococcus aureus FDAARGOS_10
GCF_003595465.1    0.00185    99.86     96.81     1280     complete    Staphylococcus aureus USA300-SUR6
GCF_003595385.1    0.00180    99.87     96.80     1280     complete    Staphylococcus aureus USA300-SUR2
GCF_003595365.1    0.00180    99.87     96.80     1280     complete    Staphylococcus aureus USA300-SUR1
GCF_001956815.1    0.00180    99.87     96.80     46170    complete    Staphylococcus aureus subsp. aureus USA300_SUR1
...

Bidirectionally (query -> references [QR] & references -> query [RQ]):

RefSeq Assembly ID
Mash Distance
QR ANI
QR Conserved DNA
RQ ANI
RQ Conserved DNA
NCBI Taxonomy ID
Assembly Status
Organism

#ID    Mash Distance    QR ANI    QR Con. DNA    RQ ANI    RQ Con. DNA    Taxonomy ID    Assembly Status    Organism
GCF_000013425.1    0.00000    100.00    100.00    100.00    100.00    93061    complete    Staphylococcus aureus subsp. aureus NCTC 8325
GCF_001900185.1    0.00002    100.00    99.89     100.00    99.89     46170    complete    Staphylococcus aureus subsp. aureus HG001
GCF_900475245.1    0.00004    100.00    99.57     99.99     99.67     93061    complete    Staphylococcus aureus subsp. aureus NCTC 8325 NCTC8325
GCF_001018725.2    0.00016    100.00    99.28     99.95     98.88     1280     complete    Staphylococcus aureus FDAARGOS_10
GCF_001018915.2    0.00056    99.99     96.35     99.98     99.55     1280     complete    Staphylococcus aureus NRS133
GCF_001019415.2    0.00081    99.99     94.47     99.98     99.36     1280     complete    Staphylococcus aureus NRS146
GCF_001018735.2    0.00096    100.00    94.76     99.98     98.58     1280     complete    Staphylococcus aureus NRS137
GCF_003354885.1    0.00103    99.93     96.63     99.93     96.66     1280     complete    Staphylococcus aureus 164
...

Installation

ReferenceSeeker can be installed via Conda and Git(Hub). In either case, a taxon database must be downloaded which we provide for download at Zenodo: For more information have a look at Databases.

BioConda

The preferred way to install and run ReferenceSeeker is Conda using the Bioconda channel:

$ conda install -c bioconda referenceseeker
$ referenceseeker --help

GitHub

Alternatively, you can use this raw GitHub repository:

install necessary Python dependencies (if necessary)
clone the latest version of the repository
install necessary 3rd party executables (Mash, MUMmer4)

$ pip3 install --user biopython xopen
$ git clone https://github.com/oschwengers/referenceseeker.git
$ # install Mash & MUMmer
$ ./referenceseeker/bin/referenceseeker --help

Test

To test your installation we prepared a tiny mock database comprising 4 Salmonella spp genomes and a query assembly (SRA: SRR498276) in the tests directory:

$ git clone https://github.com/oschwengers/referenceseeker.git

  # GitHub installation
$ ./referenceseeker/bin/referenceseeker referenceseeker/test/db referenceseeker/test/data/Salmonella_enterica_CFSAN000189.fasta

  # BioConda installation
$ referenceseeker referenceseeker/test/db referenceseeker/test/data/Salmonella_enterica_CFSAN000189.fasta

Expected output:

#ID    Mash Distance    ANI    Con. DNA    Taxonomy ID    Assembly Status    Organism
GCF_000439415.1    0.00003    100.00    99.55    1173427    complete    Salmonella enterica subsp. enterica serovar Bareilly str. CFSAN000189
GCF_900205275.1    0.01522    98.61     83.13    90370      complete    Salmonella enterica subsp. enterica serovar Typhi

Usage

Usage:

usage: referenceseeker [--crg CRG] [--ani ANI] [--conserved-dna CONSERVED_DNA]
                       [--unfiltered] [--bidirectional] [--help] [--version]
                       [--verbose] [--threads THREADS]
                       <database> <genome>

Rapid determination of appropriate reference genomes.

positional arguments:
  <database>            ReferenceSeeker database path
  <genome>              target draft genome in fasta format

Filter options / thresholds:
  These options control the filtering and alignment workflow.

  --crg CRG, -r CRG     Max number of candidate reference genomes to pass kmer
                        prefilter (default = 100)
  --ani ANI, -a ANI     ANI threshold (default = 0.95)
  --conserved-dna CONSERVED_DNA, -c CONSERVED_DNA
                        Conserved DNA threshold (default = 0.69)
  --unfiltered, -u      Set kmer prefilter to extremely conservative values
                        and skip species level ANI cutoffs (ANI >= 0.95 and
                        conserved DNA >= 0.69
  --bidirectional, -b   Compute bidirectional ANI/conserved DNA values
                        (default = False)

Runtime & auxiliary options:
  --help, -h            Show this help message and exit
  --version, -V         show program's version number and exit
  --verbose, -v         Print verbose information
  --threads THREADS, -t THREADS
                        Number of used threads (default = number of available
                        CPU cores)

Examples

Installation:

$ conda install -c bioconda referenceseeker
$ wget https://zenodo.org/record/4415843/files/bacteria-refseq.tar.gz
$ tar -xzf bacteria-refseq.tar.gz
$ rm bacteria-refseq.tar.gz

Simple:

$ # referenceseeker <REFERENCE_SEEKER_DB> <GENOME>
$ referenceseeker bacteria-refseq/ genome.fasta

Expert: verbose output and increased output of candidate reference genomes using a defined number of threads:

Referenceseeker

Install / Use

README