rCRUX: Generate CRUX metabarcoding reference libraries in R

GitHub R package version

Authors: Luna Gal, Zachary Gold, Ramon Gallego, Shaun Nielsen, Katherine Silliman, Emily Curd Inspiration: The late, great Jesse Gomer. Coding extraordinaire and dear friend. License: GPL-3 Support: Support for the development of this tool was provided by CalCOFI, NOAA, Landmark College, and VBRN. Acknowledgments: This work benefited from the amazing input of many including Lenore Pipes, Sarah Stinson, Gaurav Kandlikar, and Maura Palacios Mejia. Published-Manuscript pre-print pre-made databases eDNA metabarcoding is increasingly used to survey biological communities using common universal and novel genetic loci. There is a need for an easy to implement computational tool that can generate metabarcoding reference libraries for any locus, and are specific and comprehensive. We have reimagined CRUX (Curd et al. 2019) and developed the rCRUX package R system for statistical computing R Core Team 2021 to fit this need by generating taxonomy and fasta files for any user defined locus. The typical workflow involves using get_seeds_local() or get_seeds_remote() to simulate in silico PCR (e.g. Ye et al. 2012) to acquire a set of sequences analogous to PCR products containing metabarcode primer sequences. The sequences or "seeds" recovered from the in silico PCR step are used to search databases for complementary sequence that lack one or both primers. This search step, blast_seeds() is used to iteratively align seed sequences against a local NCBI database for matches using a taxonomic rank based stratified random sampling approach. This step results in a comprehensive database of primer specific reference barcode sequences from NCBI. Using derep_and_clean_db(), the database is de-replicated by DNA sequence where identical sequences are collapsed into a representative read. If there are multiple possible taxonomic paths for a read, the taxonomic path is collapsed to the lowest taxonomic agreement.

Typical Workflow

Installation

Install from GitHub:

# install.packages(devtools)
devtools::install_github("CalCOFI/rCRUX", build_vignettes = TRUE)

library(rCRUX)

Dependencies

NOTE: These only need to be downloaded once or as NCBI updates databases. rCRUX can access and successfully build metabarcode references using databases stored on external drives.

BLAST+

NCBI's BLAST+ suite must be locally installed and accessible in the user's path. NCBI provides installation instructions for Windows, Linux, and Mac OS. Version 2.10.1+ through 2.13.0 are verified compatible with rCRUX.

The following is example shell script to download blast executables:

cd /path/to/Applications

wget ftp://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/2.10.1/ncbi-blast-2.10.1+-x64-macosx.tar.gz

This link may help if you are using RStudio and having trouble adding blast+ to your path.

Blast-formatted database

rCRUX requires a local blast-formatted nucleotide database. These can be user generated or download a pre-formatted database from NCBI. NCBI provides a tool (perl script) for downloading databases as part of the blast+ package. A brief help page can be found here.

The following shell script can be used to download the blast-formatted nucleotide database. There are also taxon specific databases (e.g. nt_euk, nt_prok, and nt_viruses).


mkdir NCBI_blast_nt

cd NCBI_blast_nt

wget "ftp://ftp.ncbi.nlm.nih.gov/blast/db/nt.???.tar.gz*"

time for file in *.tar.gz; do tar -zxvf $file; done

cd ..

You can test your nt blast database using the following command into terminal:

blastdbcmd -db '/my/directory/ncbi_nt/nt' -dbtype nucl -entry MN937193.1 -range 499-633

If you do not get the following, something went wrong in the build.

>MN937193.1:499-633 Jaydia carinatus mitochondrion, complete genome
TTAGATACCCCACTATGCCTAGTCTTAAACCTAGATAGAACCCTACCTATTCTATCCGCCCGGGTACTACGAGCACCAGC
TTAAAACCCAAAGGACTTGGCGGCGCTTCACACCCACCTAGAGGAGCCTGTTCTA

Possible error include but are not limited to:

Partial downloads of database files. Extracting each TAR archive (e.g. nt.00.tar.gz.md5) should result in 8 files with the following extensions(.nhd, .nhi, .nhr, .nin, .nnd, .nni, .nog, and .nsq). If a few archives fail during download, you can re-download and unpack only those that failed. You do not have to re-download all archives.
You downloaded and built a blast database from ncbi fasta files but did not specify -parse_seqids

The nt database is ~242 GB (as of 8/31/22) and can take several hours (overnight) to build. Loss of internet connection can lead to partially downloaded files and blastn errors (see above).

Note: Several blast formatted databases can be searched simultaneously. See documentation for details.

Taxonomizr

rCRUX uses the taxonomizr package for taxonomic assignment based on NCBI Taxonomy id's (taxids). Many rCRUX functions require a path to a local taxonomizr readable sqlite database. This database can be built using taxonomizr's prepareDatabase function.

This database is ~72 GB (as of 8/31/22) and can take several hours (overnight) to build. Loss of internet connection can lead to partially downloaded files and taxonomizr run errors.

The following code can be used to build this database:

library(taxonomizr)

accession_taxa_sql_path <- "/my/accessionTaxa.sql"
prepareDatabase(accession_taxa_sql_path)

Note: For poor bandwidth connections, please see the taxononmizr readme for manual installation of the accessionTaxa.sql database. If built manually, make sure to delete any files other than the accessionTaxa.sql database (e.g. keeping nucl_gb.accession2taxid.gz leads to a warning message).

Example pipeline

The following example shows a simple rCRUX pipeline from start to finish. Note that this example will require internet access and considerable database storage (~314 GB, see section above), run time (mainly for blastn), and system resources to execute.

Note: Blast databases and the taxonomic assignment databases (accessionTaxa.sql) can be stored on external hard drive. It increases run time, but is a good option if computer storage capacity is limited.

There are two options to generate seeds for the database generating blast step blast_seeds_local() or blast_seeds_remote(). The local option is slower, however it is not subject to the memory limitations of using the NCBI primer_blast API. The local option is recommended if the user is building a large database, wants to include any taxid in the search, wants to use multiple forward or reverse primers, and / or has many degenerate sites in their primer set. It also cached run data so if a run is interrupted the user can pick it up from the last successful round of blast by resubmitting the original command.

get_seeds_local

This example uses default parameters, with the exception of evalue to minimize run time.


forward_primer_seq = "TAGAACAGGCTCCTCTAG"

reverse_primer_seq =  "TTAGATACCCCACTATGC"

output_directory_path <- "/my/directory/12S_V5F1_local_111122_e300" # path to desired output directory

metabarcode_name <- "12S_V5F1" # desired name of metabarcode locus

accession_taxa_sql_path <- "/my/directory/accessionTaxa.sql" # path to taxonomizr sql database

blast_db_path <- "/my/directory/ncbi_nt/nt"  # path to blast formatted database


get_seeds_local(forward_primer_seq,
                 reverse_primer_seq,
                 metabarcode_name,
                 output_directory_path,
                 accession_taxa_sql_path,
                 blast_db_path, evalue = 300)

Two output .csv files are automatically created at this path based on the arguments passed to get_seeds_local. One includes all unfiltered output the other is filtered

RCRUX

Install / Use

README