AutoTax
Generate de novo taxonomy of full length 16S rRNA sequences directly from environmental samples
Install / Use
/learn @KasperSkytte/AutoTaxREADME
AutoTax
AutoTax is a workflow that automatically generates de novo taxonomy from full length 16S rRNA amplicon sequence variants (FL-ASVs). This allows generation of ecosystem-specific de novo taxonomic databases based on any environmental sample(s). It does so by combining several different software tools, listed below, into a single BASH script that otherwise only requires a single FASTA file as input. For a more detailed description of AutoTax, please refer to the paper Dueholm et al, 2020.
Table of Contents
- What the script does
- Installation and requirements
- Environment variables
- Usage
- Running AutoTax from a container (recommended)
- Unit tests
- Generating input full-length 16S sequences
- See also
- vsearch to replace usearch
Created by gh-md-toc
What the script does
In brief, the script performs the following steps:
- Check user input, files and folders, and check for installed R packages, installing missing ones
Generate/identify FL-ASVs
- Orient the sequences based on the SILVA taxonomic database (usearch)
- Dereplicate the input sequences (both strands), and determine the coverage of each unique sequence (usearch)
- Denoise the dereplicated sequences using UNOISE3, with
minsize = 2by default (usearch) - Remove all sequences that match exactly (100% identity) with other, but longer sequences (R)
- Sort the sequences based on coverage, and rename the sequences in order of occurence, in the format
FLASVx.length, e.g.FLASV123.1410(R) - If desired, update an existing FL-ASV database (FASTA file) by matching the generated FL-ASVs to the database, replacing identical FL-ASVs with longer sequences if any, and adding the new ones to the end of the FASTA file, renamed to continue numbering from the database (R)
Generate de novo taxonomy
- Perform a multiple sequence alignment of the FL-ASVs with both the SILVA and SILVA typestrains databases using SINA, then trim, strip gaps, format, and sort based on FL-ASV IDs (multithreading doesn't always preserve ordering) (SINA+awk+R)
- Assign taxonomy to that of the best hit in both the SILVA and SILVA typestrains databases (usearch)
- Cluster the FL-ASVs at different identity thresholds each corresponding to a taxonomic level and use the FL-ASV ID of the cluster centroids as a de novo placeholder name at each level (usearch, thresholds from Yarza et al, 2014)
- Reformat the output from the last 2 steps into 3 separate tables where each column contains the taxonomy at each taxonomic level (Kingdom->Species) of each FL-ASV (R)
- Merge the 3 tables so that the de novo taxonomy fills in where the assigned taxonomy based on SILVA and SILVA typestrains are below the taxonomic thresholds (R)
- Manually curate the taxonomy based on a replacement file if any (R)
Output the taxonomy in the following formats:
- FL-ASVs in FASTA format with usearch SINTAX formatted taxonomy in the headers (R)
- FL-ASVs in FASTA format with DADA2 formatted taxonomy in the headers (R)
- QIIME formatted table (R)
- CSV files of the individual tables mentioned earlier as well as the combined, complete taxonomy for each FL-ASV (R)
Installation and requirements
The easiest and recommended way to run AutoTax is through the official Docker container, either by using Docker (privileged, but most convenient) or Apptainer/Singularity (non-privileged), see Running AutoTax from a container (recommended). This ensures complete reproducibility, and our unit tests have only been designed for running AutoTax through the container.
Alternatively, install AutoTax natively by downloading the autotax.bash script and then make sure the required tools are installed and available in the PATH variable:
wget https://raw.githubusercontent.com/KasperSkytte/AutoTax/main/autotax.bash
or clone the git repository (recursively to include submodules):
git clone --recursive https://github.com/KasperSkytte/AutoTax.git
cd AutoTax
Required software tools
- usearch (11)
- SINA (1.6 or later)
- GNU parallel (20161222-1)
- findLongSeqs, credit goes to Nick Green. The initial R implementation was extremely inefficient
- R (3.5 or later) with the following packages installed (the script will attempt to install if missing):
- Biostrings (from Bioconductor through
BiocManager::install()) - doParallel
- stringr (and stringi)
- data.table
- tidyr
- dplyr
- Biostrings (from Bioconductor through
- standard linux tools
awk,grep, andcat(already included in most Linux distributions)
Required database files
AutoTax is tailored for the SILVA database, which is also required. SILVA and SILVA typestrains database files in both UDB and ARB format are needed. A zip file with all 4 files for SILVA releases 132+138 can be found on figshare here, but won't be updated in the future. Instead use the getsilvadb.sh script which will download all the required files for a chosen release version directly from https://www.arb-silva.de/, and then automagically reformat, extract typestrains, and generate UDB databases for usearch. This is perhaps also easiest through a container, see Running getsilvadb.sh through a container.
Now it's important to make sure the paths to these files are set correctly. This is done by setting a few environment variables.
Environment variables
Before running AutoTax it's important to set a few options and filepaths to the respective database files. Inspect the autotax.bash for defaults. This is done by setting the following environment variables in the current shell (fx using export):
silva_db: Path to the SILVA.arbdatabase filesilva_udb: Path to the SILVA SSURef database file in.udbformattypestrains_udb: Path to the typestrains database file in.udbformatdenovo_prefix: A character string which will be the prefix for de novo taxonomy, resulting in e.g.denovo_s_23if set todenovo(default)denoise_minsize: The minimum abundance of each unique input sequence. Input sequences with lower abundance than this threshold will be discarded. Passed on directly to UNOISE3 during the denoise step. Set this to1to skip denoising, e.g. if input sequences are already pre-processed, or output from a previous autotax run etc, in which case the pipeline will fail due to 0 sequences output from this step.usearch_global_threads: Anyusearch_globalcommand will be split into smaller separate jobs using GNU parallel as the multithreading implementation in usearch does not scale linearly. It's much faster to run many smaller jobs. This sets the max number of threads each parallel command will use. Increase this if you lack memory.
Usage
Make sure the script is executable with chmod +x autotax.bash.
Type bash autotax.bash -h to show available options and version:
$ bash autotax.bash -h
Pipeline for extracting Full-length 16S rRNA Amplicon Sequence Variants (FL-ASVs) from full length 16S rRNA gene DNA sequences and generating de novo taxonomy
Version: 1.7.5
Options:
-h Display this help text and exit.
-i Input FASTA file with full length DNA sequences to process (required).
-c Cluster the resulting FL-ASVs at 99% (before generating de novo taxonomy),
do chimera filtering on the clusters, and then add them on top in the same way as when using -d.
-d FASTA file with previously processed FL-ASV sequences.
FL-ASVs generated from the input sequences will then be appended to this and de novo taxonomy is rerun.
-t Maximum number of threads to use. Default is all available cores except 2.
-b Run all BATS unit tests to assure everything is working as intended (requires git).
-v Print version and exit.
Using the example data in /test/example_data/ a usage example would be:
bash autotax.bash -i test/example_data/10k_fSSUs.fa -t 20.
The main output files can then be found in the output/ folder and all intermediate files along the way in temp/.
Customization
The autotax.bash script essentially consists of individual functions that can be used independently. This is what makes it possible to run unit tests on a BASH script, but it also makes it possible to source the individual functions manually from another script to create custom workflows or resuming from a previous run. Simply adding . autotax.bash to the script won't run autotax, but will load the functions.
Running AutoTax from a container (recommended)
To run AutoTax through a docker container first install Docker Engine - Community as described there. A prebuilt image autotax based on Ubuntu Linux 20.04 with all the required software and d
