AutoTax

AutoTax is a workflow that automatically generates de novo taxonomy from full length 16S rRNA amplicon sequence variants (FL-ASVs). This allows generation of ecosystem-specific de novo taxonomic databases based on any environmental sample(s). It does so by combining several different software tools, listed below, into a single BASH script that otherwise only requires a single FASTA file as input. For a more detailed description of AutoTax, please refer to the paper Dueholm et al, 2020.

What the script does
Installation and requirements
- Required software
- Required database files
Environment variables
Usage
- Customization
Running AutoTax from a container (recommended)
- Running getsilvadb.sh through a container
- Important notes when running AutoTax through a container
Unit tests
Generating input full-length 16S sequences
See also
vsearch to replace usearch

Created by gh-md-toc

What the script does

In brief, the script performs the following steps:

Check user input, files and folders, and check for installed R packages, installing missing ones

Generate/identify FL-ASVs

Orient the sequences based on the SILVA taxonomic database (usearch)
Dereplicate the input sequences (both strands), and determine the coverage of each unique sequence (usearch)
Denoise the dereplicated sequences using UNOISE3, with minsize = 2 by default (usearch)
Remove all sequences that match exactly (100% identity) with other, but longer sequences (R)
Sort the sequences based on coverage, and rename the sequences in order of occurence, in the format FLASVx.length, e.g. FLASV123.1410 (R)
If desired, update an existing FL-ASV database (FASTA file) by matching the generated FL-ASVs to the database, replacing identical FL-ASVs with longer sequences if any, and adding the new ones to the end of the FASTA file, renamed to continue numbering from the database (R)

Generate de novo taxonomy

Perform a multiple sequence alignment of the FL-ASVs with both the SILVA and SILVA typestrains databases using SINA, then trim, strip gaps, format, and sort based on FL-ASV IDs (multithreading doesn't always preserve ordering) (SINA+awk+R)
Assign taxonomy to that of the best hit in both the SILVA and SILVA typestrains databases (usearch)
Cluster the FL-ASVs at different identity thresholds each corresponding to a taxonomic level and use the FL-ASV ID of the cluster centroids as a de novo placeholder name at each level (usearch, thresholds from Yarza et al, 2014)
Reformat the output from the last 2 steps into 3 separate tables where each column contains the taxonomy at each taxonomic level (Kingdom->Species) of each FL-ASV (R)
Merge the 3 tables so that the de novo taxonomy fills in where the assigned taxonomy based on SILVA and SILVA typestrains are below the taxonomic thresholds (R)
Manually curate the taxonomy based on a replacement file if any (R)

Output the taxonomy in the following formats:

FL-ASVs in FASTA format with usearch SINTAX formatted taxonomy in the headers (R)
FL-ASVs in FASTA format with DADA2 formatted taxonomy in the headers (R)
QIIME formatted table (R)
CSV files of the individual tables mentioned earlier as well as the combined, complete taxonomy for each FL-ASV (R)

Installation and requirements

The easiest and recommended way to run AutoTax is through the official Docker container, either by using Docker (privileged, but most convenient) or Apptainer/Singularity (non-privileged), see Running AutoTax from a container (recommended). This ensures complete reproducibility, and our unit tests have only been designed for running AutoTax through the container.

Alternatively, install AutoTax natively by downloading the autotax.bash script and then make sure the required tools are installed and available in the PATH variable:

wget https://raw.githubusercontent.com/KasperSkytte/AutoTax/main/autotax.bash

or clone the git repository (recursively to include submodules):

git clone --recursive https://github.com/KasperSkytte/AutoTax.git
cd AutoTax

Required software tools

usearch (11)
SINA (1.6 or later)
GNU parallel (20161222-1)
findLongSeqs, credit goes to Nick Green. The initial R implementation was extremely inefficient
R (3.5 or later) with the following packages installed (the script will attempt to install if missing):
- Biostrings (from Bioconductor through BiocManager::install())
- doParallel
- stringr (and stringi)
- data.table
- tidyr
- dplyr
standard linux tools awk, grep, and cat (already included in most Linux distributions)

Required database files

AutoTax is tailored for the SILVA database, which is also required. SILVA and SILVA typestrains database files in both UDB and ARB format are needed. A zip file with all 4 files for SILVA releases 132+138 can be found on figshare here, but won't be updated in the future. Instead use the getsilvadb.sh script which will download all the required files for a chosen release version directly from https://www.arb-silva.de/, and then automagically reformat, extract typestrains, and generate UDB databases for usearch. This is perhaps also easiest through a container, see Running getsilvadb.sh through a container.

Now it's important to make sure the paths to these files are set correctly. This is done by setting a few environment variables.

Environment variables

Before running AutoTax it's important to set a few options and filepaths to the respective database files. Inspect the autotax.bash for defaults. This is done by setting the following environment variables in the current shell (fx using export):

silva_db: Path to the SILVA .arb database file
silva_udb: Path to the SILVA SSURef database file in .udb format
typestrains_udb: Path to the typestrains database file in .udb format
denovo_prefix: A character string which will be the prefix for de novo taxonomy, resulting in e.g. denovo_s_23 if set to denovo (default)
denoise_minsize: The minimum abundance of each unique input sequence. Input sequences with lower abundance than this threshold will be discarded. Passed on directly to UNOISE3 during the denoise step. Set this to 1 to skip denoising, e.g. if input sequences are already pre-processed, or output from a previous autotax run etc, in which case the pipeline will fail due to 0 sequences output from this step.
usearch_global_threads: Any usearch_global command will be split into smaller separate jobs using GNU parallel as the multithreading implementation in usearch does not scale linearly. It's much faster to run many smaller jobs. This sets the max number of threads each parallel command will use. Increase this if you lack memory.

Usage

Make sure the script is executable with chmod +x autotax.bash.

Type bash autotax.bash -h to show available options and version:

$ bash autotax.bash -h
Pipeline for extracting Full-length 16S rRNA Amplicon Sequence Variants (FL-ASVs) from full length 16S rRNA gene DNA sequences and generating de novo taxonomy
Version: 1.7.5
Options:
  -h    Display this help text and exit.
  -i    Input FASTA file with full length DNA sequences to process (required).
  -c    Cluster the resulting FL-ASVs at 99% (before generating de novo taxonomy),
          do chimera filtering on the clusters, and then add them on top in the same way as when using -d.
  -d    FASTA file with previously processed FL-ASV sequences.
          FL-ASVs generated from the input sequences will then be appended to this and de novo taxonomy is rerun.
  -t    Maximum number of threads to use. Default is all available cores except 2.
  -b    Run all BATS unit tests to assure everything is working as intended (requires git).
  -v    Print version and exit.

Using the example data in /test/example_data/ a usage example would be: bash autotax.bash -i test/example_data/10k_fSSUs.fa -t 20.

The main output files can then be found in the output/ folder and all intermediate files along the way in temp/.

Customization

The autotax.bash script essentially consists of individual functions that can be used independently. This is what makes it possible to run unit tests on a BASH script, but it also makes it possible to source the individual functions manually from another script to create custom workflows or resuming from a previous run. Simply adding . autotax.bash to the script won't run autotax, but will load the functions.

Running AutoTax from a container (recommended)

To run AutoTax through a docker container first install Docker Engine - Community as described there. A prebuilt image autotax based on Ubuntu Linux 20.04 with all the required software and d

AutoTax

Install / Use

README

AutoTax

Table of Contents

What the script does

Installation and requirements

Required software tools

Required database files

Environment variables

Usage

Customization

Running AutoTax from a container (recommended)