Table of Contents
About eDNAFlow
Setup and test the pipeline
- What should you expect in result folders
Running eDNAFlow on your data
Download database
Basic command usage
Description of run options
LCA (Lowest Common Ancestor) script for assigning taxonomy
Extra Notes

About eDNAFlow

eDNAFlow is a fully automated pipeline that employs a number of state-of-the-art applications to process eDNA data from raw sequences (single-end or paired-end) to generation of curated and non-curated zero-radius operational taxonomic units (ZOTUs) and their abundance tables. As part of eDNAFlow, we also present an in-house Python script to assign taxonomy to ZOTUs based on user specified thresholds for assigning Lowest Common Ancestor (LCA). This pipeline is based on Nextflow and Singularity which enables a scalable, portable and reproducible workflow using software containers on a local computer, clouds and high-performance computing (HPC) clusters.

For more information on eDNAFlow and other software used as part of the workflow please read "eDNAFlow, an automated, reproducible and scalable workflow for analysis of environmental DNA (eDNA) sequences exploiting Nextflow and Singularity" in Molecular Ecology Resources with DOI: https://doi.org/10.1111/1755-0998.13356. If you use eDNAFlow, we appreciate if you could cite the eDNAFlow paper and the other papers describing the underlying software.

alt text

Setup and test the pipeline

Note: Below instruction was only tested on Ubuntu systems. Mac instruction will be available soon.

To run the pipeline, first Nextflow and Singularity have to be installed or made available for loading as modules (e.g. in the case of running it on an HPC cluster) on your system. This pipeline was built and tested with versions 19.10 and 3.5.2 of Nextflow and Singularity, respectively. We strongly suggest that you first try the pipeline on your local machine using the test dataset provided.

We are providing scripts that will install Nextflow and Singularity on your local machine if they don't already exist and will run the pipeline on the test dataset. The scripts have been successfully tested on Ubuntu 16, 18 and 20.

Note: Do not use the install scripts on HPC, or where you don't have sudo permission.

Alternatively, for manual installation of Nextflow, follow the instructions at nextflow installation. To install Singularity version 3.5.2 manually, follow the instructions at singularity installation. If working on HPC, you may need to contact your HPC helpdesk.

Follow the steps below:

1- Clone the Git repository so that all the scripts and test data are downloaded and in one folder. To clone the repository to your directory, run this command:

git clone https://github.com/mahsa-mousavi/eDNAFlow.git

2- Next, in your terminal go to the "install" directory which is located inside the "eDNAFlow" directory (e.g. cd eDNAFlow/install)

3- Once inside the install directory run: bash install_and_se_testRun.sh to try the pipeline on single-end test data or run bash install_and_pe_testRun.sh for testing it on paired-end test data. To test the lca script for taxonomy assignment run bash install_and_lca_testRun.sh. As every step gets completed you will see a ✔ next to the relavant step or ✘ if it fails.

4- If all goes well you can now find all the results inside folder "testData2_Play"

What should you expect in result folders

In folders 00 to 09 you will find soft link(s) to the final result files of each step as explained below. To check any intermediate files or debug any particular step check the relevant folder inside work directory (check out the files starting with .command* for debugging, check the log, etc if necessary).

00_fastQC_YourSequenceFileName: Quality checking results of raw file (FastQC package)

01_a_quality_Filtering_YourSequenceFileName: The filtered fastq file (adapterRemoval package)

01_b_fastQC_YourSequenceFileName: Quality checking results of filtered file (FastQC package)

02_assigned_dmux_YourSequenceFileName_yourBarcodeFileNames: Demultiplexed file for each barcode file (OBITools package)

03_Length_filtered_YourSequenceFileName: Length filtered demultiplexed file (OBITools package)

04_splitSamples_YourSequenceFileName: Split files per sample (OBITools package)

05_relabel_Cat_YourSequenceFileName: Count of filtered demultiplexed reads in each sample (i.e. CountOfSeq.txt), each demultiplexed sample file, concatenated fastq and fasta files

06_Uniques_ZOTUs_YourSequenceFileName: Unique file, Final ZOTU fasta file and Final ZOTU table (this table is uncurated) (USEARCH package)

07_blast_YourSequenceFileName: Blast result, match file and table for generating curated result (BLAST package)

08_lulu_YourSequenceFileName: LULU results including map file and curated ZOTU table (LULU package)

09_taxonomyAssigned_lca_result_qCov#_id#_diff#: Intermediate and final taxonomy assignment result files

work: Holds all the results, intermediate files, ...

.nextflow: Nextflow generated folder holding history info

.nextflow.log: Nextflow generated log file(s) which can be used for debugging and checking what each number in work directory maps to

Running eDNAFlow on your data

Make sure eDNAFlow scripts (including eDNAFlow.nf, nextflow.config and lulu.R), conf and LCA_taxonomyAssignment_scripts folders are in the same directory where your unzipped sequencing and Multiplex identifier (MID) tag (here defined as “barcode”) files exist.

Download database

One of the mandatory parameters to run eDNAFlow is to provide a path to a local GenBank nucleotide (nt) and/or your custom database. To download the NCBI nucleotide database locally, follow the steps below.

Download the official BLAST+ container with Singularity using the below command (tested on Ubuntu 18.04):

singularity pull --dir directoryName docker://ncbi/blast:2.10.0

directoryName is the path to the directory where you want to keep the container image

Make a folder where you want to keep the database and from there run the following command:

singularity run directoryName/blast_2.10.0.sif update_blastdb.pl --decompress nt

* Please be aware step 2 will take some time and will need a large space available on the disk due to the size of GenBank nucleotide database. For us it took under 2 hours on the NCBI default 1 core setting (~10MB per second), and was done a lot faster using an HPC data transfer node (hpc-data.pawsey.org.au) or copyq (with 16 cores) on Zeus, at almost 100MB per second.

Basic command usage

Example of basic command to run the pipeline on your local machine on single-end/paired-end data with multiple barcode files using blast and/or custom database:

For single-end run: nextflow run eDNAFlow.nf --reads 'file.fastq' --barcode 'bc_*.txt' --blast_db 'path2/LocalGenbankDatabase/nt' [OPTIONS]

For paired-end run: nextflow run eDNAFlow.nf --barcode 'pe_bc*' --blast_db 'Path2TestBlastDataset/file.fasta' --custom_db 'path2/customDatabase/myDb' [OPTIONS]

For running LCA taxonomy assignment script: nextflow run eDNAFlow.nf --taxonomyAssignment --zotuTable "path2/curatedOruncurated_ZotuTable_file" --blastFile "path2/blastResult_file" --lca_output "my_lca_result" [OPTIONS]

Description of run options

eDNAFlow allows execution of all or parts of the pipeline as long as the correct file formats are provided. For example, the user may choose to run eDNAFlow on a raw file that hasn't been demultiplexed, or opt to provide an already demultiplexed file. Similarly, a user may have performed the clustering with a different algorithm (e.g. DADA2) and is only interested in using the lca script.

The following parameters can be adjusted on the command line to achieve different goals.

To see a list of available options run: nextflow run eDNAFlow.nf --help

Mandatory parameters if your sequences are NOT demultiplexed

--reads 'read.fastq': provi

EDNAFlow

Install / Use

README

Table of Contents