Barque v1.8.5

Environmental DNA metabarcoding analysis

Barque

Developed by Eric Normandeau in Louis Bernatchez's laboratory.

Licence information at the end of this file.

Description

Barque is a fast eDNA metabarcoding analysis pipeline that first denoises and then annotates ASVs or OTUs, using high-quality barcoding databases.

Barque can produce denoised OTUs and annotate them using a custom database. These annotated OTUs can then be used as a database themselves to find read counts per OTU per sample, effectively annotating the reads with the OTUs that were previously found. In this process, some of the OTUs are annotated to the species level, some to the genus or higher levels.

Citation

Barque is described as an accurate and efficient eDNA analysis pipeline in:

Mathon L, Guérin P-E, Normandeau E, Valentini A, Noel C, Lionnet C, Linard B, Thuiller W, Bernatchez L, Mouillot D, Dejean T, Manel S. 2021. Benchmarking bioinformatic tools for fast and accurate eDNA metabarcoding species identification. Molecular Ecology Resources.

https://onlinelibrary.wiley.com/doi/abs/10.1111/1755-0998.13430

It is also presented in:

Hakimzadeh A et. al. 2023. A pile of pipelines: An overview of the bioinformatics software for metabarcoding data analyses. Molecular Ecology Resources.

https://onlinelibrary.wiley.com/doi/abs/10.1111/1755-0998.13847

Use cases

Monitoring invasive species
Confirming the presence of specific species
Characterizing meta-communities in varied environments
Improving species distribution knowledge of cryptic taxa
Following loss of species over medium to long-term monitoring

Since Barque depends on the use of high-quality barcoding databases, it is especially useful for amplicons that already have large databases, like COI amplicons from the Barcode of Life Database (BOLD) or 12S amplicons from the mitofish database, although it can also use any database once it is formatted in its format, for example the Silva database for the 18s gene or any other custom database. If for some reason species annotations are not possible, Barque can be used in OTU mode.

Installation

To use Barque, you will need a local copy of its repository. Different releases can be found here. It is recommended to always use the latest release, even the development version. You can either download an archive of the latest release at the above link or get the latest commit (recommended) with the following git command:

git clone https://github.com/enormandeau/barque

Dependencies

To run Barque, you will also need to have the following programs installed on your computer.

Barque will only work on GNU Linux or OSX
bash 4+
python 3.5+ (you can use miniconda3 to install python)
python distutils package
R 3+ (ubuntu/mint: sudo apt-get install r-base-core)
java (ubuntu/mint: sudo apt-get install default-jre)
gnu parallel
flash (read merger) v1.2.11+
vsearch v2.14.2+
- /!\ v2.14.2+ required /!\
- Barque will not work with older versions of vsearch

Preparation

Install dependencies
Download a copy of the Barque repository (see Installation above)
Edit 02_info/primers.csv to provide information about the primer pair to use
Get or prepare the database (see Formatting database section below) and deposit the fasta.gz file in the 03_databases folder and give it a name that matches the information of the 02_info/primers.csv file.
Modify the parameters in 02_info/barque_config.sh for your run
Launch Barque, for example with ./barque 02_info/barque_config.sh

Overview of Barque steps

During the analyses, the following steps are performed:

Filter and trim raw reads (trimmomatic)
Merge paired-end reads (flash)
Split merged reads by amplicon (Python script)
Look for chimeras and denoise reads (vsearch and unoise3 algorithm)
Merge unique reads (Python script)
Find species or OTUs associated with unique, denoised reads (vsearch)
Summarize results (Python script)
- Tables of phylum, genus, and species counts per sample, including multiple hits
- Number of retained reads per sample at each analysis step with figure
- Most frequent non-annotated sequences to blast on NCBI nt/nr
- Species counts for these non-annotated sequences
- Sequence groups for cases of multiple hits

Running the pipeline

For each new project, get a new copy of Barque from the source listed in the Installation section. In this case, you do not need to modify the primer and config files.

Running on the test dataset

If you want to test Barque, jump straight to the Test dataset section at the end of this file. Later, be sure to read through the README to understand the program and it's outputs.

Preparing samples

Copy your demultiplexed paired-end sample files in the 04_data folder. You need one pair of files per sample. The sequences in these files must contain the sequences of the primer pair that you used during the PCR. Depending on the format in which you received your sequences from the sequencing facility, you may have to proceed to demultiplexing before you can use Barque.

IMPORTANT: The file names must follow one of these two formats:

# Format 1
SampleID_*_R1.fastq.gz
SampleID_*_R2.fastq.gz

# Format 2
SampleID_*_R1_001.fastq.gz
SampleID_*_R2_001.fastq.gz

Notes: Each sample name, or SampleID, must contain no underscore (_) and must be followed by an underscore (_). The star (*) can be any string of text that does not contain space characters. For example, you can use dashes (-) to separate parts of your sample names, eg:

# Format 1
PopA-sample123_ANYTHING_R1.fastq.gz`

# Format 2
PopA-sample123_ANYTHING_R1_001.fastq.gz

Formatting database

You need to put a database in gzip-compressed Fasta format, or .fasta.gz, in the 03_databases folder.

An augmented version of the mitofish 12S database, as well as 16S and cytb, are already available in Barque.

The pre-formatted BOLD databases ready for Barque can be downloaded below. Note that you will need to rename the downloaded file to bold.fasta.gz

https://www.ibis.ulaval.ca/services/bioinformatique/barque_databases/

If you want to use a newer version of the BOLD database, you will need to download all the animal BINs from this page , put the downloaded Fasta files in 03_databases/bold_bins (you will need to create that folder), and run the commands to format the bold database:

# Format each BIN individually (~10 minutes)
# Note: the `species_to_remove.txt` file is optional
ls -1 03_databases/bold_bins/*.fas.gz |
    parallel ./01_scripts/util/format_databases/format_bold_database.py \
    {} {.}_prepared.fasta.gz

# Concatenate the resulting formatted bins into one file
gunzip -c 03_databases/bold_bins/*_prepared.fasta.gz | gzip - > 03_databases/bold.fasta.gz

For other databases, get the database and format it:
- Name lines must contain 3 information fields separated by an underscore (_)
- Ex: >Phylum_Genus_species
- Ex: >Family_Genus_species
- Ex: >Mammal_rattus_norvegicus
- gzip-compressed Fasta format (DATABASE_NAME.fasta.gz)

Configuration file

Copy and modify the parameters in 02_info/barque_config.sh as needed.

Launching Barque

Launch the barque executable with the name of your configuration file as an argument, like this:

./barque 02_info/<YOUR_CONFIG_FILE>

Reducing false positives

Two of the parameters in the config file can help reduce the presence of false positive annotations in the results: MIN_HITS_EXPERIMENT and MIN_HITS_SAMPLE. The defaults to both of these are very permissive and should be modified if false positives are problematic in the results. Additionally, the following script is provided to filter out species annotations that fall below a minimum proportion of reads in each samples: filter_sites_by_proportion.py. This filter is especially useful if the different samples have very unequal numbers of reads. Having a high quality database will also help reducing false annotations. Finally, manual curation of the results is recommended with any eDNA analysis, regardless of the software used.

Results

Once the pipeline has finished running, all result files are found in the 12_results folder.

After a run, it is recommended to make a copy of this folder and name it with the current date, ex:

cp -r 12_results 12_results_PROJECT_NAME_2024-02-29_SOME_ADDITIONAL_INFO

Taxa count tables, named after the primer names

PRIMER_genus_table.csv
PRIMER_phylum_table.csv
PRIMER_species_table.csv

Sequence dropout report and figure

sequence_dropout.csv: Lists how many sequences were present in each sample for every analysis step. Depending on library and sequencing quality, as well as the biological diversity found at the sample site, more or less sequences are lost at each of the analysis steps. The figure sequence_dropout_figure.png shows how many sequences are retained for each sample at each step of the pipeline.

Most frequent non-annotated sequences

most_frequent_non_annotated_sequences.fasta: Sequences that are frequent in the samples but were not annotated by the pipeline. This Fasta file should be used to query the NCBI nt/nr database using the online portal found here to see what species may have been m