Tourmaline

Amplicon sequence processing workflow using QIIME 2 and Snakemake

Generate Convert Improve

Install / Use

/learn @aomlomics/Tourmaline

About this skill

Quality Score

0/100

README

Tourmaline 2

Tourmaline 2 is an amplicon sequence processing workflow for Illumina sequence data that uses QIIME 2 and the software packages it wraps. Tourmaline 2 manages commands, inputs, and outputs using the Snakemake workflow management system.

Major changes in v2 vs. v1

To use the Legacy v1 version of Tourmaline, check out the V1 branch of this repository!

Run via tourmaline.sh script

Instead of interacting with Snakemake rules directly, the main way to run Tourmaline 2 is through the tourmaline.sh script. This script allows you to run one or more of the workflow steps at a time, specify specific config files, and set the maximum number of cores. You must be located in the tourmaline directory when running it, however you can set the output file destinations to anywhere.

Usage:

conda activate snakemake-tour2
./tourmaline.sh --step [qaqc,repseqs,taxonomy] --configfile [config1,config2,config3] --cores N

You can still run individual snakemake rules as before. Each of the three steps (explained more below) has its own Snakefile, so you must specify the correct snakefile when running an individual rule.

Providing externally-generated data

Unlike Tourmaline 1, you can start any of the three workflow steps with data from an external program, so long as it is formatted correctly. For example, if you already have ASV sequences and just want to assign taxonomy with Tourmaline, you can format them for QIIME 2 (code to help with this below) and just provide the file path in your config file.

Overview

Tourmaline 2 is a modular Snakemake pipeline for processing DNA metabarcoding data. The pipeline consists of three main steps, plus an optional fourth step:

Step 1. Sequence quality assurance and quality control

Called "qaqc" in Tourmaline 2 code.
Processes raw fastq files (paired-end or single-end data).
Provides sequence quality plots for demultiplexed raw and/or trimmed reads.
Optionally trims primer sequences from raw reads.
Creates a QIIME 2 sequence artifact.

Step 2. Representative sequences (denoising and ASV generation)

Called "repseqs" in Tourmaline 2 code.
Generates ASVs using the specified method (DADA2 or Deblur).
Optional filtering based on length, abundance, and prevalence.
Produces feature table and representative sequences.

Step 3. Taxonomy assignment

Called "taxonomy" in Tourmaline 2 code.
Generates taxonomic assignments and visualizations.
Assigns taxonomy using one of four methods:

Step 4. Generate bioinformatics metadata

Creates a file with metadata about the analysis using FAIR eDNA terms.
File can be read into the NOAA Ocean DNA Explorer.

Setup Requirements

Conda (Miniconda works well)
QIIME 2 (2024.10) amplicon workflow

Snakemake conda environment, with extra packages installed

conda create -c conda-forge -c bioconda -n snakemake-tour2 snakemake biopython yq parallel

V2 (default) branch of Tourmaline

git clone https://github.com/aomlomics/tourmaline.git

bowtie2-blca conda environment (required only if running BLCA taxa assignment)

conda create -c conda-forge -c bioconda -n bt2-blca biopython muscle=3.8 bowtie2

Running Requirements

snakemake-tour2 environment must be activated
Required configuration files for each step
Input data files (vary depending on starting step)
Must run from the Tourmaline directory downloaded from GitHub, which contains the tourmaline.sh script and Snakefiles

Configuration Files

The pipeline uses three main configuration files, one for each step. These files can have any name, and example files are provided.

1. Sample/QA/QC Configuration (config_01_qaqc.yaml)

Key parameters:

run_name: [your_run_name]              # Name for this qaqc run, will be a prefix for outputs
output_dir: [path]                     # Output directory path
raw_fastq_path: [path]                 # Path to raw fastq files
paired_end: [True/False]               # Whether data is paired-end
to_trim: [True/False]                  # Whether to trim sequences

# Trimming parameters
fwd_primer: [sequence]                 # Forward primer sequence
rev_primer: [sequence]                 # Reverse primer sequence
discard_untrimmed: [True/False]        # Whether to discard sequences without the primer
minimum_length: [int]                  # Minimum sequence length to keep after trimming

QA/QC Input Files

There are three options for input files in the QA/QC step. You must choose one and leave the others blank in the config file:

# Full path to raw demultiplexed fastq files. Sample names will be the prefix of the file names.
raw_fastq_path: [path]
# Full path to pre-trimmed fastq files. Sample names will be the prefix of the file names.
trimmed_fastq_path: [path]
# Relative path and file name of a QIIME2 manifest file. It can point to trimmed or untrimmed reads.
sample_manifest_file: [path/filename]

Sample Manifest Format

Can provide either the current QIIME2 tab-separated file format, or the legacy comma-separated format. Much have the correct headers:

Tab-separated

Paired-end:

sample-id  forward-absolute-filepath     reverse-absolute-filepath
sample1    /path/to/sample1_R1.fastq.gz  /path/to/sample1_R2.fastq.gz

Single-end:

sample-id  absolute-filepath
sample1    /path/to/sample1_R1.fastq.gz

CSV (legacy)

Paired-end:

sample-id,absolute-filepath,direction
sample1,/path/to/sample1_R1.fastq.gz,forward
sample1,/path/to/sample1_R2.fastq.gz,reverse

Single-end:

sample-id,absolute-filepath
sample1,/path/to/sample1_R1.fastq.gz

FASTQ Files without a manifest file

Paired-end naming: {sample}_R1.fastq.gz and {sample}_R2.fastq.gz
Alternative format: {sample}_R1_001.fastq.gz and {sample}_R2_001.fastq.gz
Single-end naming: {sample}_R1.fastq.gz or {sample}_R1_001.fastq.gz

2. Representative sequences configuration (config_02_repseqs.yaml)

Key parameters:

run_name: [your_run_name] # Name for this repseqs run, can be the same or different than qaqc step
output_dir: [path]        # Output directory path
asv_method: [method]      # ASV method (dada2pe, dada2se, deblur)

# DADA2 parameters (if using dada2pe/dada2se)

dada2_trunc_len_f: [int]   # Forward read truncation length
dada2pe_trunc_len_r: [int] # Reverse read truncation length (paired-end only)
dada2_trim_left_f: [int]   # Number of bases to trim from start of forward reads
dada2pe_trim_left_r: [int] # Number of bases to trim from start of reverse reads (paired-end only)

# Filtering options
to_filter: [True/False]        # Whether to apply filtering
repseq_min_length: [int]       # Minimum ASV length
repseq_max_length: [int]       # Maximum ASV length
repseq_min_abundance: [float]  # Minimum abundance threshold
repseq_min_prevalence: [float] # Minimum prevalence threshold

Repseqs input files

You have two options for providing files to the repseqs step:

1) Provide an existing Tourmaline QA/QC run

Either use the same run_name and output_dir for both steps, or
Use a different run_name for the repseqs step, and provide the sample_run_name you want to use. Can be helpful if you are testing out different trimming parameters.

2) Provide an externally generated QIIME2 sequence archive (.qza)

To generate a QIIME2 sequence archive, you need a manifest file linking sample names with the absolute file path of the fastq.gz files (see the TSV format above.

Activate the qiime2-amplicon-2024.10 environment.

conda activate qiime2-amplicon-2024.10

Import to a QIIME2 artifact. Change code to match your manifest file name and desired output .qza file name and path.

Paired-end data

qiime tools import \
   --type 'SampleData[PairedEndSequencesWithQuality]' \
   --input-path my_pe.manifest \
   --output-path output-file_pe_fastq.qza \
   --input-format PairedEndFastqManifestPhred33V2

Single-end data

qiime tools import \
   --type 'SampleData[SequencesWithQuality]' \
   --input-path my_se.manifest \
   --output-path output-file_se_fastq.qza \
   --input-format SingleEndFastqManifestPhred33V2

3. Taxonomy configuration (config_03_taxonomy.yaml)

Key parameters:

run_name: [your_run_name] # Name for this pipeline run
output_dir: [path]        # Output directory path
classify_method: [method] # Classification method (naive-bayes, consensus-blast, consensus-vsearch, bt2-blca)
collapse_taxalevel: [int] # Creates an additional table where ASV counts are collapsed to the provided taxonomic level
classify_threads: [int]   #

Related Skills

node-connect

341.2k

Diagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps

frontend-design

84.5k

Create distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.

openai-whisper-api

341.2k

Transcribe audio via OpenAI Audio Transcriptions API (Whisper).

commit-push-pr

84.5k

Commit, push, and open a PR