Natrix is an open-source bioinformatics pipeline for the preprocessing of long and short raw sequencing data. The need for a scalable, reproducible workflow for the processing of environmental amplicon data led to the development of Natrix. It is divided into quality assessment, dereplication, chimera detection, split-sample merging, ASV or OTU generation, and taxonomic assessment. The pipeline is written in Snakemake (Köster and Rahmann 2018), a workflow management engine for the development of data analysis workflows. Snakemake ensures the reproducibility of a workflow by automatically deploying dependencies of workflow steps (rules) and scales seamlessly to different computing environments such as servers, computer clusters, or cloud services. While Natrix was only tested with 16S and 18S amplicon data, it should also work for other kinds of sequencing data. The pipeline contains separate rules for each step of the pipeline, and each rule that has additional dependencies has a separate Conda environment that will be automatically created when starting the pipeline for the first time. The encapsulation of rules and their dependencies allows for hassle-free sharing of rules between workflows.

To use the latest functions and updates, it is recommended to use the dev-branch of Natrix2. The dev-branch contains the latest developments and patches that are not yet available in the main-branch. Users who want to stay up to date and experiment with the latest features should use the dev-branch regularly.

DAG of an example workflow Fig. 1: DAG of the Natrix2 workflow: Schematic representation of the Natrix2 workflow. The processing of two split samples using AmpliconDuo is depicted. The color scheme represents the main steps, dashed lines outline the OTU variant, and dotted lines outline the ASV variant of the workflow. Stars depict updates to the original Natrix workflow. Details on the ONT part are depicted in Fig. 2.

DAG of an example workflow Fig. 2: Schematic diagram of processing nanopore reads with Natrix2 for OTU generation and taxonomic assignment. The color scheme represents the main steps of this variant of the workflow.

Dependencies
Installation
Sequence Count
Tutorial
Cluster execution
Output
Workflow
Primertable
Configuration
References
Citation
Troubleshooting

Dependencies

Linux (recommended)
The pipeline was developed and tested on the Ubuntu distribution. Linux provides a stable, high-performance environment for running computationally intensive bioinformatics workflows and ensures compatibility with most scientific software.
Snakemake
Workflow management system for defining, organizing, and executing reproducible and scalable data analyses. Workflows are described in a readable, Python-based language and executed with automatic handling of dependencies, parallelization, and reproducibility across different computing environments.
Conda
Cross-platform package and environment manager used to install all required software in isolated, reproducible environments. Conda ensures that the correct versions of all dependencies are used and allows easy sharing of the computational environment.
GNU screen
Terminal multiplexer that allows long-running pipeline executions to run in detached sessions, preventing termination when the terminal connection is interrupted. GNU Screen is available in the repositories of most Linux distributions:
- Debian/Ubuntu-based systems: apt-get install screen
- RHEL-based systems: yum install screen
- Arch-based systems: pacman -S screen

Installation

Conda can be installed via the Anaconda or Miniconda platforms, with Miniconda3 being recommended for most users; on Linux systems, it can be obtained using:

wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh # Download Miniconda installer

bash Miniconda3-latest-Linux-x86_64.sh # Run Miniconda installer

Dependencies will be automatically installed using Conda environments and can be found in the corresponding environment.yaml files in the envs folder and the natrix.yaml file in the root directory of the pipeline.

Important: After setting up your natrix.yaml environment, make sure to check the Sequence Count section before starting the workflow. To install Natrix, you need the open-source package management system Conda and, if you want to run Natrix using the accompanying pipeline.sh script, GNU Screen. After cloning this repository to a folder of your choice, it is recommended to create a general Natrix Conda environment using the provided natrix.yaml file; from the main folder of the cloned repository, run the following command:

conda env create -f natrix.yaml # Create the Natrix Conda environment

Natrix comes with an example primertable (example_data.csv), an example configfile (example_data.yaml), and an example amplicon dataset located in the /example_data folder. To try out Natrix using the example data (Illumina_data or Nanopore_data), run the following command:

$ ./pipeline.sh # Start Natrix2 pipeline script
Natrix2 Pipeline Script # Output
Enter project name (e.g. illumina_swarm): # Output
$ illumina_swarm # Select illumina_swarm config

The pipeline will then start a screen session using the project name (here, example_data) as the session name and begin downloading dependencies for the workflow rules. To detach from the screen session, press Ctrl+a, d (first press Ctrl+a, then d). To reattach to a running screen session, type:

screen -r # Reattach to the running screen session

When the workflow has finished, press Ctrl+a, k (first press Ctrl+a, then k) to terminate the screen session and stop any remaining processes.

Sequence Count

Before starting the workflow, check the number of sequences in your input files (*.fastq, *.fastq.gz), as the workflow may abort if too few sequences are present. Experience has shown that the workflow fails when the number of sequences is below 150. To avoid this, analyze your data using the nseqc.py tool.

When using the tool, the recommended threshold is 150.

The tool compares the specified threshold with the number of sequences in each file. If a file falls below the threshold, a warning is issued, and the affected files should be moved out of the input folder to prevent workflow errors.

Using the nseqc Tool

First, go to your main directory. Then, run the following command:

python3 natrixlib/nseqc.py <folder_path> <threshold> # Check sequence counts in FASTQ files

Once you have checked your data with the tool, you can start the workflow as usual.

Tutorial

Prerequisites: dataset, primertable, and configuration file

The FASTQ files need to follow a specific naming convention:

<img src="documentation/images/lightmode/filename.png" alt="naming" width="400"/> Fig. 3: Specific naming for the FASTQ files

samplename_unit_direction.fastq.gz # Samplename = sample ID, unit = A/B, direction = R1 or R2

with:

samplename: name of the sample, without special characters.
unit: identifier for split-samples (A, B). If the split-sample approach is not used, the unit identifier is A, but it still needs to be specified.
direction: identifier for forward (R1) and reverse (R2) reads of the same sample. For single-end reads, the direction identifier is R1 and still needs to be specified.

A dataset should look like this (two samples, paired-end, no split-sample approach):

S2016RU_A_R1.fastq.gz # Sample S2016RU, unit A, forward read (R1)
S2016RU_A_R2.fastq.gz # Sample S2016RU, unit A, reverse read (R2)
S2016BY_A_R1.fastq.gz # Sample S2016BY, unit A, forward read (R1)
S2016BY_A_R2.fastq.gz # Sample S2016BY, unit A, reverse read (R2)

In addition to the FASTQ files generated during sequencing, Natrix requires a primertable containing the sample names and, if present, the length of poly-N tails, primer sequences, and barcodes used for each sample and read direction. Except for the sample names, all other information may be omitted if the data has already been preprocessed or does not contain the corresponding subsequences. Natrix also requires a configuration file in YAML format that specifies parameter values for the tools used in the pipeline.

The primertable, configuration file, and the folder containing the FASTQ files must all be located in the root directory of the pipeline and share the same project name (with their respective extensions: project.yaml, project.csv, and the project folder containing the FASTQ files). The first configfile entry,

Natrix2

Install / Use

README

Table of contents