Natrix
Open-source bioinformatics pipeline for the preprocessing of raw sequencing data.
Install / Use
/learn @MW55/NatrixREADME
Natrix is an open-source bioinformatics pipeline for the preprocessing of raw sequencing data. The need for a scalable, reproducible workflow for the processing of environmental amplicon data led to the development of Natrix. It is divided into quality assessment, read assembly, dereplication, chimera detection, split-sample merging, ASV or OTU-generation and taxonomic assessment. The pipeline is written in Snakemake (Köster and Rahmann 2018), a workflow management engine for the development of data analysis workflows. Snakemake ensures reproducibility of a workflow by automatically deploying dependencies of workflow steps (rules) and scales seamlessly to different computing environments like servers, computer clusters or cloud services. While Natrix was only tested with 16S and 18S amplicon data, it should also work for other kinds of sequencing data. The pipeline contains seperate rules for each step of the pipeline and each rule that has additional dependencies has a seperate conda environment that will be automatically created when starting the pipeline for the first time. The encapsulation of rules and their dependencies allows for hassle-free sharing of rules between workflows.
DAG of an example workflow: each node represents a rule instance to be executed. The direction of each edge represents the order in which the rules are executed, which dashed lines showing rules that are exclusive to the OTU version and dotted lines rules exclusive to the ASV variant of the workflow. Disjoint paths in the DAG can be executed in parallel. Below is a schematic representation of the main steps of the pipeline, the color coding represents which rules belong to which main step.
If you use Natrix, please cite: Welzel, M., Lange, A., Heider, D. et al. Natrix: a Snakemake-based workflow for processing, clustering, and taxonomically assigning amplicon sequencing reads. BMC Bioinformatics 21, 526 (2020). https://doi.org/10.1186/s12859-020-03852-4
Dependencies
- Conda
- GNU screen (optional)
Conda can be downloaded as part of the Anaconda or the Miniconda plattforms (Python 3.7). We recommend to install miniconda3. Using Linux you can get it with:
$ wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
$ bash Miniconda3-latest-Linux-x86_64.sh
GNU screen can be found in the repositories of most Linux distributions:
- Debian / Ubuntu based: apt-get install screen
- RHEL based: yum install screen
- Arch based: pacman -S screen
All other dependencies will be automatically installed using conda environments and can be found in the corresponding environment.yaml files in the envs folder and the natrix.yaml file in the root directory of the pipeline.
Getting Started
To install Natrix, you'll need the open-source package management system conda and, if you want to try Natrix using the accompanying pipeline.sh script you'll need GNU screen.
After cloning this repository to a folder of your choice, it is recommended to create a general natrix conda environment with the accompanying natrix.yaml. In the main folder of the cloned repository, execute the following command:
$ conda env create -f natrix.yaml
This will create a conda environment containing all dependencies for Snakemake itself.
With Natrix comes an example primertable example_data.csv, configfile example_data.yaml and an example amplicon dataset in the folder example_data.
To try out Natrix using the example data, type in the following command:
$ ./pipeline.sh
Please enter the name of the project
$ example_data
The pipeline will then start a screen session using the project name (here, example_data) as session name and will beginn downloading dependencies for the rules. To detach from the screen session, press Ctrl+a, d (first press Ctrl+a and then d). To reattach to a running screen session, type in:
$ screen -r
When the workflow has finished, you can press Ctrl+a, k (first press Ctrl+a and then k). This will end the screen session and any processes that are still running.
Tutorial
Prerequisites: dataset, primertable and configuration file
The FASTQ files need to follow a specific naming convention:
<p align="center"> <img src="documentation/images/filename.png" alt="naming" width="400"/> </p>samplename_unit_direction.fastq.gz
with:
- samplename as the name of the sample, without special characters.
- unit, identifier for split-samples (A, B). If the split-sample approach is not used, the unit identifier is simply A.
- direction, identifier for forward (R1) and reverse (R2) reads of the same sample. If the reads are single-end, the direction identifier is R1.
A dataset should look like this (two samples, paired-end, no split-sample approach):
S2016RU_A_R1.fastq.gz
S2016RU_A_R2.fastq.gz
S2016BY_A_R1.fastq.gz
S2016BY_A_R2.fastq.gz
Besides the FASTQ data from the sequencing process Natrix needs a primertable containing the sample names and, if they exists in the data, the length of the poly-N tails, the sequence of the primers and the barcodes used for each sample and direction. Besides the sample names all other information can be omitted if the data was already preprocessed or did not contain the corresponding subsequence. Natrix also needs a configuration file in YAML format, specifying parameter values for tools used in the pipeline.
The primertable, configfile and the folder containing the FASTQ files all have to be in the root directory of the pipeline and have the same name (with their corresponding file extensions, so project.yaml, project.csv and the project folder containing the FASTQ files). The first configfile entry (filename) also needs to be the name of the project.
Running Natrix with the pipeline.sh script
IF everything is configured correctly, you can start the pipeline by typing in the following commands into your terminal emulator:
$ ./pipeline.sh
Please enter the name of the project
$ example_data
The pipeline will then start a screen session using the project name as session name and will beginn downloading dependencies for the rules. To detach from the screen session, press Ctrl+a, d (first press Ctrl+a and then d). To reattach to a running screen session, type in:
$ screen -r
When the workflow has finished, you can press Ctrl+a, k (first press Ctrl+a and then k). This will end the screen session and any processes that are still running.
Running Natrix with Docker or docker-compose
Pulling the image from Dockerhub
Natrix can be run inside a Docker-container. Therefore, Docker has to be installed. Please have a look at the Docker website to find out how to install Docker and set up an environment if you have not used it before.
The easiest way to run the docker container is to download the pre-build container from dockerhub.
$ docker pull mw55/natrix
The docker container has all environments pre-installed, eliminating the need for downloading the environments during first-time initialization of the workflow. To connect to the shell inside the docker container, input the following command:
docker run -it --label natrix_container -v </host/database>:/app/database -v </host/results>:/app/results -v </host/input_folder>:/app/input -v </host/demultiplexed>:/app/demultiplexed mw55/natrix bash
/host/database is the full path to a local folder, in which you wish to install the database (SILVA or NCBI). This part is optional and only needed if you want to use BLAST for taxonomic assignment.
/host/results is the full path to a local folder in which the results of the workflow should be stored for the container to use.
/host/input_folder is the full path to a local folder in which the input (the project folder, the project.yaml and the project.csv) should be saved.
/host/demultiplexed is the full path to a local folder in which the demultiplexed data, or, if demultiplexing is turned off, the input data will be saved.
After you connected to the container shell, you can follow the running Natrix manually tutorial.
Directly starting the workflow using docker-compose
Alternatively, you can start the workflow using the the docker-compose command in the root directory of the workflow (it will pull the latest natrix image from DockerHub):
$ PROJECT_NAME="<project>" docker-compose up (-d)
with project being the name of your project. e.g.:
$ PROJECT_NAME="example_data" docker-compose up # sudo might be needed
all output folders will be available at /srv/docker/natrix/ make sure to copy your project folder, project.yaml and project.csv files to /srv/docker/natrix/input/ or create a new volume-mapping using the docker-compose.yml file. By default the container will wait until the input files exist. At first launch the container will download the required databases to /srv/docker/natrix/databases/, this process might take a while.
Building the container yourself
If you prefer to build the docker container yourself from the repository (for example, if you modified the source code of Natrix) the container can be build and started directly: (host folders hav
Related Skills
node-connect
345.9kDiagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps
frontend-design
106.4kCreate distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.
openai-whisper-api
345.9kTranscribe audio via OpenAI Audio Transcriptions API (Whisper).
qqbot-media
345.9kQQBot 富媒体收发能力。使用 <qqmedia> 标签,系统根据文件扩展名自动识别类型(图片/语音/视频/文件)。
