PRONAME_logo

PRONAME: PROcessing NAnopore MEtabarcoding data

PRONAME is an open-source bioinformatics pipeline that allows processing Nanopore metabarcoding sequencing data. The pipeline is written mainly in bash and is compiled in a Docker image which simply needs to be pulled from Docker Hub to be ready to use. The Docker image includes all developed scripts, dependencies and precompiled reference databases.

The pipeline is divided into four steps: (i) Nanopore sequencing data is first imported into PRONAME to trim adapter and primer sequences (optional) and to visualize raw read length and quality (proname_import). (ii) One of the main advantages of the second script of the pipeline (proname_filter) is that it allows diffentiating simplex from duplex reads and, thus, take advantage of higher-accuracy duplex reads introduced with the V14 sequencing chemistry. Reads that do not meet length and quality criteria are then filtered out. (iii) The next script of the pipeline (proname_refine) performs a read clustering, uses Medaka, i.e. a Nanopore data-dedicated tool, to correct sequencing errors by polishing, and discards chimera sequences. (iv) The last script (proname_taxonomy) allows performing the taxonomic analysis of the generated high-accuracy consensus sequences. The pipeline offers the possibility to generate a phyloseq object and to import the generated files into QIIME2 for further analyses (diversity, abundance, etc.), if desired.

The rEGEN-B (rrn operons Extracted from GENomes of Bacteria) database developed in this work is included in the Docker image and is also directly available on Figshare.

Additional information can also be found in our associated publication.

Installation

If you don't have Docker on your computer, you can find instructions to install it here.

The PRONAME Docker image is available on Docker Hub.

This repository includes two Docker images, each optimized for a specific architecture:

amd64: For x86_64 (Intel/AMD) processors (i.e. suitable for most Linux & Windows machines)
arm64: For ARM-based processors (e.g., Apple M1/M2, Raspberry Pi)

These images provide flexibility for users on different hardware platforms.

Download

To download an image, please run one of the following commands:

Command to pull image for amd64 architecture:

 docker pull benn888/proname:v2.1.4-amd64

Command to pull image for arm64 architecture:
```
 docker pull benn888/proname:v2.1.4-arm64
```

Note that, depending on your installation, running Docker commands may require sudo privileges.

You can run this command to confirm that the image has successfully been downloaded and is available:

docker images

Then, the simplest way to run a new container is to use this command:

docker run -it --name proname_container benn888/proname:v2.1.4-<arch>

Where <arch> should be replaced by amd64 or arm64.

However, a more effective way to launch a container is to set up a shared volume that mounts a host directory directly in the container. This setup allows access to raw sequencing data in the container and enables direct access to PRONAME results from the host machine:

docker run -it --rm --name proname_container -v /path/to/host/data:/data benn888/proname:v2.1.4-<arch>

where /path/to/host/data is the path to the directory on your host machine containing the raw sequencing data, and /data is the directory in the container where this data will be accessible. Place any files resulting from the PRONAME analysis in /data to access them directly from the host machine.

Note that, although we did not encounter any memory issue when testing and using PRONAME, it is good to keep in mind that fine-tuning Docker's memory usage may be useful in certain cases.

For Docker Desktop on macOS, in particular, ensure that the setting Settings > General > Use Rosetta for x86_64/amd64 emulation on Apple Silicon is unchecked.

And that's it! You are now ready to analyze your nanopore metabarcoding data with PRONAME. There are four scripts constituting the pipeline:

proname_import
proname_filter
proname_refine
proname_taxonomy

These scripts must be run in this order, with their required arguments. The best way to go is to type the name of each script followed by "--help" (e.g. proname_import --help) to get the list of all arguments and a usage example. A tutorial detailing the whole workflow is also presented below.

Tutorial

0. Before PRONAME

The Nanopore sequencing data to import into PRONAME must be fastq files, i.e. basecalled reads. If you have raw-signal data (fast5 or pod5 files), you should first basecall them preferably with Dorado. For this tutorial, all fastq files (one file per sample) have been placed in the RawData directory.

The fastq files analyzed in this tutorial can be found under accession PRJNA1299388 on the NCBI website.

1. proname_import

The first step is to import sequencing data into PRONAME. Since adapter and primer sequences have not been removed yet, the --trimadapters and --trimprimers arguments are set to "yes" and the primer sequences are provided (5'-3'). Given that the V14 sequencing chemistry was used, the --duplex argument is set to "yes", so that seperate length-vs-quality scatterplots are generated for simplex, duplex and simplex+duplex reads.

proname_import \
  --inputpath RawData \
  --threads 48 \
  --duplex yes \
  --trimadapters yes \
  --sequencingkit SQK-LSK114 \
  --trimprimers yes \
  --fwdprimer AGRGTTYGATYMTGGCTCAG \
  --revprimer CGACATCGAGGTGCCAAAC

Here is the complete list of available arguments for proname_import:

| Command | Arguments | Description | Mandatory arguments | | ------- | --------- | ----------- | ------------------- | | proname_import | --inputpath | Path to the folder containing raw fastq files. To prevent file conflicts and ensure accurate sequence counting, your raw FASTQ files must be stored in a separate directory (e.g., /data/RawData). Do not place them directly in your working directory, as this is where proname_import writes all its output files. Mixing input and output files in the same location can lead to errors and unreliable results. | Yes | | | --threads | Number of threads to use for the Guppy adapter-trimming step and/or the Cutadapt primmer-trimming step. You can know the number of available threads on your computer by running the command 'nproc --all' [Default: 2] | No | | | --duplex | Indicate whether your sequencing data include duplex reads or not. Duplex reads are high-quality reads that were introduced with the kit 14 chemistry. [Option: "yes" or "no"] | Yes | | | --trimadapters | Indicate whether your sequencing data contain adapters that should be trimmed. [Option: "yes" or "no"] | Yes | | | --sequencingkit | Name of the ONT sequencing kit used to generate the library(-ies). [Default: "SQK-LSK114"] | No | | | --trimprimers | Indicate whether your sequencing data contain primers that should be trimmed. [Option: "yes" or "no"] | Yes | | | --fwdprimer | The sequence of the forward primer used during PCR to amplify DNA. If barcoded primers were used to multiplex samples, please provide here only the target-specific part of the primer in 5'->3' orientation. This argument is required if --trimprimers is set to "yes". | ~ | | | --revprimer | The sequence of the reverse primer used during PCR to amplify DNA. If barcoded primers were used to multiplex samples, please provide here only the target-specific part of the primer in 5'->3' orientation. This argument is required if --trimprimers is set to "yes". | ~ | | | --primererrorrate | Maximum allowed mismatch rate between primer sequences and reads during primer trimming. This value controls the tolerance to sequencing errors in primer regions. For example, 0.15 allows up to 15% mismatches between the primer and the read. Increasing this value may recover more reads but also raises the risk of unspecific matches. [Default: 0.15] | No | | | --nocounting | When this argument is set to "yes", counting of simplex/duplex reads is not performed. [Options: "yes" or "no", Default: "no"] | No | | | --plotformat | Format of the scatterplot visualization files produced. It can be either "png" or "html". Since NanoPlot produces empty png plots for an unknown reason, it is only used to generate html visualizations. png plots are produced using the custom script scaleq.py. [Default: "png"] | No | | | --noscatterplot | When this argument is set to 'yes', no length vs. quality scatterplot is generated. Since this is a time-consuming step, this possiblity has been made available to increase the flexibility of the pipeline. However, it is strongly discouraged to skip this scatterplot generation. Visual inspection of these plots is crucial for deciding which type of read to work with (duplex and/or simplex) and which length and quality thresholds to apply. [Options: "yes" or "no", Default: "no"] | No | | | --version | Print the version of the pipeline. | No | | | --verbose | Activate verbose/debug mode (no redirections). | No | | | --help | Print the help menu. | No |

The analysis of the simplex_duplex_read_distribution.tsv generated file shows that enough duplex reads were sequenced:

| Sample_name | Simplex_reads | Duplex_reads | | ----------- | ------------- | ------------ | | sample10 | 405356 | 38219 | | sample1 | 254762 | 58214 | | sample2 | 473202 | 33534 | | sample3 | 250737 | 48763

PRONAME

Install / Use

README