Logo

aMeta: an accurate and memory-efficient ancient Metagenomic profiling workflow

About

aMeta is a Snakemake workflow for identifying microbial sequences in ancient DNA shotgun metagenomics samples. The workflow performs:

trimming adapter sequences and removing reads shorter than 30 bp with Cutadapt
quaity control before and after trimming with FastQC and MultiQC
taxonomic sequence kmer-based classification with KrakenUniq
sequence alignment with Bowtie2 and screening for common microbial pathogens
deamination pattern analysis with MapDamage2
Lowest Common Ancestor (LCA) sequence alignment with Malt
authentication and validation of identified microbial species with MaltExtract

You can get overview of aMeta from the rule-graph (DAG) below:

rulegraph

When using aMeta and / or pre-built databases provided together with the wokflow for your research projects, please cite our article:

Zoé Pochon*, Nora Bergfeldt*, Emrah Kırdök, Mário Vicente, Thijessen Naidoo,
Tom van der Valk, N. Ezgi Altınışık, Maja Krzewińska, Love Dalen, Anders Götherström*,
Claudio Mirabello*, Per Unneberg* and Nikolay Oskolkov*,
aMeta: an accurate and memory-efficient ancient Metagenomic profiling workflow,
Genome Biology 2023, 24 (242), https://doi.org/10.1186/s13059-023-03083-9

Authors

Nikolay Oskolkov (@LeandroRitter) nikolay.oskolkov@scilifelab.se
Claudio Mirabello (@clami66) claudio.mirabello@scilifelab.se
Per Unneberg (@percyfal) per.unneberg@scilifelab.se

Installation

Clone the github repository, then create and activate aMeta conda environment (here and below cd aMeta implies navigating to the cloned root aMeta directory). For this purpose, we recommend installing own conda, for example from here https://docs.conda.io/en/latest/miniconda.html, and mamba https://mamba.readthedocs.io/en/latest/installation.html:

git clone https://github.com/NBISweden/aMeta
cd aMeta
# For conda version < 23.10 use mamba instead of conda
conda env create -f workflow/envs/environment.yaml
# We added preliminary support for Snakemake v8. To try it out, change the conda command to
# conda env create -f workflow/envs/environment.v8.yaml
conda activate aMeta

Run a test to make sure that the workflow was installed correctly:

cd .test
./runtest.sh -j 1

Here, and below, by -j you can specify the number of threads that the workflow can use. Please make sure that the installation and test run accomplished successfully before proceeding with running aMeta on your real data. Potential problems with installation and test run often come from unstable internet connection and particular conda settings used e.g. at computer clusters, therefore we advise you to use your own freshly installed conda. Also, please note that the test run currently needs ~16 GB of RAM which is suitable for running on regular laptops. Nevertheless, when executing the test run on a computer cluster one should pay attention to assigning more than one core to the job since one core in a computer cluster may have less then 16 GB (for example ~8 GB) of RAM, and this can be the reason for failure of the test run on an HPC while it can still run fine on a laptop.

Quick start

To run the worflow you need to prepare a tab-delimited sample-file config/samples.tsv with at least two columns, and a configuration file config/config.yaml, below we provide examples for both files.

Here is an example of samples.tsv, this implies that the fastq-files are located in aMeta/data folder:

sample	fastq
foo	data/foo.fq.gz
bar	data/bar.fq.gz

Currently, it is important that the sample names in the first column exactly match the names of the fastq-files in the second column. For example, a fastq-file "data/foo.fq.gz" specified in the "fastq" column, must have a name "foo" in the "sample" column. Please make sure that the names in the first and second columns match.

Below is an example of config.yaml, here you will need to download a few databases that we made public (or build databases yourself).

samplesheet: "config/samples.tsv"

# KrakenUniq Microbial NCBI NT database (if you are interested in prokaryotes only)
# can be downloaded from https://doi.org/10.17044/scilifelab.20518251
krakenuniq_db: resources/DBDIR_KrakenUniq_MicrobialNT

# KrakenUniq full NCBI NT database (if you are interested in prokaryotes and eukaryotes)
# can be downloaded from https://doi.org/10.17044/scilifelab.20205504
#krakenuniq_db: resources/DBDIR_KrakenUniq_Full_NT

# Bowtie2 index and helping files for following up microbial pathogens
# can be downloaded from https://doi.org/10.17044/scilifelab.21185887
bowtie2_db: resources/library.pathogen.fna
bowtie2_seqid2taxid_db: resources/seqid2taxid.pathogen.map
pathogenomesFound: resources/pathogensfound.very_inclusive.tab

# Bowtie2 index for full NCBI NT (for quick followup of prokaryotes and eukaryotes),
# can be downloaded from https://doi.org/10.17044/scilifelab.21070063 (please unzip files)
# For using Bowtie2 NT index, replace "bowtie2_db" and "bowtie2_seqid2taxid_db" above by
#bowtie2_db: resources/library.fna
#bowtie2_seqid2taxid_db: resources/seqid2taxid.map.orig

# Helping files for building Malt database
# can be downloaded from https://doi.org/10.17044/scilifelab.21070063
malt_nt_fasta: resources/library.fna
malt_seqid2taxid_db: resources/seqid2taxid.map.orig
malt_accession2taxid: resources/nucl_gb.accession2taxid

# A path for downloading NCBI taxonomy files (performed automatically)
# you do not need to change this line
ncbi_db: resources/ncbi

# Breadth and depth of coverage filters
# default thresholds are very conservative, can be tuned by users
n_unique_kmers: 1000
n_tax_reads: 200

There are several ways to download the database files. One option is to follow this link https://docs.figshare.com/#articles_search and search for the last number in the database links provided above in the "article_id" search bar. This will give you the download url for each file. Then you can either use wget inside a screen session or tmux session to download it, or aria2c, for example, https://aria2.github.io/. N.B. We strongly recommend you not to mix the databases in the same directory but place them in individual folders, otherwise they may overwrite each other. Also, if you use the KrakenUniq full NCBI NT database and / or Bowtie2 index of full NCBI NT, please keep in mind, that the reference genomes used for building the database / index were imported as is from the BLASTN tool https://blast.ncbi.nlm.nih.gov/Blast.cgi. This implies that the majority of eukaryotic reference genomes (including human reference genome) included in the database / index may be of poor quality for the sake of minimization of resource usage. In contrast, the vast majority of microbial reference genomes included in the NCBI NT database / index are of very good (complete) quality. Therefore, if the goal of your analysis is human / animal microbiome profiling, we recommend you to use the Microbial NCBI NT database / index, this will make sure that human / animal reads will not be acidentally assiged to microbial organisms. However, the full NCBI NT database / index are very useful if you work with e.g. sedimentary or environmental ancient DNA, and your goal is to simply detect in unbiased way all prokaryotic and eukaryotic organisms present in your samples, without trying to precisely quantify their abundance.

After you have prepared the sample- and configration-file, please install job-specific environments, update Krona taxonomy and modify default java heap space parameters for Malt jobs:

cd aMeta
# install job-specific environments
snakemake --snakefile workflow/Snakefile --use-conda --conda-create-envs-only -j 20

# update Krona taxonomy
env=$(grep krona .snakemake/conda/*yaml | awk '{print $1}' | sed -e "s/.yaml://g" | head -1)
cd $env/opt/krona/
./updateTaxonomy.sh taxonomy
cd -

# modify default java heap space parameters for Malt jobs
env=$(grep hops .snakemake/conda/*yaml | awk '{print $1}' | sed -e "s/.yaml://g" | head -1)
conda activate $env
version=$(conda list malt --json | grep version | sed -e "s/\"//g" | awk '{print $2}')
cd $env/opt/malt-$version
sed -i -e "s/-Xmx64G/-Xmx512G/" malt-build.vmoptions
sed -i -e "s/-Xmx64G/-Xmx512G/" malt-run.vmoptions
cd -
conda deactivate

Finally, the workflow can be run using the following command line:

cd aMeta
snakemake --snakefile workflow/Snakefile --use-conda -j 20

In the sections More configuration options, Environment module configuration and Runtime configuration we will give more information about fine-tuning the configuration as well as instructions on how to run the workflow in a computer cluster enviroment.

More details about running aMeta can be found in the step-by-step tutorial available in the aMeta/vignettes directory.

Main results of the workflow and their interpretation

All output files of the workflow are located in aMeta/results directory. To get a quick overview of ancient microbes present in your samples you should check a heatmap in results/overview_heatmap_scores.pdf.

Overview

The heatmap demonstrates microbial species (in rows) authenticated for each sample (in columns). The colors and the numbers in the heatmap represent authentications scores, i.e. numeric quantification of eight quality metrics that pro

AMeta

Install / Use

README