MetaBench

MetaBench is a pipeline to continuously benchmark metagenomics analysis tools. It covers database construction (build), taxonomic binning and profiling.

It supports:

multiple tools
- in several release versions
multiple databases
- multiple range of parameters
multiple samples
- for binning and profiling both with multiple range of parameters
multiple evaluation metrics and threshold
performance benchmarks with optional repeats

It outputs:

standardized JSON files for integration
bioboxes output files for binning and profiling
interactive dashboard to analyze results

It requires:

Reference sequences to build databases
Set of reads in fastq and ground truth in the bioboxes format

Current configured tools:

ganon v2.0.0
kmcp v0.9.4
metacache v2.3.1
kraken2 v2.1.3
bracken v2.9

Installation and requirements

MetaBench is written in Snakemake and makes use of conda/mamba internally to install dependencies. It uses Bokeh to plot the interactive dashboard.

mamba create -n metabench_env snakemake genome_updater pandas "bokeh==2.4.3"
source activate metabench_env
pip install randomname
git clone https://github.com/pirovc/metabench.git
cd metabench

Usage example

Build

Downloading a small reference set for the build with genome_updater:

genome_updater.sh -d refseq -g bacteria -c "reference genome" -f "genomic.fna.gz" -o example/bac_rs -b refgen -t 8 -a

Create config/build_test.yaml:

workdir: "example/build/"
threads: 8
repeat: 1

tools:
  ganon:
    "2.0.0": ""
  kmcp:
    "0.9.4": ""

dbs:
  "bac_rs_refgen":
    folder: "../bac_rs/refgen/files/"
    extension: ".fna.gz"
    taxonomy: "ncbi"
    taxonomy_files: "../bac_rs/refgen/taxdump.tar.gz"
    assembly_summary: "../bac_rs/refgen/assembly_summary.txt"

run:
   ganon:
     "2.0.0":
       bac_rs_refgen:
         fixed_args:
           "--ncbi-file-info": "../bac_rs/refgen/assembly_summary.txt"
         args:
           "--max-fp": [0.0001, ""]
   kmcp:
     "0.9.4":
       bac_rs_refgen:
         fixed_args:
         args:

In the example above MetaBench is set to build databases for 2 tools (ganon and kmcp). kmcp will run with default parameters only (no args:) and ganon will run with --max-fp 0.0001 and default parameters.

Verify run with --dry-run:

snakemake -s metabench/build.smk --configfile config/build_test.yaml --cores 8 --use-conda --dry-run

Run it:

snakemake -s metabench/build.smk --configfile config/build_test.yaml --cores 8 --use-conda

If everything finished correctly, the following files will be created:

<details> <summary>Files</summary>

$ tree -A example/build/

example/build/
├── ganon
│   └── 2.0.0
│       └── bac_rs_refgen
│           ├── default
│           │   ├── ganon_db.hibf
│           │   ├── ganon_db.ibf -> ganon_db.hibf
│           │   └── ganon_db.tax
│           ├── default.build.bench.json
│           ├── default.build.bench.tsv
│           ├── default.build.log
│           ├── default.build.size.tsv
│           ├── --max-fp=0.0001
│           │   ├── ganon_db.hibf
│           │   ├── ganon_db.ibf -> ganon_db.hibf
│           │   └── ganon_db.tax
│           ├── --max-fp=0.0001.build.bench.json
│           ├── --max-fp=0.0001.build.bench.tsv
│           ├── --max-fp=0.0001.build.log
│           └── --max-fp=0.0001.build.size.tsv
└── kmcp
    └── 0.9.4
        └── bac_rs_refgen
            ├── default
            │   └── kmcp_db
            │       ├── name.map
            │       ├── R001
            │       │   ├── _block001.uniki
            │       │   ├── _block002.uniki
            │       │   ├── __db.yml
            │       │   └── __name_mapping.tsv
            │       ├── taxid.map
            │       └── taxonomy
            │           ├── citations.dmp
            │           ├── delnodes.dmp
            │           ├── division.dmp
            │           ├── gc.prt
            │           ├── gencode.dmp
            │           ├── images.dmp
            │           ├── merged.dmp
            │           ├── names.dmp
            │           ├── nodes.dmp
            │           └── readme.txt
            ├── default.build.bench.json
            ├── default.build.bench.tsv
            ├── default.build.log
            └── default.build.size.tsv

</details>

*.build.bench.json contains the standardized metrics in JSON format. If repeat > 1 in the config file, only the fastest run is selected.
*.build.bench.tsv contains the raw benchmark metrics from Snakemake. If repeat > 1 in the config file, one line for each run will be reported.
*.build.log contains the STDOUT and STDERR from the run.
*.build.size.tsv contains the size in bytes for the mandatory database files (du --bytes).

Obs: note that if no arguments are used in args: section of the configuration, the database folder/files will be named default. If parameters are used, databases are created based on them (--max-fp 0.0001 -> --max-fp=0.0001, if more than one, connected by underscore "_"). Any information provided in fixed_args: is not accounted for file/folder names.

Check the config/build_example.yaml for more examples on how to use the configuration file. Multiple databases, range of parameters and others can be configured to be executed in the same run.

Classify (binning + profiling)

Classification includes both binning and profiling procedures. It requires databases (as created in the build process above) and one or more samples with single or paired fastq files.

Create config/classify_test.yaml:

workdir: "example/classify/"
threads: 8
repeat: 1

tools:
  ganon:
    "2.0.0": ""
  kmcp:
    "0.9.4": ""

samples:
  "mende.10species.10K":
    fq1: "../../files/illumina_10species.10K.1.fq.gz"
    fq2: "../../files/illumina_10species.10K.2.fq.gz"

run:
  ganon:
    "2.0.0":
      dbs: 
        "bac_rs_refgen": "../../example/build/ganon/2.0.0/bac_rs_refgen/"
      fixed_args:
      binning_args:
        "--rel-cutoff": [0.25, 0.8]
      profiling_args:

  kmcp:
    "0.9.4":
      dbs: 
        "bac_rs_refgen": "../../example/build/kmcp/0.9.4/bac_rs_refgen/"
      fixed_args:
      binning_args:
      profiling_args:

Verify run with --dry-run:

snakemake -s metabench/classify.smk --configfile config/classify_test.yaml --cores 8 --use-conda --dry-run

Run it:

snakemake -s metabench/classify.smk --configfile config/classify_test.yaml --cores 8 --use-conda

If everything finished correctly, the following files will be created:

<details> <summary>Files</summary>

$ tree -A example/classify/

example/classify/
├── ganon
│   └── 2.0.0
│       └── mende.10species.10K
│           └── bac_rs_refgen
│               ├── default
│               │   ├── --rel-cutoff=0.25
│               │   │   ├── default.profiling.bench.json
│               │   │   ├── default.profiling.bench.tsv
│               │   │   ├── default.profiling.bioboxes.gz
│               │   │   └── default.profiling.log
│               │   ├── --rel-cutoff=0.25.binning.bench.json
│               │   ├── --rel-cutoff=0.25.binning.bench.tsv
│               │   ├── --rel-cutoff=0.25.binning.bioboxes.gz
│               │   ├── --rel-cutoff=0.25.binning.log
│               │   ├── --rel-cutoff=0.25.rep
│               │   ├── --rel-cutoff=0.8
│               │   │   ├── default.profiling.bench.json
│               │   │   ├── default.profiling.bench.tsv
│               │   │   ├── default.profiling.bioboxes.gz
│               │   │   └── default.profiling.log
│               │   ├── --rel-cutoff=0.8.binning.bench.json
│               │   ├── --rel-cutoff=0.8.binning.bench.tsv
│               │   ├── --rel-cutoff=0.8.binning.bioboxes.gz
│               │   ├── --rel-cutoff=0.8.binning.log
│               │   └── --rel-cutoff=0.8.rep
│               └── --max-fp=0.0001
│                   ├── --rel-cutoff=0.25
│                   │   ├── default.profiling.bench.json
│                   │   ├── default.profiling.bench.tsv
│                   │   ├── default.profiling.bioboxes.gz
│                   │   └── default.profiling.log
│                   ├── --rel-cutoff=0.25.binning.bench.json
│                   ├── --rel-cutoff=0.25.binning.bench.tsv
│                   ├── --rel-cutoff=0.25.binning.bioboxes.gz
│                   ├── --rel-cutoff=0.25.binning.log
│                   ├── --rel-cutoff=0.25.rep
│                   ├── --rel-cutoff=0.8
│                   │   ├── default.profiling.bench.json
│                   │   ├── default.profiling.bench.tsv
│                   │   ├── default.profiling.bioboxes.gz
│                   │   └── default.profiling.log
│                   ├── --rel-cutoff=0.8.binning.bench.json
│                   ├── --rel-cutoff=0.8.binning.bench.tsv
│                   ├── --rel-cutoff=0.8.binning.bioboxes.gz
│                   ├── --rel-cutoff=0.8.binning.log
│                   └── --rel-cutoff=0.8.rep
└── kmcp
    └── 0.9.4
        └── mende.10species.10K
            └── bac_rs_refgen
                └── default
                    ├── default
                    │   ├── default.profiling.bench.json
                    │   ├── default.profiling.bench.tsv
                    │   ├── default.profiling.bioboxes.gz
                    │   └── default.profiling.log
                    ├── default.binning.bench.json
                    ├── default.binning.bench.tsv
                    ├── default.binning.bioboxes.gz
                    └── default.binning.log

</details>

*.profiling.bioboxes.gz contains the standardized profiling output in bioboxes format.
*.binning.bioboxes.gz contains the standardized binning output in bioboxes format.
*.bench.json contains the runtime metrics in JSON format. If repeat > 1 in the config file, only the fastest run is selected. This file wi

Metabench

Install / Use

README

MetaBench

Installation and requirements

Usage example

Build

Classify (binning + profiling)

Related Skills