MetaCerberus

Python code for versatile Functional Ontology Assignments for Metagenomes searching via Hidden Markov Model (HMM) with environmental focus of shotgun metaomics data

Generate Convert Improve

Install / Use

/learn @raw-lab/MetaCerberus

About this skill

Quality Score

0/100

README

Welcome to MetaCerberus

Check out our MetaCerberus ReadTheDocs Documentation and Tutorial!

About

MetaCerberus transforms raw sequencing (i.e. genomic, transcriptomics, metagenomics, metatranscriptomic) data into knowledge. It is a start to finish python code for versatile analysis of the Functional Ontology Assignments for Metagenomes (FOAM), KEGG, CAZy/dbCAN, VOG, pVOG, PHROG, COG, and a variety of other databases including user customized databases via Hidden Markov Models (HMM) for functional annotation for complete metabolic analysis across the tree of life (i.e., bacteria, archaea, phage, viruses, eukaryotes, and whole ecosystems). MetaCerberus also provides automatic differential statistics using DESeq2/EdgeR, pathway enrichments with GAGE, and pathway visualization with Pathview R.

GitHub Logo Art by Andra Buchan

Installing MetaCerberus

Option 1) Mamba

Mamba install from bioconda with all dependencies:

Linux/OSX-64

Install mamba using conda

conda install conda-forge::mamba

NOTE: Make sure you install mamba in your base conda environment unless you have OSX with ARM architecture (M1/M2 Macs). Follow the OSX-ARM instructions below if you have a Mac with ARM architecture.

Install MetaCerberus with mamba

mamba create -n metacerberus -c conda-forge -c bioconda metacerberus
conda activate metacerberus
metacerberus.py --setup
metacerberus.py --download

OSX-ARM (M1/M2)

Set up conda environment

conda create -y -n metacerberus
conda activate metacerberus
conda config --env --set subdir osx-64

Install mamba, python, and pydantic inside the environment

conda install -y -c conda-forge mamba python=3.10 "pydantic<2"

Install MetaCerberus with mamba

mamba install -y -c conda-forge -c bioconda metacerberus
metacerberus.py --setup
metacerberus.py --download

NOTE: Mamba is the fastest installer. Anaconda or miniconda can be slow. Also, install mamba from conda not from pip. The pip mamba doesn't work for install.

Option 2) Anaconda - Linux/OSX-64 Only

Anaconda install from bioconda with all dependencies:

conda create -n metacerberus -c conda-forge -c bioconda metacerberus -y
conda activate metacerberus
metacerberus.py --setup
metacerberus.py --download

Option 3) Manual with conda/mamba from Github

git clone https://github.com/raw-lab/MetaCerberus.git 
cd MetaCerberus
bash install_metacerberus.sh
conda activate MetaCerberus
metacerberus.py --download

MetaCerberus Lite

We also have a lite version of MetaCerberus on anaconda that only depends on the very basic dependencies.
This can make it a bit faster and easier to install as it is less likely to have conflicts with other dependencies on the system.

To install the "lite" version, use "metacerberus-lite" instead of "metacerberus" from Bioconda, following the details listed above.

mamba create -n metacerberus -c conda-forge -c bioconda metacerberus-lite
conda activate metacerberus
metacerberus.py --setup
metacerberus.py --download

Additional dependencies such as fastqc and fastp can be installed in the environment manually if desired for those steps in the pipeline.

Brief Overview

MetaCerberus has three basic modes: quality control (QC) for raw reads, formatting/gene prediction, and annotation.

MetaCerberus can use three different input files: 1) raw read data from any sequencing platform (Illumina, PacBio, or Oxford Nanopore), 2) assembled contigs, as MAGs, vMAGs, isolate genomes, or a collection of contigs, 3) amino acid fasta (.faa), previously called pORFs.
We offer customization, including running all databases together, individually or specifying select databases. For example, if a user wants to run prokaryotic or eukaryotic-specific KOfams, or an individual database alone such as dbCAN, both are easily customized within MetaCerberus.
In QC mode, raw reads are quality controlled via FastQC prior and post trim FastQC. Raw reads are then trimmed via data type; if the data is Illumina or PacBio, fastp is called, otherwise it assumes the data is Oxford Nanopore then Porechop is utilized PoreChop.
If Illumina reads are utilized, an optional bbmap step to remove the phiX174 genome is available or user provided contaminate genome. Phage phiX174 is a common contaminant within the Illumina platform as their library spike-in control. We highly recommend this removal if viral analysis is conducted, as it would provide false positives to ssDNA microviruses within a sample.
We include a --skip_decon option to skip the filtration of phiX174, which may remove common k-mers that are shared in ssDNA phages.
In the formatting and gene prediction stage, contigs and genomes are checked for N repeats. These N repeats are removed by default.
We impute contig/genome statistics (e.g., N50, N90, max contig) via our custom module Metaome Stats.
Contigs can be converted to pORFs using Prodigal, FragGeneScanRs, and Prodigal-gv as specified by user preference.
Scaffold annotation is not recommended due to N's providing ambiguous annotation.
Both Prodigal and FragGeneScanRs can be used via our --super option, and we recommend using FragGeneScanRs for samples rich in eukaryotes.
FragGeneScanRs found more ORFs and KOs than Prodigal for a stimulated eukaryote rich metagenome. HMMER searches against the above databases via user specified bitscore and e-values or our minimum defaults (i.e., bitscore = 25, e-value = 1 x 10-9 ).

Input formats

From any NextGen sequencing technology (from Illumina, PacBio, Oxford Nanopore)
type 1 raw reads (.fastq format)
type 2 nucleotide fasta (.fasta, .fa, .fna, .ffn format), assembled raw reads into contigs
type 3 protein fasta (.faa format), assembled contigs which genes are converted to amino acid sequence

Output Files

If an output directory is given, that folder will be created where all files are stored.
If no output directory is specified, the 'results_metacerberus' subfolder will be created in the current directory.
Gage/Pathview R analysis provided as separate scripts within R.

Visualization of Outputs

We use Plotly to visualize the data
Once the program is executed the html reports with the visuals will be saved to the last step of the pipeline.
The HTML files require plotly.js to be present. One has been provided in the package and is saved to the report folder.

Annotation Rules

Rule 1 is for finding high quality matches across databases. It is a score pre-filtering module for pORFs thresholds: which states that each pORF match to an HMM is recorded by default or a user-selected cut-off (i.e., e-value/bit scores) per database independently, or across all default databases (e.g, finding best hit), or per user specification of the selected database.
Rule 2 is to avoid missing genes encoding proteins with dual domains that are not overlapping. It is imputed for non-overlapping dual domain module pORF threshold: if two HMM hits are non-overlapping from the same database, both are counted as long as they are within the default or user selected score (i.e., e-value/bit scores).
Rule 3 is to ensure overlapping dual domains are not missed. This is the dual independent overlapping domain module for convergent binary domain pORFs. If two domains within a pORF are overlapping <10 amino acids (e.g, COG1 and COG4) then both domains are counted and reported due to the dual domain issue within a single pORF. If a function

Related Skills

node-connect

346.4k

Diagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps

frontend-design

107.2k

Create distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.

openai-whisper-api

346.4k

Transcribe audio via OpenAI Audio Transcriptions API (Whisper).

qqbot-media

346.4k

QQBot 富媒体收发能力。使用 <qqmedia> 标签，系统根据文件扩展名自动识别类型（图片/语音/视频/文件）。