MetaCerberus
Python code for versatile Functional Ontology Assignments for Metagenomes searching via Hidden Markov Model (HMM) with environmental focus of shotgun metaomics data
Install / Use
/learn @raw-lab/MetaCerberusREADME
Welcome to MetaCerberus
Check out our MetaCerberus ReadTheDocs Documentation and Tutorial!
About
MetaCerberus transforms raw sequencing (i.e. genomic, transcriptomics, metagenomics, metatranscriptomic) data into knowledge. It is a start to finish python code for versatile analysis of the Functional Ontology Assignments for Metagenomes (FOAM), KEGG, CAZy/dbCAN, VOG, pVOG, PHROG, COG, and a variety of other databases including user customized databases via Hidden Markov Models (HMM) for functional annotation for complete metabolic analysis across the tree of life (i.e., bacteria, archaea, phage, viruses, eukaryotes, and whole ecosystems). MetaCerberus also provides automatic differential statistics using DESeq2/EdgeR, pathway enrichments with GAGE, and pathway visualization with Pathview R.
Art by Andra Buchan
Installing MetaCerberus
Option 1) Mamba
- Mamba install from bioconda with all dependencies:
Linux/OSX-64
- Install mamba using conda
conda install conda-forge::mamba
- NOTE: Make sure you install mamba in your base conda environment unless you have OSX with ARM architecture (M1/M2 Macs). Follow the OSX-ARM instructions below if you have a Mac with ARM architecture.
- Install MetaCerberus with mamba
mamba create -n metacerberus -c conda-forge -c bioconda metacerberus
conda activate metacerberus
metacerberus.py --setup
metacerberus.py --download
OSX-ARM (M1/M2)
- Set up conda environment
conda create -y -n metacerberus
conda activate metacerberus
conda config --env --set subdir osx-64
- Install mamba, python, and pydantic inside the environment
conda install -y -c conda-forge mamba python=3.10 "pydantic<2"
- Install MetaCerberus with mamba
mamba install -y -c conda-forge -c bioconda metacerberus
metacerberus.py --setup
metacerberus.py --download
- NOTE: Mamba is the fastest installer. Anaconda or miniconda can be slow. Also, install mamba from conda not from pip. The pip mamba doesn't work for install.
Option 2) Anaconda - Linux/OSX-64 Only
- Anaconda install from bioconda with all dependencies:
conda create -n metacerberus -c conda-forge -c bioconda metacerberus -y
conda activate metacerberus
metacerberus.py --setup
metacerberus.py --download
Option 3) Manual with conda/mamba from Github
git clone https://github.com/raw-lab/MetaCerberus.git
cd MetaCerberus
bash install_metacerberus.sh
conda activate MetaCerberus
metacerberus.py --download
MetaCerberus Lite
We also have a lite version of MetaCerberus on anaconda that only depends on the very basic dependencies.
This can make it a bit faster and easier to install as it is less likely to have conflicts with other dependencies on the system.
To install the "lite" version, use "metacerberus-lite" instead of "metacerberus" from Bioconda, following the details listed above.
mamba create -n metacerberus -c conda-forge -c bioconda metacerberus-lite
conda activate metacerberus
metacerberus.py --setup
metacerberus.py --download
Additional dependencies such as fastqc and fastp can be installed in the environment manually if desired for those steps in the pipeline.
Brief Overview
<p align="center"> <img src="https://raw.githubusercontent.com/raw-lab/MetaCerberus/main/img/workflow.jpg" alt="MetaCerberus Workflow" height=600> </p>MetaCerberus has three basic modes: quality control (QC) for raw reads, formatting/gene prediction, and annotation.
- MetaCerberus can use three different input files: 1) raw read data from any sequencing platform (Illumina, PacBio, or Oxford Nanopore), 2) assembled contigs, as MAGs, vMAGs, isolate genomes, or a collection of contigs, 3) amino acid fasta (.faa), previously called pORFs.
- We offer customization, including running all databases together, individually or specifying select databases. For example, if a user wants to run prokaryotic or eukaryotic-specific KOfams, or an individual database alone such as dbCAN, both are easily customized within MetaCerberus.
- In QC mode, raw reads are quality controlled via FastQC prior and post trim FastQC. Raw reads are then trimmed via data type; if the data is Illumina or PacBio, fastp is called, otherwise it assumes the data is Oxford Nanopore then Porechop is utilized PoreChop.
- If Illumina reads are utilized, an optional bbmap step to remove the phiX174 genome is available or user provided contaminate genome. Phage phiX174 is a common contaminant within the Illumina platform as their library spike-in control. We highly recommend this removal if viral analysis is conducted, as it would provide false positives to ssDNA microviruses within a sample.
- We include a --skip_decon option to skip the filtration of phiX174, which may remove common k-mers that are shared in ssDNA phages.
- In the formatting and gene prediction stage, contigs and genomes are checked for N repeats. These N repeats are removed by default.
- We impute contig/genome statistics (e.g., N50, N90, max contig) via our custom module Metaome Stats.
- Contigs can be converted to pORFs using Prodigal, FragGeneScanRs, and Prodigal-gv as specified by user preference.
- Scaffold annotation is not recommended due to N's providing ambiguous annotation.
- Both Prodigal and FragGeneScanRs can be used via our --super option, and we recommend using FragGeneScanRs for samples rich in eukaryotes.
- FragGeneScanRs found more ORFs and KOs than Prodigal for a stimulated eukaryote rich metagenome. HMMER searches against the above databases via user specified bitscore and e-values or our minimum defaults (i.e., bitscore = 25, e-value = 1 x 10<sup>-9</sup> ).
Input formats
- From any NextGen sequencing technology (from Illumina, PacBio, Oxford Nanopore)
- type 1 raw reads (.fastq format)
- type 2 nucleotide fasta (.fasta, .fa, .fna, .ffn format), assembled raw reads into contigs
- type 3 protein fasta (.faa format), assembled contigs which genes are converted to amino acid sequence
Output Files
- If an output directory is given, that folder will be created where all files are stored.
- If no output directory is specified, the 'results_metacerberus' subfolder will be created in the current directory.
- Gage/Pathview R analysis provided as separate scripts within R.
Visualization of Outputs
- We use Plotly to visualize the data
- Once the program is executed the html reports with the visuals will be saved to the last step of the pipeline.
- The HTML files require plotly.js to be present. One has been provided in the package and is saved to the report folder.
Annotation Rules
<p align="center"> <img src="https://raw.githubusercontent.com/raw-lab/MetaCerberus/main/img/Rules.jpg" alt="MetaCerberus Rules" height=600> </p>- Rule 1 is for finding high quality matches across databases. It is a score pre-filtering module for pORFs thresholds: which states that each pORF match to an HMM is recorded by default or a user-selected cut-off (i.e., e-value/bit scores) per database independently, or across all default databases (e.g, finding best hit), or per user specification of the selected database.
- Rule 2 is to avoid missing genes encoding proteins with dual domains that are not overlapping. It is imputed for non-overlapping dual domain module pORF threshold: if two HMM hits are non-overlapping from the same database, both are counted as long as they are within the default or user selected score (i.e., e-value/bit scores).
- Rule 3 is to ensure overlapping dual domains are not missed. This is the dual independent overlapping domain module for convergent binary domain pORFs. If two domains within a pORF are overlapping <10 amino acids (e.g, COG1 and COG4) then both domains are counted and reported due to the dual domain issue within a single pORF. If a function
Related Skills
node-connect
346.4kDiagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps
frontend-design
107.2kCreate distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.
openai-whisper-api
346.4kTranscribe audio via OpenAI Audio Transcriptions API (Whisper).
qqbot-media
346.4kQQBot 富媒体收发能力。使用 <qqmedia> 标签,系统根据文件扩展名自动识别类型(图片/语音/视频/文件)。
