MGM
MGM (Microbial General Model) as a large-scaled pretrained language model for interpretable microbiome data analysis.
Install / Use
/learn @HUST-NingKang-Lab/MGMREADME
English| 中文
MGM
🎊 MGM has been published in Advanced Science .
Microbial General Model (MGM) is a large-scale pretrained language model designed for interpretable microbiome data analysis. MGM supports a variety of tasks, including data preparation, model training, and inference, making it a versatile tool for microbiome research.

Installation
From PyPI
Install MGM using pip:
pip install microformer-mgm
From Source
Install MGM from the source code:
python setup.py install
MicroCorpus-260K
The MicroCorpus-260K dataset includes 263,302 microbiome samples sourced from MGnify, ideal for training your own MGM model. It is available for download on OneDrive. The dataset includes:
MicroCorpus-260K.pkl: Normalized microbiome corpus (mean and standard deviation across all samples).MicroCorpus-260K_unnorm.pkl: Unnormalized microbiome corpus.mgnify_biomes.csv: Metadata for the samples in the dataset.
Loading MicroCorpus-260K
Load the dataset in Python:
from pickle import load
corpus = load(open('MicroCorpus-260K.pkl', 'rb'))
corpus[0] # Access the first sample (dict with input_ids and attention_mask)
abundance = corpus.data # Access the abundance data
Example Notebook
The MGM_Interpretability_Guideline.ipynb Jupyter Notebook provides a comprehensive guide for using MGM to perform interpretable microbiome analysis. It demonstrates how to:
- Fine-tune MGM on an infant microbiome dataset.
- Extract sample embeddings for visualization.
- Compute attention weights to identify keystone microbial genera.
This notebook is ideal for researchers seeking to understand MGM's interpretability features and apply them to their own datasets.
CLI Usage
MGM is accessed via a command-line interface (CLI) with various modes. The general syntax is:
mgm <mode> [options]
Below, the modes are grouped into Data Preparation, Model Training, and Inference for better organization.
Data Preparation
construct
Converts abundance data into a microbiome corpus, normalized using phylogeny, and ranked from high to low genus abundance.
- Input: Abundance data in
hdf5,csv, ortsvformat (features in rows, samples in columns) - Output: A
.pklfile containing the microbiome corpus
Example:
mgm construct -i data/abundance.csv -o data/corpus.pkl
Note: For
hdf5files, use-kto specify the key (default isgenus).
Model Training
pretrain
Pretrains the MGM model using causal language modeling on a microbiome corpus. Optionally, trains a generator with labeled data.
- Input:
- Microbiome corpus (
.pkl) - Optional: Label file (
.csv, two columns: sample ID and label)
- Microbiome corpus (
- Output: Pretrained MGM model
Examples:
mgm pretrain -i data/corpus.pkl -o models/pretrained_model
mgm pretrain -i data/corpus.pkl -l data/labels.csv -o models/generator_model --with-label
Note: Use
--from-scratchto train from scratch instead of loading pretrained weights. If a label file is provided, the tokenizer and model embedding layer are updated.
train
Trains a supervised MGM model from scratch using labeled data.
- Input:
- Microbiome corpus (
.pkl) - Label file (
.csv, two columns: sample ID and label)
- Microbiome corpus (
- Output: Supervised MGM model
Example:
mgm train -i data/corpus.pkl -l data/labels.csv -o models/supervised_model
finetune
Finetunes a pretrained MGM model for a specific task using labeled data.
- Input:
- Microbiome corpus (
.pkl) - Label file (
.csv, two columns: sample ID and label) - Optional: Pretrained model (defaults to MicroCorpus-260K pretrained model if not specified)
- Microbiome corpus (
- Output: Finetuned MGM model
Example:
mgm finetune -i data/corpus.pkl -l data/labels.csv -m models/pretrained_model -o models/finetuned_model
Inference
predict
Generates predictions using a finetuned MGM model. Optionally evaluates against ground truth labels.
- Input:
- Microbiome corpus (
.pkl) - Optional: Label file (
.csv) for evaluation - Supervised MGM model
- Microbiome corpus (
- Output: Prediction results (
.csv)
Example:
mgm predict -E -i data/corpus.pkl -l data/labels.csv -m models/finetuned_model -o data/predictions.csv
Note: Use
-Ewith a label file to compare predictions with ground truth.
generate
Generates synthetic microbiome data using a pretrained MGM model.
- Input:
- Pretrained MGM model
- Optional: Prompt file (
.txt, one label per line) for labeled generation
- Output: Synthetic genus tensors (
.pkl)
Example:
mgm generate -m models/generator_model -p data/prompt.txt -n 100 -o data/synthetic.pkl
Note: Use
-nto specify the number of samples to generate.
reconstruct
Reconstructs abundance data from a ranked corpus, with optional training of a reconstructor model or label decoding.
- Input:
- Abundance file (e.g.,
csv) for training the reconstructor, or a trained model checkpoint. - Ranked corpus (
.pkl) for reconstruction - Optional: Generator model and prompt file (text, one label per line) for labeled data
- Abundance file (e.g.,
- Output:
- Reconstructed corpus (
.pkl) - Reconstructor model (if training)
- Decoded labels (if applicable)
- Reconstructed corpus (
Examples:
mgm reconstruct -a data/abundance.csv -i data/synthetic.pkl -g models/generator_model -w True -o data/reconstructor
mgm reconstruct -r data/reconstructor_model.ckpt -i data/synthetic.pkl -g models/generator_model -w True -o data/reconstructed
For more details on any mode, run:
mgm <mode> --help
Maintainers
| Name | Email | Organization | |---------------|---------------------------|----------------------------------------------------------| | Haohong Zhang | haohongzh@gmail.com | PhD Student, School of Life Science and Technology, HUST | | Zixin Kang | 29590kang@gmail.com | Undergraduate, School of Life Science and Technology, HUST | | Kang Ning | ningkang@hust.edu.cn | Professor, School of Life Science and Technology, HUST |
