🌫️ MIST: Metabolite Inference with Spectrum Transformers

This repository provides implementations and code examples for Metabolite Inference with Spectrum Transformers (MIST). MIST models can be used to predict molecular fingerprints from tandem mass spectrometry data and, when trained in a contrastive learning framework, enable embedding and structure annotation by database lookup. Rather than directly embed binned spectra, MIST applies a transformer architecture to directly encode and learn to represent collections of chemical formula. MIST has also since been extended to predict precursor chemical formulae as MIST-CF.

Samuel Goldman, Jeremy Wohlwend, Martin Strazar, Guy Haroush, Ramnik J. Xavier, Connor W. Coley

Update: This branch provides an updated version of the MIST method for increased usability and developability. See the change log for specific details.

Model graphic

Install & setup
Quick start
Data
Training models
Experiments
Change log
Citations

Install & setup <a name="setup"></a>

After git cloning the repository, the environment and package can be installed. Please note that the environment downloaded attempts to utilize cuda11.1. Please comment this line out in environment.yml if you do not plan to use gpu support prior to the commands below. We strongly recommend replacing conda with mamba for fast install (e.g., mamba env create -f environment.yml).

conda env create -f environment.yml
conda activate ms-gen
pip install -r requirements.txt
python setup.py develop

This environment was tested on Ubuntu 20.04.1 with CUDA Version 11.4 . It takes roughly 10 minutes to install using Mamba.

Quick start <a name="quickstart"></a>

After creating a python environment, pretrained models can be used to:

Predict fingerprints from spectra (quickstart/model_predictions/fp_preds/)
Annotate spectra by ranking candidates in a reference smiles list (quickstart/model_predictions/retrieval/)
Embed spectra into a dense continuous space (quickstart/model_predictions/contrastive_embed/)

To showcase these capabilities, we include an MGF file, quickstart/quickstart.mgf (a sample from the Mills et al. data), along with a set of sample smiles quickstart/lookup_smiles.txt.


conda activate ms-gen
. quickstart/00_download_models.sh
. quickstart/01_run_models.sh

Output predictions can be found in quickstart/model_predictions and are included by default with the repository. We provide an additional notebook notebooks/mist_demo.ipynb that shows these calls programmatically, rather than in the command line.

Data <a name="data"></a>

Training models requires the use of paired mass spectra data and unpaired libraries of molecules as annotation candidates.

Downloading and preparing paired datasets

We utilize two datasets to train models:

csi2022: H+ Spectra from GNPS, NIST, MONA, and others kindly provided by Kai Duhrkop from the SIRIUS and CSI:FingerID team. This dataset is used to complete most benchmarking done.
canopus_train: Public data extracted from GNPS and prepared by the 2021 CANOPUS methods paper. This has since been renamed "NPLIB1" in our subsequent papers.

Each paired spectra dataset will have the following standardized folders and components, living under a single dataset folder:

labels.tsv: A file containing the columns ["dataset", "spec", "name", "ionization", "formula", "smiles", "inchikey", "instrument"], where "smiles" coreresponds to an achiral version of the smiles string.
spec_files: A directory containing each .ms file in the dataset
subformulae: Outputs of a subformula labeling program run on the corresponding .ms directory
magma_outputs: Outputs of a MAGMa program run on the corresponding spec files directory
splits: Splits contains all splits. These are in the form of a table with 2 columns including split name and category (train, val, test, or exclude)
retrieval_hdf: Folder to hold hdf files used for retrieval and contrastive model training. Note we construct these with relevant isomers for the dataset.
[optional] prev_results: Folder to hold any previous results on the dataset if benchmarked by another author
[optional] data augmentation: Another part of model training is the use of simulated spectra from a forward model. After training these different forward models, we store relevant predictions inside spearate folders here.

We are not able to redistribute the CSI2022 dataset. The canopus_train dataset (including split changes) can be downloaded and prepared for minimal model execution:


. data_processing/canopus_train/00_download_canopus_data.sh

We intentionally do not include the retrieval HDF file in the data download, as the retrieval file is larger (>5GB). This can be re-made by following the instructions below to process PubChem (or one of the other unpaired libraries), then running python data_processing/canopus_train/03_retrieval_hdf.py. The full data processing pipeline used to prepare relevant files can be found in data_processing/canopus_train/ (i.e., subformulae assignment, magma execution, retrieval and contrastive dataframe construction, subsetting of smiles to be used in augmentation, and assigning subformuale to the augmented mgf provided).

Unpaired molecules

We consider processing three example datasets to be used as unpaired molecules: biomols, a dataset of biologicaly-relevant molecules prepared by Duhrkop et al. for the CANOPUS manuscript, hmdb, the Human Metabolome Database, and pubchem, the most complete dataset of molecules. Instructions for downloading and processing each of these can be found in data_processing/mol_libraries/.

MIST uses these databases of molecules (without spectra) in two ways:

Data augmentation: To train our models, we utilize an auxiliary forward molecule-to-spectrum model to add training examples to the dataset. The primary requirements are that these augmented spectra are provided as a labels file and an mgf file. We provide an example of this in the data/paired_spectra/canopus_train/aug_iceberg_canopus_train/. See the ms-pred github repository for details on training a model and exporting an mgf. See data_processing/canopus_train/04_subset_smis.sh for how we subsetted the biomolecules dataset to create labels for the ms-pred prediction and data_processing/canopus_train/05_buid_aug_mgf.sh for how we process the resulting mgf into subformulae assignments after export.
Retrieval libraries: A second use for these libraries is to build retrieval databases or as contrastive decoys. See data_processing/canopus_train/03_retrieval_hdf.py for call signatures to construct both of these, after creating a mapping of chem formula to smiles (e.g., data_processing/mol_libraries/pubchem/02_make_formula_subsets.sh).

Training models <a name="training"></a>

After downloading the canopus_train dataset, the following two commands demonstrate how to train models that can be used (as illustrated in the quickstart). The config files specify the exact parameters used in experiments as reported in the paper.

MIST Fingerprint model:


CUDA_VISIBLE_DEVICES=0 python src/mist/train_mist.py \
    --cache-featurizers \
    --labels-file 'data/paired_spectra/canopus_train/labels.tsv' \
    --subform-folder 'data/paired_spectra/canopus_train/subformulae/subformulae_default/' \
    --spec-folder 'data/paired_spectra/canopus_train/spec_files/' \
    --magma-folder 'data/paired_spectra/canopus_train/magma_outputs/magma_tsv/' \
    --fp-names morgan4096 \
    --num-workers 16 \
    --seed 1 \
    --gpus 1 \
    --augment-data \
    --batch-size 128 \
    --iterative-preds 'growing' \
    --iterative-loss-weight 0.4 \
    --learning-rate 0.00077 \
    --weight-decay 1e-07 \
    --lr-decay-frac 0.9 \
    --hidden-size 256 \
    --pairwise-featurization \
    --peak-attn-layers 2 \
    --refine-layers 4 \
    --spectra-dropout 0.1 \
    --magma-aux-loss \
    --magma-loss-lambda 8 \
    --magma-modulo 512 \
    --split-file 'data/paired_spectra/canopus_train/splits/canopus_hplus_100_0.tsv' \
    --forward-labels 'data/paired_spectra/canopus_train/aug_iceberg_canopus_train/biomols_filtered_smiles_canopus_train_labels.tsv' \
    --forward-aug-folder 'data/paired_spectra/canopus_train/aug_iceberg_canopus_train/canopus_hplus_100_0/subforms/' \
    --frac-orig 0.6 \
    --form-embedder 'pos-cos' \
    --no-diffs \
    --save-dir results/canopus_fp_mist/split_0

Contrastive model:


CUDA_VISIBLE_DEVICES=0 python src/mist/train_contrastive.py \
    --seed 1 \
    --labels-file 'data/paired_spectra/canopus_train/labels.tsv' \
    --subform-folder 'data/paired_spectra/canopus_train/subformulae/subformulae_default/' \
    --spec-folder 'data/paired_spectra/canopus_train/spec_files/' \
    --magma-folder 'data/paired_spectra/canopus_train/magma_outputs/' \
    --hdf-file 'data/paired_spectra/canopus_train/retrieval_hdf/intpubchem_with_morgan4096_retrieval_db_contrast.h5' \
    --augment-data \
    --contrastive-weight 0.6 \
    --contrastive-scale 16 \
    --num-decoys 64 \
    --max-db-decoys 256 \
    --decoy-norm-exp 4 \
    --negative-strategy 'hardisomer_tani_pickled' \
    --dist-name 'cosine' \
    --learning-rate 0.00057 \
    --weight-decay 1e-07 \
    --scheduler \
    --lr-decay-frac 0.7138 \

Mist

Install / Use

README