GitHub

MS2Query - Reliable and fast MS/MS spectral-based analogue search

Overview

The publication can be found here: https://rdcu.be/c8Hkc Please cite this article when using MS2Query

MS2Query uses MS2 mass spectral data to find the best match in a library and is able to search for both analogues and exact matches. A pretrained library for MS2Query is available based on the GNPS library. In our benchmarking we show that MS2Query performs better compared to current standards in the field like Cosine Score and the Modified Cosine score. MS2Query is easy to install (see below) and is scalable to large numbers of MS2 spectra.

Workflow

MS2Query is a tool for MSMS library matching, searching both for analogues and exact matches in one run. The workflow for running MS2Query first uses MS2Deepscore to calculate spectral similarity scores between all library spectra and a query spectrum. By using pre-computed MS2Deepscore embeddings for library spectra, this full-library comparison can be computed very quickly. The top 2000 spectra with the highest MS2Deepscore are selected. In contrast to other analogue search methods, no preselection on precursor m/z is performed. MS2Query optimizes re-ranking the best analogue or exact match at the top by using a random forest that combines 5 features. The random forest predicts a score between 0 and 1 between each library and query spectrum and the highest scoring library match is selected. By using a minimum threshold for this score, unreliable matches are filtered out.

For questions regarding MS2Query please make an issue on github or contact niek.dejonge@wur.nl

Installation guide

Prepare environmnent

We recommend to create an Anaconda environment with

conda create --name ms2query python=3.9
conda activate ms2query

Pip install MS2Query

MS2Query can simply be installed by running:

pip install ms2query

All dependencies are automatically installed, the dependencies can be found in setup.py. The installation is expected to take about 2 minutes. MS2Query is tested by continous integration on MacOS, Windows and Ubuntu for python version 3.9 and 3.10

Run MS2Query from command line

Download default library

When running for the first time a pretrained ms2query library should be downloaded. Change the file locations to the location where the library should be stored. Change the --ionmode to the needed ionmode (positive or negative)

ms2query --library .\folder_to_store_the_library --download --ionmode positive

Alternatively all model files can be manually downloaded from https://zenodo.org/record/6124552 for positive mode and https://zenodo.org/record/7104184 for negative mode.

Preprocessing mass spectra

MS2Query is run on all MS2 spectra in a spectrum file. MS2Query does not do any peak picking or clustering of similar MS2 spectra. If your files contain many MS2 spectra per feature it is advised to first reduce the number of MS2 spectra by clustering or feature selection. There are multiple tools available that do this. One reliable method is using MZMine for preprocessing, https://mzmine.github.io/mzmine_documentation/index.html. As input for MS2Query you can use the MGF file of the FBMN output of MZMine, see https://ccms-ucsd.github.io/GNPSDocumentation/featurebasedmolecularnetworking-with-mzmine2/.

Running MS2Query

After downloading a default library MS2Query can be run on your MS2 spectra. Run the command below and specify the location where your spectra are stored. If a spectrum file is specified all spectra in this folder will be processed. If a folder is specified all spectrum files within this folder will be processed. The results generated by MS2Query, are stored as csv files in a results directory within the same directory as your query spectra.

ms2query --spectra .\location_of_spectra --library .\library_folder --ionmode positive

To do a test run with dummy data you can download the file dummy_spectra.mgf. The expected results can be found in expected_results_dummy_data.csv. After downloading the library files, running on the dummy data is expected to take less than half a minute.

Run ms2query --help for more info/options, or see below:

usage: MS2Query [-h] [--spectra SPECTRA] --library LIBRARY_FOLDER [--ionmode {positive,negative}] [--download] [--results RESULTS] [--filter_ionmode]

MS2Query is a tool for MSMS library matching, searching both for analogues and exact matches in one run

optional arguments:
  -h, --help            show this help message and exit
  --spectra SPECTRA     The MS2 query spectra that should be processed. If a directory is specified all spectrum files in the directory will be processed. Accepted formats are: "mzML", "json", "mgf", "msp", "mzxml", "usi" or a
                        pickled matchms object
  --library LIBRARY_FOLDER
                        The directory containing the library spectra (in sqlite), models and precalculated embeddings, to download add --download
  --ionmode {positive,negative}
                        Specify the ionization mode used
  --download            This will download the most up to date model and library.The model will be stored in the folder given as the second argumentThe model will be downloaded in the in the ionization mode specified under --mode
  --results RESULTS     The folder in which the results should be stored. The default is a new results folder in the folder with the spectra
  --filter_ionmode      Filter out all spectra that are not in the specified ion-mode. The ion mode can be specified by using --ionmode
  --addional_metadata   Return additional metadata columns in the results, for example --additional_metadata retention_time feature_id

Interpretation of results

As output a csv file is returned, an example of results can be found in expected_results_dummy_data.csv. For each of your input spectra MS2Query predicts a library match. It is important to check the ms2query_model_prediction column. This column contains a score, which indicates the likelihood that the found match is a good match. This score ranges between 0 and 1, the closer this score is to 1 the more likely that it is a good match/analogue. It is important to use this score to select only the reliable hits, since a prediction is given for each spectrum, regardless of the ms2query score. There is no strict minimum for this score, but the higher the MS2Query model prediction the more likely it is a good match/analogue. It will depend on your research goal, what a good threshold is. If a high recall is important you might want a low threshold and if a high reliability is more important you might want a high threshold. To give a general indication, a score > 0,7 has many good analogues and exact matches. In the range of 0.6-0.7, the results can still be useful, but should be analysed with more caution and results below 0.6 can often best be discarded.

MS2Query does not need two different workflows for searching for analogues and searching for exact matches, it automatically selects the most likely library spectra. If it is important to separate potential exact matches from potential analogues for your research question, the column with the precursor mz difference can be used to separate these results, since exact matches should have no precursor mz difference. The columns completely to the right are estimated molecular classes based on the molecular structure of the predicted library molecule, these columns can be used to get a quick overview of the kind of compounds that were found.

Build MS2Query into other tools

If you want to incorporate MS2Query into another tool it might be easier to run MS2Query from a python script, instead of running from the command line. The guide below can be used as a starting point.

Below you can find an example script for running MS2Query. Before running the script, replace the variables ms2query_library_files_directory and ms2_spectra_directory with the correct directories.

This script will first download files for a default MS2Query library

Ms2query

Install / Use

README