MFAssignR
The MFAssignR package was designed for multi-element molecular formula (MF) assignment of ultrahigh resolution mass spectrometry measurements. A number of tools for internal mass recalibration, MF assignment, signal-to-noise evaluation, and unambiguous formula selections are provided.
Install / Use
/learn @skschum/MFAssignRREADME
MFAssignR
Additional Documentation
This package is the focus of a peer-reviewed journal article in Environmental Research. The citation is: Schum S.K., Brown L.E., Mazzoleni L.R., MFAssignR: Molecular formula assignment software for ultrahigh resolution mass spectrometry analysis of environmental complex mixtures, Environmental Research, https://doi.org/10.1016/j.envres.2020.11011, volume 191, (2020).
If you use this package please refer to this publication as well.
Package Overview and References
The MFAssignR package was designed for multi-element molecular formula (MF) assignment of ultrahigh resolution mass spectrometry measurements. A number of tools for internal mass recalibration, MF assignment, signal-to-noise evaluation, and unambiguous MF assignments are provided. This package contains MFAssign(), MFAssign_RMD(), MFAssignCHO(), MFAssignCHO_RMD(), SNplot(), HistNoise(), KMDNoise(), RecalList(), Recal(), and IsoFiltR() described in the sections below. Note, the functions with “RMD” were designed to be run within an R Markdown file and are otherwise identical to the corresponding non-”RMD” versions. To learn more, please see the section titled “Semi-Automated MFAssignR Functions” in the User Manual. User caution with the function parameter settings and output evaluation is required; thus, several function outputs are provided to assist the user with these evaluations.
Molecular Formula (MF) Assignment
The MF assignment algorithm in MFAssign was adapted from the low mass moiety CHOFIT assignment algorithm developed by Green and Perdue (2015). Briefly, the CHOFIT algorithm uses low mass moieties such as CH4O-1 and C4O-3 to move around in the O/C and H/C space to assign MF with C, H, and O without conventional loops. The MFAssignCHO function uses the CHOFIT strategy to assign MF with C, H, and O. Additional combinatorial assignments with various heteroatoms are made using nested loops that subtract the mass of a heteroatom from the measured ion mass, creating a CHO “core” mass, which can then be assigned using the low mass moiety CHOFIT approach. The MFAssign function uses this latter approach with several additional heteroatoms. Further information is available in Green and Perdue (2015) and Perdue and Green (2015).
MFAssign()
MFAssign can be used to assign molecular formulas to two-column or three-column dataframes where the first column is ion mass, the second column is intensity, and the third column can be anything else, but was designed for retention time, allowing better formula assignment of LC-MS data.
Using the low mass moiety and combinatorial assignment approach, MFAssign can be used to assign MF with 12C, 1H, and 16O and a variety of heteroatoms and isotopes, including 2H, 13C, 14N, 15N, 31P, 32S, 34S, 35Cl, 37Cl, 19F, 79Br, 81Br, and 126I. It can also assign Na+ adducts, which are common in positive ion mode. Due to the increasing number of chemically reasonable MF with the increasing number of possible elements and increasing molecular weight, the output will provide a list of ambiguous and unambiguous MF.
In MFAssignR, we use a de novo concept for MF assignment, where de novo means the first in series. This approach takes advantage of the naturally occurring mass spectral patterns typically observed in natural organic matter. The most frequent mass difference patterns include: 2.01565, 14.01565, and 15.99491 that correspond to H2, CH2, and O. Thus, these patterns are used to restrain the number of chemically reasonable MF assigned to ions above the user defined ‘de novo’ cutoff (e.g., m/z 300). In MFAssign, this is done using Kendrick mass defects and z* sorting. First, Kendrick mass defects (KMD) and z* values are calculated with a CH2 Kendrick base to sort the measured masses into CH2 homologous series (Stenson et al., 2003). The function then selects 1 to 3 members of each CH2 homologous series with ions below the user defined cutoff and attempts to assign MF. The ambiguous MF are then returned to the unassigned list. Then, the unambiguous MF are used as seeds for additional assignments using CH2, O, H2, H2O, and CH2O MF extensions (Kujawinski and Behn, 2006). To do the formula extensions, the KMD and z* values for each of these bases are calculated and then used to assign MF through the addition or subtraction of the series bases.
“MFAssign” functions track how many “paths” can be used to assign each MF and if a single mass has multiple MF. By default, the functions will choose the MF that has the largest number of paths that intercept with it. For example, if a single mass has two possible MF and one has 20 potential “paths” to it, while the other has 4, the function will choose the MF with 20 paths. Work is ongoing to track these paths and the associated MF in the data frame output of these functions. Overall, the multi-path MF extension approach greatly reduces the number of ambiguous assignments and provides an increased level of confidence in the final MF list because the MF are related to unambiguous MF assigned below the user defined cutoff. To reduce the number of ambiguous sulfur assignments, sulfur containing MF used as seeds must be unambiguous and have a matching 34S peak.
To allow ambiguity in the formula assignments there is the "Ambig" parameter which can be turned "on" or "off". This option turns off the path frequency prioritization step for the formula assignments as described above, which allows all chemically reasonably MF assignments to be retained for each mass. Additionally, an "MSMS" parameter is available, which can be used to assign MF in a data set that is not very continuous (e.g., MS/MS data). In this case, no pre-filtering of the ions below the “DeNovo” threshold is done, meaning that all ions below the threshold will be assigned directly. This causes the function to run somewhat slower, but can improve assignment coverage in some situations. These parameters replace the MFAssignAll and MFAssignMSMS functions from previous versions (<= v.0.0.3).
MFAssignCHO()
MFAssignCHO is a simplified version of MFAssign used only to assign MF with CHO elements. MFAssignCHO runs faster than MFAssign and can be used for preliminary MF assignments prior to the selection of internal recalibration ions in conjunction with RecalList and Recal, which are described below.
Isotope Filtering
The IsoFiltR function can identify prospective 13C and 34S isotope ions. This is done to avoid incorrect monoisotopic MF assignments. This function operates on a two-column or three-column data frame using the same structure as the MFAssign function.
IsoFiltR() identifies potential isotope masses using a four-step identification method.
-
First the mass list is transformed to identify mass difference pairs appropriate for the element under investigation (delta mass for C (1.003355) or S (1.995797), with +/- 5 ppm mass error). Only those that meet this criterion move on to step 2.
-
Using the mass difference between 12C/13C (1.003355) or 32S/34S (1.995797), the KMD value can be calculated for a specific isotope. This means that the 12C (32S) monoisotopic peak will be in a KMD homologous series with its matching 13C (34S) isotopic peak, analogous to homologous series of CH2. If the KMD values are equivalent for the candidate pair, the masses can be considered to be in a series and the pair will move on to the third step. The equations for 13C are: KM = 1/1.003355 * m/z and KMD = nominal mass - KM. Then, 2/1.995797 replaces 1/1.003355 for 34S.
-
Isotope pairs are separated using a “Resolution Enhanced KMD” approach adapted from Zheng et al. 2019. Resolution enhanced KMD values are calculated by dividing the mass of some homologous series base (in this case CH2) by an integer that is experimentally determined to accomplish the desired separation. This value is then used in the typical KM and KMD calculation in order to calculate the “resolution enhanced” KMD (re-KMD). For example, the integer 21 is used to adjust the CH2 base mass in the following KMD calculation: BaseMass_adj = 14.01565 / 21 and then re-KM = (round(BaseMass_adj) / BaseMass_adj) * m/z, followed by re-KMD = round(re-KM) - re-KM yields a resolution enhanced KMD.
For 13C, the integer 21 is used in the resolution-enhanced KMD, while for 34S it is 12. Then, the masses that are 12/13 C or 32/34 S pairs will have specific re-KMD difference values, which are used to select the pairs of masses that are most likely to be isotope pairs. The re-KMD differences (polyisotope – monoisotope) are both positive and negative because the re-KM and re-KMr values were rounded off. The values are -0.291 and 0.709 for 32/34 S and -0.496 and 0.503 for 12/13 C. If the masses meet these criteria, they can move on to step four. Using CH2 KMD values that are divided by an experimentally derived integer, the isotope pairs are separated into two specific values. If the difference in the enhanced KMD for the candidate pair matches one of those values, it will move to the fourth step.
- The abundance ratios are used to constrain the remaining isotope pairs to ensure that the isotope masses are not too large or too small relative to the intensity of the monoisotopic peak. The limits on this are loose due to the variation in the polyisotope abundance with analyte signal (similar to isotope dilution) as observed in ultrahigh resolution Orbitrap and FT-ICR measurements.
The candidate pairs that make it through these four steps are put into two data frames, Mono and Iso, which contain the monoisotopic and isotopic masses respectively. Then all of the masses that were not flagged as possible mono/iso pairs are returned to the Mono output data frame. In complex mixtures, some masses can be flagged as both monoisotopic and isotopic. In these cases, the masses are included in both outputs and are classified as either monoistopic or isotopic after the MF assignment.
When the two data frame outputs from IsoF
