Asari
asari, metabolomics data preprocessing
Install / Use
/learn @shuzhao-li-lab/AsariREADME
Asari
Trackable and scalable Python program for high-resolution LC (Li et al. Nature Communications 14.1 (2023): 4113) and GC (publication to come) metabolomics datasets:
- Taking advantage of high mass resolution to prioritize mass separation and alignment
- Peak detection on a composite map instead of repeated on individual samples
- Statistics guided peak dection, based on local maxima and prominence, selective use of smoothing
- Reproducible, track and backtrack between features and EICs
- Tracking peak quality, selectiviy metrics on m/z, chromatography and annotation databases
- Scalable, performance conscious, disciplined use of memory and CPU
- Transparent, JSON centric data structures, easy to chain other tools
A web server (https://asari.app) and full pipeline are available now.
A set of tutorials are hosted at https://github.com/shuzhao-li-lab/asari_pcpfm_tutorials/.
Install
-
From PyPi repository:
pip3 install asari-metabolomics. Add--upgradeto update to new versions. -
Or clone from source code: https://github.com/shuzhao-li/asari . One can run it as a Python module by calling Python interpreter. GitHub repo is often ahead of PyPi versions.
-
Requires Python 3.8+. Installation time ~ 5 seconds if common libraries already exist.
-
Python>=3.12 is currently incompatible with GC workflow because of limitations with installing numba and matchms in Python3.13.
-
One can use the web version (https://asari.app) without local installation.
Input
Input data are centroid mzML files from LC, GC or DI metabolomics. Example datasets can be found at https://github.com/shuzhao-li-lab/data.
We use ThermoRawFileParser (https://github.com/compomics/ThermoRawFileParser) to convert Thermo .RAW files to .mzML. You can also perform conversion, if your input files are Thermo raw files either by using the convert command or by passing --convert_raw True with the process command.
Msconvert in ProteoWizard (https://proteowizard.sourceforge.io/tools.shtml) can handle the conversion of most vendor data formats and .mzXML files.
MS/MS spectra are ignored in default LC-MS workflow but handled by alternative workflows.
Use
If installed from pip, one can run asari as a command in a terminal, followed by a subcommand for specific tasks.
For help information:
asari -h
To process all mzML files under directory mydir/projectx_dir:
asari process --mode pos --input mydir/projectx_dir
To get statistical description on a single file (useful to understand data and parameters):
asari analyze --input mydir/projectx_dir/file_to_analyze.mzML
To get annotation on a tab delimited feature table:
asari annotate --mode pos --ppm 10 --input mydir/projectx_dir/feature_table_file.tsv
To do automatic esitmation of min peak height, add this argument:
--autoheight True
To output additional extraction table on a targeted list of m/z values from target_mzs.txt:
asari extract --input mydir/projectx_dir --target target_mzs.txt
This is useful to add QC check during data processing, e.g. the target_mzs.txt file can be spike-in controls.
Alternatively, you can do:
asari qc_report --input mydir/projectx_dir/ --spikeins target_trios.csv
Or add this to the process command to generate the reports during processing:
--single_file_qc_reports true --spikeins target_trios.csv
To launch a dashboard in your web browser after the project is processed into directory process_result_dir:
asari viz --input process_result_dir
Alternative to a standalone command, to run as a module via Python interpreter, one needs to point to module location, e.g.:
python3 -m asari.main process --mode pos --input mydir/projectx_dir
An example output feature table is test/HighOnly_HILICpos_preferred_Feature_table.tsv, which was from:
asari process -i MT202304_2phase_HILICpos -o highonly --min_peak_height 1000000 --anno F
Graphical Interface
A prototype graphic interface is provided on install. You can start the GUI by running in a terminal:
asari_gui
Ask your IT support for creating a desktop icon if desired.
Workflow Selection - GC / LC / Other
Asari processes both GC and LC data via different workflows.
Worfkows can be provided by passing --worfklow <workflow_name> to asari.
Some workflows require additional parameters. To list possible worfklows:
asari list_workflows
We have three workflows currently:
LC - default workflow for Asari GC - GC worfklow for Asari, uses retention index for normalization. LC_START - alternative LC workflow with spanning tree alignment
Output
A typical run on disk may generatae a directory like this
rsvstudy_asari_project_427105156
├── Annotated_empricalCompounds.json
├── Feature_annotation.tsv
├── export
│ ├── _mass_grid_mapping.csv
│ ├── cmap.pickle
│ ├── full_Feature_table.tsv
│ └── unique_compound__Feature_table.tsv
├── pickle
│ ├── Blank_20210803_003.pickle
│ ├── ...
├── preferred_Feature_table.tsv
└── project.json
The recommended feature table is preferred_Feature_table.tsv.
All peaks are kept in export/full_Feature_table.tsv if they meet signal (snr) and shape standards
(part of input parameters but default values are fine for most people).
That is, if a feature is only present in one sample, it will be reported,
as we think this is important for applications like exposome and personalized medicine.
The filtering decisions are left to end users.
The pickle folder keeps intermediate files during processing.
They are removed after the processing by default, to save disk space.
Users can choose to keep them by specifying --keep_intermediates True.
Optionally, users may chose to save intermediates as json files using --storage_format json
which may be safer than using pickle at the expense of additional disk space. --compress true
will store the files in individual zip files saving disk space. Enabling compression can be
intensive on the CPU/memory subsystem of your machine, use with care.
Dashboard
After data are processed, users can use asari viz --input process_result_dir to launch a dashboard to inspect data, where 'process_result_dir' refers to the result folder. The dashboard uses these files under the result folder: 'project.json', 'export/cmap.pickle', 'export/epd.pickle' and 'export/full_Feature_table.tsv'. Thus, one can move around the folder, but modification of these files is not a good idea. Please note that pickle files are for internal use, and one should not trust pickle files from other people.

Parameters
For the LC workflows, only one parameter in asari requires real attention, i.e., m/z precision is set at 5 ppm by default. Most modern instruments are fine with 5 ppm, but one may want to change if needed.
Default ionization mode is pos. Change to neg if needed, by specifying --mode neg in command line.
Users can supply a custom parameter file xyz.yaml, via --parameters xyz.yaml in command line.
A template YAML file can be found at test/parameters.yaml.
When the above methods overlap, command line arguments take priority.
That is, commandline overwrites xyz.yaml, which overwrites default asari parameters in default_parameters.py.
The GC workflow requires, in addition to passing --workflow GC to the process command, also an appropirately formatted --retention_index_standards
file which is in .csv. Examples are provided in the db folder. You can also specify which database to use by passing --GC_Database <path_to_msp> or --GC_Databse <database_name> where <database_name> is one of the supported libraries in /db/gcms_libraries.json. By default, MoNA GC-MS is used.
Algorithms
Basic data concepts follow https://github.com/shuzhao-li/metDataModel, organized as
├── Experiment
├── Sample
├── MassTrack
├── Peak
├── Peak
├── MassTrack
├── Peak
├── Peak
...
├── Sample
...
├── Sample
A sample here corresponds to an injection file in LC-MS experiments. A MassTrack is an extracted chromatogram for a specific m/z measurement, governing full retention time. Therefore, a MassTrack may include multiple mass traces, or EICs/XICs, as referred by literature. Peak (an elution peak at specific m/z) is specific to a sample, but a feature is defined at the level of an experiment after correspondence.
Additional details:
- Use of MassTracks simplifies m/z correspondence, which results in a MassGrid
- Two modes of m/z correspondence: a clustering method for studies >= N (default 10) samples; and a slower method based on landmark peaks and verifying mass precision.
- Chromatogram construction is based on m/z values via flexible bins and frequency counts (in lieu histograms).
- Elution peak alignment is based on LOWESS
- Use integers for RT scan numbers and intensities for computing efficiency
- Avoid mathematical curves whereas possible for computing efficiency
Selectivity is tracked for
- mSelectivity, how distinct are m/z measurements
- cSelectivity, how distinct are chromatograhic elution peaks
Step-by-step algorithms are explained in doc/README.md.
This package uses mass2chem, khipu and JMS for mass search and annotation functions.
Performance
Asari is designed to run > 1000 samples on a laptop computer. The performance is achieved via
- Imple
