PythonCentricPipelineForMetabolomics
Python pipeline for metabolomics data preprocessing, QC, standardization and annotation
Install / Use
/learn @shuzhao-li-lab/PythonCentricPipelineForMetabolomicsREADME
Introduction
The Python-Centric Pipeline for Metabolomics is designed to take raw LC-MS metabolomics data and ready them for downstream statistical analysis. The pipeline can
- convert Thermo .raw to mzML (ThermoRawFileParser)
- process mzML data to feature tables (Asari)
- perform quality control
- data normalization and batch correction
- pre-annotation to group featues to empirical compounds (khipu)
- perform MS1 annotation using an authentic compound library, a public database (e.g. HMDB, LIPID MAP), or custom database
- perform MS2 annotation (matchms) using a custom database (default MoNA)
- output data in standardized formats (.txt, JSON), ready for downstream analysis
Asari supports a visual dashboard to explore and inspect individual features. We are working to add supports of GC and other data types.
Note that to replicate the presented results you will need to run the download extras command. See below.
Citations
Please cite these publications if you use PCPFM and Asari:
-
Mitchell, J.M., Chi, Y., Thapa, M., Pang, Z., Xia, J. and Li, S., 2024. Common data models to streamline metabolomics processing and annotation, and implementation in a Python pipeline. PLOS Computational Biology, 20(6), p.e1011912. (https://doi.org/10.1371/journal.pcbi.1011912)
-
Li, S., Siddiqa, A., Thapa, M., Chi, Y. and Zheng, S., 2023. Trackable and scalable LC-MS metabolomics data processing using asari. Nature Communications, 14(1), p.4113. (https://www.nature.com/articles/s41467-023-39889-1)
Recent Changes
Please see the VERSION_LOG.md for details on recent changes. This is for documentation but also because the manuscript is under review. Notably there was an issue regarding sample names that do not match their mzML file names. This has been fixed as of 2/28/24.
Workflow
This is a basic overview of the various steps in the pipeline and workflow:
<img width="871" alt="image" src="https://github.com/shuzhao-li-lab/PythonCentricPipelineForMetabolomics/assets/10132705/60b92ee0-e855-41df-be5d-509a0b5f5f2f">Quick Start
See the workflows under examples/workflows/bash_workflows for examples of processing pipelines to get started. You will need an appropriately formattted sequence file / sample metadata file along with mzML files. You can work with .raw files but support is limited. Creating properly formatted metadata sheets is easy by hand for small studies but the preprocessing step can be helpful for larger studies (manual is still recommended for full flexibility).
There are also examples of basic and advanced pipeline and asari usage located here: https://github.com/shuzhao-li-lab/asari_pcpfm_tutorials
Along with additional details on running Asari.
PythonCentricPipelineForMetabolomics (PCPFM)
The PythonCentricPipelineForMetabolomics (PCPFM) aims to be an all-in-one pre-processing pipeline for LC-MS metabolomics datasets leveraging the data quality and performance improvements offered by our pre-processing software Asari.
- Inputs should include a set of raw files (.raw or .mzML) and a csv file for metadata (minimal sample names and file path).
- Outputs are intended to be immediately usable for downstream analysis (e.g. MetaboAnalyst or common tools in R, Python etc.). This includes feature tables that are optionally blank masked, normalized, batch corrected, annotated or otherwise curated by PCPFM and empirical compounds as a JSON file representing putative metabolites that can be annotated with MS1, MS2, or authentic standards. The organization of the outputs is as such:
Experiment Directory/
annotations/
empCpd.json
...
asari_results/
preferred_Feature_table.tsv
export/
full_feature_table.tsv
converted_acquisitions
sample1.mzML
sample2.mzML
....
feature_tables
user_created_table_1.tsv
user_created_table_2.tsv
...
results
annotation_table
feature_table
sample_annotation_table
QAQC_figs
user_created_table_1/
pca.png
tsne.png
...
user_created_table_2/
pca.png
tsne.png
...
...
raw_acquisitions
sample1.raw
sample2.raw
...
experiment.json
Installation
The preferred installation mechanism is pip:
pip install pcpfm
or download the source and install manually:
pip install -e . or pip install .
Additional files such as the LC-MS/MS databases from MoNA, a JMS-compliant version of the HMDB and LMSD can be download and placed in the correct directory by running:
pcpfm download_extras
After the basic installation is complete. By running this command, you agree to the terms and conditions of those 3rd pary resources. Namely this includes that the HMDB is NOT to be used for commercial purposes.
Note that annotation sources including the HMDB, while free for public non-commercial use, is not redistributed in this package. There is a command to download this source and other annotation sources once you agree to respect the license of the annotation sources we will allow the downloading of. This includes currently the HMDB and LC-MS/MS Orbitrap database from MoNA.
Basic Usage
Preparing experiment metadata
Goal: to organize metadata in a CSV file.
This step is optional, you can also provide a manually crafted sequence file instead. The examples in the manuscript use manually constructed examples.
An example command:
pcpfm preprocess -s ./Sequence.csv --new_csv_path ./NewSequence.csv --name_field='Name' --path_field='Path' --preprocessing_config ./pcpfm/prerpocessing.json
This command will create a new csv file called ./NewSequence.csv using the rules specified in preprocessing.json assuming the sample should be located either at --path_field or in the csv directory by its 'File Name'.
It is typical that the sequence file contains sufficient information for metadata. However, some instruments do not allow all values for all fields in a sequence file. This step is therefore to prepare metadata from the sequence file.
An example of input CSV file:
| Sample Type | Name | Filepath | |-------------|----------------|------------------------------------| | Blank | SZ_01282024_01 | my_experiment/SZ_01282024_01.raw | | QC | SZ_01282024_07 | my_experiment/SZ_01282024_07.raw | | Unknown | SZ_01282024_13 | my_experiment/SZ_01282024_13.raw | | ... | ... | ... |
Other fields are supported and can be used during an analysis. As a basic recommmendation, you should include a field for sample type (e.g., "Type") with strings for each type of sample (i.e., standards are marked 'STD', blanks are 'BLANK', etc.) and a "Batch" field if your samples were collected in multiple batches and you want to do batch correction. All fields are read in and stored in the underlying data structures and any number of fields are supported.
The preprocessing command can help with the creation of these csv files. For this command a dictioary structure is provided as JSON that contains top-level keys for the fields you want to add to the sequence file with sub-keys for the desired field values which in turn are the keys for a dictionary with a set of rules specifying when that value should be used. The rules currently allow for the searching for any number of substrings specified by a list of strings as a value for a "substrings" key and the fields to search specified by a list of strings for a "search" field. An "else" key can be provided with a string value which will be used if none of the substrings are found in any search field.
For example,
"sample_type":
{
"qstd": {
"substrings": ["qstd", "QSTD", "Qstd"],
"search": ["File Name", "Sample ID"]
},
...
would result in the "sample_type" field being populated with "qstd" if any of the substrings are observed in either the "File Name" or "Sample ID" fields in the csv file.
If multiple key: value pairs are true, the specified value is added only once; however, multiple different matches will be concatenated using "_". Thus if something matches the qstd and blank definitions, the type would become "qstd_blank".
Furthermore pre-processing will attempt to map the specified path to the file to a local path. If the path exists, no changes are made; however, if the path does not exist, a .mzML or .raw file in the same location as the sequence file with the specified Name field is checked for existence. If it exists a field called "InferredPath" is created that stores this path.
An example preprocessing configuration is provided under default_configs/default_preprocessing.json.
Note that unlike other commands, there is no reasonable default configuration for this step as it depends greatly on your data. Furthermore, please note that any missing values will be cast to the appropriate placeholder using the logic in Panda's read_csv function. This can cause missing fields to become np.nan or an empty string. If you don't want this behavior, then don't pass empty fields.
Assemble Experiment
Goal: to create a directory on disk to store the project.
An example command:
pcpfm assemble -s ./sequence.csv --name_field='Name' --path_field='InferredPath' -o . -j my_experiment
This will create an experiment in the local directory with the name 'my_experiment'. The experiment object will be used throught the processing and store intermediates. The experiment will be stored as a dictionary on disk (specified by -o) wi
Languages
Security Score
Audited on Mar 26, 2026
