MetaSpectraST
An unsupervised and database-independent analysis tools for metaproteomic MS/MS data using spectrum clustering.
Install / Use
/learn @bravokid47/MetaSpectraSTREADME

metaSpectraST
metaSpectraST is an unsupervised and database-independent analysis tools for metaproteomic MS/MS data using spectrum clustering. It clusters all experimentally observed MS/MS spectra based on their spectral similarity and create a representative consensus spectrum for each cluster by using the spectrum clustering algorithm implemented in the spectral library search engine, SpectraST.
Spectrally similar MS/MS spectra that are grouped in one spectral cluster are presumed to originate from the same peptide sequence, and therefore metaSpecraST treats them as replicate spectra and quantitatively profiles samples by counting the number (spectral count, SC) or intensity (spectral index, SI<sub>N</sub>) of replicate spectra in each spectral cluster.
The metaSpectraST spectral clusters also offer a portal to integrate and reconcile multiple peptide identification approaches, including database search, open modification search, and de novo sequencing. For each spectral cluster, sequences of raw spectra and their cosnensus spectrum assigned by different identification methods vote for the consensus peptide sequence of the spectral cluster through a heuristic reconciliation scheme and the majority rule.
With metaSpectraST you can,
- Fast profile and compare the microbial communities of your sample;
- Classify your metaproteomic (or proteomic) samples;
- Validate biological/technical replicates;
- Integrate and reconcile multiple peptide/protein identification approaches for further taxonomic or functional studies.
Contents
Installation
Dependencies
- Python version >= 3.7, R version 4.1.3
- SpectraST (v5.0)
SpectraST is an integral component of the Trans Proteomic Pipeline suite (TPP) of software. A compiled executable file is included here, which can be used alone without other TPP components.
We encourage users to download and install the entire TPP suite, which provides other useful functionalities such as raw data importation, automatic validation of search results, protein inference, and quantification and visualization. Please refer to the guides for TPP Linux installation, and the official download site for Windows installer.
- edgeR (v3.34.0)
metaSpectraST normalizes the data using the trimmed mean of M-values (TMM) normalization method implemented in the edgeR package. edgeR is not necessary if you would like to normalize the data with other methods. Please refer to Bioconductor-edgR for further information.
To install the edgeR package, start R and enter:
if (!require("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install("edgeR")
You may also need to install the limma (v3.48.1) package
if (!require("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install("limma")
Installing metaSpectraST
- Download or clone the repository:
git clone https://github.com/bravokid47/metaSpectraST.git
- Make metaSpectraST executable by adding the directory
yourpath/metaspectrast/to the environment variable$PATH, or just copy the following line to the~/.bashrcor~/.bash_profilefile andsourcethe file.
export PATH="$PATH:yourpath/metaspectrast";
Quick start
Data format
metaSpectraST can perform spectral clustering from the following data formats:
- mzML format
- mzXML format
- mgf format
Please note that mgf format is required for computing the normalized spectral index (SI<sub>N</sub>). File formats can be converted with msconvert or ThermoRawFileParser.
Modules of metaSpectraST
There are 6 individual modules in metaSpectraST. Run the following command to get explanation of the 6 modules.
metaspectrast -h
Output
>>>
_________ metaSpectraST by Hao, Chunlin _________
metaSpectraST v=0.0
Usage: metaspectrast [module]
Module:
1 cluster Clustering MS/MS spectra and create consensus spectra
2 computesc Spectral count-based (SC) sample profiling
3 computesin Normalized spectral index (SIn) based sample profiling
4 normalize Normlizing the data matrix of sample profiles (SC or SIn)
5 classify Hierarchically clustering and classifying samples
6 reconcile Reconciliation scheme
Each module is run separately. For example, to run the computesin module,
metaspectrast computesin -h
Output
>>>
usage: metaSpectraST_SIn.py [-h] [-s [SPTXT]] -m MGF [MGF ...]
metaSpectraST (v0.0) by Hao, Chunlin.
Compute normalized spectral index (SIn) of consensus spectra.
optional arguments:
-h, --help show this help message and exit
-s [SPTXT] consensus spectra .sptxt file, grandConsensus.sptxt by default.
-m MGF [MGF ...] raw spectra data sets in MGF format
Step 1: performing spectral clustering
Run the following command to perform spectral clustering:
metaspectrast cluster <path/*mzML>
Fragmentation type (ETD, HCD, CID-QTOF) of the spectra can be specified by the -i option. Default is off and the fragmentation type can be determined from the data files.
metaspectrast cluster -i HCD <path/*mzML>
When this step is done, it produces three types of output file in the working directory. The file bar.splib is the spectra library in a binary format. The bar.sptxt is a human-readable version of the bar.splib. The files bar.spidx and bar.pepidx are indices on the precursor m/z value and peptide, respectively. The file grandConsensus.sptxt is the library of consensus spectra, which will be used in the subsequent steps. A library of consensus spectra in .mgf format is also produced, named as grandConsensus.mgf.
Step 2: profiling samples
Consensus spectrum created in step 1 can be quantified by counting the number (spectral count, SC) or intensity (spectral index, SI<sub>N</sub>) of the replicate spectra (raw spectra) in the corresponding spectral cluster in the sample. Quantified consensus spectra can then be used to profile the samples.
Spectral count-based (SC) profiling
metaspectrast computesc -s <path/grandConsensus.sptxt>
When it is done, it produces two CSV files, unnorm_consensusPep_SC.csv and consensusSpec_RawSpectra_idx.csv. The file unnorm_consensusPep_SC.csv is unnormalized spectral count of consensus spectra in each sample, which can be normalized by the normalize module (see Step 3) or simply normalized by the sum of the spectral count in each data set. The file consensusSpec_RawSpectra_idx.csv is the index of the correspondence of raw spectrum and its consensus spectrum.
Normalized spectral index-based (SI<sub>N</sub>) profiling
metaspectrast computesin -s <path/grandConsensus.sptxt> -m <path/*mgf>
Note that the .mgf file has to be named the same as the the corresponding input file in Step 1.
When it is done, it produces three CSV files, unnorm_consensusPep_SI.csv, consensusPep_SIn.csv and consensusSpec_RawSpectra_idx.csv. Similar to SC profiling, the file unnorm_consensusPep_SI.csv is unnormalized spectral index of consensus spectra in each sample, which can be normalized by the normalize module (see Step 3). The file consensusPep_SIn.csv is the same file as unnorm_consensusPep_SI.csv, but normalized by the sum of the spectral index in each data set. The file consensusSpec_RawSpectra_idx.csv is the index of the correspondence of raw spectrum and its consensus spectrum.
Step 3: classifying samples and visualization
Hierarchical clustering of samples can be performed based on their SI<sub>N</sub> or SC profiles. But before that, SI<sub>N</sub> or SC profiles need to be normalized.
Normalization
Normalization of the SI<
Related Skills
feishu-drive
343.1k|
things-mac
343.1kManage Things 3 via the `things` CLI on macOS (add/update projects+todos via URL scheme; read/search/list from the local Things database)
clawhub
343.1kUse the ClawHub CLI to search, install, update, and publish agent skills from clawhub.com
codebase-memory-mcp
1.1kHigh-performance code intelligence MCP server. Indexes codebases into a persistent knowledge graph — average repo in milliseconds. 66 languages, sub-ms queries, 99% fewer tokens. Single static binary, zero dependencies.
