SsGSEA2.0
Single sample Gene Set Enrichment analysis (ssGSEA) and PTM Enrichment Analysis (PTM-SEA)
Install / Use
/learn @broadinstitute/SsGSEA2.0README
ssGSEA2.0/PTM-SEA
Resources for gene-centric single sample Gene Set Enrichment Analysis (ssGSEA) of gene expression data (e.g. mRNAs, proteins) and site-centric PTM Signature Enrichment Analysis (PTM-SEA) [1] of phosphoproteomics data sets using the PTM signatures database (PTMsigDB) [1].
Disclaimer
The primary purpose of this repository is to supplement our manuscript in which we describe PTM-SEA and PTMsigDB. While ssGSEA2.0 presents an updated version of the original ssGSEA R implementation, we want to acknowledge that this is not the primary repository for ssGSEA. The official codebase for ssGSEA can be found here, and the official GenePattern module to perform ssGSEA can be accessed here.
ssGSEA 2.0
This is an updated version of the original ssGSEA [2,3] R-implementation. Depending on the input dataset and chosen database (gene sets or PTM signatures), the software performs either ssGSEA or PTM-SEA, respectively. The Molecular Signatures Database (MSigDB) [4] provides a large collection of curated gene sets. Gene sets are stored as plain text in GMT format. A current version of MSigDB gene set collections can be found in the db/msigdb subfolder. MSigDB gene sets are realeased under Creative Commons Attribution 4.0 International License. The license terms can be found in thedb/msigdb folder.
File formats supported by ssGSEA2.0/PTM-SEA are Gene Cluster Text GCT v1.2 or GCT v1.3 files. Morpheus provides a convenient way to convert your data tables into GCT format.
For more information about the GSEA method and MSigDB please visit http://software.broadinstitute.org/gsea/.
PTMsigDB v2.0.0
Please check out our new website for PTMsigDB. We have updated PTMsigDB to version v2.0.0 in which we provide better and more consistent annotation of each PTM site. We have also inlcuded a disease category comprising of signatures associated to certain diseases curated from the table Disease-associated_sites available at PhosphoSitePlus (PSP) [5].
The PTM signatures database (PTMsigDB) is a database comprised of modification site-specific signatures of perturbations, kinase activities and signaling pathways curated from more than 2,500 publications which provides the foundation to perform PTM-SEA. A unique advantage of PTMsigDB over other pathway databases is the annotation of each PTM site with its reported direction of change upon a specific perturbation or signaling event which is incorporated into the scoring scheme of PTM-SEA. The foundation of PTMsigDB is PhosphoSitePlus (PSP) [5], a comprehensive systems biology resource for PTMs, which provides high-quality curation and annotation of PTMs at the individual residue level. A collection of PTM sites, whose levels are collectively regulated in a curated pathway or upon a perturbation, are defined as a signature set. Signature sets in PTMsigDB can be separated into different categories: 1) Perturbation signatures derived from treatment of cells with perturbagens such as small molecules or growth factors; 2) Signature sets of molecular signaling pathways; 3) Kinase-substrate signatures; and 4) Disease-associated signature sets.
To ensure a high degree of compatibility to phosphorylation datasets generated by different software packages and searched against different protein sequence databases, PTMsigDB represents signatures using three different identifiers to represent phosphorylation sites: 1) PSP site group ID; 2) UniProt-centric ID; 3) Flanking sequence (Table 1). While the PSP site group ID provides an unambiguous representation of PTM sites within protein families and across species [5], using this type of identifier restricts the analysis to PTM sites present in PSP. We generally recommend to using the flanking sequence as site identifier, since these are more invariant to updates made to protein sequence databases.
| Database format | Site accession | Example in PTMsigDB | Example in dataset | Download | ----------------- | -------------- | ------------------- | ------------------- | ------------ | UniProt-centric | Uniprot_acc;site-type;direction | Q06609;Y315-p;u | Q06609;Y315-p | human<br>mouse<br>rat | Flanking sequence | +/-7aa flanking seq-type;direction | ETRICKIYDSPCLPE-p;u | ETRICKIYDSPCLPE-p | human<br>mouse<br>rat | PSP site group id | site_grp_id-type;direction | 448324-p;u | 448324-p | human<br>mouse<br>rat
Table 1: PTM site representation in PTMsigDB. The direction of change for a PTM site in a signature is indicated by ;u (up-regulation) or ;d (down-regulation). Please note that the annotation of directionality is a feature of PTMsigDB (column: Example in PTMsigDB) and must not be included when generating compatible site identifier for a particular dataset (column: Example in dataset).
PTM-SEA
PTM-Signature Enrichment Analysis (PTM-SEA) is a modified version of ssGSEA to perform site-specific signature analysis by scoring PTMsigDB's bi-directional signature-sets. The input to PTM-SEA is a single site-centric data matrix, m, stored in GCT v1.2 or GCT v1.3 format and PTM signatures database (PTMsigDB). Each row in m represents a single phosphorylation site confidently localized to a specific amino acid residue, with measured abundances across samples specified in columns in m. Multiple phosphorylation sites detected on the same peptide have to be converted into separate site-specific entities for every site. While some proteomics software packages, such as MaxQuant [6], readily produce single site-centric PTM reports, the use of other software packages might require additional preprocessing steps.
How can I use these tools?
ssGSEA2.0/PTM-SEA can be run on a local PC/MAC in R or RStudio. In addition, ssGSEA2.0/PTM-SEA can be access on Broad's public GenePattern [7] server. Below we provide instructions how to run ssSGEA2.0/PTM-SEA.
Example dataset
We provide an example dataset that can be used to test PTM-SEA. The dataset is based on Supplemental Table 6 in [1].
Single site-centric phosphoproteome dataset
GenePattern
GenePattern is a powerful platform to deploy and run software or entire analysis pipelines in a web browser [7]. We have implemented ssGSEA2.0/PTM-SEA as GenePattern module which can be accessed at the link below. Please note that access to the public GenePattern server requires a free registration.
PTM-SEA in GenePattern: https://tinyurl.com/PTM-SEA-GP
R-GUI / RStudio
The script ssgsea-gui.R requires little or no knowledge of R or on how to use the command line. Input files and databases can be specified via Windows file dialogs that will be automatically invoked. The first dialog lets you choose a folder containing input files in GCT v1.2 or GCT v1.3 format. The script loops over all GCT files in this directory and runs ssGSEA on each file separately. The second dialog window lets the user choose one or multiple gene set databases in GMT format such as [MSigDB](http://software.broadinstitute.org/gsea/msig
