SkillAgentSearch skills...

ProteomicsAnalysisPipeline

A highly customizable proteomics analysis pipeline for TMT and Olink data.

Install / Use

/learn @tjpel/ProteomicsAnalysisPipeline
About this skill

Quality Score

0/100

Supported Platforms

Universal

README

Analysis Pipeline - Textual

Introduction

Created by Thomas Pelowitz (github.com/tjpel) for the use of the Sung Research Group at Mayo Clinic (PI: Dr. Jaeyun Sung) and the Mayo Clinic Proteomics Core (Supervisor: Benjamin Madden).

The purpose of this pipeline is to automatically perform various data cleaning, processing, transformation, normalization, and visualization techniques on various forms of proteomics data.

Installation

This pipeline requires various widely available python packages to run. These packages and versions are outlined in the requirements.txt file. These can be easily installed in the following ways:

First, install Python version 3.12 or later.<br> Next, install the necessary packages using one of the following methods:<br> Pip: pip install -r requirements.txt<br> Conda: With an active conda environment, conda install --file requirements.txt<br> Finally, if you wish to create .svg visualization files, you must install the following library through pip.<br> pip install kaleido

Data and Directory Preparation

For the pipeline to perform correctly, the input must match the expected format. First, a directory must be prepared for the input data. The directory should have a data directory within it, and there should be a raw directory within that.

Within your {project_directory}/data/raw directory, you may put your input data. This data must be formatted a certain way specific to the type of data. There are formatting examples within the examples directory of this repo. There are some important commonalities to all file types.

  1. The file is in an .xlsx format.

  2. The first sheet is labeled "Data". This sheet's purpose is to provide and describe the proteins and their counts from the experiment. This sheet will contain metadata columns describing the proteins, along with protein counts for each sample. Each sample column must begin with "Sample ".

  3. The second sheet is labeled "Sample Information". This sheet describes which samples belong to which study groups. The first column, "Sample ID", will contain the names of samples, using the same names as they were given in the "Data" sheet. Each subsequent column will describe a category of study groups from the experiment, with the values of the column being study groups in the category. For example, a column "Drug Dose" may describe which samples have received a "Full Dose", "Half Dose", or "No Dose".

  4. The third sheet, "Study Group Information" describes which study groups will act as controls or cases for a category of study groups. The values of the "Study Group" column should be the same study groups from "Sample Information". "Study Group ID" refers to the category of the study group, and "Study Group Type" values should be "Case" or "Control".

  5. The final "Notes" sheet is optional and is a place to communicate details about the experiment or findings. This is the ONLY place where notes can be written within the input file. Unexpected text on any other sheet will cause the pipeline to not work as intended.

It is highly recommended that you use the examples as a reference to format the input for your project.

Configuration

Many aspects of the pipeline's behavior are controlled by the configuration file, config.json. This configuration file must be set up very particularly for the pipeline to perform as intended. Please refer to the example configuration as you set up the configuration for your project. These objects are ordered such that the objects closer to the top are more frequently edited throughout the course of setting up and performing an analysis. The following parent objects must be present:

"project_information"

This object describes the basic information about the project. Objects within this include:

  • "file_type": Values: One of ["Olink", "TMT Phospho", "TMT Protein List"]. This object describes the type of file and data the pipeline should expect. Note: pTyr is included in "TMT Phospho".
  • "relative_path": A relative path to the project directory from where the analysis_pipeline.py script will be called from.
  • "raw_data_name": The name (including the .xlsx extension) of the raw data file.
"ordered_pipeline"

This object dictates the data cleaning, normalization, and transformation steps, and their order, to be performed by the pipeline. Objects within this object are formatted like the following: "X" : {Object details} where "X" is an integer denoting the step, in ascending order, that the process will be performed at. The object details denote the process and arguments for the process to be completed. The name of the process should be denoted as "Name" : process_name, and any arguments for the process should be denoted as "Argument" : argument. Currently, no process has multiple arguments.

Pipeline processes:

  • "Drop Duplicates" This method drops rows that lack a unique value in the primary key column ("Assay" for Olink, "Modifications in Master Protein" for TMT Phospho). The row with the first instance of a repeated value (e.g. the row with the lowest index) is kept, and all others are removed. Proteins and values removed this way are not added to the "Removed Proteins" intermediate dataset nor the delivery dataset.<br><br>

  • "Remove Proteins With >=X% Values Missing in Each Group" This method removes proteins that are missing a certain proportion of the time to all study groups in a category of study groups.<br> Argument: Values: One of [integer, float] such that 0 <= argument <= 100. A protein missing for a proportion of samples equal to or greater than this proportion in each study group in a study group category will be removed. <br> Example: If the argument value is set to 60, and a protein is missing from 60% or more of the samples in study groups 1 and 2 of study group category A, that protein will be removed if category A only contains groups 1 and 2. If category A also contains a study group 3 where the same protein is only missing from 40% of its samples, the protein will not be removed.<br><br>

  • "Remove Proteins With >=X% Values Missing Globally" This method removes proteins that are missing from more than a certain proportion of samples.<br> Argument: Values: One of [integer, float] such that 0 <= argument <= 100. A protein missing for a proportion of samples equal to or greater than this proportion will be removed. <br><br>

  • "Median Normalization" This method normalizes the values for each sample by subtracting the median value of proteins counts for that sample from each protein count for that sample.<br> Where $x_{i,j}$ is the prenormalized count of protein $i$ for sample $j$ and $y_{i,j}$ count of protein $i$ for sample $j$ after normalization;<br> $$y_{i,j} = x_{i,j} - ~{x}_{j}$$<br><br>

  • "Mean Normalization" This method normalizes the values for each sample by subtracting the mean value of proteins counts for that sample from each protein count for that sample.<br> Where $x_{i,j}$ is the prenormalized count of protein $i$ for sample $j$ and $y_{i,j}$ count of protein $i$ for sample $j$ after normalization;<br> $$y_{i,j} = x_{i,j} - ={x}_{j}$$<br><br>

  • "Total Value Normalization" This method normalizes the values for each sample by setting the values for each protein in each sample to be the proportion of the counts of the protein to the sum of all protein counts for that sample.<br> Where $x_{i,j}$ is the prenormalized count of protein $i$ for sample $j$ and $y_{i,j}$ count of protein $i$ for sample $j$ after normalization;<br> $$y_{i,j} = \frac{x_{i,j}}{\sum_{i=0}^{max(i)}{x_{i,j}}}$$<br><br>

  • "Quantile Normalization" This method normalizes the values for each sample to a quantile normalized value. For more details on quantile normalization, please see this page.<br><br>

  • "Impute Missing Values with X% of the Minimum Value of Sample" This method replaces all missing values with a fraction of the minimum value of the sample the missing value is in. Argument: Values: One of [integer, float] such that 0 <= argument. The proportion of the minimum value of a sample that a missing value will be set to. A value greater than 100 will set missing values to a value greater than the minimum value of the sample.<br><br>

  • "Impute Missing Values with X% of the Minimum Value of Protein" This method replaces all missing values with a fraction of the minimum value of the protein the missing value is in.<br> Argument: Values: One of [integer, float] such that 0 <= argument. The proportion of the minimum value of a protein that a missing value will be set to. A value greater than 100 will set missing values to a value greater than the minimum value of the protein.<br><br>

  • "<i>Z</i>-Score Transformation" This method transforms all protein counts so that they are <i>Z</i>-score transformed. <i>Z</i>-Score Transformation is definied as the following, where $x_{i,j}$ is the pretransformed count of protein $i$ for sample $j$ and $y_{i,j}$ count of protein $i$ for sample $j$ after normalization; $$y_{i,j} = \frac{x_{i,j} - ={x}_{j}}{\sigma_j}$$<br><br>

  • "LogX Transformation" This method transforms all protein count values such that the new value will be equal to the log<sub><i>a</i></sub> value of the previous protein count value, where <i>a</i> is the value of this method's argument. If the dataset contains negative values at the time of this process being performed, a pseudo count will first be introduced such that each value will have the (minimum value of the dataset - 1) subtracted from it, resulting in a new minimum value of 1.<br> Argument: Values: One of [2]. As of now, this method can only perform log<sub>2</sub>() operations. This functionality may be expanded in the future if the need arises.<br><br>

"comparisons"

This object describes the s

Related Skills

View on GitHub
GitHub Stars4
CategoryDevelopment
Updated6d ago
Forks0

Languages

Python

Security Score

90/100

Audited on Mar 28, 2026

No findings