proDA

The goal of proDA is to identify differentially abundant proteins in label-free mass spectrometry data. The main challenge of this data are the many missing values. The missing values don’t occur randomly but especially at low intensities. This means that they cannot just be ignored. Existing methods have mostly focused on replacing the missing values with some reasonable number (“imputation”) and then run classical methods. But imputation is problematic because it obscures the amount of available information. Which in turn can lead to over-confident predictions.

proDA on the other hand does not impute missing values, but constructs a probabilistic dropout model. For each sample it fits a sigmoidal dropout curve. This information can then be used to infer means across samples and the associated uncertainty, without the intermediate imputation step. proDA supports full linear models with variance and location moderation.

For full details, please see our preprint:

Constantin Ahlmann-Eltze and Simon Anders: proDA: Probabilistic Dropout Analysis for Identifying Differentially Abundant Proteins in Label-Free Mass Spectrometry. biorXiv 661496 (Jun 2019)

Installation

proDA is implemented as an R package.

You can install it from Bioconductor by typing the following commands into R:

if(!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")
BiocManager::install("proDA")

To get the latest development version from GitHub, you can use the devtools package:

# install.packages("devtools")
devtools::install_github("const-ae/proDA")

The pkgdown documentation for the package is available on https://const-ae.github.io/proDA/reference.

In the following section, I will give a very brief overview on the main functionality of the proDA package, aimed at experienced R users. New users are advised to skip this “quickstart” and to go directly to section 1.3, where I give a complete walkthrough and explain in detail, what steps are necessary for the analysis of label-free mass spectrometry data.

Quickstart

The three steps that are necessary to analyze the data are

Load the data (see vignette on loading MaxQuant output files)
Fit the probabilistic dropout model (proDA())
Test in which proteins the coefficients of the model differ (test_diff())

# Load the package
library(proDA)
# Generate some dataset with known structure
syn_dataset <- generate_synthetic_data(n_proteins = 100, n_conditions = 2)

# The abundance matrix
syn_dataset$Y[1:5, ]
#>           Condition_1-1 Condition_1-2 Condition_1-3 Condition_2-1 Condition_2-2 Condition_2-3
#> protein_1            NA            NA      18.88592            NA      18.72059      20.06119
#> protein_2      21.37123      20.53557      18.83239      20.41027      21.73266      21.16719
#> protein_3            NA      18.77742      18.98681            NA            NA      19.20291
#> protein_4      25.44209      25.15151      25.38142      25.22754      24.95229      24.97185
#> protein_5      23.46724      23.15808      23.21357      23.29562      23.25999      23.57925

# Assignment of the samples to the two conditions
syn_dataset$groups
#> [1] Condition_1 Condition_1 Condition_1 Condition_2 Condition_2 Condition_2
#> Levels: Condition_1 Condition_2

# Fit the probabilistic dropout model
fit <- proDA(syn_dataset$Y, design = syn_dataset$groups)

# Identify which proteins differ between Condition 1 and 2
test_diff(fit, `Condition_1` - `Condition_2`, sort_by = "pval", n_max = 5)
#> # A tibble: 5 x 10
#>   name              pval adj_pval  diff t_statistic    se    df avg_abundance n_approx n_obs
#>   <chr>            <dbl>    <dbl> <dbl>       <dbl> <dbl> <dbl>         <dbl>    <dbl> <dbl>
#> 1 protein_96  0.00000248 0.000248  8.62        39.4 0.219     4          22.2     4.02     4
#> 2 protein_95  0.0000103  0.000513 -4.84       -27.6 0.175     4          21.2     6.       6
#> 3 protein_91  0.0000528  0.00176  -4.17       -18.3 0.228     4          19.1     4.01     4
#> 4 protein_98  0.000236   0.00479   4.35        12.5 0.348     4          21.6     6.00     6
#> 5 protein_100 0.000239   0.00479   2.49        12.5 0.200     4          21.3     4.95     5

Other helpful functions for quality control are median_normalization() and dist_approx().

proDA Walkthrough

proDA is an R package that implements a powerful probabilistic dropout model to identify differentially abundant proteins. The package was specifically designed for label-free mass spectrometry data and in particular how to handle the many many missing values.

But all this is useless if you cannot load your data and get it into a shape that is useable. In the next section, I will explain how to load the abundance matrix and bring it into a useful form. The steps that I will go through are

Load the proteinGroups.txt MaxQuant output table
Extract the intensity columns and create the abundance matrix
Replace the zeros with NAs and take the log2() of the data
Normalize the data using median_normalization()
Inspect sample structure with a heatmap of the distance matrix (dist_approx())
Fit the probabilistic dropout model with proDA()
Identify differentially abundant proteins with test_diff()

Load Data

I will now demonstrate how to load a MaxQuant output file. For more information about other approaches for loading the data, please take a look at the vignette on loading data.

MaxQuant is one of the most popular tools for handling raw MS data. It produces a number of files. The important file that contains the protein intensities is called proteinGroups.txt. It is a large table with detailed information about the identification and quantification process for each protein group (which I will from now on just call “protein”).

This package comes with an example proteinGroups.txt file, located in the package folder. The file contains the reduced output from an experiment studying the different DHHCs in Drosophila melanogaster.

system.file("extdata/proteinGroups.txt", package = "proDA", mustWork = TRUE)
#> [1] "/Users/ahlmanne/Library/R/3.6/library/proDA/extdata/proteinGroups.txt"

In this example, I will use the base R functions to load the data, because they don’t require any additional dependencies.

# Load the table into memory
maxquant_protein_table <- read.delim(
    system.file("extdata/proteinGroups.txt", package = "proDA", mustWork = TRUE),
    stringsAsFactors = FALSE
)

As I have mentioned, the table contains a lot of information (359 columns!!), but we are first of all interested in the columns which contain the measured intensities.

# I use a regular expression (regex) to select the intensity columns
intensity_colnames <- grep("^LFQ\\.intensity\\.", colnames(maxquant_protein_table), value=TRUE)
head(intensity_colnames)
#> [1] "LFQ.intensity.CG1407.01" "LFQ.intensity.CG1407.02" "LFQ.intensity.CG1407.03"
#> [4] "LFQ.intensity.CG4676.01" "LFQ.intensity.CG4676.02" "LFQ.intensity.CG4676.03"


# Create the intensity matrix
abundance_matrix <- as.matrix(maxquant_protein_table[, intensity_colnames])
# Adapt column and row maxquant_protein_table
colnames(abundance_matrix) <- sub("^LFQ\\.intensity\\.", "", intensity_colnames)
rownames(abundance_matrix) <- maxquant_protein_table$Protein.IDs
# Print some rows of the matrix with short names so they fit on the screen
abundance_matrix[46:48, 1:6]
#>                                       CG1407.01 CG1407.02 CG1407.03 CG4676.01 CG4676.02 CG4676.03
#> A0A0B4K6W1;P08970                        713400    845440         0         0   1032600         0
#> A0A0B4K6W2;A0A0B4K7S0;P55824-3;P55824   5018800   4429500   2667200         0   8780200   1395800
#> A0A0B4K6X7;A1Z8J0                             0         0         0         0         0         0

After extracting the bits from the table we most care about, we will have to modify it.

Firstly, MaxQuant codes missing values as 0. This is misleading, because the actual abundance probably was not zero, but just some value too small to be detected by the mass spectrometer. Accordingly, I will replace all 0 with NA.

Secondly, the raw intensity values have a linear mean-variance relation. This is undesirable, because a change of x units can be a large shift if the mean is small or irrelevant if the mean is large. Luckily, to make the mean and variance independent, we can just log the intensities. Now a change of x units is as significant for highly abundant proteins, as it is for low abundant ones.

abundance_matrix[abundance_matrix == 0] <- NA
abundance_matrix <- log2(abundance_matrix)
abundance_matrix[46:48, 1:6]
#>                                       CG1407.01 CG1407.02 CG1407.03 CG4676.01 CG4676.02 CG4676.03
#> A0A0B4K6W1;P08970                      19.44435  19.68934        NA        NA  19.97785        NA
#> A0A0B4K6W2;A0A0B4K7S0;P55824-3;P55824  22.25891  22.07871  21.34689        NA  23.06582  20.41266
#> A0A0B4K6X7;A1Z8J0                            NA        NA        NA        NA        NA        NA

Quality Control

Quality control (QC) is essential for a successful bioinformatics analysis, because any dataset shows some unwanted variation or could even contain more serious error like for example a sample swap.

Often we start with normalizing the data to remove potential sample specific effects. But already this step is challenging, because the missing values cannot easily be corrected for. Thus, a first helpful plot is to look how many missing values are in each s

ProDA

Install / Use

README