iq: an R package for protein quantification

This R package provides an implementation of the MaxLFQ algorithm by Cox et al. (2014) in a comprehensive pipeline for DIA-MS (Pham et al. 2020). It also offers options for protein quantification using the N most intense fragment ions, using all fragment ions, and the Tukey's median polish algorithm. In general, the tool can be used to integrate multiple proportional observations into a single quantitative value.

Citation

Pham TV, Tran CTM, Henneman AA, Pham LHC, Le DG, Can AH, Bui PHL, Piersma SR, Jimenez CR. Boosting the speed and accuracy of protein quantification algorithms in mass spectrometry-based proteomics. Journal of Proteome Research 2026 Feb 6; 25(2):1198–1203. https://doi.org/10.1021/acs.jproteome.5c01038

For version 1.x.x, please cite

Pham TV, Henneman AA, Jimenez CR. iq: an R package to estimate relative protein abundances from ion quantification in DIA-MS-based proteomics, Bioinformatics 2020 Apr 15;36(8):2611-2613. https://doi.org/10.1093/bioinformatics/btz961

Installation

The package is hosted on CRAN. It is best to install from within R.

install.packages("iq")

Usage

See a recent example for processing a Spectronaut output.

Or an older vignette for processing output from Spectronaut, OpenSWATH and MaxQuant with some visualization.

A blog post on converting a .parquet file to a .tsv file.

The package can be loaded in the usual manner

library("iq")

To process a DIA-NN output

For version of DIA-NN prior to 2.0, the following is an iq function call to filter on the Q.Value, PG.Q.Value, Lib.Q.Value, and Lib.PG.Q.Value for a match-between run (MBR) DIA-NN search as discussed here.

process_long_format("report.tsv", 
                    sample_id = "Run",
                    intensity_col = "Fragment.Quant.Raw",
                    output_filename = "report-pg-global.txt", 
                    annotation_col = c("Protein.Names", "Genes"),
                    filter_double_less = c("Q.Value" = "0.01", "PG.Q.Value" = "0.05", 
                                           "Lib.Q.Value" = "0.01", "Lib.PG.Q.Value" = "0.01"))

DIA-NN version 2.0 uses the parquet data format for output. We can use the R package arrow to read the data. However, the fragment intensities are not reported by the default setting. To perform quantification using MS/MS fragments, one must switch on the --export-quant option. Then we can use R to create a .tsv as an intermediate step as follows

require("arrow")
# if the package "arrow" is not available, you can install it by 
# install.packages("arrow") 

raw <- arrow::read_parquet("report.parquet")

# create a new column called "Intensities"
raw$Intensities = paste(raw$Fr.0.Quantity, raw$Fr.1.Quantity, raw$Fr.2.Quantity, 
                        raw$Fr.3.Quantity, raw$Fr.4.Quantity, raw$Fr.5.Quantity, 
                        raw$Fr.6.Quantity, raw$Fr.7.Quantity, raw$Fr.8.Quantity, 
                        raw$Fr.9.Quantity, raw$Fr.10.Quantity, raw$Fr.11.Quantity, sep = ";")

write.table(raw, "report.tsv", sep = "\t", row.names = FALSE, quote = FALSE)

# using the new column "Intensities"
iq::process_long_format("report.tsv", 
                        output_filename = "report-protein-group.txt", 
                        sample_id = "Run",
                        intensity_col = "Intensities",
                        annotation_col = c("Protein.Ids","Protein.Names", "Genes"),
                        filter_double_less = c("Q.Value" = "0.01", "PG.Q.Value" = "0.05", 
                                               "Lib.Q.Value" = "0.01", 
                                               "Lib.PG.Q.Value" = "0.01"))

Alternatively, we can use the aggregated intensities in Precursor.Normalisedas discussed here.

iq::process_long_format(arrow::read_parquet("report.parquet"), 
                        output_filename = "report-protein-group.txt", 
                        sample_id = "Run",
                        intensity_col = "Precursor.Normalised",
                        intensity_col_sep = NULL,
                        annotation_col = c("Protein.Ids","Protein.Names", "Genes"),
                        filter_double_less = c("Q.Value" = "0.01", "PG.Q.Value" = "0.05", 
                                               "Lib.Q.Value" = "0.01", 
                                               "Lib.PG.Q.Value" = "0.01"))

Similarly, for a DIA-NN search without MBR

iq::process_long_format(arrow::read_parquet("report.parquet"), 
                        output_filename = "report-protein-group.txt", 
                        sample_id = "Run",
                        intensity_col = "Precursor.Normalised",
                        intensity_col_sep = NULL,
                        annotation_col = c("Protein.Ids","Protein.Names", "Genes"),
                        filter_double_less = c("Q.Value" = "0.01", "PG.Q.Value" = "0.05", 
                                               "Global.Q.Value" = "0.01", 
                                               "Global.PG.Q.Value" = "0.01"))

Finally, use the parameter peptide_extractor if you want to get the number of peptides per protein, for example with MBR and the --export-quant option.

iq::process_long_format("report.tsv", 
                        output_filename = "report-protein-group.txt", 
                        sample_id = "Run",
                        intensity_col = "Intensities",
                        annotation_col = c("Protein.Ids","Protein.Names", "Genes"),
                        filter_double_less = c("Q.Value" = "0.01", "PG.Q.Value" = "0.05", 
                                               "Lib.Q.Value" = "0.01", 
                                               "Lib.PG.Q.Value" = "0.01"),
                        peptide_extractor = function(x) gsub("[0-9].*$", "", x))

To process a Spectronaut output

Use this export schema iq.rs to make a long report, for example "Spectronaut_Report.xls".

process_long_format("Spectronaut_Report.xls",
                    output_filename = "iq-MaxLFQ.tsv", 
                    sample_id  = "R.FileName",
                    primary_id = "PG.ProteinGroups",
                    secondary_id = c("EG.Library", "FG.Id", "FG.Charge", "F.FrgIon", 
                                     "F.Charge", "F.FrgLossType"),
                    intensity_col = "F.PeakArea",
                    annotation_col = c("PG.Genes", "PG.ProteinNames", "PG.FastaFiles"),
                    filter_string_equal = c("F.ExcludedFromQuantification" = "False"),
                    filter_double_less = c("PG.Qvalue" = "0.01", "EG.Qvalue" = "0.01"),
                    log2_intensity_cutoff = 0)

New in version 2

A simple example

Quantify a data matrix for a single protein

X <- matrix(c(10, 12, 12, NA, NA, 10), nrow = 2, 
            dimnames = list(c("ion1", "ion2"), 
                            c("sample1", "sample2", "sample3")))

#      sample1 sample2 sample3
# ion1      10      12      NA
# ion2      12      NA      10

iq::process_matrix(X, method = "maxlfq_bit")

# [1] 11 13  9

The result by maxlfq-bit shows that sample2 is 4-fold higher than sample1 (difference of 2 in log2 space). This two samples share ion1 only whose quantitative difference is reflected in the result. Similarly, sample 1 is 4-fold higher than sample3.

The new iq data format

A new data format is introduced to support large-scale data processing. This example convert the usual tab-deliminated format file Bruderer15-DIA-longformat-compact.txt to the new iq format

primary_id <- "PG.ProteinGroups"
sample_id  <- "R.Condition" 
secondary_id <- c("EG.ModifiedSequence", "FG.Charge", "F.FrgIon", "F.Charge")
annotation_col <- c("PG.Genes", "PG.ProteinNames")

iq::long_format_to_iq_format("Bruderer15-DIA-longformat-compact.txt", "bruderer15.iq", 
                        primary_id = primary_id,
                        secondary_id = secondary_id,
                        sample_id = sample_id,
                        annotation_col = annotation_col,
                        intensity_col = "F.PeakArea",
                        intensity_col_sep = NULL,
                        normalization = "median")

The new data is contained in a folder bruderer15.iq. Then we can process the data using different method

iq::process_iq_format("bruderer15.iq", output_filename = "result-maxlfq_bit.tsv",
                      method = "maxlfq_bit")

iq::process_iq_format("bruderer15.iq", output_filename = "result-maxlfq.tsv",
                      method = "maxlfq")

a <- read.delim("result-maxlfq.tsv")
b <- read.delim("result-maxlfq_bit.tsv")

max(a[,4:27] - b[,4:27], na.rm = TRUE)
# [1] 1.012523e-13

min(a[,4:27] - b[,4:27], na.rm = TRUE)
# [1] -1.012523e-13

The new method maxlfq-bit and the current implementation maxlfq should give the same result within the tolerance of floating-point arithmetic.

Cluster processing

The iq data format makes it very easy to process subsets of proteins on different processing nodes. Here is an example of using the Dutch national computing cluster for processing. You will need to adapt the shell script to your cluster processing and storage architec

Iq

Install / Use

README

iq: an R package for protein quantification

Installation

Usage

New in version 2