SkillAgentSearch skills...

Ukbnmr

Tools for processing Nightingale NMR biomarker data in UK Biobank

Install / Use

/learn @sritchie73/Ukbnmr
About this skill

Quality Score

0/100

Supported Platforms

Universal

README

The ukbnmr R package

This package provides utilities for working with the UK Biobank NMR metabolomics data.

There are three groups of functions in this package:

  1. Data extraction
  2. Removal of technical variation
  3. Recomputing derived biomarkers and computing additional biomarker ratios after adjustment for biological covariates

All functions are designed to be applied directly to the UK Biobank phenotype data on the UK Biobank Research Analysis Platform after the NMR metabolomics fields have been extracted using the Table Exporter tool.

This package also works with datasets predating the Research Analysis Platform, which have been extractedusing the ukbconv tool or processed with the ukbtools R package.

This package also provides a data.frame of biomarker information, loaded as nmr_info, and a data.frame of sample processing information, loaded as sample_qc_info. See help("nmr_info") and help("sample_qc_info") for details on column contents.

Installation

The most up to date version of this package can be installed from this GitHub repository using the remotes package:

remotes::install_github("sritchie73/ukbnmr", dependencies = TRUE, build_vignettes = TRUE)

And major releases can also be installed directly from CRAN:

install.packages("ukbnmr")

Citation

If using this package to remove additional technical variation or compute additional biomarker ratios, please cite:

Ritchie S. C. et al., Quality control and removal of technical variation of NMR metabolic biomarker data in ~120,000 UK Biobank participants, Sci Data 10 64 (2023). doi: 10.1038/s41597-023-01949-y.

Note that several updates have been made to the package and algorithm based on subsequent releases of NMR metabolomic biomarker data that have expanded to cover all ~500,000 UK Biobank participants with blood samples. These updates are described in more detail in the Algorithms for removing technical varation section below. The impact of technical variation and its removal in the full UK Biobank data are also shown in the Technical variation in the full UK Biobank NMR data section below.

Citation is appreciated, but not expected, if simply using the data extraction functions for convenience to extract the NMR biomarker data and associated information as-is into analysis-ready data.frames.

Data Extraction Functions

Three data extraction functions are supplied by this package for extracting the UK Biobank NMR data and associated processing information and quality control tags into an analysis-ready format from the CSV or TSV files of field data saved by the Table Exporter tool on the UK Biobank Research Analysis Platform.

Exported field data saved by the Table Exporter has column names following a naming scheme with the format "p<field_id>_i<visit_index>_a<repeat_index>".

The extract_biomarkers() function extracts from this raw field data a data.frame that contains one column per NMR biomarker which are labelled with short descriptive ( and analysis-friendly) column names for each biomarker. Each row of the extracted data.frame corresponds to a single observation for a UK Biobank participant at either baseline assessment (2006-2010) or the first repeat assessment (2012-2013): rows are uniquely identifiable by their combination of "eid" and "visit_index" columns. The "eid" column contains the project-specific identifier for each participant and the "visit_index" column contains either a 0 or 1 depending on whether the biomarker was quantified from blood samples taken at baseline assessemt (visit_index == 0) or at the first repeat assessment (visit_index == 1). Mappings between biomarker column names and UK Biobank field identifiers, along with detailed descriptions of each biomarker, are provided in the nmr_info data.frame that is bundled with this package.

The extract_biomarker_qc_flags() function similarly returns a data.frame with one column for each biomarker, with observations containing the quality control flags for the measurement of the respective biomarker for the UK Biobank participant and timepoint indicated in the "eid" and "visit_index" columns. Observations with no quality control flags contain NA. In instances where there were multiple quality control flags, the individual flags are separated by "; ".

The extract_sample_qc_flags() function similarly returns a data.frame with one column for each of the NMR sample processing flags and quality control flags for each sample for the respective UK Biobank participant ("eid") and timepoint ("visit_index"). Mappings between sample processing column names and UK Biobank field identifiers, along with detailed descriptions of each sample processing flag, are provided in the sample_qc_info data.frame that is bundled with this package.

An example workflow for extracting these data and saving them for later use:

library(ukbnmr)
library(data.table) # for fast reading and writing of csv files using fread() and fwrite()

# Load exported field data saved by the Table Exporter tool on the RAP
exported <- fread("path/to/exported_ukbiobank_phenotype_data.csv")

nmr <- extract_biomarkers(exported)
biomarker_qc_flags <- extract_biomarker_qc_flags(exported)
sample_qc_flags <- extract_sample_qc_flags(exported)

fwrite(nmr, file="path/to/nmr_biomarker_data.csv")
fwrite(biomarker_qc_flags, file="path/to/nmr_biomarker_qc_flags.csv")
fwrite(sample_qc_flags, file="path/to/nmr_sample_qc_flags.csv")

Remember to use the dx upload tool provided by the UK Biobank Research Analysis Platform to save these files to your persistant project storage for later use.

Removal of technical variation

The remove_technical_variation() function removes additional technical variation present in the UK Biobank NMR data (see section below for details), returning a list containing the corrected NMR biomarker data, biomarker QC flags, and sample processing information in analysis-ready data.frames.

Note that the no prefiltering of samples or columns should be performed prior to running this function: the algorithms used for removing technical variation expect all the data to be present.

This function takes a little over an hour to run, and requires at least 32 GB of RAM, so you will want to save the output, rather than incorporate this function into your analysis scripts.

An example workflow for using this function and saving the output for loading into future R sessions or other programs:

library(ukbnmr)
library(data.table) # for fast reading and writing of csv files using fread() and fwrite()

# Load exported field data saved by the Table Exporter tool on the RAP
exported <- fread("path/to/exported_ukbiobank_phenotype_data.csv")

processed <- remove_technical_variation(exported) 

fwrite(processed$biomarkers, file="path/to/nmr_biomarker_data.csv")
fwrite(processed$biomarker_qc_flags, file="path/to/nmr_biomarker_qc_flags.csv")
fwrite(processed$sample_processing, file="path/to/nmr_sample_qc_flags.csv")
fwrite(processed$log_offset, file="path/to/nmr_biomarker_log_offset.csv")
fwrite(processed$outlier_plate_detection, file="path/to/outlier_plate_info.csv")

Remember to use the dx upload tool provided by the UK Biobank Research Analysis Platform to save these files to your persistant project storage for later use.

Algorithms for removing technical variation

Three versions of the QC algorithm have been developed:

  1. Version 1 was designed based on the first phase of data released to the public covering ~120,000 UK Biobank participants.
  2. Version 2 made several improvements to the algorithm based on the subsequent second public release of data covering an additional ~150,000 participants.
  3. Version 3 (the default) makes some further minor tweaks primarily so that the algorithm is compatible with the full public data release covering all ~500,000 participants.

Algorithm version 1

Version 1 of the algorithm is as described in Ritchie et al. 2023, which was developed based on the technical variation observed in the NMR metabolomics data

Related Skills

View on GitHub
GitHub Stars60
CategoryDevelopment
Updated6d ago
Forks5

Languages

R

Security Score

80/100

Audited on Mar 28, 2026

No findings