Ukbrest

ukbREST: efficient and streamlined data access for reproducible research of large biobanks

Generate Convert Improve

Install / Use

/learn @hakyimlab/Ukbrest

About this skill

Quality Score

0/100

README

ukbREST

Title: ukbREST: efficient and streamlined data access for reproducible research of large biobanks

Authors: Milton Pividori and Hae Kyung Im

DOI: https://doi.org/10.1093/bioinformatics/bty925

Im-Lab (http://hakyimlab.org/), Section of Genetic Medicine, Department of Medicine, The University of Chicago.

Center for Translational Data Science (https://ctds.uchicago.edu/), The University of Chicago.

Join our mailing list here: https://groups.google.com/d/forum/ukbrest

Abstract

Large biobanks, such as UK Biobank with half a million participants, are changing the scale and availability of genotypic and phenotypic data for researchers to ask fundamental questions about the biology of health and disease. The breadth of the UK Biobank data is enabling discoveries at an unprecedented pace. However, this size and complexity pose new challenges to investigators who need to keep the accruing data up to date, comply with potential consent changes, and efficiently and reproducibly extract subsets of the data to answer specific scientific questions. Here we propose a tool called ukbREST designed for the UK Biobank study (easily extensible to other biobanks), which allows authorized users to efficiently retrieve phenotypic and genetic data. It exposes a REST API that makes data highly accessible inside a private and secure network, allowing the data specification in a human readable text format easily shareable with other researchers. These characteristics make ukbREST an important tool to make biobank’s valuable data more readily accessible to the research community and facilitate reproducibility of the analysis, a key aspect of science.

Architecture and setup overview

News

2019-12-06: the installation steps for macOS and PostgreSQL have been updated. Check it out!
2018-11-25: fix when a dataset has a data-field already loaded. Docker image is now updated. Check out the documentation (Section Duplicated data-fields).

Installation

You only need to install ukbREST in a server/computer; clients can connect to it and make queries just using standard tools like curl. The quickest way to get ukbREST is to use our Docker image. So install Docker and follow the steps below. Just make sure, once you installed Docker, that you have enough disk space (in macOS go to Preferences/Disk and increase the value). Take a look a the wiki to know the general specifications expected for a computer/server.

If you just want to give ukbREST a try, and you are not a UK Biobank user, you can follow the guide in the wiki and use our simulated data.

Step 1: Pre-process

If you are an approved UK Biobank researcher you are probably already familiar with this. Once you downloaded your encrypted application files, decrypt them and convert them to CSV and HTML formats using ukbconv. Checkout the Data Showcase documentation.

Copy all CSV and HTML files to a particular folder (for example, called phenotype). You will have one CSV and one HTML file per dataset, each one with a specific Basket ID, like for example the ones shown below for four different datasets with Basket IDs 1111, 2222, 3333, 4444:

$ ls -lh phenotype/*
-rw-rw-r-- 1   6.6G Jul  2 23:22 phenotype/ukb1111.csv
-rw-rw-r-- 1   6.4M Jul  2 23:19 phenotype/ukb1111.html
-rw-rw-r-- 1   2.7G Jul  2 23:20 phenotype/ukb2222.csv
-rw-rw-r-- 1   4.5M Jul  2 23:19 phenotype/ukb2222.html
-rw-rw-r-- 1  1012M Jul  2 23:22 phenotype/ukb3333.csv
-rw-rw-r-- 1   192K Jul  2 23:19 phenotype/ukb3333.html
-rw-rw-r-- 1    22G Jul  2 23:24 phenotype/ukb4444.csv
-rw-rw-r-- 1   4.1M Jul  2 23:19 phenotype/ukb4444.html

Make sure your phenotype CSV files do not have overlapping data-fields (use the latest data refresh for each basket).

For the genotype data you'll also have a specific folder, for instance, called genotype. Here you have to copy your bgen, bgi (BGEN index files) and sample (BGEN sample) files:

$ ls -lh genotype/*
-rw-rw-r-- 1  114G Mar 16 09:51 genotype/ukb_imp_chr10_v3.bgen
-rw-rw-r-- 1  198M Mar 16 10:12 genotype/ukb_imp_chr10_v3.bgen.bgi
-rw-rw-r-- 1  109G Mar 16 09:52 genotype/ukb_imp_chr11_v3.bgen
-rw-rw-r-- 1  201M Mar 16 10:12 genotype/ukb_imp_chr11_v3.bgen.bgi
-rw-rw-r-- 1  109G Mar 16 09:54 genotype/ukb_imp_chr12_v3.bgen
[...]
-rw-rw-r-- 1  9.3M Apr  6 09:41 genotype/ukb12345_imp_chr1_v3_s487395.sample

Step 2: Setup

Here we are going to start PostgreSQL and load the phenotype data into it. Start Docker in your server/computer and pull the PostgreSQL and ukbREST images:

$ docker pull postgres:11

$ docker pull hakyimlab/ukbrest

Create a network in Docker that we'll use to connect ukbREST with PostgreSQL:

$ docker network create ukb

Start the PostgreSQL container (here we are using user test with password test; you should choose a stronger one):

$ docker run -d --name pg --net ukb -p 127.0.0.1:5432:5432 \
  -e POSTGRES_USER=test -e POSTGRES_PASSWORD=test \
  -e POSTGRES_DB=ukb \
  postgres:11

Keep in mind that the above command runs PostgreSQL with the default settings. That could make it work really slow when you send a query to ukbREST. See the installation instructions in the wiki for more details.

Then use the ukbREST Docker image to load your phenotype data into the PostgreSQL database. Here we are only loading your CSV/HTML main datasets, but keep in mind that you can also load Sample-QC or relatedness data, which is provided separately in UK Biobank. This is covered in the wiki.

In the command below, replace the bold text with the full path of both your phenotype and genotype folder, as well as the right name of your .sample file.

<pre> $ docker run --rm --net ukb \ -v /full/path/to/genotype/folder/:/var/lib/genotype \ -v /full/path/to/phenotype/folder/:/var/lib/phenotype \ -e UKBREST_GENOTYPE_BGEN_SAMPLE_FILE="ukb12345_imp_chr1_v3_s487395.sample" \ -e UKBREST_DB_URI="postgresql://test:test@pg:5432/ukb" \ -e UKBREST_LOADING_N_JOBS=2 \ hakyimlab/ukbrest --load [...] 2018-07-20 22:50:34,962 - ukbrest - INFO - Loading finished! </pre>

Sometimes we found that the CSV file have a wrong encoding, making Python fail when reading the file. If ukbREST found this, you'll see an error message about Unicode decoding error. Check out the documentation to know how to fix it.

You can also adjust the number of cores used when loading the data with the variable UKBREST_LOADING_N_JOBS (set to 2 cores in the example above).

The documentation also explain the SQL schema, so you can take full advantage of it.

Once your main datasets are loaded, you only need to complete two more steps: 1) load the data-field codings and 2) some useful SQL functions. You do this by just running two commands.

To load the data-field codings, run this:

$ docker run --rm --net ukb \
  -e UKBREST_DB_URI="postgresql://test:test@pg:5432/ukb" \
  hakyimlab/ukbrest --load-codings

This will load most of the data-field codings from the UK Biobank Data Showcase (they are in .tsv format in the codings folder). This includes, for instance, data coding 19, which is used for data-field 41202 (Diagnoses - main ICD10). For your application, however, you could need to download a few more if you have specific data-fields. This is covered in the documentation.

Finally, run this command to create some useful SQL functions you will likely use in your queries:

$ docker run --rm --net ukb \
  -e UKBREST_DB_URI="postgresql://test:test@pg:5432/ukb" \
  hakyimlab/ukbrest --load-sql

Step 3: Start

Now you only need to start the ukbREST server:

<pre> $ docker run --rm --net ukb -p 127.0.0.1:5000:5000 \ -e UKBREST_SQL_CHUNKSIZE="10000" \ -e UKBREST_DB_URI="postgresql://test:test@pg:5432/ukb" \ hakyimlab/ukbrest </pre>

For security reasons, note that with these commands both the ukbREST server and the PostgreSQL are only reachable from your own computer/server. No one from the network will be able to make any queries other than you from the computer where ukbREST is running.

Check out the documentation to setup ukbREST in a private and secure network and how to add user authentication and SSL encryption.

Step 4: Query

Once the ukbREST is up and running, you can request any data-field using different query methods. Column names for data-fields have this format: c{DATA_FIELD_ID}_{INSTANCE}_{ARRAY}.

Phenotype queries

ukbREST lets you make queries in differ

Related Skills

proje

Interactive vocabulary learning platform with smart flashcards and spaced repetition for effective language acquisition.

YC-Killer

2.7k

A library of enterprise-grade AI agents designed to democratize artificial intelligence and provide free, open-source alternatives to overvalued Y Combinator startups. If you are excited about democratizing AI access & AI agents, please star ⭐️ this repository and use the link in the readme to join our open source AI research team.

best-practices-researcher

The most comprehensive Claude Code skills registry | Web Search: https://skills-registry-web.vercel.app

groundhog

398

Groundhog's primary purpose is to teach people how Cursor and all these other coding agents work under the hood. If you understand how these coding assistants work from first principles, then you can drive these tools harder (or perhaps make your own!).

hakyimlab

View profile

View on GitHub

GitHub Stars43

CategoryEducation

Updated2mo ago

Forks22

hakyimlab/ukbrest

Languages

Python

Security Score

95/100

Audited on Jan 13, 2026

No findings