PHESANT
PHESANT - PHEnome Scan ANalysis Tool (pheWAS, Mendelian randomisation (MR)-pheWAS etc.) in UK Biobank
Install / Use
/learn @MRCIEU/PHESANTREADME
PHESANT - PHEnome Scan ANalysis Tool
Run a phenome scan (pheWAS, Mendelian randomisation (MR)-pheWAS etc.) in UK Biobank.
There are three components in this project:
- Running a phenome scan in UK Biobank
- Post-processing of results
- PHESANT-viz: Visualising the results
General requirements
R for parts 1 and 2 above. Tested with R-3.3.1-ATLAS. Phenome scan requires the R packages: optparse (V1.3.2), MASS (V7.3-45), lmtest (V0.9-34), nnet (V7.3-12), forestplot (V1.7) and data.table (V1.10.4).
Java for part 3 above. Tested with jdk-1.8.0-66.
Citing this project
Please cite:
Millard LAC, et al. Software Application Profile: PHESANT: a tool for performing automated phenome scans in UK Biobank. International Journal of Epidemiology (2017).
1) Running a phenome scan
A phenome scan is run using WAS/phenomeScan.r. This is ready to go. One amendment you may wish to make before running PHESANT is the TRAIT_OF_INTEREST column in the variable information file (see below).
The PHESANT phenome scan processing pipeline is illustrated in the figure here, and described in detail in the paper above.
The phenome scan is run with the following command:
cd WAS/
Rscript phenomeScan.r \
--phenofile=<phenotypesFilePath> \
--traitofinterestfile=<traitOfInterestFilePath> \
--variablelistfile="../variable-info/outcome-info.tsv" \
--datacodingfile="../variable-info/data-coding-ordinal-info.csv" \
--traitofinterest=<traitOfInterestName> \
--resDir=<resultsDirectoryPath> \
--userId=<userIdFieldName>
The following example runs part 1 of 20, of a sensitivity analysis phenome scan (adjusting for age, sex, and assessment centre, see below), using a non genetic trait of interest:
cd WAS/
Rscript phenomeScan.r \
--phenofile=<phenotypesFilePath> \
--traitofinterestfile=<traitOfInterestFilePath> \
--variablelistfile="../variable-info/outcome-info.tsv" \
--datacodingfile="../variable-info/data-coding-ordinal-info.csv" \
--traitofinterest=<traitOfInterestName> \
--resDir=<resultsDirectoryPath> \
--userId=<userIdFieldName> \
--sensitivity \
--genetic=FALSE \
--partIdx=1 \
--numParts=20
Required arguments
Arg | Description -------|-------- phenofile | Comma separated file containing phenotypes. Each row is a participant, the first column contains the participant id and the remaining columns are phenotypes. Where there are multiple columns for a phenotype these must be adjacent in the file. Specifically for a given field in Biobank the instances should be adjacent and within each instance the arrays should be adjacent. Each variable name needs to be changed to the format 'x[varid]_[instance]_[array]' (we use the prefix 'x' so that the variable names are valid in R). variablelistfile | Tab separated file containing information about each phenotype, that is used to process them (see below). datacodingfile | Comma separated file containing information about data codings (see below). traitofinterest | Variable name as in traitofinterestfile. resDir | Directory where you want the results to be stored.
Optional arguments
Arg | Description
-------|--------
traitofinterestfile | Comma separated file containing the trait of interest (e.g. a snp, genetic risk score or observed phenotype). Each row is a participant and there should be two columns - the user ID and the trait of interest. Where this argument is not supplied, the trait of interest should be a column in the phenofile.
confounderfile | Comma separated file containing the confounders, so that you can choose what confounders to use in the phenome scan.
userId | User id column as in the traitofinterestfile and the phenofile (default: userId).
partIdx | Subset of phenotypes you want to run (for parallelising).
numParts | Number of subsets you are using (for parallelising).
sensitivity | By default analyses are adjusted for age (field 21022), sex (field 31) and, if the genetic argument is set to TRUE, genotype chip (a binary variable derived from field 22000). If sensitivity argument is used (by including --sensitivity when running PHESANT) then analyses additionally adjust for the assessment centre (field 54). If sensitivity argument is used (by including --sensitivity when running PHESANT) and the genetic argument is set to TRUE, the first 10 genetic principal components (fields 22009_0_1 to 22009_0_10) are also included as confounders. If you wish to choose your own confounders to use in the phenome scan you can use the confounderfile option (described above).
genetic | By default genetic=TRUE, and we assume the trait of interest is a genetic variable (e.g. a SNP or genetic risk score). If this is not the case (e.g you are running an environment-wide association study) then set this flag to FALSE. This option determines which variables are controlled for in analyses, see sensitivity arg above.
save | Instead of running phenome scan, generated phenotypes are stored to file, in resDir. If this option is used then traitofinterest argument is not required.
confidenceintervals | By default confidenceintervals=TRUE, but specifying confidenceintervals=FALSE means that PHESANT doesn't calculate the association confidence intervals (which may speed up PHESANT).
standardise | By default standardise=TRUE, but specifying standardise=FALSE means that PHESANT will not standardise the exposure variable. E.g. use this option for binary exposure variables.
tab | By default phenotype file (phenofile) is comma seperated, but tab=TRUE can be specified when your file is tab delimited (e.g. using the r option for UK Biobank's ukbconv utility).
mincase | Minimum size of phenotype categories (default is 10).
The numParts and partIdx arguments are both used to parallelise the phenome scan. E.g. setting numParts to 5 will divide the set of phenotypes into 5 (rough) parts and then partIdx can be used to call the phenome scan on a specific part (1-5).
Data coding file
Data codes define a set of values that can be assigned to a given field. A data code can be assigned to more than one variable, which is why we use a separate file describing the necessary information for each data code. For example, there are several fields about diet that have data code 100009.
The data coding file should have the following columns:
- dataCode - The ID of the data code.
- ordinal - Whether the field is ordinal (value 1) or not (value 0). This field is only used for fields of the categorical (single) field type. Value -1 denotes this is not needed because the field is binary.
- ordering - Any needed corrections for the numeric ordering of a data codes specified by Biobank. This field is only used for data codes specified as ordinal in the ordinal column. For example, data code 100001 has values half, 1 and 2+ coded as 555, 1 and 200, respectively. We need the 'half' value to be less than the '1' value, so we change the order to '555|1|200'. NB: if this column is used then and any value is not included then this value is set to NA (i.e. this field can be used to remove and reorder values at the same time).
- reassignments - Any value changes that are needed. For example, in data code 100662, the values 7 and 6 may be deemed equal (both representing 'never visited by friends/family' so we can set '7=6' to assign the value 6 to all participants with the value 7.
- default_value - A default value assigned to all participants with no value for the field, but with a value for field stated in
default_value_related_fieldcolumn below. This is used where a category is not explicitly stated in the field but instead needs to be determined by looking at whether another field has a value. Typically, this occurs where there is no category for 'none' in a questionnaire field, because participants were told they did not have to mark 'none' but could instead leave it blank (see for example section 5.3 in the 24 hour diet questionnaire manual). Hence, we assume that if they completed the questionnaire and have not ticked a value, then the value is 'none'. See default value example below. - default_value_related_field - The field used to determine which participants are assigned the default value. All participants with a value in the field stated here, and with no value for a field with this data code, are assigned the default value stated in
default_value.
Example of default value
In the data code information file we specify default_value=0 and default_value_related_field=20080 for data code 100006.
Field 100200, for example, has data code 100006.
Therefore all participants with a value for field 20080, but with no value in field 100200, are assigned value 0 for field 100200.
Intuitively, all participants who have answered the 24-hour recall diet questionnaire have a value in field 20080, and of these, we assume that those with no value for field 100200 have opted
for 'none' implicitly, by not ticking any option.
Variable information file
This file was initially the UK Biobank data dictionary, which can be downloaded from the UK Biobank website here. This data dic
Related Skills
node-connect
347.2kDiagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps
frontend-design
108.0kCreate distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.
openai-whisper-api
347.2kTranscribe audio via OpenAI Audio Transcriptions API (Whisper).
qqbot-media
347.2kQQBot 富媒体收发能力。使用 <qqmedia> 标签,系统根据文件扩展名自动识别类型(图片/语音/视频/文件)。
