SkillAgentSearch skills...

DbGaPCheckup

Easy checks for data integrity and proper formatting of the dbGaP subject phenotype data set and data dictionary.

Install / Use

/learn @lwheinsberg/DbGaPCheckup
About this skill

Quality Score

0/100

Supported Platforms

Universal

README

dbGaPCheckup

<!-- README.md is generated from README.Rmd. Please edit that file -->

1 Overview

<!-- badges: start --> <!-- badges: end -->

I want to make the data sing, but it is torturing me instead. Will the real data please stand up?

The goal of dbGaPCheckup is to make your National Library of Medicine database of Genotypes and Phenotypes (dbGaP) data set submission a tiny bit easier. Specifically, our package implements several check, awareness, utility, and reporting functions designed to help you ensure that your Subject Phenotype data set and data dictionary meet a variety of dbGaP specific formatting requirements. A list of the functions available can be found below.

The software announcement for our package has been published in BMC Bioinformatics and is available at https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-023-05200-8.

Heinsberg, L.W., Weeks, D.E. dbGaPCheckup: pre-submission checks of dbGaP-formatted subject phenotype files. BMC Bioinformatics 24, 77 (2023). https://doi.org/10.1186/s12859-023-05200-8

| Function_Name | Function_Type | Function_Description | |:------------------------|:----------------------|:-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| | field_check | check | Checks for dbGaP required fields: variable name (VARNAME), variable description (VARDESC), units (UNITS), and variable value and meaning (VALUES). | | pkg_field_check | check | Checks for package-level required fields: variable type (TYPE), minimum value (MIN), and maximum value (MAX). | | dimension_check | check | Checks that the number of variables match between the data set and data dictionary. | | name_check | check | Checks that variable names match between the data set and data dictionary. | | id_check | check | Checks that the first column of the data set is the primary ID for each participant labeled as SUBJECT_ID, that values contain no illegal characters or padded zeros, and that each participant has an ID. | | duplicate_id_check | check | Checks for duplicated SUBJECT_ID values are present in the dataset (while expected/allowable in longitudinal data, it may indicate an error in cross-sectional submissions) | | row_check | check | Checks for empty or duplicate rows in the data set and data dictionary. | | NA_check | check | Checks for NA values in the data set and, if NA values are present, also checks for an encoded NA value=meaning description. | | type_check | check | If a TYPE field exists, this function checks for any TYPE entries that aren’t allowable per dbGaP instructions. | | values_check | check | Checks for potential errors in the VALUES columns by ensuring (1) required format of VALUE=MEANING (e.g., 0=Yes or 1=No) AND ensuring there is only one equals sign per cell; (2) no leading/trailing spaces near the equals sign; (3) all variables of TYPE encoded have VALUES entries; (4) all variables with VALUES entries are listed as TYPE encoded; and (5) there are no duplicated MEANINGs (e.g., 1=Yes; 2=Yes) within the same variable | | integer_check | check | Checks for variables that appear to be incorrectly listed as TYPE integer. | | decimal_check | check | Checks for variables that appear to be incorrectly listed as TYPE decimal. | | misc_format_check | check | Checks miscellaneous dbGaP formatting requirements to ensure (1) no empty variable names; (2) no duplicate variable names; (3) variable names do not contain “dbgap”; (4) there are no duplicate column names in the dictionary; and (5) column names falling after VALUES column are unnamed. | | description_check | check | Checks for unique and non-missing descriptions (VARDESC) for every variable in the data dictionary. | | minmax_check | check | Checks for variables that have values exceeding the listed MIN or MAX. | | ascii_check | check | Scans for non-ASCII characters (e.g., with accents) and newline and carriage return characters (e.g., line breaks)

Related Skills

View on GitHub
GitHub Stars4
CategoryDevelopment
Updated11mo ago
Forks2

Languages

R

Security Score

62/100

Audited on May 1, 2025

No findings