SkillAgentSearch skills...

AllelicSeries

Implementation of gene-level rare coding variant association tests targeting allelic series: cases where increasingly deleterious mutations have increasingly large phenotypic effects.

Install / Use

/learn @insitro/AllelicSeries
About this skill

Quality Score

0/100

Supported Platforms

Universal

README

README

Updated: 2025-03-26

Allelic Series

CRAN_Status_Badge

This package implements gene-level rare variant association tests targeting allelic series: genes where increasingly deleterious mutations have increasingly large phenotypic effects. The main COding-variant Allelic Series Test (COAST) was designed to operate on the benign missense variants (BMVs), damaging missense variants (DMVs), and protein truncating variants (PTVs) within a gene. COAST uses a set of adjustable weights that tailor the test towards rejecting the null for genes where the mean phenotypic impact increases monotonically from BMVs to DMVs to PTVs. Such genes are of candidate therapeutic interest due to the existence of a dose-response relationship between gene functionality and phenotypic effect. COAST-SS extends COAST to accept summary statistics as input. Both methods have been generalized to allow for arbitrary number of discrete annotation categories, for example the 4 category scale: benign missense, damaging missense, low-confidence loss of function, high-confidence loss of function. For additional details, see:

  • McCaw ZR, O’Dushlaine C, Somineni H, Bereket M, Klein C, Karaletsos T, Casale FP, Koller D, Soare TW. (2023) “An allelic-series rare-variant association test for candidate-gene discovery” doi:10.1016/j.ajhg.2023.07.001.

  • McCaw ZR, Gao J, Dey R, Tucker S, Zhang Y, Research Team insitro, Gronsbell J, Li X, Fox E, O’Dushlaine C, Soare TW. (2025) “A Scalable Framework for Identifying Allelic Series from Summary Statistics” doi:10.1016/j.ajhg.2025.09.012.

Installation

# Install from CRAN.
install.packages("AllelicSeries")

# Install from GitHub.
remotes::install_github("insitro/AllelicSeries", build_vignettes = TRUE)
library(AllelicSeries)

To view vignettes:

browseVignettes("AllelicSeries")

Example data

Here, data for n = 1000 subjects at snps = 300 variants with minor allele frequencies between 0.1% and 0.5% (maf_range) are simulated. By default, variants are annotated into 1 of 3 categories, representing BMVs, DMVs, and PTVs. beta specifies the mean per-variant effect sizes in each annotation category. Setting beta inversely proportional to sqrt(n) makes the power independent of the sample size. For additional details, see the Data Generation vignette.

set.seed(101)
n <- 1000
data <- DGP(
  n = n,
  snps = 300,
  maf_range = c(0.001, 0.005),
  beta = c(1, 2, 3) / sqrt(n)
)

The example data are a list with the following components:

  • anno: An snps by 1 annotation vector coded as 1 for BMVs, 2 for DMVs, and 3 for PTVs. Note that the values of (1, 2, 3) simply identify different categories of variants; weights other than these can be set when performing the association test.

  • covar: An n by 6 covariate matrix where n is the number of subjects and the columns correspond to: an intercept, age, sex, and 3 genetic PCs (pc1, pc2, pc3).

  • geno: An n by snps genotype matrix with additive coding.

  • pheno: An n by 1 phenotype vector.

COding-variant Allelic Series Test

results <- COAST(
  anno = data$anno,
  covar = data$covar,
  geno = data$geno,
  pheno = data$pheno,
  weights = c(1, 2, 3)
)

The function COAST performs the coding-variant allelic series test. The required inputs are the annotation vector, a covariate matrix, the per-variant genotype matrix, and the phenotype vector.

  • The function allows any number of annotation categories, coded as consecutive integers starting at 1. The example data have categories of c(1, 2, 3). The length of anno should match the number of columns of the genotype matrix geno. Note: a previous version of the package used zero-indexed annotation categories. The main functions will still run with zero-indexed annotations, but one-based indexing has been adopted as it is the default in R.

  • If unspecified, covar will default to an intercept vector (i.e. a vector of 1s). If covar is specified, an intercept will automatically be added if not detected.

  • weights encodes the relative importance of the annotation categories. The example weights of c(1, 2, 3) target a genetic architecture where effect sizes increase with increasing deleteriousness: BMVs have a relative effect of 1, DMVs have a relative effect of 2, and PTVs have a relative effect of 3. Weights of c(1, 1, 1) target instead a genetic architecture where all variant types have equivalent expected magnitudes.

show(results)
## Effect Sizes:
##         test beta    se
## 1       base 0.03 0.027
## 2       base 0.07 0.028
## 3       base 0.07 0.062
## 4        ind 0.01 0.048
## 5        ind 0.16 0.048
## 6        ind 0.05 0.068
## 7  max_count 0.03 0.015
## 8    max_ind 0.07 0.025
## 9  sum_count 0.03 0.010
## 10   sum_ind 0.04 0.016
## 
## 
## Counts:
##   anno alleles variants carriers
## 1    1     834      148      574
## 2    2     730      126      516
## 3    3     154       26      144
## 
## 
## P-values:
##           test   type     pval
## 1     baseline burden 2.64e-02
## 2          ind burden 5.84e-03
## 3    max_count burden 2.34e-02
## 4      max_ind burden 8.95e-03
## 5    sum_count burden 2.62e-03
## 6      sum_ind burden 4.39e-03
## 7 allelic_skat   skat 3.38e-03
## 8         omni   omni 4.36e-03

By default, the output of COAST includes a data.frame of estimated effect sizes from the burden tests, a data frame of counts showing the number of alleles, variants, and carriers in each class that contributed to the test, and a data.frame of p-values, with the omni test denoting the final, overall p-value. The effect sizes data.frame is accessed via:

results@Betas
##         test        beta         se
## 1       base 0.029482960 0.02667753
## 2       base 0.070413431 0.02760127
## 3       base 0.073294520 0.06157383
## 4        ind 0.007297359 0.04823273
## 5        ind 0.163297146 0.04755333
## 6        ind 0.054511276 0.06780962
## 7  max_count 0.034206336 0.01508838
## 8    max_ind 0.066202953 0.02532496
## 9  sum_count 0.031465473 0.01045514
## 10   sum_ind 0.044181239 0.01551063

the counts data.frame is accessed via:

results@Counts
##   anno alleles variants carriers
## 1    1     834      148      574
## 2    2     730      126      516
## 3    3     154       26      144

and the p-values data.frame via:

results@Pvals
##           test   type        pval
## 1     baseline burden 0.026375086
## 2          ind burden 0.005839884
## 3    max_count burden 0.023386302
## 4      max_ind burden 0.008945286
## 5    sum_count burden 0.002616191
## 6      sum_ind burden 0.004393287
## 7 allelic_skat   skat 0.003376729
## 8         omni   omni 0.004363169

To return the omnibus p-value only, specify return_omni_only = TRUE when calling COAST. For additional details, see the COAST vignette.

Different numbers of annotation categories

The following provides a minimal example of generating data and running COAST with a different number of annotation categories, in this case 4, which might represent benign missense variants, damaging missense variants, low-confidence loss of function, and high-confidence loss of function. The main difference when analyzing data with a different number of annotation categories is that the weights argument should be provided to COAST, and should have length equal to the number of annotation categories.

withr::local_seed(102)

# Generate data.
n <- 1e2
data4 <- DGP(
  n = n,
  snps = 400,
  beta = c(1, 2, 3, 4) / sqrt(n),
  prop_anno = c(0.4, 0.3, 0.2, 0.1),
  weights = c(1, 1, 1, 1)
)

# Run COAST
results <- COAST(
  anno = data4$anno,
  covar = data4$covar,
  geno = data4$geno,
  pheno = data4$pheno,
  weights = c(1, 2, 3, 4)
)
show(results)
## Effect Sizes:
##         test  beta    se
## 1       base  0.01 0.065
## 2       base  0.23 0.062
## 3       base  0.19 0.096
## 4       base  0.61 0.116
## 5        ind -0.10 0.253
## 6        ind  0.53 0.205
## 7        ind  0.31 0.169
## 8        ind  0.96 0.186
## 9  max_count  0.18 0.035
## 10   max_ind  0.47 0.090
## 11 sum_count  0.11 0.018
## 12   sum_ind  0.19 0.035
## 
## 
## Counts:
##   anno alleles variants carriers
## 1    1     194      169       86
## 2    2     159      131       78
## 3    3      71       61       51
## 4    4      41       39       31
## 
## 
## P-values:
##           test   type     pval
## 1     baseline burden 9.67e-10
## 2          ind burden 5.86e-07
## 3    max_count burden 4.26e-07
## 4      max_ind burden 1.65e-07
## 5    sum_count burden 5.37e-10
## 6      sum_ind burden 8.24e-08
## 7 allelic_skat   skat 3.27e-06
## 8         omni   omni 4.11e-09

COAST from Summary Statistics

Summary statistics calculation

The function CalcSumstats calculates summary statistics starting either from:

  • The direct output of DGP, or
  • An annotation vector anno, genotype matrix geno, and phenotype vector pheno, formatted as provided by DGP.
  • Providing covariates covar is optional. If omitted, an intercept-only design matrix is adopted by default. If covar is specified, an intercept will automatically be added if not detected.

Related Skills

View on GitHub
GitHub Stars14
CategoryDevelopment
Updated5mo ago
Forks2

Languages

R

Security Score

72/100

Audited on Oct 22, 2025

No findings