Panstripe
post processing of bacterial pangenome gene presence/absence matrices
Install / Use
/learn @gtonkinhill/PanstripeREADME
panstripe
<!-- badges: start --> <!-- badges: end --> <p align="center"> <img src="https://github.com/gtonkinhill/panstripe/blob/main/inst/vignette-supp/panstripe.png" alt="alt text" width="500"> </p>Panstripe improves the post processing of bacterial pangenome analyses. In particular it aims to replace the dubious but popular pangenome accumulation curve. The package is currently under development so frequent changes are to be expected.
- Installation
- Quick Start
- Citation
- Comparing Pangenomes
- Open vs Closed
- Rate vs Size
- Output
- Alternative models
- Plots
- U-plot
Installation
panstripe is currently available on GitHub. It can be installed with
devtools
install.packages("remotes")
remotes::install_github("gtonkinhill/panstripe")
If you would like to also build the vignette with your installation run:
remotes::install_github("gtonkinhill/panstripe", build_vignettes = TRUE)
Quick Start
Panstripe takes as input a phylogeny and gene presence/absence matrix in Rtab format as is produced by Panaroo, Roary, PIRATE and other prokaryotic pangenome tools.
library(panstripe)
library(ape)
library(patchwork)
set.seed(1234)
### NOTE: here we load example files from the panstripe package. You should
### replace these variables with the relevant paths to the files you are using.
phylo.file.name <- system.file("extdata", "tree.newick", package = "panstripe")
rtab.file.name <- system.file("extdata", "gene_presence_absence.Rtab", package = "panstripe")
###
# Load files
pa <- read_rtab(rtab.file.name)
tree <- read.tree(phylo.file.name)
# Run panstripe
fit <- panstripe(pa, tree)
fit$summary
#> # A tibble: 6 × 7
#> term estimate std.error statistic p.value `bootstrap CI 2.5%`
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 istip 1.49 0.282 5.29 8.03e- 7 0.978
#> 2 core 2.97 0.247 12.0 9.69e-21 2.54
#> 3 depth 0.0572 0.0619 0.923 3.58e- 1 -0.0859
#> 4 istip:core -1.93 0.431 -4.48 2.10e- 5 -2.82
#> 5 p 1.25 NA NA NA 1.11
#> 6 phi 2.58 NA NA NA 1.92
#> # ℹ 1 more variable: `bootstrap CI 97.5%` <dbl>
# Plot results
plot_pangenome_params(fit)
<!-- -->
plot_pangenome_cumulative(fit)
<!-- -->
A significant p-value for the tip term indicates that there is a
different rate of gene exchange at the tips of the phylogeny compared
with the internal branches. This is usually driven by annotation errors
or highly mobile elements that do not persist long enough to be observed
in multiple genomes.
A significant p-value for the core term indicates that there is a
significant association between the core genome branch length and the
number of gene exchange events.
The depth term is less interesting but indicates when there is a
difference in our ability to detect older gene exchange events.
Citation
To cite panstripe please use: Tonkin-Hill, G. et al. Robust analysis of prokaryotic pangenome gene gain and loss rates with Panstripe. bioRxiv (2022) doi:10.1101/2022.04.23.489244
Comparing Pangenomes
As panstripe uses a GLM framework it is straightforward to compare the
slope terms between datasets. The compare_pangenomes function
considers the interaction term between the data sets and the core,
tip and depth terms.
IMPORTANT: It is important that the phylogenies for the pangenomes being compared are on the same scale. This can be achieved most easily by using time scaled phylogenies. Alternatively, it is possible to build a single phylogeny and partition it into clades or use SNP scaled phylogenies.
Here, we simulate two pangenomes with different gene exhange rates, fit
the model using panstripe and run the compare_pangenomes function.
# Simulate a fast gene gain/loss rate with error
sim_fast <- simulate_pan(rate = 0.001, ngenomes = 200)
sim_slow <- simulate_pan(rate = 5e-04, ngenomes = 200)
# Run panstripe
fit_fast <- panstripe(sim_fast$pa, sim_fast$tree)
fit_slow <- panstripe(sim_slow$pa, sim_slow$tree)
# Compare the fits
result <- compare_pangenomes(fit_fast, fit_slow)
result$summary
#> # A tibble: 4 × 7
#> term estimate std.error statistic p.value `bootstrap CI 2.5%`
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 depth -0.0578 0.0268 -2.16 0.0314 -0.108
#> 2 istip -0.0464 0.173 -0.268 0.788 -0.266
#> 3 core -0.683 0.165 -4.15 0.0000374 -0.972
#> 4 dispersion model NA NA 0.439 0.508 NA
#> # ℹ 1 more variable: `bootstrap CI 97.5%` <dbl>
A significant p-value for the tip term indicates that the two
pangenomes differ in the rates of gene presence and absence assigned to
the tips of the phylogeny. This is usually driven by either differences
in the annotation error rates between data sets or differences in the
gain and loss of highly mobile elements that do not persist long enough
to be observed in multiple genomes.
A significant p-value for the core term indicates that two data sets
have different rates of gene gain and loss.
The depth term is less interesting but indicates when there is a
difference in our ability to detect older gene exchange events between
the two pangenomes.
The dispersion parameter indicates if there is a significant
difference in the dispersion of the two pangenomes. This suggests that
the relationship between the rate of gene exchange and the size of each
event differs in the two pangenomes. Here, the p-value is obtained using
a Likelihood Ratio Test.
## Plot the results
plot_pangenome_params(list(fast = fit_fast, slow = fit_slow), legend = FALSE) + plot_pangenome_cumulative(list(fast = fit_fast,
slow = fit_slow)) + plot_layout(nrow = 1)
<!-- -->
Open vs Closed
The definition of what constitutes an open or closed pangenome is somewhat ambiguous. We prefer to consider whether there is evidence for a temporal signal in the pattern of gene gain and loss. This prevents annotation errors leading to misleading results.
After fitting a panstripe model the significance of the temporal
signal can be assessed by considering the coefficient and p-value of the
‘core’ term in the model. The uncertainty of this estimate can also be
investigated by looking at the bootstrap confidence intervals of the
core term.
Let’s simulate a ‘closed’ pangenome
sim_closed <- simulate_pan(rate = 0, ngenomes = 100)
fit_closed <- panstripe(sim_closed$pa, sim_closed$tree)
fit_closed$summary
#> # A tibble: 6 × 7
#> term estimate std.error statistic p.value `bootstrap CI 2.5%`
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 istip 1.53 0.143 10.7 3.09e-21 1.13
#> 2 core -723. 38024. -0.0190 9.85e- 1 -783.
#> 3 depth -0.0452 0.0374 -1.21 2.29e- 1 -0.145
#> 4 istip:core 723. 38024. 0.0190 9.85e- 1 493.
#> 5 p 1.05 NA NA NA 1.00
#> 6 phi 0.399 NA NA NA 0.199
#> # ℹ 1 more variable: `bootstrap CI 97.5%` <dbl>
The p-value indicates that there is not a significant association between core branch lengths and gene gain/loss. This is typical of species that undergo very little to no recombination such as M. tuberculosis.
Rate vs Size
While comparing the core parameter of the model identifies differences
in the association between branch lengths and gene gain and loss it does
not indicate whether this is driven by higher rates of recombination or
simply larger recombination events involving more genes.
The panstripe model allows these two scenarios to be investigated by
allowing the dispersion parameter in the model to be different for each
pangenome.
Here, we simulate two data sets with the same recombination rate but where each recombination event differs in the number of genes involved.
sim_large <- simulate_pan(rate = 0.001, ngenomes = 100, mean_trans_size = 4)
sim_small <- simulate_pan(rate = 0.001, ngenomes = 100, mean_trans_size = 3)
fit_large <- panstripe(sim_large$pa, sim_large$tree)
fit_small <- panstripe(sim_small$pa, sim_small$tree)
# Compare the fits
result <- compare_pangenomes(fit_large, fit_small)
result$summary
#> # A tibble: 4 × 7
#> term estimate std.error statistic p.value `bootstrap CI 2.5%`
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 depth 0.0858 0.0490 1.75 0.0807 -0.0169
#> 2 istip -0.181 0.297 -0.609 0.543
Related Skills
node-connect
349.7kDiagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps
frontend-design
109.7kCreate distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.
openai-whisper-api
349.7kTranscribe audio via OpenAI Audio Transcriptions API (Whisper).
qqbot-media
349.7kQQBot 富媒体收发能力。使用 <qqmedia> 标签,系统根据文件扩展名自动识别类型(图片/语音/视频/文件)。
