BactaxR
Bacterial taxonomy construction and evaluation in R
Install / Use
/learn @lmc297/BactaxRREADME
bactaxR
Bacterial taxonomy construction and evaluation in R
Overview
bactaxR is an R package which contains functions to aid in average-nucleotide identity (ANI)-centric bacterial taxonomy construction and evaluation. Specific functions include:
- Parsing output from <a href="https://github.com/ParBLiSS/FastANI">fastANI</a>
- Identification of ANI-based genomospecies breakpoints
- ANI-based dendrogram construction
- Identification of medoid genomes using selected genomospecies thresholds
- Mapping discrete traits (e.g., genomospecies, presence or absence of a phenotypic trait) to phylogenies
Post issues at https://github.com/lmc297/bactaxR/issues
Citation
If you found bactaxR and/or its source code to be useful, please cite:
Carroll, Laura M., Martin Wiedmann, Jasna Kovac. 2020. "Proposal of a Taxonomic Nomenclature for the Bacillus cereus Group Which Reconciles Genomic Definitions of Bacterial Species with Clinical and Industrial Phenotypes." mBio 11(1): e00034-20; DOI: 10.1128/mBio.00034-20.
Installation
-
Download R, if necessary: https://www.r-project.org/
-
Dowload R Studio, if necessary: https://www.rstudio.com/products/rstudio/download/
-
Open R Studio, and install the
devtoolspackage, if necessary, by typing the following command into R Studio's console:
install.packages("devtools")
- Install
ggtreefrom Bioconductor, if necessary, by running the following commands from R Studio's console:
if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install("ggtree")
- Load
devtoolsby typing the following command into R Studio's console:
library(devtools)
- Install
bactaxRby typing the following command into R Studio's console:
install_github("lmc297/bactaxR")
Note: Users who get an error when installing bactaxR should run the following command before attempting to install bactaxR again: Sys.setenv("R_REMOTES_NO_ERRORS_FROM_WARNINGS"=TRUE)
- Load
bactaxRby typing the following command into R Studio's console:
library(bactaxR)
Tutorials
Tutorial 1: Construct a histogram and dendrogram using pairwise ANI values, identify medoid genomes, and visualize medoid-based clusters in a graph
- For this tutorial, we're going to use pairwise ANI values that were calculated between 36 B. cereus group genomes using <a href="https://github.com/ParBLiSS/FastANI">FastANI</a> (a subset of the original data set, which will save time and memory; for all 2,231 genomes used in the full data set, see Supplementary Table S1 of the paper).
Click <a href="https://raw.githubusercontent.com/lmc297/bactaxR/master/data/bactaxR_fastani_output.txt">here</a> to download the data set. This tab-separated file was produced by FastANI, with query genomes in the first column, reference genomes in the second column, and ANI values in the third column.
Feel free to save this file in the directory of your choice; if you would like to follow along with this tutorial exactly, using identical path/file names, save this file in your home directory as bactaxR_fastani_output.txt.
-
Open RStudio; if you have not already installed
bactaxR, follow the installation instructions above. -
If you have not already done so, load
bactaxRusing the following command:
library(bactaxR)
- Let's store our pairwise ANI information as a
bactaxRObject; this will allow us to construct dendrograms and graphs and identify medoid genomes, exactly as was done in the paper. To do so, run the following command (replace~/bactaxR_fastani_output.txtwith the path to your own file, if necessary):
ani <- read.ANI(file = "~/bactaxR_fastani_output.txt")
This command:
- Uses the
read.ANIfunction inbactaxRto readbactaxR_fastani_output.txt, an output file produed by FastANI - Stores genome names and pairwise comparisons as a
bactaxRObject, assigning it to the variable nameani
Note: bactaxR can use pairwise ANI values calculated using any ANI tool, not just FastANI; you can use read.ANI with any headerless, tab-delimited file where query genome is in the first column, reference genome is in the second, and ANI values (ranging from 0 to 100) are in the third. Additionally, if you already have a data frame of query genomes/reference genomes/ANI values loaded into R, you can use the load.ANI function to store it as a bactaxRObject. Note that both of these functions check to make sure that these ANI values are pairwise all-vs-all ANI values (i.e., all query genome names must be identically present in the reference genome column, and vice-versa). Additionally, your ANI values should be between 0 and 100 (i.e., if they are between 0 and 1, multiply them by 100; for example, 0.95 ANI becomes 95 ANI, 0.971 ANI becomes 97.1 ANI). See ?read.ANI and ?load.ANI for more information.
We can obtain a summary of our bactaxR object using the following command:
summary(ani)
This should tell us that our data set has 36 genomes and 1,296 total comparisons; this makes sense, because 36^2 = 1,296 (i.e., these are pairwise comparisons).
- Next, we'll construct a histogram using our pairwise ANI values. To build a histogram and store it as a variable
h, run the following command:
h <- ANI.histogram(bactaxRObject = ani, bindwidth = 0.1)
This command:
- Builds a histogram using pairwise ANI values stored in a
bactaxRObject(here, we're using ourbactaxRObjectwhich we namedani) - Uses a histogram bin width of 0.1
To view the histogram, just run:
h
For more options for annotating/displaying your histogram, see ?ANI.histogram
- Next, we will construct a dendrogram and identify medoid genomes with a single command. Most researchers have relied on a <a href="https://www.nature.com/articles/s41467-018-07641-9">genomospecies threshold of 95</a>, so let's use that as a threshold for identifying medoid genomes here. To build a dendrogram and identify medoid genomes at a 95 ANI genomospecies threshold, run the following command:
dend <- ANI.dendrogram(bactaxRObject = ani, ANI_threshold = 95, xline = c(4,5,6,7.5), xlinecol = c("#ffc425", "#f37735", "deeppink4", "black"), label_size = 0.5)
This command:
- Constructs a dendrogram, using the methods described in the paper, with ANI dissimilarity plotted along the X-axis
- Identifies medoid genomes at a 95 ANI threshold, using the
ANI_thresholdparameter - Annotates the dendrogram using vertical lines at the specified ANI dissimilarity (i.e., X-axis) threshold(s), using the
xlineparameter for X-axis position and thexlinecolparameter for color information (here, we have vertical lines at dissimilarity values of 4, 5, 6, and 7.5, which correspond to ANI values of 96, 95, 94, and 92.5, respectively; these parameters are just for annotating the dendrogram plot, and have no analytical value/effect on the identification of medoid genomes or dendrogram construction) - Annotates the dendrogram using tip labels with size 0.5 (
label_size = 0.5; by default, this is set to an arbitrarily small number so that tip labels are hidden)
See ?ANI.dendrogram for a complete list of options.
We can see the medoid genomes identified at our specified ANI threshold (i.e., 95) by running dend$medoid_genomes
We can see the clusters to which all of our genomes were assigned at our specified ANI threshold using dend$cluster_assignments
- Let's construct an ANI similarity graph using our pairwise ANI values, and color it using the 95 ANI cluster assignments we produced in step 6. If we look at
?ANI.graph, we can see that we need to supply metadata (i.e., the discrete attributes which we will use to color our graph; in our case, cluster assignment) in the form of a named vector.
To do this, we'll create a vector, metadata, which contains our cluster assinments:
metadata <- dend$cluster_assignments$Cluster
- Next, we'll name our vector of cluster assignments with their associated genome labels:
names(metadata) <- dend$cluster_assignments$Genome
- Now we can construct our graph as follows (we'll use a 95 ANI threshold like we did before):
ANI.graph(bactaxRObject = ani, ANI_threshold = 95,
metadata = metadata,
legend_pos_x = -1.5, show_legend = T, graphout_niter = 1000000,
legend_ncol = 1, edge_color = "black")
This command:
- Constructs a graph, drawing an edge between any two genomes which share an ANI value greater than or equal to
ANI_threshold(here, we set this to 95) - Colors nodes (i.e., points) using the named vector
metadata(here, we used clusters identified in step 6 at a 95 ANI threshold) - Annotate and color the graph according to various user-supplied parameters (see
?ANI.graphfor more details)
Tutorial 2: Annotate a phylogeny using discrete traits
- For this tutorial, we're going to annotate a phylogeny constructed using 79 marker genes identified in 2,231 B. cereus group genomes, using discrete metadata (i.e., species assignments and presence/absence of phenotypic traits).
Click <a href="https://raw.githubusercontent.com/lmc297/bactaxR/master/data/bactaxR_phylogeny.nwk">here</a> to download the phylogeny (in <a href="https://en.wikipedia.org/wiki/Newick_format">Newick</a> format).
Feel free to save this file in the directory of your choice; if you would like to follow along with this tutorial exactly, using identical path/file names, save this file in your home directory as bactaxR_phylogeny.nwk.
- Click <a href="https://github.com/lmc297/bactaxR/blob/master/data/sup_tab
Related Skills
next
A beautifully designed, floating Pomodoro timer that respects your workspace.
product-manager-skills
49PM skill for Claude Code, Codex, Cursor, and Windsurf: diagnose SaaS metrics, critique PRDs, plan roadmaps, run discovery, and coach PM career transitions.
devplan-mcp-server
3MCP server for generating development plans, project roadmaps, and task breakdowns for Claude Code. Turn project ideas into paint-by-numbers implementation plans.
