Homologizer
Homologizer: phasing gene copies into polyploid subgenomes
Install / Use
/learn @wf8/HomologizerREADME
Phasing gene copies into polyploid subgenomes using homologizer
Will Freyman (willfreyman@gmail.com)
Carl Rothfels (crothfels@berkeley.edu)
This tutorial describes the usage of homologizer to phase gene copies into
polyploid subgenomes.
The tutorial is an abbreviated version of a soon-to-be published paper in Methods in Molecular Biology.
Please see that paper for many more details and practical considerations for running homologizer analyses.
If you use homologizer, please cite the paper in which we first describe the method:
- Freyman, W.A., Johnson, M.G., and C.J. Rothfels. 2022. Homologizer: phylogenetic phasing of gene copies into polyploid subgenomes. bioRxiv 2020.10.22.351486v4
homologizer is implemented in RevBayes. Please see http://revbayes.com to download and install RevBayes.
For users without previous RevBayes experience, we recommend the tutorials at http://revbayes.com.
Introduction
The first analysis described below uses homologizer
to phase gene copies into the subgenomes of a set of allopolyploids.
The second example analysis uses homologizer to test whether the observed
gene copies of one of the allopolyploids are homeologs from distinct
subgenomes or allelic variants arising from the same subgenome.
We provide the data and the full scripts
required to run these examples in the repo at
http://github.com/wf8/homologizer/.
The homologizer model

The homologizer model phases gene copies into polyploid subgenomes over a mul-tree,
as illustrated in the image above.
Panel A: A phylogenetic network with a single
a single hybridization (reticulation) giving rise to an allopolyploid.
Panel B: The mul-tree representation of the phylogenetic network
has two tips (red and orange) representing the two subgenomes of the allopolyploid.
Panel C: Six loci were sequenced from
the allopolyploid (red and orange squares).
Two copies (red and orange) of each locus were recovered.
Loci 2, 3, and 5 are incorrectly phased; that is they are
incorrectly assigned to the wrong subgenome.
Panel D: After phasing, the gene copies of each locus
are now assigned to the correct subgenome.
Panel E: An alternative phasing model allows for the two gene copies to be allelic variants
from the same subgenome.
In this example, both copies of locus 2 and 3 are allelic variants from the orange subgenome and no copies of these loci were
recovered from the red subgenome.
Tutorial: phasing gene copies
Our first example analysis uses homologizer
to phase gene copies into the subgenomes of a set of allopolyploids.
The output of the analysis is the posterior distribution of phased homeologs,
i.e., the posterior distribution of the assignments of each gene copy,
for each locus, into each of the subgenomes of the polyploids. Since we perform
joint inference of the phasing and phylogeny, the posterior distribution of the
multi-locus phylogeny is also inferred, along with all other parameters of the model.
In this example analysis we use a reduced version of the dataset from the fern family Cystopteridaceae previously analyzed in Rothfels et. al (2017) and Freyman et. al (2020) (reduced to increase the speed of the analyses). The data consist of four single-copy nuclear loci (ApPEFP_C, gapCpSh, IBR3, and pgiC) for a sample of 11 diploids and two tetraploids.
Here we gloss over many of the details in the phylogenetic model
(e.g., substitution models)
so that we can focus on the phasing aspect of the analysis.
Detailed tutorials on these other aspects of RevBayes can be found at http://revbayes.com.
Setting up the analysis
Below we'll step through the code for this analysis line by line. However, if you have downloaded the data and script from
the git repo http://github.com/wf8/homologizer/
you can run the full analysis by typing rb src/cystopteridaceae.Rev in your terminal window
from the homologizer directory.
Our first step is to define a vector that holds the input sequence alignment files, one for each locus.
alignments = ["data/APP.nex",
"data/GAP.nex",
"data/IBR.nex",
"data/PGI.nex"]
We'll now loop through and read in each alignment, saving
them to the vector data.
num_loci = alignments.size()
for (i in 1:num_loci) {
data[i] = readDiscreteCharacterData(alignments[i])
}
Next we set the initial phase assignments
for the polyploid accession xCystocarpium_7974.
We need to set the phase assignment here to enable the MCMC to initialize.
We can randomly assign gene copies to subgenomes; the assignment
should not affect the final outcome of the analysis assuming the MCMC
is allowed to converge.
We do this by calling the function setHomeologPhase
on each of the alignments. In the alignments the sequences for this
accession are named 7974_copy1 through 7974_copy4.
We wish to phase those copies among four mul-tree tips,
xCystocarpium_7974_A through xCystocarpium_7974_D.
for (i in 1:num_loci) {
data[i].setHomeologPhase("7974_copy1", "xCystocarpium_7974_A")
data[i].setHomeologPhase("7974_copy2", "xCystocarpium_7974_B")
data[i].setHomeologPhase("7974_copy3", "xCystocarpium_7974_C")
data[i].setHomeologPhase("7974_copy4", "xCystocarpium_7974_D")
}
This data set contains a second polyploid C_tasmanica_6379.
This accession, though, only has two subgenomes and two gene copies of each locus.
However, it is missing a sequence for the gene IBR;
for IBR there is only a single copy: 6379_copy1.
Recalling that IBR is the third sequence alignment we read in,
we can add a blank second IBR gene copy for C_tasmanica_6379:
data[3].addMissingTaxa("6379_copy2")
Now we again loop through the alignments, this time setting the initial phase
for C_tasmanica_6379.
for (i in 1:num_loci) {
data[i].setHomeologPhase("6379_copy1", "C_tasmanica_6379_A")
data[i].setHomeologPhase("6379_copy2", "C_tasmanica_6379_B")
}
The next few sections of code are fairly standard for Rev phylogenetic analyses,
and not unique to a homologizer analysis.
Since some of the diploid accessions are also missing sequences for some loci, we
now add any blank sequences needed so all the alignments contain all the accessions:
for (i in 1:num_loci) {
for (j in 1:num_loci) {
data[i].addMissingTaxa(data[j].taxa())
}
}
We'll need some useful information from the alignments:
num_tips = data[1].ntaxa()
n_branches = 2 * num_tips - 3
Now create a vector of branch lengths. We'll draw each
branch length from an exponential distribution.
We'll also add MCMC scaling moves for each branch length (which we'll store in the moves vector, indexed by the mvi counter).
mvi = 0
for (i in 1:n_branches) {
branch_lengths[i] ~ dnExponential(10)
moves[++mvi] = mvScale(branch_lengths[i], weight=1.0)
}
We'll use a uniform topology prior that puts equal probability on all unrooted, fully resolved topologies. Additionally, we'll add MCMC moves for the topology, the nearest-neighbor interchange (NNI) and subtree pruning and regrafting (SPR) tree arrangment moves.
topology ~ dnUniformTopology(data[1].taxa())
moves[++mvi] = mvNNI(topology, weight=40.0)
moves[++mvi] = mvSPR(topology, weight=40.0)
Finally, we combine the topology and the branch length vector into a deterministic node that represents our phylogeny:
tree := treeAssembly(topology, branch_lengths)
For the nucleotide substitution models we will specify a general time-reversible (GTR)
model for each locus.
We will use an uninformative Dirichlet distribution as prior on the stationary
frequencies (pi), and for the
six exchangeability rates er.
To estimate pi and er we use the MCMC move
mvSimplexElementScale, which randomly changes one element of the simplex
and then rescales the other elements so that they sum to one again.
For each locus we construct the GTR rate matrix Q using the function fnGTR
which puts together pi and er.
for (i in 1:num_loci) {
er_prior <- v(1,1,1,1,1,1)
er[i] ~ dnDirichlet(er_prior)
er[i].setValue(simplex(v(1,1,1,1,1,1)))
moves[++mvi] = mvSimplexElementScale(er[i], weight=5)
pi_prior <- v(1,1,1,1)
pi[i] ~ dnDirichlet(pi_prior)
pi[i].setValue(simplex(v(1,1,1,1)))
moves[++mvi] = mvSimplexElementScale(pi[i], weight=5)
Q[i] := fnGTR(er[i], pi[i])
}
Additionally, we estimate a substitution rate multiplier for each of the alignments except the first one. We draw the rate multipliers from an exponential distribution:
for (i in 1:num_loci) {
if (i == 1) {
rate_multiplier[i] <- 1.0
} else {
rate_multiplier[i] ~ dnExponential(1)
moves[++mvi] = mvScale(rate_multiplier[i], weight=5)
}
}
Our sequence evolution models are continuous-time Markov chains (CTMC) over
the phylogeny. So we pass a GTR rate matrices Q,
a rate_multiplier,
and the tree into a phylogenetic CTMC distribution, one for each locus.
for (i in 1:num_loci) {
ctmc[i] ~ dnPhyloCTMC(tree=tree, Q=Q[i], branchRates=rate_multiplier[i], type="DNA")
}
We now have fully defined our phylogenetic model, so we wrap it up and declare it complete:
mymodel = model(Q)
To infer the phasing, though, we wish to add MCMC phasing proposals.
We use the function mvHomeologPhase to define a phasing proposal
that swaps the sequences between any two mul-tree tips for a given locus.
Since our polyploid accession `C_tasmani
Related Skills
node-connect
352.2kDiagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps
frontend-design
111.1kCreate distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.
openai-whisper-api
352.2kTranscribe audio via OpenAI Audio Transcriptions API (Whisper).
qqbot-media
352.2kQQBot 富媒体收发能力。使用 <qqmedia> 标签,系统根据文件扩展名自动识别类型(图片/语音/视频/文件)。
