SkillAgentSearch skills...

Homologizer

Homologizer: phasing gene copies into polyploid subgenomes

Install / Use

/learn @wf8/Homologizer
About this skill

Quality Score

0/100

Supported Platforms

Universal

README

Phasing gene copies into polyploid subgenomes using homologizer

DOI

Will Freyman (willfreyman@gmail.com)
Carl Rothfels (crothfels@berkeley.edu)

This tutorial describes the usage of homologizer to phase gene copies into polyploid subgenomes. The tutorial is an abbreviated version of a soon-to-be published paper in Methods in Molecular Biology. Please see that paper for many more details and practical considerations for running homologizer analyses. If you use homologizer, please cite the paper in which we first describe the method:

  • Freyman, W.A., Johnson, M.G., and C.J. Rothfels. 2022. Homologizer: phylogenetic phasing of gene copies into polyploid subgenomes. bioRxiv 2020.10.22.351486v4

homologizer is implemented in RevBayes. Please see http://revbayes.com to download and install RevBayes. For users without previous RevBayes experience, we recommend the tutorials at http://revbayes.com.

Introduction

The first analysis described below uses homologizer to phase gene copies into the subgenomes of a set of allopolyploids. The second example analysis uses homologizer to test whether the observed gene copies of one of the allopolyploids are homeologs from distinct subgenomes or allelic variants arising from the same subgenome. We provide the data and the full scripts required to run these examples in the repo at http://github.com/wf8/homologizer/.

The homologizer model

The homologizer model phases gene copies into polyploid subgenomes over a mul-tree, as illustrated in the image above. Panel A: A phylogenetic network with a single a single hybridization (reticulation) giving rise to an allopolyploid. Panel B: The mul-tree representation of the phylogenetic network has two tips (red and orange) representing the two subgenomes of the allopolyploid. Panel C: Six loci were sequenced from the allopolyploid (red and orange squares). Two copies (red and orange) of each locus were recovered. Loci 2, 3, and 5 are incorrectly phased; that is they are incorrectly assigned to the wrong subgenome. Panel D: After phasing, the gene copies of each locus are now assigned to the correct subgenome. Panel E: An alternative phasing model allows for the two gene copies to be allelic variants from the same subgenome. In this example, both copies of locus 2 and 3 are allelic variants from the orange subgenome and no copies of these loci were recovered from the red subgenome.

Tutorial: phasing gene copies

Our first example analysis uses homologizer to phase gene copies into the subgenomes of a set of allopolyploids. The output of the analysis is the posterior distribution of phased homeologs, i.e., the posterior distribution of the assignments of each gene copy, for each locus, into each of the subgenomes of the polyploids. Since we perform joint inference of the phasing and phylogeny, the posterior distribution of the multi-locus phylogeny is also inferred, along with all other parameters of the model.

In this example analysis we use a reduced version of the dataset from the fern family Cystopteridaceae previously analyzed in Rothfels et. al (2017) and Freyman et. al (2020) (reduced to increase the speed of the analyses). The data consist of four single-copy nuclear loci (ApPEFP_C, gapCpSh, IBR3, and pgiC) for a sample of 11 diploids and two tetraploids.

Here we gloss over many of the details in the phylogenetic model (e.g., substitution models) so that we can focus on the phasing aspect of the analysis. Detailed tutorials on these other aspects of RevBayes can be found at http://revbayes.com.

Setting up the analysis

Below we'll step through the code for this analysis line by line. However, if you have downloaded the data and script from the git repo http://github.com/wf8/homologizer/ you can run the full analysis by typing rb src/cystopteridaceae.Rev in your terminal window from the homologizer directory.

Our first step is to define a vector that holds the input sequence alignment files, one for each locus.

alignments = ["data/APP.nex",
              "data/GAP.nex",
              "data/IBR.nex",
              "data/PGI.nex"]

We'll now loop through and read in each alignment, saving them to the vector data.

num_loci = alignments.size()
for (i in 1:num_loci) {
    data[i] = readDiscreteCharacterData(alignments[i])
}

Next we set the initial phase assignments for the polyploid accession xCystocarpium_7974. We need to set the phase assignment here to enable the MCMC to initialize. We can randomly assign gene copies to subgenomes; the assignment should not affect the final outcome of the analysis assuming the MCMC is allowed to converge.

We do this by calling the function setHomeologPhase on each of the alignments. In the alignments the sequences for this accession are named 7974_copy1 through 7974_copy4. We wish to phase those copies among four mul-tree tips, xCystocarpium_7974_A through xCystocarpium_7974_D.

for (i in 1:num_loci) {
    data[i].setHomeologPhase("7974_copy1", "xCystocarpium_7974_A")
    data[i].setHomeologPhase("7974_copy2", "xCystocarpium_7974_B")
    data[i].setHomeologPhase("7974_copy3", "xCystocarpium_7974_C")
    data[i].setHomeologPhase("7974_copy4", "xCystocarpium_7974_D")
}

This data set contains a second polyploid C_tasmanica_6379. This accession, though, only has two subgenomes and two gene copies of each locus. However, it is missing a sequence for the gene IBR; for IBR there is only a single copy: 6379_copy1. Recalling that IBR is the third sequence alignment we read in, we can add a blank second IBR gene copy for C_tasmanica_6379:

data[3].addMissingTaxa("6379_copy2")

Now we again loop through the alignments, this time setting the initial phase for C_tasmanica_6379.

for (i in 1:num_loci) {
    data[i].setHomeologPhase("6379_copy1", "C_tasmanica_6379_A")
    data[i].setHomeologPhase("6379_copy2", "C_tasmanica_6379_B")
}

The next few sections of code are fairly standard for Rev phylogenetic analyses, and not unique to a homologizer analysis. Since some of the diploid accessions are also missing sequences for some loci, we now add any blank sequences needed so all the alignments contain all the accessions:

for (i in 1:num_loci) {
    for (j in 1:num_loci) {
        data[i].addMissingTaxa(data[j].taxa())
    }
}

We'll need some useful information from the alignments:

num_tips = data[1].ntaxa()
n_branches = 2 * num_tips - 3

Now create a vector of branch lengths. We'll draw each branch length from an exponential distribution. We'll also add MCMC scaling moves for each branch length (which we'll store in the moves vector, indexed by the mvi counter).

mvi = 0
for (i in 1:n_branches) {
    branch_lengths[i] ~ dnExponential(10)
    moves[++mvi] = mvScale(branch_lengths[i], weight=1.0)
}

We'll use a uniform topology prior that puts equal probability on all unrooted, fully resolved topologies. Additionally, we'll add MCMC moves for the topology, the nearest-neighbor interchange (NNI) and subtree pruning and regrafting (SPR) tree arrangment moves.

topology ~ dnUniformTopology(data[1].taxa())
moves[++mvi] = mvNNI(topology, weight=40.0)
moves[++mvi] = mvSPR(topology, weight=40.0)

Finally, we combine the topology and the branch length vector into a deterministic node that represents our phylogeny:

tree := treeAssembly(topology, branch_lengths)

For the nucleotide substitution models we will specify a general time-reversible (GTR) model for each locus. We will use an uninformative Dirichlet distribution as prior on the stationary frequencies (pi), and for the six exchangeability rates er. To estimate pi and er we use the MCMC move mvSimplexElementScale, which randomly changes one element of the simplex and then rescales the other elements so that they sum to one again. For each locus we construct the GTR rate matrix Q using the function fnGTR which puts together pi and er.

for (i in 1:num_loci) {

    er_prior <- v(1,1,1,1,1,1)
    er[i] ~ dnDirichlet(er_prior)
    er[i].setValue(simplex(v(1,1,1,1,1,1)))
    moves[++mvi] = mvSimplexElementScale(er[i], weight=5)

    pi_prior <- v(1,1,1,1)
    pi[i] ~ dnDirichlet(pi_prior)
    pi[i].setValue(simplex(v(1,1,1,1)))
    moves[++mvi] = mvSimplexElementScale(pi[i], weight=5)

    Q[i] := fnGTR(er[i], pi[i])
}

Additionally, we estimate a substitution rate multiplier for each of the alignments except the first one. We draw the rate multipliers from an exponential distribution:

for (i in 1:num_loci) {
    if (i == 1) {
        rate_multiplier[i] <- 1.0
    } else {
        rate_multiplier[i] ~ dnExponential(1)
        moves[++mvi] = mvScale(rate_multiplier[i], weight=5)
    }
}

Our sequence evolution models are continuous-time Markov chains (CTMC) over the phylogeny. So we pass a GTR rate matrices Q, a rate_multiplier, and the tree into a phylogenetic CTMC distribution, one for each locus.

for (i in 1:num_loci) {
    ctmc[i] ~ dnPhyloCTMC(tree=tree, Q=Q[i], branchRates=rate_multiplier[i], type="DNA")
}

We now have fully defined our phylogenetic model, so we wrap it up and declare it complete:

mymodel = model(Q)

To infer the phasing, though, we wish to add MCMC phasing proposals. We use the function mvHomeologPhase to define a phasing proposal that swaps the sequences between any two mul-tree tips for a given locus. Since our polyploid accession `C_tasmani

Related Skills

View on GitHub
GitHub Stars11
CategoryDevelopment
Updated2mo ago
Forks1

Languages

R

Security Score

75/100

Audited on Feb 3, 2026

No findings