SubPhaser
To phase, partition and visualize subgenomes of a neoallopolyploid or hybrid based on the subgenome-specific repetitive kmers.
Install / Use
/learn @zhangrengang/SubPhaserREADME
Quick install and start
git clone https://github.com/zhangrengang/SubPhaser
cd SubPhaser
# install
conda env create -f SubPhaser.yaml
conda activate SubPhaser
python setup.py install
# start
cd example_data
# small genome (Arabidopsis_suecica: 270Mb)
bash test_Arabidopsis.sh
# middle genome (peanut: 2.6Gb)
bash test_peanut.sh
# large genome (wheat: 14Gb)
bash test_wheat.sh
Table of Contents
- Introduction
- Inputs
- Run SubPhaser
- Run SubPhaser through Singularity/Apptainer
- Outputs
- When SubPhaser do not work
- Citation
- Applications
- Contact
- Full Usage and Default Parameters
Introduction
For many allopolyploid species, their diploid progenitors are unknown or extinct, making it impossible to unravel their subgenomes.
Here, we develop SubPhaser to phase subgenomes, by using repetitive kmers as the "differential signatures", assuming that repetitive sequences (mainly transposable elements) were expanded across chromosomes in the progenitors' independently evolving period. The tool also identifies genome-wide subgenome-specific regions and long terminal repeat retrotransposons (LTR-RTs), which will provide insights into the evolutionary history of allopolyploidization.
For details of methods and benchmarking results of SubPhaser, please see the paper in New Phytologist and its Supplementary Material including performances in dozens of chromosome-level neoallopolyploid/hybrid genomes published before October, 2021.
There are mainly four modules:
- The core module to phase subgenomes:
- Count kmers by
jellyfish. - Identify the differential kmers among homoeologous chromosome sets.
- Cluster into subgenomes by a K-Means algorithm and estimate confidence level by the bootstrap.
- Evaluate whether subgenomes are successfully phased by hierarchical clustering and principal component analysis (PCA).
- Count kmers by
- The module to identify and test the enrichments of subgenome-specific kmers:
- Identify subgenome-specific kmers.
- Identify significant enrichments of subgenome-specific kmers by genome window/bin, which is useful to identify intewr-subgenomic exchanges (refer to Supplementary Material for identifying bona fide exchanges) and/or assembly errors (e.g. switch errors and hamming errors).
- Identify subgenome-specific enrichments with user-defined features (e.g. transposable elements, genes) via
-custom_features.
- The LTR module to identify and analyze subgenome-specific LTR-RT elements (disable by
-disable_ltr):- Identify the LTR-RTs by
LTRharvestand/orLTRfinder(time-consuming for large genome, especiallyLTRfinder). - Classify the LTR-RTs by
TEsorter. - Identify subgenome-specific LTR-RTs by testing the enrichment of subgenome-specific kmers.
- Estimate the insertion age of subgenome-specific LTR-RTs, which is helpful to estimate the time of divergence–hybridization period(s) (the period in which the progenitors are evolving independently; refer to Supplementary Material for estimating the time period).
- Reconstruct phylogenetic trees of subgenome-specific LTR/Gypsy and LTR/Copia elements, which is helpful to infer the evolutionary history of these LTR-RTs (disable by
-disable_ltrtree, time-consuming for large genome).
- Identify the LTR-RTs by
- The visualization module to visualize genome-wide data (disable by
-disable_circos):- Identify the homoeologous blocks by
minimap2simply (disable by-disable_blocks, time-consuming for large genome). - Integrate and visualize the whole genome-wide data by
circos.
- Identify the homoeologous blocks by
The below is an example of output figures of wheat (ABD, 1n=3x=21):
Figure. Phased subgenomes of allohexaploid bread wheat genome. Colors are unified with each subgenome in subplots B-F, i.e. the same color means the same subgenome.
- (A) The histogram of differential k-mers among homoeologous chromosome sets.
- (B) Heatmap and clustering of differential k-mers. The x-axis, differential k-mers; y-axis, chromosomes. The vertical color bar, each chromosome is assigned to which subgenome; the horizontal color bar, each k-mer is specific to which subgenome (blank for non-specific kmers).
- (C) Principal component analysis (PCA) of differential k-mers. Points indicate chromosomes.
- (D) Chromosomal characteristics (window size: 1 Mb). Rings from outer to inner:
- (1) Subgenome assignments by a k-Means algorithm.
- (2) Significant enrichment of subgenome-specific k-mers (blank for non-enriched windows).
- (3) Normalized proportion of subgenome-specific k-mers.
- (4-6) Density distribution (count) of each subgenome-specific k-mer set.
- (7) Density distribution (count) of subgenome-specific LTR-RTs and other LTR-RTs (the most outer, in grey color).
- (8) Homoeologous blocks of each homoeologous chromosome set.
- (E) Insertion time of subgenome-specific LTR-RTs.
- (F) A phylogenetic tree of 1,000 randomly subsampled LTR/Gypsy elements.
Note: On the clustering heatmap (Fig. B) and PCA plot (Fig. C), a subgenome is defined as well-phased if it has clearly distinguishable patterns of both differential k-mers and homeologous chromosomes, indicating that each subgenome shares subgenome-specific features as expected. If the subgenomes are not well-phased, the downstream analyses (may be failed) are meaningless and should be ignored.
Sometimes, just a few abmormal chromosomes are mistakely assigned by the k-Means method, according to the heatmap, PCA and/or circos plots.
In this case, the users could manually adjust the subgenome assignments (edit and rename the *chrom-subgenome.tsv file) and then feed it to SubPhaser by -sg_assigned option for downstream analysis.
Inputs
- Chromosome-level genome sequences (fasta format), e.g. the wheat genome (haploid assembly, ABD, 1n=3x=21).
Note: do not use hard-masked genome by RepeatMakser, as
subphaserdepends on repeat sequences. - Configuration of homoeologous chromosome sets, e.g.
Chr1A Chr1B Chr1D # each row is one homoeologous chromosome set
Chr2B Chr2A Chr2D # chromosome order is arbitrary and useless
Chr3D Chr3B Chr3A # seperate with blank character(s)
Chr4B Chr4D Chr4A
5A|Chr5A 5B|Chr5B 5D|Chr5D # will rename chromosome id to 5A, 5B and 5D, respectively
Chr6A,Chr7A Chr6B,Chr7B Chr6D,Chr7D # will treat multiple chromosomes together using ","
If some homoeologous relationships are ambiguous, they can be placed as singletons that will not be used to identify differential kmers. For example:
Chr1A Chr1B Chr1D
Chr2B Chr2A Chr2D
Chr3D Chr3B Chr3A
Chr4B Chr4D
Chr4A # singleton(s) will skip the step to identify differential kmers
...
- [Optional] Sequences of genomic features (fasta format, with
-custom_features): Any sequences of genomic features, such as transposable elements (TEs), long terminal repeat retrotransposons (LTR-RTs), simple repeats and genes, could be fed to identify the subgenome-specific ones.
Run SubPhaser
Run with default parameters:
subphaser -i genome.fasta.gz -c sg.config
Run with just the core algorithm enabled:
subphaser -i genome.fasta.gz -c sg.config -just_core
or
subphaser -i genome.fasta.gz -c sg.config -disable_ltr -disable_circos
Change key parameters when differential kmers are too few (see Fig. A):
subphaser -i genome.fasta.gz -c sg.config -k 15 -q 50 -f 2
Mutiple genomes (e.g. two relative species):
subphaser -i genomeA.fasta.gz genomeB.fasta.gz -c sg.config
Mutiple config files:
subphaser -i genome.fasta.gz -c sg1.config sg2.config
Input custom feature (e.g. transposable element, gene) sequences for subgenome-specific enrichments:
subphaser -i genome.fasta.gz -c sg.config -custom_features TEs.fasta genes.fasta
Define custom colors for subgenomes:
subphaser -i genome.fasta.gz -c sg.config -colors "#f9c00c,#00b9f1,#7200da"
Run SubPhaser through Singularity/Apptainer
Alternatively, you can run subphaser through Singularity/Apptainer container:
# install
apptainer remote add --no-login SylabsCloud cloud.sylabs.io
apptainer remote use SylabsCloud
apptainer pull subphaser.sif library://shang-hongyun/collection/subphaser.sif:1.2.6
# run
./subphaser.sif subphaser -h
Outputs
phase-results/
├── k15_q200_f2.circos/ # config and data files for circos plot, so developers are able to re-plot with some custom modifications
├── k15_q200_f2.kmer_freq.pdf # histogram of differential kmers, useful to adjust option `-q`
├── k15_q200_f2.kmer.mat # differential kmer matrix (m kmer × n chromosome)
├── k15_q200_f2.kmer.mat.pdf # heatmap of the kmer matrix
├── k15_q200_f2.kmer.mat.R # R script for the heatmap plot
├── k15_q200_f2.kmer_pca.pdf # PCA plot of the kmer matrix
├── k15_q200_f2.chrom-subgenome.tsv # subgenome assignments and bootstrap values
├── k15_q200_f2.sig.kmer-subgenome.tsv # subgenome-specific kmers
├── k15_q200_f2.bin.enrich # subgenome-specific enrichments by genome wind
Related Skills
next
A beautifully designed, floating Pomodoro timer that respects your workspace.
product-manager-skills
50PM skill for Claude Code, Codex, Cursor, and Windsurf: diagnose SaaS metrics, critique PRDs, plan roadmaps, run discovery, and coach PM career transitions.
devplan-mcp-server
3MCP server for generating development plans, project roadmaps, and task breakdowns for Claude Code. Turn project ideas into paint-by-numbers implementation plans.
