Quick install and start

git clone https://github.com/zhangrengang/SubPhaser
cd SubPhaser

# install
conda env create -f SubPhaser.yaml
conda activate SubPhaser
python setup.py install

# start
cd example_data
# small genome    (Arabidopsis_suecica: 270Mb)
bash test_Arabidopsis.sh
# middle genome   (peanut: 2.6Gb)
bash test_peanut.sh
# large genome    (wheat: 14Gb)
bash test_wheat.sh

Introduction
Inputs
Run SubPhaser
Run SubPhaser through Singularity/Apptainer
Outputs
When SubPhaser do not work
Citation
Applications
Contact
Full Usage and Default Parameters

Introduction

For many allopolyploid species, their diploid progenitors are unknown or extinct, making it impossible to unravel their subgenomes. Here, we develop SubPhaser to phase subgenomes, by using repetitive kmers as the "differential signatures", assuming that repetitive sequences (mainly transposable elements) were expanded across chromosomes in the progenitors' independently evolving period. The tool also identifies genome-wide subgenome-specific regions and long terminal repeat retrotransposons (LTR-RTs), which will provide insights into the evolutionary history of allopolyploidization.

For details of methods and benchmarking results of SubPhaser, please see the paper in New Phytologist and its Supplementary Material including performances in dozens of chromosome-level neoallopolyploid/hybrid genomes published before October, 2021.

There are mainly four modules:

The core module to phase subgenomes:
- Count kmers by jellyfish.
- Identify the differential kmers among homoeologous chromosome sets.
- Cluster into subgenomes by a K-Means algorithm and estimate confidence level by the bootstrap.
- Evaluate whether subgenomes are successfully phased by hierarchical clustering and principal component analysis (PCA).
The module to identify and test the enrichments of subgenome-specific kmers:
- Identify subgenome-specific kmers.
- Identify significant enrichments of subgenome-specific kmers by genome window/bin, which is useful to identify intewr-subgenomic exchanges (refer to Supplementary Material for identifying bona fide exchanges) and/or assembly errors (e.g. switch errors and hamming errors).
- Identify subgenome-specific enrichments with user-defined features (e.g. transposable elements, genes) via -custom_features.
The LTR module to identify and analyze subgenome-specific LTR-RT elements (disable by -disable_ltr):
- Identify the LTR-RTs by LTRharvest and/or LTRfinder (time-consuming for large genome, especially LTRfinder).
- Classify the LTR-RTs by TEsorter.
- Identify subgenome-specific LTR-RTs by testing the enrichment of subgenome-specific kmers.
- Estimate the insertion age of subgenome-specific LTR-RTs, which is helpful to estimate the time of divergence–hybridization period(s) (the period in which the progenitors are evolving independently; refer to Supplementary Material for estimating the time period).
- Reconstruct phylogenetic trees of subgenome-specific LTR/Gypsy and LTR/Copia elements, which is helpful to infer the evolutionary history of these LTR-RTs (disable by -disable_ltrtree, time-consuming for large genome).
The visualization module to visualize genome-wide data (disable by -disable_circos):
- Identify the homoeologous blocks by minimap2 simply (disable by -disable_blocks, time-consuming for large genome).
- Integrate and visualize the whole genome-wide data by circos.

The below is an example of output figures of wheat (ABD, 1n=3x=21):

wheat Figure. Phased subgenomes of allohexaploid bread wheat genome. Colors are unified with each subgenome in subplots B-F, i.e. the same color means the same subgenome.

(A) The histogram of differential k-mers among homoeologous chromosome sets.
(B) Heatmap and clustering of differential k-mers. The x-axis, differential k-mers; y-axis, chromosomes. The vertical color bar, each chromosome is assigned to which subgenome; the horizontal color bar, each k-mer is specific to which subgenome (blank for non-specific kmers).
(C) Principal component analysis (PCA) of differential k-mers. Points indicate chromosomes.
(D) Chromosomal characteristics (window size: 1 Mb). Rings from outer to inner:
- (1) Subgenome assignments by a k-Means algorithm.
- (2) Significant enrichment of subgenome-specific k-mers (blank for non-enriched windows).
- (3) Normalized proportion of subgenome-specific k-mers.
- (4-6) Density distribution (count) of each subgenome-specific k-mer set.
- (7) Density distribution (count) of subgenome-specific LTR-RTs and other LTR-RTs (the most outer, in grey color).
- (8) Homoeologous blocks of each homoeologous chromosome set.
(E) Insertion time of subgenome-specific LTR-RTs.
(F) A phylogenetic tree of 1,000 randomly subsampled LTR/Gypsy elements.

Note: On the clustering heatmap (Fig. B) and PCA plot (Fig. C), a subgenome is defined as well-phased if it has clearly distinguishable patterns of both differential k-mers and homeologous chromosomes, indicating that each subgenome shares subgenome-specific features as expected. If the subgenomes are not well-phased, the downstream analyses (may be failed) are meaningless and should be ignored. Sometimes, just a few abmormal chromosomes are mistakely assigned by the k-Means method, according to the heatmap, PCA and/or circos plots. In this case, the users could manually adjust the subgenome assignments (edit and rename the *chrom-subgenome.tsv file) and then feed it to SubPhaser by -sg_assigned option for downstream analysis.

Inputs

Chromosome-level genome sequences (fasta format), e.g. the wheat genome (haploid assembly, ABD, 1n=3x=21). Note: do not use hard-masked genome by RepeatMakser, as subphaser depends on repeat sequences.
Configuration of homoeologous chromosome sets, e.g.

Chr1A   Chr1B   Chr1D                      # each row is one homoeologous chromosome set
Chr2B   Chr2A   Chr2D                      # chromosome order is arbitrary and useless
Chr3D   Chr3B   Chr3A                      # seperate with blank character(s)
Chr4B   Chr4D   Chr4A
5A|Chr5A   5B|Chr5B   5D|Chr5D             # will rename chromosome id to 5A, 5B and 5D, respectively
Chr6A,Chr7A   Chr6B,Chr7B   Chr6D,Chr7D    # will treat multiple chromosomes together using ","

If some homoeologous relationships are ambiguous, they can be placed as singletons that will not be used to identify differential kmers. For example:

Chr1A   Chr1B   Chr1D
Chr2B   Chr2A   Chr2D
Chr3D   Chr3B   Chr3A
Chr4B   Chr4D
Chr4A					# singleton(s) will skip the step to identify differential kmers
...

[Optional] Sequences of genomic features (fasta format, with -custom_features): Any sequences of genomic features, such as transposable elements (TEs), long terminal repeat retrotransposons (LTR-RTs), simple repeats and genes, could be fed to identify the subgenome-specific ones.

Run SubPhaser

Run with default parameters:

subphaser -i genome.fasta.gz -c sg.config

Run with just the core algorithm enabled:

subphaser -i genome.fasta.gz -c sg.config -just_core

subphaser -i genome.fasta.gz -c sg.config -disable_ltr -disable_circos

Change key parameters when differential kmers are too few (see Fig. A):

subphaser -i genome.fasta.gz -c sg.config -k 15 -q 50 -f 2

Mutiple genomes (e.g. two relative species):

subphaser -i genomeA.fasta.gz genomeB.fasta.gz -c sg.config

Mutiple config files:

subphaser -i genome.fasta.gz -c sg1.config sg2.config

Input custom feature (e.g. transposable element, gene) sequences for subgenome-specific enrichments:

subphaser -i genome.fasta.gz -c sg.config -custom_features TEs.fasta genes.fasta

Define custom colors for subgenomes:

subphaser -i genome.fasta.gz -c sg.config -colors "#f9c00c,#00b9f1,#7200da"

Run SubPhaser through Singularity/Apptainer

Alternatively, you can run subphaser through Singularity/Apptainer container:

# install
apptainer remote add --no-login SylabsCloud cloud.sylabs.io
apptainer remote use SylabsCloud
apptainer pull subphaser.sif library://shang-hongyun/collection/subphaser.sif:1.2.6

# run
./subphaser.sif subphaser -h

Outputs

phase-results/
├── k15_q200_f2.circos/                # config and data files for circos plot, so developers are able to re-plot with some custom modifications
├── k15_q200_f2.kmer_freq.pdf          # histogram of differential kmers, useful to adjust option `-q`
├── k15_q200_f2.kmer.mat               # differential kmer matrix (m kmer × n chromosome)
├── k15_q200_f2.kmer.mat.pdf           # heatmap of the kmer matrix
├── k15_q200_f2.kmer.mat.R             # R script for the heatmap plot
├── k15_q200_f2.kmer_pca.pdf           # PCA plot of the kmer matrix
├── k15_q200_f2.chrom-subgenome.tsv    # subgenome assignments and bootstrap values
├── k15_q200_f2.sig.kmer-subgenome.tsv # subgenome-specific kmers
├── k15_q200_f2.bin.enrich             # subgenome-specific enrichments by genome wind

SubPhaser

Install / Use

README