Allopolyploids
PhyloSD: Phylogenomic Subgenome Detection pipeline
Install / Use
/learn @eead-csic-compbio/AllopolyploidsREADME
PhyloSD: Phylogenomic Subgenome Detection pipeline
This pipeline was designed for the subgenome identification of homeologous diploid genomes present in allopolyploids, also considering potentially unknown progenitors.
The Phylogenomic Subgenomic Detection (PhyloSD) pipeline includes three sequential Nearest Diploid Species Node, Bootstrapping Refinement, and Subgenome Assignment algorithms.
The protocol is explained in detail in:
R Sancho, LA Inda, A Diaz-Perez,L Des Marais, S Gordon, JP Vogel, J Lusinska, R Hasterok, B Contreras-Moreira, P Catalan (2022) Tracking the Ancestry of Known and 'Ghost' Homeologous Subgenomes in Model Grass Brachypodium Polyploids. The Plant Journal http://doi.org/10.1111/tpj.15650
The supporting information (supplementary tables, figures, appendix and methods) that accompanies this paper can be found at Dryad repository https://doi.org/10.5061/dryad.ncjsxksqw
A Docker container with dependencies installed can be found at:
https://hub.docker.com/r/eeadcsiccompbio/allopolyploids
<!-- made with perl -lne 'if(/^(#{1,}) (.*)/){ ($i,$t)=($1,$2); $l=lc($t); $l=~s/\W/\-/g; print "$i [$t](#$l)"}'-->- 0) Pre-processing of the data set and installation of dependencies
- 1) NEAREST DIPLOID SPECIES NODE algorithm
- 2) BOOTSTRAPPING REFINEMENT algorithm
- 3) SUBGENOME ASSIGNMENT algorithm
- 4) OPTIONAL DATING ANALYSIS
0) Pre-processing of the data set and installation of dependencies
This pipeline has several dependencies, which can be installed as explained below in folder bin/:
|software|URL| |-------|---| |TrimAl|https://github.com/inab/trimal| |newick_utils|https://github.com/tjunier/newick_utils| |IQ-TREE|http://www.iqtree.org| |GET_HOMOLOGUES-EST|https://github.com/eead-csic-compbio/get_homologues| |concat_alignments.pl|https://github.com/vinuesa/get_phylomarkers| |Consensus.pl|https://github.com/josephhughes/Sequence-manipulation| |R|https://www.r-project.org|
In addition, the instructions below require wget, curl, make, git, perl and parallel on your system.
In Debian-like systems like Ubuntu please copy and paste the following command lines in your terminal to install:
# system dependencies
sudo apt-get install -y libdb-dev parallel git curl cpanminus
# R
sudo apt-get install -y r-base
# third-party binaries and Perl dependencies
make install
In RedHat-like systems:
# Perl
sudo yum install libdb-devel
curl -L https://cpanmin.us | perl - --sudo App::cpanminus
# R
sudo yum install epel-release
sudo apt-get install R
# third-party binaries and Perl dependencies
make install
In order to check yout installation please run:
make test
Data set and abbreviations
In order to download the data used for the Brachypodium, used in the tutorial below, benchmark please do:
make brachy
Note that Brachypodium RNA-seq data were also deposited at the European Nucleotide Archive with run accessions numbers:
- ERR3633153 (B. arbuscula)
- ERR3634426 (B. boissieri)
- ERR3634464 (B. hybridum)
- ERR3634593 (B. mexicanum) *
- ERR3634717 (B. phoenicoides Bpho422)
- ERR3634894 (B. phoenicoides Bpho6)
- ERR3634970 (B. pinnatum)
- ERR3636695 (B. retusum)
- ERR3636791 (B. rupestre)
- ERR3636828 (B. stacei)
B. sylvaticum RNA-seq data from accession Brasy-Esp were obtained from the study by Fox et al., 2013.
Transcriptomic data were also retrieved for the outgroup species Oryza sativa (SRX738077) and Hordeum vulgare (ERR159679), which were used to root the phylogenetic Brachypodium trees and to provide a stem branch for the grafting of ancestral homeologs/subgenomes.
The following abbreviations were used:
Bmex --> B.mexicanum
Bret --> B.retusum
Bboi --> B.boissieri
Bhyb --> B.hybridum
Brup --> B.rupestre
Bpho --> B.phoenicoides (accession Bpho6)
B422 --> B.phoenicoides (accession Bpho422)
Bsta --> B.stacei
Bdis --> B.distachyon
Barb --> B.arbuscula
Bpin --> B.pinnatum
Bsyl --> B.sylvaticum
Osat --> Oryza sativa
Hvul --> Hordeum vulgare
Run GET_HOMOLOGUES-EST to construct the core gene/transcript alignments
This first step requires https://github.com/eead-csic-compbio/get_homologues and the set of transcripts in folder genome_transcripts, most of them assembled de novo with https://github.com/trinityrnaseq/trinityrnaseq . With the following command we request clusters with sequence identity >= 80% and 75% coverage of the shortest sequence
bin/get_homologues/get_homologues-est.pl -d genome_transcripts \
-m cluster -I genome_transcripts/species.list -M -A -S 80 \
&> gen.M.A.S80.core.cluster.log
This produces 3675 clusters.
We can now compute pan-gene matrices for these core clusters:
bin/get_homologues/compare_clusters.pl \
-d genome_transcripts_est_homologues/arb8075_alltaxa_no_sorghum_no_sylCor.list_algOMCL_e0_S80_/ \
-o core_clusters_Brachypodium -m -n &> compare.core.log
We can optionally plot an Average Nucleotide Identity (ANI) matrix:
# Note: this requires additional R dependencies
bin/get_homologues/install_R_deps.R
bin/get_homologues/plot_matrix_heatmap.sh -i \
genome_transcripts_est_homologues/arb8075_alltaxa_no_sorghum_no_sylCor.list_algOMCL_e0_S80_Avg_identity.tab \
-H 14 -W 26 -t "ANI of transcripts in 3675 core clusters" -N -o pdf
From the clusters obtained earlier we can now produce MVIEW collapsed, multiple sequence alignments (MSA), while also annotating Pfam domains:
for FILE in `ls core_clusters/*.fna`; do
echo $FILE;
bin/get_homologues/annotate_cluster.pl -D -f $FILE -o $FILE.aln.fna -c 20 &>> collapse.align.core.log;
done
1) NEAREST DIPLOID SPECIES NODE algorithm
1.1) Edit polyconfig.pm file to adapt the previous information to your specific analyses. Define diploids, polyploids, outgroup, rooted species, ad-hoc labelling rules, ...
1.2) Simplify headers and names of files from collapsed core clusters
We might need to simplify the FASTA headers, which is something that should be taylored according to the user's data.
for FILE in *collapsed.fna.aln.fna; do
echo $FILE;
perl -p -i -e 's/>(.+?) .+/>$1/g; s/:\d+:\d+:[+-]//g' $FILE;
done
Original header:
>c43592_g1_i1_chr8:11969473:11975162:+[B422_80_75] collapsed:6,{12}, Pfam:Microtubule associated protein (MAP65/ASE1 family);(PF03999,)
Simplified header:
>c43592_g1_i1_chr8[B422_80_75]
The goal is to leave the species identifier as the last bit, in this case within [brackets]. You might need to follow a sligthly different approach with your own data.
You can also simplify the names of your files (optional):
for f in *.fna;
do mv -- "$f" "$(basename -- "$f" .collapsed.fna.aln.fna).fa";
done
Original names:
99998_c33211_g1_i1_chr2-26015115-26017843--.collapsed.fna.aln.fna
Renamed names:
99998_c33211_g1_i1_chr2-26015115-26017843--.fa
Note: All FASTA files are compressed in fasta.files.tar.bz2 and saved in 01_collapsed_core_clusters directory
1.3) Filter Mutiple Sequences Alignments (MSAs) according to the diploid block and remove inconsistent positions
We now take these alignments of nucleotide sequences of both diploid and polyploid species and produce trimmed FASTA files suitable for phylogenetic tree inference. The goal is to define a solid diploid backbone, which should be covered by outgroup sequences as well, and then use it to filter out polyploid sequences/alleles with diploid block overlap < $MINBLOCKOVERLAP. Therefore, some sequences and species might be removed from the initial input.
This process is computed using scripts/_trim_MSA_block.pl and the defauld parameters are defined in polyconfig.pm file:
Default values for paremeters controlling how aligned blocks of sequences are produced in script _trim_MSA_block.pl our $MINBLOCKLENGTH = 100; our $MAXGAPSPERBLOCK = 100; # tolerated gaps for diploids in block our $MINBLOCKOVERLAP = 0.50; # fraction of diploid block covered by outgroups & polyploids
We run the tool:
for FILE in *.fa;
do echo $FILE;
perl scripts/_trim_MSA_block.pl -i $FILE -o $FILE.block.fna \
-m 100 -M 100 -O 0.5 -R 'c\d+_g\d+_i' \
-t scripts/species2rename.tsv &>> blocks.log;
done
We recover 2001 aligned core clusters with a consistent diploid block. You can save the filter files in 02_blocks directory
Warning: In this step, some alignments can lost outgroups (Osat or Hvul). If we count the number of block.fna files with each taxon we recover: Osat --> 1925/2001 Hvul --> 1944/2001 Bsta --> 2001/2001 Bdis --> 2001/2001 Barb --> 2001/2001 Bpin --> 2001/2001 Bsyl --> 2001/2001
You can count the files with all diploid and outgroup taxa using the regular expression:
grep -e '^>.*_Osat' -e '^>.*_Hvul' -e '^>.*_Bsta' -e '^>.*_Bdis' -e '^>.*_Barb' -e '^>.*_Bpin' -e '^>.*_Bsyl' *.block.fna \
| cut -d":" -f1 | uniq -c | cut -d" " -f7,8 | grep -c "^7"
We recover 1878 alignments with all diploid and outgroup taxa.
We save a list of files with all diploid and ougroup taxa:
grep -e '^>.*_Osat' -e '^>.*_Hvul' -e '^>.*_Bsta' -e '^>.*_Bdis' -e '^>.*_Barb' -e '^>.*_Bpin' -e '^>.*_Bsyl' *.block.fna | cut -d":" -f1 | uniq -c | cut -d" " -f7,8 | grep "^7" | cut -d" " -f2 > list_block_all_fasta.txt
We can now trim the resulting blocks wi
Related Skills
node-connect
352.2kDiagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps
frontend-design
111.1kCreate distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.
openai-whisper-api
352.2kTranscribe audio via OpenAI Audio Transcriptions API (Whisper).
qqbot-media
352.2kQQBot 富媒体收发能力。使用 <qqmedia> 标签,系统根据文件扩展名自动识别类型(图片/语音/视频/文件)。
