SkillAgentSearch skills...

Allopolyploids

PhyloSD: Phylogenomic Subgenome Detection pipeline

Install / Use

/learn @eead-csic-compbio/Allopolyploids
About this skill

Quality Score

0/100

Supported Platforms

Universal

README

PhyloSD: Phylogenomic Subgenome Detection pipeline

Build Status

This pipeline was designed for the subgenome identification of homeologous diploid genomes present in allopolyploids, also considering potentially unknown progenitors.

The Phylogenomic Subgenomic Detection (PhyloSD) pipeline includes three sequential Nearest Diploid Species Node, Bootstrapping Refinement, and Subgenome Assignment algorithms.

The protocol is explained in detail in:

R Sancho, LA Inda, A Diaz-Perez,L Des Marais, S Gordon, JP Vogel, J Lusinska, R Hasterok, B Contreras-Moreira, P Catalan (2022) Tracking the Ancestry of Known and 'Ghost' Homeologous Subgenomes in Model Grass Brachypodium Polyploids. The Plant Journal http://doi.org/10.1111/tpj.15650

The supporting information (supplementary tables, figures, appendix and methods) that accompanies this paper can be found at Dryad repository https://doi.org/10.5061/dryad.ncjsxksqw

A Docker container with dependencies installed can be found at:

https://hub.docker.com/r/eeadcsiccompbio/allopolyploids

<!-- made with perl -lne 'if(/^(#{1,}) (.*)/){ ($i,$t)=($1,$2); $l=lc($t); $l=~s/\W/\-/g; print "$i [$t](#$l)"}'-->

0) Pre-processing of the data set and installation of dependencies

This pipeline has several dependencies, which can be installed as explained below in folder bin/:

|software|URL| |-------|---| |TrimAl|https://github.com/inab/trimal| |newick_utils|https://github.com/tjunier/newick_utils| |IQ-TREE|http://www.iqtree.org| |GET_HOMOLOGUES-EST|https://github.com/eead-csic-compbio/get_homologues| |concat_alignments.pl|https://github.com/vinuesa/get_phylomarkers| |Consensus.pl|https://github.com/josephhughes/Sequence-manipulation| |R|https://www.r-project.org|

In addition, the instructions below require wget, curl, make, git, perl and parallel on your system.

In Debian-like systems like Ubuntu please copy and paste the following command lines in your terminal to install:

# system dependencies
sudo apt-get install -y libdb-dev parallel git curl cpanminus

# R 
sudo apt-get install -y r-base

# third-party binaries and Perl dependencies
make install

In RedHat-like systems:

# Perl
sudo yum install libdb-devel
curl -L https://cpanmin.us | perl - --sudo App::cpanminus

# R
sudo yum install epel-release
sudo apt-get install R

# third-party binaries and Perl dependencies
make install

In order to check yout installation please run:

make test

Data set and abbreviations

In order to download the data used for the Brachypodium, used in the tutorial below, benchmark please do:

make brachy 

Note that Brachypodium RNA-seq data were also deposited at the European Nucleotide Archive with run accessions numbers:

  • ERR3633153 (B. arbuscula)
  • ERR3634426 (B. boissieri)
  • ERR3634464 (B. hybridum)
  • ERR3634593 (B. mexicanum) *
  • ERR3634717 (B. phoenicoides Bpho422)
  • ERR3634894 (B. phoenicoides Bpho6)
  • ERR3634970 (B. pinnatum)
  • ERR3636695 (B. retusum)
  • ERR3636791 (B. rupestre)
  • ERR3636828 (B. stacei)

B. sylvaticum RNA-seq data from accession Brasy-Esp were obtained from the study by Fox et al., 2013.

Transcriptomic data were also retrieved for the outgroup species Oryza sativa (SRX738077) and Hordeum vulgare (ERR159679), which were used to root the phylogenetic Brachypodium trees and to provide a stem branch for the grafting of ancestral homeologs/subgenomes.

The following abbreviations were used:

Bmex --> B.mexicanum
Bret --> B.retusum
Bboi --> B.boissieri
Bhyb --> B.hybridum
Brup --> B.rupestre
Bpho --> B.phoenicoides (accession Bpho6)
B422 --> B.phoenicoides (accession Bpho422)
Bsta --> B.stacei
Bdis --> B.distachyon
Barb --> B.arbuscula
Bpin --> B.pinnatum
Bsyl --> B.sylvaticum
Osat --> Oryza sativa
Hvul --> Hordeum vulgare

Run GET_HOMOLOGUES-EST to construct the core gene/transcript alignments

This first step requires https://github.com/eead-csic-compbio/get_homologues and the set of transcripts in folder genome_transcripts, most of them assembled de novo with https://github.com/trinityrnaseq/trinityrnaseq . With the following command we request clusters with sequence identity >= 80% and 75% coverage of the shortest sequence

bin/get_homologues/get_homologues-est.pl -d genome_transcripts \
	-m cluster -I genome_transcripts/species.list -M -A -S 80 \
	&> gen.M.A.S80.core.cluster.log

This produces 3675 clusters.

We can now compute pan-gene matrices for these core clusters:

bin/get_homologues/compare_clusters.pl \
-d genome_transcripts_est_homologues/arb8075_alltaxa_no_sorghum_no_sylCor.list_algOMCL_e0_S80_/ \
-o core_clusters_Brachypodium -m -n &> compare.core.log

We can optionally plot an Average Nucleotide Identity (ANI) matrix:

# Note: this requires additional R dependencies
bin/get_homologues/install_R_deps.R
bin/get_homologues/plot_matrix_heatmap.sh -i \
   genome_transcripts_est_homologues/arb8075_alltaxa_no_sorghum_no_sylCor.list_algOMCL_e0_S80_Avg_identity.tab \
   -H 14 -W 26 -t "ANI of transcripts in 3675 core clusters" -N -o pdf

From the clusters obtained earlier we can now produce MVIEW collapsed, multiple sequence alignments (MSA), while also annotating Pfam domains:

for FILE in `ls core_clusters/*.fna`; do
   echo $FILE;
	bin/get_homologues/annotate_cluster.pl -D -f $FILE -o $FILE.aln.fna -c 20 &>> collapse.align.core.log;
done

1) NEAREST DIPLOID SPECIES NODE algorithm

1.1) Edit polyconfig.pm file to adapt the previous information to your specific analyses. Define diploids, polyploids, outgroup, rooted species, ad-hoc labelling rules, ...

1.2) Simplify headers and names of files from collapsed core clusters

We might need to simplify the FASTA headers, which is something that should be taylored according to the user's data.

for FILE in *collapsed.fna.aln.fna; do
echo $FILE;
perl -p -i -e 's/>(.+?) .+/>$1/g; s/:\d+:\d+:[+-]//g' $FILE;
done

Original header:
>c43592_g1_i1_chr8:11969473:11975162:+[B422_80_75] collapsed:6,{12}, Pfam:Microtubule associated protein (MAP65/ASE1 family);(PF03999,)

Simplified header:
>c43592_g1_i1_chr8[B422_80_75]

The goal is to leave the species identifier as the last bit, in this case within [brackets]. You might need to follow a sligthly different approach with your own data.

You can also simplify the names of your files (optional):

for f in *.fna;
do mv -- "$f" "$(basename -- "$f" .collapsed.fna.aln.fna).fa";
done

Original names:
99998_c33211_g1_i1_chr2-26015115-26017843--.collapsed.fna.aln.fna

Renamed names:
99998_c33211_g1_i1_chr2-26015115-26017843--.fa

Note: All FASTA files are compressed in fasta.files.tar.bz2 and saved in 01_collapsed_core_clusters directory

1.3) Filter Mutiple Sequences Alignments (MSAs) according to the diploid block and remove inconsistent positions

We now take these alignments of nucleotide sequences of both diploid and polyploid species and produce trimmed FASTA files suitable for phylogenetic tree inference. The goal is to define a solid diploid backbone, which should be covered by outgroup sequences as well, and then use it to filter out polyploid sequences/alleles with diploid block overlap < $MINBLOCKOVERLAP. Therefore, some sequences and species might be removed from the initial input.

This process is computed using scripts/_trim_MSA_block.pl and the defauld parameters are defined in polyconfig.pm file:

Default values for paremeters controlling how aligned blocks of sequences are produced in script _trim_MSA_block.pl our $MINBLOCKLENGTH = 100; our $MAXGAPSPERBLOCK = 100; # tolerated gaps for diploids in block our $MINBLOCKOVERLAP = 0.50; # fraction of diploid block covered by outgroups & polyploids

We run the tool:

for FILE in *.fa;
do echo $FILE;
perl scripts/_trim_MSA_block.pl -i $FILE -o $FILE.block.fna \
-m 100 -M 100 -O 0.5 -R 'c\d+_g\d+_i' \
-t scripts/species2rename.tsv &>> blocks.log;
done

We recover 2001 aligned core clusters with a consistent diploid block. You can save the filter files in 02_blocks directory

Warning: In this step, some alignments can lost outgroups (Osat or Hvul). If we count the number of block.fna files with each taxon we recover: Osat --> 1925/2001 Hvul --> 1944/2001 Bsta --> 2001/2001 Bdis --> 2001/2001 Barb --> 2001/2001 Bpin --> 2001/2001 Bsyl --> 2001/2001

You can count the files with all diploid and outgroup taxa using the regular expression:

grep -e '^>.*_Osat' -e '^>.*_Hvul' -e '^>.*_Bsta' -e '^>.*_Bdis' -e '^>.*_Barb' -e '^>.*_Bpin' -e '^>.*_Bsyl' *.block.fna \
| cut -d":" -f1 | uniq -c | cut -d" " -f7,8 | grep -c "^7"

We recover 1878 alignments with all diploid and outgroup taxa.

We save a list of files with all diploid and ougroup taxa:

grep -e '^>.*_Osat' -e '^>.*_Hvul' -e '^>.*_Bsta' -e '^>.*_Bdis' -e '^>.*_Barb' -e '^>.*_Bpin' -e '^>.*_Bsyl' *.block.fna | cut -d":" -f1 | uniq -c | cut -d" " -f7,8 | grep "^7" | cut -d" " -f2 > list_block_all_fasta.txt

We can now trim the resulting blocks wi

Related Skills

View on GitHub
GitHub Stars11
CategoryDevelopment
Updated1y ago
Forks1

Languages

Perl

Security Score

75/100

Audited on Oct 9, 2024

No findings