HomBlocks

HomBlocks: A multiple-alignment construction pipeline for organelle phylogenomics based on locally collinear block searching

Generate Convert Improve

Install / Use

/learn @fenghen360/HomBlocks

About this skill

Quality Score

0/100

README

If you find this useful, please cite:

Guiqi Bi, Yunxiang Mao, Qikun Xing, Min Cao , HomBlocks: A multiple-alignment construction pipeline for organelle phylogenomics based on locally collinear block searching, Genomics (2017), doi: 10.1016/j.ygeno.2017.08.001

What is HomBlocks?

HomBlocks is a new and highly efficient pipeline that used homologous blocks searching method to construct multi-gene alignment. It can automatically recognize locally collinear blocks (LCB) among organelle genomes and excavate phylogeny informative regions to construct multi-gene involved alignment in few hours. 　　Because the traditional way of constructing multi-gene alignments, which was utilized in organelle phylogenomics analysis, is a time-consuming process. Therefore, for the purpose of improving the efficiency of sequence matrix construction derived from multitudes of organelle genomes, we developed a time-saving and accurate method that would be utilized in phylogenomics studies. 　　In this pipeline, the core conserved fragment (conserved coding genes, functional non-coding regions and rRNA) will be picked out and integrated into a long sequence from the same genome. This method avoids the bothering sequence alignment procedure of every single gene and can generate phylogeny informative and high quality data matrix. Usually, instead of week-long manual work, it only takes less than an hour to construct the HomBlocks matrix with around two dozens of organelle genomes. In addition, HomBlocks produces optimal partition schemes of sequences and sequence evolution models for RAxML, which are important in downstream phylogeny analysis.

Conventional way for construction of multi-gene alignment from organelle genomes

Almost all studies regarding of organelle genomics are accustomed to making phylogeny analyses by taking advantage of multiple genes in improvements of phylogentic resolution. But, usually, every single set of orthology genes is required to be pre-aligned, then concatenation will be performed among these common aligned genes. Although some softwares, like Sequence Matrix, can facilitate the procedure of sequence extraction or concatenation, constructing multi-gene alignments derived from organelle genomes is a complex process and prone to induce artificial errors. Despite that, the most concerning point for researches is how long this alignment procedure will take. In general, with the help of some bioinformatics tools, it will take at least two weeks to make genome-wide alignments using common genes among 30 higher plant chloroplast genomes (about 150kb long with at least 100 common genes). Thus, the phenomenon that the number of genes used in phylogeny were decreased below 70 is common in papers of plant chloroplast genomes. And reseachers have to be patient and cautious, because single gene alignment with artificial errors can lead to undetectable misplacement in the final alignments. Generally speaking, organelle phylogenomic analysis provides exact tools to detect genetic relationships, but the construction of multi-gene alignments does not sound convenient.

Reasons why alignment cannot be established using whole organelle genomes

The evolution of organelle genomes is dynamic and diverse in gene content, structure and sequnce divergence. Thus, basically speaking, these genomes cannot be aligned directly using the whole genome sequences as shown by picture below.

This is the result picture of Mauve which shows the comparison of plastid genomes of three green algae. As we can see, there is a large invert frament in Ulva sp. when comparing with other sequences (arrow B). The gene content and intergenic region length are also different (arrorw C). Similarly, number of gene introns among the genomes were different (arrow A). The most direct consequence is that they exhibited in different length (arrow D). For aligners, these characteristics can lead to fatal error or being corrupted. 　　Organelle genomes within intraspecies are usually conserved both in length and structure. So, in some cases, they can be aligned directly. But in nine cases out of ten, researches of organelle genomes focus on interspecies level, which means the direct alignment is difficult to realize.

Methodology

The working flow diagram was shown below.

HomBlocks utilizes progressiveMauve, which applies anchored alignment algorithm, to identify locally collinear blocks (LCBs) shared by organelle genomes (chloroplast and mitochondrial genomes). The co-exist LCBs among all organelle genomes will be extracted and trimmed to screen the phylogeny informative regions out. 　　HomBlocks offers four different methods for LCBs trimming: Gblocks, trimAl, noisy and BMGE. Without settings, the default trimming method is Gblocks. 　　The final alignment that was composed of trimmed LCB could be used in downstream analysis. Additional parameters were provided for users to select the best fit DNA substitution model and optimal partition schemes and models of sequence evolution for RAxML with the final alignment by PartitionFinder.

Installation

HomBlocks is a pipeline that implemented by Perl 5. 　　No external installation is needed for HomBlocks. 　　All the dependencies external executable files are placed under bin directory. 　　git clone https://github.com/fenghen360/HomBlocks.git　 　　Or download the zip compressed files into your work directory

# Decompressing files
unzip HomBlocks-master.zip

# Note that Homblocks.pl is the main program, you can check it's usage by
perl Homblocks.pl

# Check wether programs in bin directory are executable. if they are not, change their permission.
cd HomBlocks-master
cd bin
chmod 755 *

# make programes in PartitionFinderV1.1.1 executable
cd ..
cd PartitionFinderV1.1.1
chmod 755 *
chmod 755 PartitionFinder*
cd programs
chmod 755 *
cd ..
cd partfinder
chmod 755 *

Required software

perl with version above 5
java with version above 1.7 (required by BMGE.jar)
python with version above 2.7 (required by PartitionFinder)
circos (optimal)
circos is not easy to install on a linux server without root permissions. If you want install to visualize the genes involved in the alignments. You can use perl scipts cpanm.pl (http://xrl.us/cpanm) to install perl modules. Otherwise, my advice is to do this visualization on circoletto webserver http://tools.bat.infspire.org/circoletto/ by input of one final alignment sequence from a species and a corresponding set of every single gene sequence, respectively.

Tutorial

HomBlocks is not complex to use. What it needs are fasta or gebank files (fasta, fas, fa or gb suffix). You must put all these sequences in a directory. Like these test sequnces that were put in Xenarthrans/fasta.

To begin with, you can check the usage of HomBlocks without any parameters.

# check usage
perl HomBlocks.pl

# The print of screen should be like this
usage: ./HomBlocks.pl <parameters>
         
parameters:
                -in=<file>                            Genome alignment outputfile derived from Muave. If you set --align, ignore this input parameter.
                -out_seq=<file>                       Output file of trimmed and concatenated sequences.
                -number=<int>                         Number of taxa used in aliggment (should be precious). If you set --align, ignore this input parameter.
                -min=<int>                            Minimum alignment length of a extracted module. (Default: unset)
                -method=[Gblocks|trimAl|BMGE|noisy]   To choose which program to be used in alignment trimming. (Default: Gblocks).
                --PartitionFinder                     To calculate the best subsitition model for each extracted colinear block and set best partition scheme by PartitionFinder.

                --align                               If you want to align sequences by mauve, add this parameter (Default: progressiveMauve).
                                                      Then you should split every sequence into a single file. File suffix with fasta,gb,fas,fa is acceptable.
                --path=                               Absolute path to directory where you put in fasta sequences (Under --align parameter).

                --mauve-out=                          The output file produced by mauve (Absolute path). If you set --align parameter.
               
           

                -help/h                               Print the usage.

#### Running with 36 Xenarthrans mitochondrial genomes as an example

This dataset of example running was referred to this paper: Gibb, G. et al. (2016). Shotgun mitogenomics provides a reference phylogenetic framework and timescale for living xenarthrans. Molecular Biology and Evolution, 33(3), 621-642.

Example run with parameters like this:

perl HomBlocks.pl --align --path=/public/home/mgb217/HomBlocks/Xenarthrans/fasta/ -out_seq=Xenarthrans.output.fasta  --mauve-out=Xenarthrans.mauve.out

The meanings of these parameters could be found in the usage of HomBlocks. It should be noted that --align and --path must be set at same time. Because --align means that you have no mauve alignments result file for the first time, so set this parameter to run progressiveMauve for LCB detection. Meanwhile, --path parameter will define the absolute path of directory in where you put your sequences. Next, HomBlocks will detect the sequences in the directory you defined. The printscreen should be like this:

Totla 36 files detected!
The list of sequence

Related Skills

node-connect

341.8k

Diagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps

frontend-design

84.6k

Create distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.

openai-whisper-api

341.8k

Transcribe audio via OpenAI Audio Transcriptions API (Whisper).

commit-push-pr

84.6k

Commit, push, and open a PR