MCScanX
MCScanX: Multiple Collinearity Scan toolkit X version. The most popular synteny analysis tool in the world!
Install / Use
/learn @wyp1125/MCScanXREADME
MCScanX
:License: BSD <http://creativecommons.org/licenses/BSD/>_
Notice
The original authors are collaborating with Dr. Xi Zhang at Dalhousie University to maintain the MCScanX software. Dr. Zhang has developed a utility tool, MCScanX_Assistant <https://github.com/zx0223winner/MCScanX_Assistant>_, which streamlines data preparation and simplifies the installation process for MCScanX.
Contact: Xi.Zhang@dal.ca
Overview
The MCScanX package has two major components: a modified version of MCscan algorithm <https://github.com/tanghaibao/mcscan>_ allowing users to handle MCScan more conveniently and to view multiple alignment of syntenic blocks more clearly, and a variety of downstream analysis tools to conduct different biological analyses based on the synteny data generated by the modified MCScan algorithm.
All programs are executed using command line options on Linux systems or Mac OS. Usage or help information are well built into the programs. To show them on the screen, users just need to run the program without giving any options::
$./program_name
.. image:: https://lh4.ggpht.com/_O4Q4Y0oWQYU/Tcn3sydLaSI/AAAAAAAAA0w/foXv6yt4S2Y/s400/Figure1backup.jpg :alt: MCScanX flow chart
All code is copiable, distributable, modifiable, and usable without any restrictions.
Contact: Xi Zhang, Xi.Zhang@dal.ca; Yupeng Wang, wyp1125@gmail.com
Installation
Make :::::: Simply put MCscanX.zip into a directory and run::
$unzip MCscanx.zip
$cd MCScanx
$make
The following is the list of executable programs ::::::::::::::::::::::::::::::::::::::::::::::::: Main programs (in the main folder)
- MCScanX
- MCScanX_h
- Duplicate_gene_classifier
Downstream analysis programs (in the downstream_analyses folder)
- Tool 1. detect_syntenic_tandem_arrays
- Tool 2. dissect_multiple_alignment
- Tool 3. dot_plotter.java
- Tool 4. dual_synteny_plotter.java
- Tool 5. circle_plotter.java
- Tool 6. bar_plotter.java
- Tool 7. add_ka_and_ks_to_synteny.pl
- Tool 8. group_collinear_genes.pl
- Tool 9. detect_collinearity_within_gene_families.pl
- Tool 10. family_circle_plotter.java
- Tool 11. family_tree_plotter.java
- Tool 12. origin_enrichment_analysis.pl
Main programs
MCScanX :::::::: This program, implementing a modified MCScan algorithm, detects syntenic blocks and progressively aligns multiple syntenic blocks against reference genomes (PIVOT).
-
Usage MCscan2 reads in two data files: xyz.blast and xyz.gff. The xyz.blast file is simply the direct BLASTP output of m8 format as following::
AT1G50920 AT1G50920 100.00 671 0 0 1 671 1 671 0.0 1316
Here is a typical parameter setting for generating the xyz.blast file::
$blastall -i query_file -d database -p blastp -e 1e-10 -b 5 -v 5 -m 8 -o xyz.blast
The xyz.bed file holds gene positions, following a tab-delimited format::
chr# starting_position ending_position gene
Note: for chr#, a two-letter short name is used as prefix for the species; # is the chromosome number. (For example, the second chromosome of Arabidopsis thaliana should be denoted as at2.)
The bed format is defined here <http://genome.ucsc.edu/FAQ/FAQformat.html#format1>_, and is especially useful since there are a ton of tools that can handle bed files, most notably BEDTOOLS.
The xyz.bed file can be generated by parsing the .gff3 file released by the sequencing initiatives.
Repeat of the same gene is not allowed in the .bed file.
When comparing multiple genomes, simply concatenate all inter-/intra-species m8 blast output into xyz .blast file and concatenate all gene positions of different species into xyz.bed file.
It is advised that to make MCscanX generate more reasonable results, the number of BLASTP hits for a gene should be restricted to around top 5. When you have xyz.blast and xyz.bed ready, put them in the same folder. Then you can simply use::
$ ./MCScanx dir/xyz
-
Output The execution of MCScanX outputs one text file xyz.syteny, containing pairwise syteny blocks as follows::
Alignment 0: score=9171.0 e_value=0 N=187 at1&at1 plus
0- 0: AT1G17240 AT1G72300 0 0- 1: AT1G17290 AT1G72330 0 ... 0-185: AT1G22330 AT1G78260 1e-63 0-186: AT1G22340 AT1G78270 3e-174##Alignment 1: score=5084.0 e_value=5.6e-251 N=106 at1&at1 plus
and one directory xyz.html , containing html files that display multiple alignment of syntenic blocks against each chromosome. The HTML files must be viewed through a web browser. In a HTML file, the first column shows the number of syntenic blocks at each gene locus, the second column shows the genes in PIVOT (reference chromosome) where tandem genes are marked in red, and the following is aligned syntenic blocks where only match genes are displayed.
-
MCScanX parameters (for advanced users) [Usage]::
./MCScanX prefix_fn [options]
-k MATCH_SCORE, final score=MATCH_SCORE+NUM_GAPS*GAP_PENALTY (default: 50) -g GAP_PENALTY, gap penalty (default: -1) -s MATCH_SIZE, number of genes required to call synteny (default: 5) -e E_VALUE, alignment significance (default: 1e-05) -u UNIT_DIST, average intergenic distance (default: 10000) -m MAX_GAPS, maximum gaps(one gap=UNIT_DIST) allowed (default: 20) -a only builds the pairwise blocks (.synteny file) -b patterns of syntenic blocks. 0:intra- and inter-species (default); 1:intra-species; 2:inter-species -h print this help page
MCScanX_h :::::::::::::::::::::::::: The BLASTP input of MCScanX can be replaced by a tab-delimited file containing more reliable pairwise homologous relationships. In this case, users should use MCScanX_h instead. The executation of MCScanX_h is very similar to that of MCScanX, except that the "xyz.blast" file should be replaced by "xyz.homology" file. At the bottom of screen output, statistics on numbers / percentages of collinear homolog pairs are shown.
Duplicate_gene_classifier :::::::::::::::::::::::::: Users may use this program, which incorporate the MCScanX algorithm, to classify origins of the duplicate genes of ONE genome into whole genome /segmental (match genes in syntenic blocks), tandem (continuous repeat), proximal (in nearby chromosomal region but not adjacent) or dispersed (other modes than segmental, tandem and proximal) duplications.
-
Usage::
$ ./duplicate_gene_classifier dir/xyz
The input of duplicate_gene_classifier is the same with MCscanX, except an additional option for defining the maximum distance (# of genes) between 2 proximal duplicates.
-
Output The output is a text file in the same directory with input files named xyz.gene_type. It contains origin information for all the genes in xyz.gff file with a tab-delimited format::
Gene gene_type(0/1/2/3/4)
Note: 0, 1, 2, 3, 4 stand for singleton, dispersed, proximal, tandem, segmental respectively. It is not reasonable to apply this program to data of multiple genomes.
Downstream analyses :::::::::::::::::::::
- Detect_syntenic_tandem_arrays :::::::::::::::::::::::::::::::::: Tandem duplications often complicate synteny detection. To enhance the power of synteny detection, MCScan algorithms use the gene with best BLASTP hit to represent a tandem array. This program transforms match genes in syntenic blocks into tandem arrays if tandem duplications exist there.
-
Usage::
$ ./detect_syntenic_tandem_arrays -g gff_file -b blast_file -s synteny_file -o output_file
-
Output The path of output_file should be specified by the user. If any gene of a syntenic pair is located in a tandem array, the syntenic pair will be written into the output_file.
- Dissect_multiple_alignment :::::::::::::::::::::::::::::: This program dissects the number of syntenic blocks at each gene locus of the reference genome(s) into the number of intra-species syntenic blocks and the number of inter-species syntenic blocks.
-
Usage::
$ ./dissect_multiple_alignment -g gff_file -s synteny_file -o output_file
-
Output The path of output_file should be specified by the user. The first and second columns of output_file show the chromosomes and genes in reference genome(s). The 3rd, 4th and 5th columns show the numbers of intra-species syntenic blocks, inter-species syntenic blocks and outgroup species respectively.
- dot_plotter ::::::::::::::: This java script generates a dot plot for all the syntenic blocks on two sets of chromosomes given by the user. Note that JDK is needed for executing Java programs.
-
Usage::
$ java dot_plotter -g gff_file -s synteny_file -c control_file -o output_PNG_file
The input files include a gff file containing all gene positions, a synteny file generated by MCScanX, and a control file (.ctl) containing plot size and chromosome IDs. The control file can be easily made by modifying the dot.ctl file::
800 //dimension (in pixels) of x axis
800 //dimension (in pixels) of y axis
sb1,sb2,sb3,sb4,sb5,sb6,sb7,sb8,sb9,sb10 //chromosomes in x axis
os1,os2,os3,os4,os5,os6,os7,os8,os9,os10,os11,os12 //chromosomes in y axis
Note that no space is allowed between adjacent chromosome IDs.
- Output Output is an image file (PNG format) which can be viewed with an image viewer. Each dot is a sytenic gene pair between the two sets of chromosomes. Different colors of dots, generated randomly, represent different syntenic blocks.
- dual_synteny_plotter :::::::::::::::::::::::: This java script generates a dual synteny plot which links all the synteny blocks between two sets of chromosomes using straight lines.
-
Usage::
$ java dual_synteny_plotter -g gff_file -s synteny_file -c control_file -o output_PNG_file
The input files include a gff file containing all gene positions, a synteny file generated by MCScanX, and a control file (.ctl) containing plot size and chromosome IDs. The control file can be easily made by modifying the column.ctl file::
200 //plot width (in pixels)
800 //plot height (in pixels)
sb1,sb2 //chromo
