OrthNet
CLfinder-OrthNet, a pipeline to encode orthologs from multiple genomes and their evolutionary history into networks (OrthNets) based on co-linearity between them. OrthNets enable detection of all orthologous gene groups that share the same evolutionary history, using a search based on network topology
Install / Use
/learn @ohdongha/OrthNetREADME
CLfinder-OrthNet
Synopsis
A pipeline to detect co-linearity in gene orders among multiple closely-related genomes (CLfinder) and build networks of orthologs, or OrthNets, connecting ortholog nodes with co-linearity information (co-linear, transposed, or unable to determine) among them as edge properties (ONfinder).
- Ideally and in most cases, each OrthNet consists of orthologs derived from a single ancestral locus, including those duplicated specifically in a subset of genomes.
- In addition to visualize the evolutionary context/history of each ortholog group, users can search ortholog groups (as OrthNets) based on an evolutionary history pattern/context they represent.
- For example, users can retrieve all OrthNets (i.e. ortholog groups) that have underwent:
- tandem duplication events unique to a genome or a subset of genomes, either mono-, para-, or polyphyletic
- orthologs deleted, duplicated, or transposed uniquely in a genome or a subset of genomes,
- orthologs transposed and duplicated uniquely in a genome or a subset of genomes, etc.
- CLfinder and ONfinder (OrthNet finder) are two separable modules. Users can use only the CLfinder module to quickly obtain co-linearity information and a summary matrix of the pairwise comparisons for multiple genomes. The ONfinder module is optimized for creating OrthNets using inputs from CLfinder, but also can accept co-linearity information from other programs.
Jump to: Before starting; Preparing input files; Running CLfinder; Running ONfinder; Searching OrthNets; Annotating OrthNets; Notes; How to cite;
News
- 2018-12-13 CLfinder-OrthNet will be presented at Plant&Animal Genomes 2019 (Jan. 12-16, 2019 at San Diego, CA) in the following three workshops: Systems Genomics, N. G. Genome Annotation & Analysis, and Digital Tools & Resources (details and abstracts)
- 2018-10-31 The first CLfinder-OrthNet paper is published in DNA Research and now online.
Updates
- 2019-06-25 Minor fixes on temporary file name formats; added more options to
parse_gtf_2table.pyto help the preparation of input#1. - 2018-12-25 Added
cluster_OrthNet_topology_exact.py(previouslysearch_OrthNet_topology.py) andsearch_OrthNet_CNVs.py; minor bug fixes and improvements - 2018-06-05 Added options to use MMseqs2 to generate inputs #2 and #3. MMseqs2 is much faster than orthoMCL or BLASTP and also generate one HSP (High-scoring Segment Pair) per each sequence comparison.
Upcoming changes
create_table4Cytoscape.pygenerates a table of node properties that can be imported to Cytoscape, to facilitate visualization of OrthNet- Wrapper scripts for the CLfinder and ONfinder modules
- Tutorial and test dataset based on MMSeqs2
- Iterative MCL (testing)
- R-OrthNet : OrthNet with edges representing presence or absence of conservations in regulatory features between nodes
Before starting
Prerequisites - programs
- python 2.7
- mcl - other clustering options will be added in future updates.
- MMseqs2 - to generate input #2 and #3 (see below for alternatives)
- I assume users are familiar with basic linux commands.
- Optional
Prerequisites -genome data
- genome annotation of "representative" gene models (.gff, .gff3, or .gtf)
- representative gene model sequences for all loci (.fasta)
Installing
Copy all python scripts to a folder, add the folder to $PATH, and made them executable:
export PATH=<folder>:$PATH # in bash, add this to ~/.bashrc
chmod 755 <folder>/*.py
Preparing input files
CLfinder-OrthNet accept three inputs: 1. gene model coordinates (genome annotation), 2. within-species paralog groups, and 3. between species "best-hit" pairs for all pair of genomes
ProjectID and list of genomes
ProjectID.list includes all GenomeIDs that you want to compare, one per line. I recommend GenomeIDs to be simple (2~5 alphanumeric) and ProjectID to be unique by adding date or time-stamp. For example, below, 180101_crucifers will be the ProjectID to compare six crucifer genomes included in the first CLfinder-OrthNet article (https://doi.org/10.1101/236299):
echo 'Aly Ath Cru Esa Sir Spa' | tr ' ' '\n' > 180101_Crucifers.list
Input #1: gene model coordinates (genome annotation)
For each genome, coordinates of representative gene models are parsed from genome annotations in .gtf format. The parsed file will have strand, coordinates, number of exons, and length of the mRNA and CDS (ORF), one gene model per line.
-
If genome annotation is in .gff or .gff3 format, convert it to .gtf:
gffread input.gff -T -o output.gtf -
Parse the .gtf file into a .gtfParsed.txt file. Name the output file as "GenomeID.gtfParsed.txt". Repeat for all GenomeIDs:
parse_gtf_2table.py -pr input.gtf GenomeID.gtfParsed.txt > GenomeID.gtfParsed.logImportant! Genomes should contain one representative gene/transcript model per each locus. See Note 1
Input #2: within-species paralog groups (PGs)
A tab-delimited text file with GeneID and paralog group ID (PG), one gene per line for each genome:
GeneID PG
Gene1 PGxxxx
Gene2 PGxxxy
Gene3 PGxxxz
...
Input #2 can be prepared by various methods. Below are two example options:
method #1 Cluster all representative protein sequences using MMseqs2 for each genome and get "in-paralog" groups. Convert the .tsv output to input #2. Assuming all representative peptide sequences for each genome is GenomeID.pep.rep.fa:
mkdir tmp # temporary folder for MMseqs2 runs
mmseqs createdb GenomeID.pep.rep GenomeID_DB
mmseqs createindex GenomeID_DB tmp
mmseqs cluster GenomeID_DB GenomeID_c tmp --max-seqs 50000 -c 0.5
mmseqs createtsv GenomeID_DB GenomeID_DB GenomeID_c GenomeID_c.tsv
parse_mmseqs_clusters.py -H GenomeID_c.tsv GenomeID.PG
To run MMseqs2 clustering for all genomes in ProjectID.list :
<details> <summary>Click to expand</summary>mkdir tmp # temporary folder for MMseqs2 runs
while read g; do
mmseqs createdb ${g}.pep.rep ${g}_DB
mmseqs createindex ${g}_DB tmp
mmseqs cluster ${g}_DB ${g}_c tmp --max-seqs 50000 -c 0.5
mmseqs createtsv ${g}_DB ${g}_DB ${g}_c ${g}_c.tsv
parse_mmseqs_clusters.py -H ${g}_c.tsv ${g}.PG
done < ProjectID.list
Proceed to the next step with GenomeID.PG (input #2).
</details>method #2 (originally the default method) If orthoMCL is available, you can run it for each genome and get "in-paralog" groups. Convert the orthoMCL output (mclOutput_GenomeID.txt) to input #2:
parse_mclOutput.py -rH mclOutput_GenomeID.txt PG > GenomeID.PG
if converting multiple orthoMCL output files for genomes in ProjectID.list:
while read g; do parse_mclOutput.py mclOutput_${g}.txt PG -rH > ${g}.PG; done < ProjectID.list
Proceed to the next step with GenomeID.PG (input #2).
Input #3: between-species best-hit pairs (BHPairs)
A tab-delimited text file with the GeneID of the query gene and its 'best-hit' or best-hit candidate GeneID in the target genome, one pair per line, for all possible pairs of genomes in ProjectID.list. Below, I describe two methods using BLASTN (default) and MMseqs2 (alternative):
method #1 For all GenomeIDs in ProjectID.list, create a blast database for the representative CDS sequences in GenomeID.cds.rep.fa:
makeblastdb -in GenomeID.cds.rep.fa -dbtype nucl
To process multiple genomes listed in ProjectID.list:
while read g; do makeblastdb -in ${g}.cds.rep.fa -dbtype nucl; done < ProjectID.list
Create blast commands for all possible pair of genomes in ProjectID.list. For blastn:
create_pairwiseBLAST_commands.py ProjectID.list -n "-task blastn -evalue 1e-5 -max_target_seqs 10 -outfmt '6 std qlen slen'" > ProjectID_pairwiseBLASTN.sh
Check create_pairwiseBLAST_commands.py -h for detailed options to designate folders for CDS sequences or blastn output files, as well as options to use blastp or MMseqs2 on deduced peptide sequences instead.
Once blast commands (ProjectID_pairwiseBLASTN.sh) were created, users will want to run it in the background (e.g. using the linux screen command) and multiplex if possible, depending on the computational resource. Users can add -num_threads option to the string given with -n option in the example above.
method #2 Users can use peptide sequences deduced from representative gene models (GenomeID.pep.rep.fa) to generate input #3 using MMseqs2. First create and index DB for all genomes listed in ProjectID.list (if MMseqs2 was used for the input #2, this step must have been already done):
<details> <summary>Click to expand</summary>m
Related Skills
node-connect
353.3kDiagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps
frontend-design
111.7kCreate distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.
openai-whisper-api
353.3kTranscribe audio via OpenAI Audio Transcriptions API (Whisper).
qqbot-media
353.3kQQBot 富媒体收发能力。使用 <qqmedia> 标签,系统根据文件扩展名自动识别类型(图片/语音/视频/文件)。
