SkillAgentSearch skills...

OrthNet

CLfinder-OrthNet, a pipeline to encode orthologs from multiple genomes and their evolutionary history into networks (OrthNets) based on co-linearity between them. OrthNets enable detection of all orthologous gene groups that share the same evolutionary history, using a search based on network topology

Install / Use

/learn @ohdongha/OrthNet

README

CLfinder-OrthNet

Synopsis

A pipeline to detect co-linearity in gene orders among multiple closely-related genomes (CLfinder) and build networks of orthologs, or OrthNets, connecting ortholog nodes with co-linearity information (co-linear, transposed, or unable to determine) among them as edge properties (ONfinder).

  • Ideally and in most cases, each OrthNet consists of orthologs derived from a single ancestral locus, including those duplicated specifically in a subset of genomes.
  • In addition to visualize the evolutionary context/history of each ortholog group, users can search ortholog groups (as OrthNets) based on an evolutionary history pattern/context they represent.
  • For example, users can retrieve all OrthNets (i.e. ortholog groups) that have underwent:
    • tandem duplication events unique to a genome or a subset of genomes, either mono-, para-, or polyphyletic
    • orthologs deleted, duplicated, or transposed uniquely in a genome or a subset of genomes,
    • orthologs transposed and duplicated uniquely in a genome or a subset of genomes, etc.
  • CLfinder and ONfinder (OrthNet finder) are two separable modules. Users can use only the CLfinder module to quickly obtain co-linearity information and a summary matrix of the pairwise comparisons for multiple genomes. The ONfinder module is optimized for creating OrthNets using inputs from CLfinder, but also can accept co-linearity information from other programs.

Jump to: Before starting; Preparing input files; Running CLfinder; Running ONfinder; Searching OrthNets; Annotating OrthNets; Notes; How to cite;

News

  • 2018-12-13 CLfinder-OrthNet will be presented at Plant&Animal Genomes 2019 (Jan. 12-16, 2019 at San Diego, CA) in the following three workshops: Systems Genomics, N. G. Genome Annotation & Analysis, and Digital Tools & Resources (details and abstracts)
  • 2018-10-31 The first CLfinder-OrthNet paper is published in DNA Research and now online.

Updates

  • 2019-06-25 Minor fixes on temporary file name formats; added more options to parse_gtf_2table.py to help the preparation of input#1.
  • 2018-12-25 Added cluster_OrthNet_topology_exact.py (previously search_OrthNet_topology.py) and search_OrthNet_CNVs.py; minor bug fixes and improvements
  • 2018-06-05 Added options to use MMseqs2 to generate inputs #2 and #3. MMseqs2 is much faster than orthoMCL or BLASTP and also generate one HSP (High-scoring Segment Pair) per each sequence comparison.

Upcoming changes

  • create_table4Cytoscape.py generates a table of node properties that can be imported to Cytoscape, to facilitate visualization of OrthNet
  • Wrapper scripts for the CLfinder and ONfinder modules
  • Tutorial and test dataset based on MMSeqs2
  • Iterative MCL (testing)
  • R-OrthNet : OrthNet with edges representing presence or absence of conservations in regulatory features between nodes

Before starting

Prerequisites - programs

  • python 2.7
  • mcl - other clustering options will be added in future updates.
  • MMseqs2 - to generate input #2 and #3 (see below for alternatives)
  • I assume users are familiar with basic linux commands.
  • Optional
    • gffread - if you start with a .gff file to generate input #1
    • orthoMCL - an alternative to generate input #2
    • blast+ - an alternative to generate input #3 and annotation
    • Cytoscape - to visualize and print OrthNets

Prerequisites -genome data

  • genome annotation of "representative" gene models (.gff, .gff3, or .gtf)
  • representative gene model sequences for all loci (.fasta)

Installing

Copy all python scripts to a folder, add the folder to $PATH, and made them executable:

export PATH=<folder>:$PATH # in bash, add this to ~/.bashrc
chmod 755 <folder>/*.py

Preparing input files

CLfinder-OrthNet accept three inputs: 1. gene model coordinates (genome annotation), 2. within-species paralog groups, and 3. between species "best-hit" pairs for all pair of genomes

ProjectID and list of genomes

ProjectID.list includes all GenomeIDs that you want to compare, one per line. I recommend GenomeIDs to be simple (2~5 alphanumeric) and ProjectID to be unique by adding date or time-stamp. For example, below, 180101_crucifers will be the ProjectID to compare six crucifer genomes included in the first CLfinder-OrthNet article (https://doi.org/10.1101/236299):

echo 'Aly Ath Cru Esa Sir Spa' | tr ' ' '\n' > 180101_Crucifers.list

Input #1: gene model coordinates (genome annotation)

For each genome, coordinates of representative gene models are parsed from genome annotations in .gtf format. The parsed file will have strand, coordinates, number of exons, and length of the mRNA and CDS (ORF), one gene model per line.

  1. If genome annotation is in .gff or .gff3 format, convert it to .gtf:

    gffread input.gff -T -o output.gtf
    
  2. Parse the .gtf file into a .gtfParsed.txt file. Name the output file as "GenomeID.gtfParsed.txt". Repeat for all GenomeIDs:

    parse_gtf_2table.py -pr input.gtf GenomeID.gtfParsed.txt > GenomeID.gtfParsed.log
    

    Important! Genomes should contain one representative gene/transcript model per each locus. See Note 1

Input #2: within-species paralog groups (PGs)

A tab-delimited text file with GeneID and paralog group ID (PG), one gene per line for each genome:

GeneID	PG
Gene1	PGxxxx
Gene2	PGxxxy
Gene3	PGxxxz
...

Input #2 can be prepared by various methods. Below are two example options:

method #1 Cluster all representative protein sequences using MMseqs2 for each genome and get "in-paralog" groups. Convert the .tsv output to input #2. Assuming all representative peptide sequences for each genome is GenomeID.pep.rep.fa:

mkdir tmp # temporary folder for MMseqs2 runs
mmseqs createdb GenomeID.pep.rep GenomeID_DB
mmseqs createindex GenomeID_DB tmp
mmseqs cluster GenomeID_DB GenomeID_c tmp --max-seqs 50000 -c 0.5
mmseqs createtsv GenomeID_DB GenomeID_DB GenomeID_c GenomeID_c.tsv
parse_mmseqs_clusters.py -H GenomeID_c.tsv GenomeID.PG

To run MMseqs2 clustering for all genomes in ProjectID.list :

<details> <summary>Click to expand</summary>
mkdir tmp # temporary folder for MMseqs2 runs
while read g; do 
mmseqs createdb ${g}.pep.rep ${g}_DB
mmseqs createindex ${g}_DB tmp
mmseqs cluster ${g}_DB ${g}_c tmp --max-seqs 50000 -c 0.5
mmseqs createtsv ${g}_DB ${g}_DB ${g}_c ${g}_c.tsv
parse_mmseqs_clusters.py -H ${g}_c.tsv ${g}.PG
done < ProjectID.list

Proceed to the next step with GenomeID.PG (input #2).

</details>

method #2 (originally the default method) If orthoMCL is available, you can run it for each genome and get "in-paralog" groups. Convert the orthoMCL output (mclOutput_GenomeID.txt) to input #2:

parse_mclOutput.py -rH mclOutput_GenomeID.txt PG > GenomeID.PG

if converting multiple orthoMCL output files for genomes in ProjectID.list:

while read g; do parse_mclOutput.py mclOutput_${g}.txt PG -rH > ${g}.PG; done < ProjectID.list

Proceed to the next step with GenomeID.PG (input #2).

Input #3: between-species best-hit pairs (BHPairs)

A tab-delimited text file with the GeneID of the query gene and its 'best-hit' or best-hit candidate GeneID in the target genome, one pair per line, for all possible pairs of genomes in ProjectID.list. Below, I describe two methods using BLASTN (default) and MMseqs2 (alternative):

method #1 For all GenomeIDs in ProjectID.list, create a blast database for the representative CDS sequences in GenomeID.cds.rep.fa:

makeblastdb -in GenomeID.cds.rep.fa -dbtype nucl

To process multiple genomes listed in ProjectID.list:

while read g; do makeblastdb -in ${g}.cds.rep.fa -dbtype nucl; done < ProjectID.list

Create blast commands for all possible pair of genomes in ProjectID.list. For blastn:

create_pairwiseBLAST_commands.py ProjectID.list -n "-task blastn -evalue 1e-5 -max_target_seqs 10 -outfmt '6 std qlen slen'" > ProjectID_pairwiseBLASTN.sh

Check create_pairwiseBLAST_commands.py -h for detailed options to designate folders for CDS sequences or blastn output files, as well as options to use blastp or MMseqs2 on deduced peptide sequences instead.

Once blast commands (ProjectID_pairwiseBLASTN.sh) were created, users will want to run it in the background (e.g. using the linux screen command) and multiplex if possible, depending on the computational resource. Users can add -num_threads option to the string given with -n option in the example above.

method #2 Users can use peptide sequences deduced from representative gene models (GenomeID.pep.rep.fa) to generate input #3 using MMseqs2. First create and index DB for all genomes listed in ProjectID.list (if MMseqs2 was used for the input #2, this step must have been already done):

<details> <summary>Click to expand</summary>
m

Related Skills

View on GitHub
GitHub Stars10
CategoryDevelopment
Updated5mo ago
Forks2

Languages

Python

Security Score

92/100

Audited on Oct 27, 2025

No findings