Diverge4

DIVERGE v4 is a Python package and web server for large-scale analysis of functional divergence across multi-gene families. It includes modular tools for detecting evolutionary shifts in amino acid sites and a comprehensive database of human protein families.

Generate Convert Improve

Install / Use

/learn @zjupgx/Diverge4

About this skill

Quality Score

0/100

README

DIVERGE v4

GitHub last commit

DIVERGE v4 is a Python package designed for large-scale analysis of functional divergence across multi-gene families. It is a major upgrade of the widely used DIVERGE software, incorporating a novel Super-Cluster algorithm, a modular Python structure, and a user-friendly web server. This package allows for the identification of amino acid sites undergoing significant evolutionary shifts, helping to uncover functional divergence after gene duplication.

DIVERGE analyzes two types of functional divergence:

Type-I: Significant differences in evolutionary rates at specific sites between gene clusters, indicating different functional constraints
Type-II: Subfamily-specific amino acid property conservation, where sites are conserved across subfamilies but with different amino acid types

Major Updates from DIVERGE v3

Novel Super-Cluster Algorithm: A statistically robust method for analyzing large gene families that:
- Replaces numerous one-to-one comparisons with a single computation
- Divides clusters into Super-Cluster pairs based on conservation patterns
- Provides more accurate functional divergence detection for multi-gene families
- Reduces computational complexity from exponential to linear
Modular Python Architecture:
- Base Layer (C++ API): Core data structures and computationally intensive functions
- Middle Layer (Python Wrapper): PyBind11-based bridge between C++ and Python
- Top Layer (High-Level Python API): User-friendly interface for model building and analysis
Comprehensive Database:
- Analysis results of 4,540 human protein families
- Covers 10,133 human genes and 215,480 protein sequences
- Built using phylogenetic data from PANTHER database
- Multiple sequence alignments from 19 selected vertebrate species
- Expert-reviewed phylogenetic trees

framework

Features

Novel Super-Cluster Algorithm: A statistically robust method designed for large-scale analysis of functional divergence in multi-gene families, replacing numerous one-to-one comparisons with a single computation
Modular Python Package: Built for scalability and seamless integration into bioinformatics workflows, with 10 customizable modules for functional divergence analysis
Web Interface: A user-friendly web server developed using the Streamlit framework, making the package accessible even without programming knowledge
Comprehensive Database: Analysis results of 4,540 human protein families (comprising 10,133 human genes and 215,480 protein sequences), searchable by UniProtKB, Ensembl, HGNC IDs, or gene names

Installation

To install the DIVERGE v4 Python package, use pip:

pip install diverge

If you prefer to compile the package from source using setup.py, you will need to install the pybind11 library, which provides the C++ bindings for Python used in this package. You can install it via pip:

pip install pybind11

Once pybind11 is installed, you can compile DIVERGE v4 by running the following commands:

git clone https://github.com/zjupgx/diverge4.git
cd diverge4
python setup.py install

pybind11 is necessary because DIVERGE v4 uses C++ for its core data structures and computationally intensive tasks, which are exposed to Python via pybind11.

Quick Start

Input Requirements

Multiple Sequence Alignment (MSA) File:
- Supported formats: FASTA or CLUSTAL
- Only amino acid alignments are allowed
- Gaps (-) are allowed in the alignment
Phylogenetic Tree File:
- Must be in Newick format
- Branch lengths are optional but recommended
- Internal node names should be removed to prevent program crashes
- Tree depth must be at least 3 for proper analysis

Quick Usage Examples

Basic Functional Divergence Analysis

from diverge import Gu99

# Perform Type-I functional divergence analysis
gu99 = Gu99("alignment.aln", "cluster1.tree", "cluster2.tree")
print("Theta coefficient:", gu99.summary.iloc[0, 0])
print("Sites with high divergence (Qk > 0.9):", sum(gu99.results.iloc[:, 0] > 0.9))

SuperCluster Analysis for Large Gene Families

from diverge import SuperCluster

# Analyze multiple clusters with parallel processing
super_cluster = SuperCluster("alignment.aln", "tree1.tree", "tree2.tree", 
                           "tree3.tree", "tree4.tree", parallel=True)
print("Summary:", super_cluster.summary)
print("Results shape:", super_cluster.results.shape)

Batch Processing for High Performance

from diverge import Gu99Batch

# Process multiple datasets in parallel
batch = Gu99Batch(max_threads=8)
batch.add_task("dataset1.aln", "d1_tree1.tree", "d1_tree2.tree", task_name="Dataset_1")
batch.add_task("dataset2.aln", "d2_tree1.tree", "d2_tree2.tree", task_name="Dataset_2")
batch.calculate_batch()

# Get results
results = batch.get_successful_results()
batch.print_summary()

Conservation-Weighted Analysis

from diverge import SuperCluster

# Apply conservation weighting for improved accuracy
conswins = {'cons_win_len': 3, 'lambda_param': 0.7}
super_cluster = SuperCluster("alignment.aln", *tree_files, 
                           conswins=conswins, parallel=True)

Comprehensive Documentation

📖 Complete User Guide - Detailed documentation covering:

All analysis methods with examples
SuperCluster algorithm details
Batch processing workflows
Performance optimization
Troubleshooting guide
Advanced features

Main Functional Modules

DIVERGE v4 provides various independent computing processes to create custom pipelines for functional divergence analysis. Below are the main functions:

| Function | Description | |--------------|-----------------| | Type-I Divergence (Gu99 method) | Detect type-I functional divergence using the Gu (1999) method | | Type-I Divergence (Gu2001 method) | Detect type-I functional divergence using the Gu (2001) method. Requires phylogenetic tree file with branch length data | | Type-II Divergence | Detect type-II functional divergence of gene families | | Super-Cluster Analysis | Perform large-scale functional divergence analysis using the Super-Cluster method, designed for multi-gene families | | Rate Variation Among Sites (RVS) | Estimate rate variations among sites for a given cluster. Only one cluster is allowed per run | | Functional Distance Analysis | Estimate type-I functional distance between pairs of clusters and compute type-I functional branch lengths. Requires at least three clusters | | FDR for Predictions | Calculate the false discovery rate of functionally diverging sites | | Asymmetric Test for Type-I Functional Divergence | Test whether the degree of type-I functional divergence differs between duplicate genes. Requires three clusters | | Effective Number of Sites | Estimate the effective number of sites related to type-I or type-II functional divergence. Requires two clusters | | Gene-Specific Type-I Analysis | Site-specific posterior profile for predicting gene-specific type-I functional divergence-related sites. Requires three clusters |

Super-Cluster Algorithm

The Super-Cluster algorithm is designed to efficiently analyze functional divergence in large gene families by:

Partitioning m clusters into two groups (Super-Cluster pairs)
Computing changes at amino acid sites for each Super-Cluster
Performing DIVERGE Type-I analysis on Super-Cluster pairs
Recording site-specific posterior probabilities for divergence profiling

This approach provides several advantages:

Reduces computational complexity
Improves statistical robustness
Enables analysis of larger gene families
Provides more intuitive functional divergence profiles

Web Server and Database

The web server (https://pgx.zju.edu.cn/diverge) provides:

Interactive Analysis: Upload MSA and phylogeny files for functional divergence analysis
Comprehensive Database: Access pre-computed analyses of 4,540 human protein families
Search Functionality: Query the database using UniProtKB ID, Ensembl ID, HGNC ID, or gene name
Functional Annotations: Access Gene Ontology terms, pathways, and protein class assignments for human proteins
Visualization Tools: Interactive visualization of results and amino acid sites

Troubleshooting

Common issues and solutions:

Sequence Name Mismatch: Ensure sequence names in MSA file exactly match those in tree file
Tree Depth Error: Verify tree file has at least 3 levels of depth
Internal Node Names: Remove names from internal nodes in tree file
File Format: Confirm MSA is in proper FASTA or CLUSTAL format

Citation

If you use DIVERGE in your research, please cite:

Chen Y, Xu X, Pan Y, Wang S, Zhao W, Zhou B, Zhou J, Zheng Y, Zhou Z, Gu X. DIVERGE v4: A Platform for Large-Scale Analysis of Functional Divergence Across Multi-Gene Families. Mol Biol Evol. 2025, 42(11):msaf277. doi: 10.1093/molbev/msaf277.

🌐 Web Server | 📦 GitHub |&nb