SkillAgentSearch skills...

EagleC2

EagleC2: Deep learning–powered discovery and screening of complex, fine-scale, and heterogeneity-defining structural variations from Hi-C data

Install / Use

/learn @XiaoTaoWang/EagleC2
About this skill

Quality Score

0/100

Supported Platforms

Universal

README

EagleC2

Hi-C has emerged as a powerful tool for detecting structural variations (SVs), but its sensitivity remains limited—particularly for SVs lacking canonical contact patterns. Here, we introduce EagleC2, a next-generation deep-learning framework that integrates an ensemble of convolutional neural networks (CNNs) with diverse architectures, trained on over 2.7 million image patches from 51 cancer Hi-C datasets with matched whole-genome sequencing (WGS) data. EagleC2 substantially outperforms its predecessor (EagleC) and other state-of-the-art methods, achieving consistently higher precision and recall across diverse validation datasets. Notably, it enables the discovery of non-canonical SVs—including complex rearrangements and fusions involving extremely small fragments—that are frequently missed by existing tools. In individual cancer genomes, EagleC2 detects over a thousand previously unrecognized SVs, the majority of which are supported by orthogonal evidence. To support clinical and diagnostic applications, EagleC2 also offers a rapid evaluation mode for accurately screening predefined SV lists, even at ultra-low coverage (e.g., 1x depth). When applied to single-cell Hi-C data from glioblastoma before and after erlotinib treatment, EagleC2 reveals extensive SV heterogeneity and dynamic structural changes, including events overlooked by conventional pipelines. These findings establish EagleC2 as a powerful and versatile framework for SV discovery, with broad applications in genome research, cancer biology, diagnostics, and therapeutic development.

.. image:: ./images/framework.png :align: center

Unique features of EagleC2

Compared with the original EagleC, EagleC2 has the following unique features:

  • EagleC2 is able to detect non-canonical SVs, including fine-scale complex rearrangements (multiple SVs clustered within a local window) and fusions involving extremely small fragments
  • EagleC2 offers a rapid evaluation mode for accurately screening predefined SV lists, even at ultra-low coverage (e.g., 1x depth)
  • EagleC2 supports arbitrary resolutions, without requiring model re-training for each resolution
  • EagleC2 enables fast genome-wide SV prediction on large Hi-C datasets without the need for distributed computing across multiple nodes
  • EagleC2 supports both CPU and GPU inference

Navigation

  • Installation_
  • Download pre-trained models_
  • Overview of the commands_
  • Quick start_
  • Visualize local contact patterns around SV breakpoints_
  • Post-processing and filtering of SV predictions_
  • Evaluation of predefined SVs_

Installation

EagleC2 and all required dependencies can be installed using mamba <https://github.com/conda-forge/miniforge>_ and pip <https://pypi.org/project/pip/>_.

After you have installed mamba successfully, you can create a conda environment for EagleC2 by executing the following commands (for Linux users)::

$ conda config --add channels defaults
$ conda config --add channels bioconda
$ conda config --add channels conda-forge
$ mamba create -n EagleC2 hdbscan numba statsmodels python=3.11 cooler=0.9 joblib=1.3 numpy=1.26 scikit-learn=1.4 "tensorflow>=2.16"
$ mamba activate EagleC2
$ pip install eaglec

This will intall the core dependencies required to run EagleC2.

If you also wish to use the visualization module, please install the following additional packages (pyBigWig is only required if you want to plot signals from BigWig files)::

$ mamba install matplotlib pyBigWig

If you plan to use the gene fusion annotation module, please install::

$ mamba install pyensembl

For macOS users (tested on Apple M-series chips only), you can install EagleC2 and its dependencies with::

$ conda config --add channels defaults
$ conda config --add channels bioconda
$ conda config --add channels conda-forge
$ mamba clean --all
$ mamba create -n EagleC2gpu python=3.11 hdbscan numba statsmodels joblib=1.3 numpy=1.26 scikit-learn=1.4
$ mamba activate EagleC2gpu
$ pip install --no-cache-dir cooler==0.9.1
$ pip install --no-cache-dir tensorflow==2.16.1 keras==3.3.3
$ pip install --no-cache-dir tensorflow-metal==1.1.0
$ pip install eaglec

Similarly, if you would like to use the visualization or gene fusion annotation modules on macOS, please install matplotlib, pyBigWig, and pyensembl as described above.

Download pre-trained models

Before proceeding, please download the pre-trained models <https://www.jianguoyun.com/p/DWhJeUsQh9qdDBjVpoEGIAA>_ for EagleC2.

Unlike EagleC, which relied on separate models trained for specific resolutions (e.g., 5 kb, 10 kb, 50 kb, and 500 kb) and sequencing depths, EagleC2 was trained on a unified dataset that integrates samples across a wide range of resolutions and depths. This allows for seamless application to data at arbitrary resolutions and sequencing depths, without the need for model re-training.

Overview of the commands

EagleC2 is distributed with eight command-line tools. You can command [-h] in a terminal window to view the basic usage of each command.

  • predictSV

    predictSV is the core command for predicting SVs from chromatin contact maps.

    Required inputs:

    1. Path to a .mcool file – This is a multi-resolution format for storing contact matrices. See cooler <https://github.com/open2c/cooler>_ for details. If you only have .hic files (see Juicer <https://github.com/aidenlab/juicer>), you can convert them to .mcool using hic2cool <https://github.com/4dn-dcic/hic2cool> or HiClift <https://github.com/XiaoTaoWang/HiCLift>_.
    2. Path to the folder containing the pre-trained models.

    Output:

    The predicted SVs will be written to a .txt file with 13 columns:

    • Breakpoint coordinates (chrom1, pos1, chrom2, pos2)
    • Probability values for each SV type (++, +-, -+, --, ++/--, and +-/-+)
    • The resolution of the contact matrix from which the SV was originally predicted
    • The finest resolution to which the SV can be refined
    • The number of bad bins near the SV breakpoints
  • plot-SVbreaks

    Plots a local contact map centered on the provided SV breakpoint coordinates. For intra-chromosomal SVs, contact counts will be distance-normalized. All contact matrices will be min-max scaled to the range [0, 1].

    The input breakpoint coordinates should follow the format: "chrom1,pos1,chrom2,pos2".

    This is useful for visually checking whether the expected contact patterns are present around SV breakpoints, including those identified by short-read or long-read whole-genome sequencing methods.

  • filterSV

    Filters the predicted SVs based on probability values.

  • evaluateSV

    Evaluates a predefined list of SVs using EagleC2 models.

  • reformatSV

    Reformats the output from predictSV into a format compatible with NeoLoopFinder <https://github.com/XiaoTaoWang/NeoLoopFinder>_.

  • annotate-gene-fusion

    Annotates gene fusion events for a list of SV breakpoints.

  • plot-interSVs

    Plots a contact map for a specified set of chromosomes, with predicted SVs marked.

  • plot-intraSVs

    Plots a contact map for a specified genomic region, with predicted SVs marked.

As the commands annotate-gene-fusion, plot-interSVs, and plot-intraSVs are directly inherited from the original EagleC, this documentation does not cover them in detail. For more information, please refer to the orignal EagleC documentation <https://github.com/XiaoTaoWang/EagleC>_

Quick Start

The following steps will guide you through the process of using EagleC2. All commands below are expected to be executed in a terminal window.

  1. Unzip the pre-trained models

Place the downloaded pre-trained models in your working directory and unzip the archive::

$ unzip EagleC2-models.zip

2. Download the test dataset

Download the test dataset FY1199.used_for_SVpredict.mcool <https://www.jianguoyun.com/p/DYoL0UgQh9qdDBjdpoEGIAA>_, which contains ~18 million contact pairs. This dataset is derived from FY1199, a human lymphoblastoid cell line with a known balanced inter-chromosomal translocation between chromosomes 11 and 22 (46,XY,t(11;22)(q23.3;q11.2)). Place the file in the same directory as the pre-trained models.

  1. Run the SV prediction command

Execute the following command to perform SV prediction on this Hi-C dataset::

$ predictSV --mcool FY1199.used_for_SVpredict.mcool --resolutions 25000,50000,100000 \
            --high-res 25000 --prob-cutoff-1 0.5 --prob-cutoff-2 0.5 -O FY1199_EagleC2 \
            -g hg38 --balance-type ICE -p 8 --intra-extend-size 1,1,1 --inter-extend-size 1,1,1

For view a full description of each parameter, run::

$ predictSV -h

What happens when you run the above command

This command performs genome-wide SV prediction on ICE-normalized contact matrices at 50 kb and 100 kb resolutions (as specified by --resolutions, excluding those listed in --high-res). To accelerate computation, pixels with significantly elevated contact counts are identified and extended by 1 bin on both ends (controlled by --intra-extend-size and --inter-extend-size; the values specified for these parameters correspond to each resolution listed in --resolutions) to cover potential SV breakpoints.

SV predicted at coarser resolutions are progressively refined at higher resolutions. For example, an SV initially predicted at 100 kb (with a probability cutoff of 0.5, set by --prob-cutoff-1) will be refined at 50 kb. If the probability at 50 kb exceeds the second cutoff (set by --prob-cutoff-2), the SV will be further refined at 25 kb. Otherwise, the 50 kb coordinates are reported as final.

SV predictions across all res

View on GitHub
GitHub Stars15
CategoryEducation
Updated13d ago
Forks0

Languages

Python

Security Score

75/100

Audited on Mar 27, 2026

No findings