EagleC2
EagleC2: Deep learning–powered discovery and screening of complex, fine-scale, and heterogeneity-defining structural variations from Hi-C data
Install / Use
/learn @XiaoTaoWang/EagleC2README
EagleC2
Hi-C has emerged as a powerful tool for detecting structural variations (SVs), but its sensitivity remains limited—particularly for SVs lacking canonical contact patterns. Here, we introduce EagleC2, a next-generation deep-learning framework that integrates an ensemble of convolutional neural networks (CNNs) with diverse architectures, trained on over 2.7 million image patches from 51 cancer Hi-C datasets with matched whole-genome sequencing (WGS) data. EagleC2 substantially outperforms its predecessor (EagleC) and other state-of-the-art methods, achieving consistently higher precision and recall across diverse validation datasets. Notably, it enables the discovery of non-canonical SVs—including complex rearrangements and fusions involving extremely small fragments—that are frequently missed by existing tools. In individual cancer genomes, EagleC2 detects over a thousand previously unrecognized SVs, the majority of which are supported by orthogonal evidence. To support clinical and diagnostic applications, EagleC2 also offers a rapid evaluation mode for accurately screening predefined SV lists, even at ultra-low coverage (e.g., 1x depth). When applied to single-cell Hi-C data from glioblastoma before and after erlotinib treatment, EagleC2 reveals extensive SV heterogeneity and dynamic structural changes, including events overlooked by conventional pipelines. These findings establish EagleC2 as a powerful and versatile framework for SV discovery, with broad applications in genome research, cancer biology, diagnostics, and therapeutic development.
.. image:: ./images/framework.png :align: center
Unique features of EagleC2
Compared with the original EagleC, EagleC2 has the following unique features:
- EagleC2 is able to detect non-canonical SVs, including fine-scale complex rearrangements (multiple SVs clustered within a local window) and fusions involving extremely small fragments
- EagleC2 offers a rapid evaluation mode for accurately screening predefined SV lists, even at ultra-low coverage (e.g., 1x depth)
- EagleC2 supports arbitrary resolutions, without requiring model re-training for each resolution
- EagleC2 enables fast genome-wide SV prediction on large Hi-C datasets without the need for distributed computing across multiple nodes
- EagleC2 supports both CPU and GPU inference
Navigation
Installation_Download pre-trained models_Overview of the commands_Quick start_Visualize local contact patterns around SV breakpoints_Post-processing and filtering of SV predictions_Evaluation of predefined SVs_
Installation
EagleC2 and all required dependencies can be installed using mamba <https://github.com/conda-forge/miniforge>_
and pip <https://pypi.org/project/pip/>_.
After you have installed mamba successfully, you can create a conda environment for EagleC2 by executing the following commands (for Linux users)::
$ conda config --add channels defaults
$ conda config --add channels bioconda
$ conda config --add channels conda-forge
$ mamba create -n EagleC2 hdbscan numba statsmodels python=3.11 cooler=0.9 joblib=1.3 numpy=1.26 scikit-learn=1.4 "tensorflow>=2.16"
$ mamba activate EagleC2
$ pip install eaglec
This will intall the core dependencies required to run EagleC2.
If you also wish to use the visualization module, please install the following additional packages (pyBigWig is only required if you want to plot signals from BigWig files)::
$ mamba install matplotlib pyBigWig
If you plan to use the gene fusion annotation module, please install::
$ mamba install pyensembl
For macOS users (tested on Apple M-series chips only), you can install EagleC2 and its dependencies with::
$ conda config --add channels defaults
$ conda config --add channels bioconda
$ conda config --add channels conda-forge
$ mamba clean --all
$ mamba create -n EagleC2gpu python=3.11 hdbscan numba statsmodels joblib=1.3 numpy=1.26 scikit-learn=1.4
$ mamba activate EagleC2gpu
$ pip install --no-cache-dir cooler==0.9.1
$ pip install --no-cache-dir tensorflow==2.16.1 keras==3.3.3
$ pip install --no-cache-dir tensorflow-metal==1.1.0
$ pip install eaglec
Similarly, if you would like to use the visualization or gene fusion annotation modules on macOS, please install matplotlib, pyBigWig, and pyensembl as described above.
Download pre-trained models
Before proceeding, please download the pre-trained models <https://www.jianguoyun.com/p/DWhJeUsQh9qdDBjVpoEGIAA>_ for EagleC2.
Unlike EagleC, which relied on separate models trained for specific resolutions (e.g., 5 kb, 10 kb, 50 kb, and 500 kb) and sequencing depths, EagleC2 was trained on a unified dataset that integrates samples across a wide range of resolutions and depths. This allows for seamless application to data at arbitrary resolutions and sequencing depths, without the need for model re-training.
Overview of the commands
EagleC2 is distributed with eight command-line tools. You can command [-h] in a
terminal window to view the basic usage of each command.
-
predictSV
predictSV is the core command for predicting SVs from chromatin contact maps.
Required inputs:
- Path to a .mcool file – This is a multi-resolution format for storing contact
matrices. See
cooler <https://github.com/open2c/cooler>_ for details. If you only have .hic files (seeJuicer <https://github.com/aidenlab/juicer>), you can convert them to .mcool usinghic2cool <https://github.com/4dn-dcic/hic2cool>orHiClift <https://github.com/XiaoTaoWang/HiCLift>_. - Path to the folder containing the pre-trained models.
Output:
The predicted SVs will be written to a .txt file with 13 columns:
- Breakpoint coordinates (chrom1, pos1, chrom2, pos2)
- Probability values for each SV type (++, +-, -+, --, ++/--, and +-/-+)
- The resolution of the contact matrix from which the SV was originally predicted
- The finest resolution to which the SV can be refined
- The number of bad bins near the SV breakpoints
- Path to a .mcool file – This is a multi-resolution format for storing contact
matrices. See
-
plot-SVbreaks
Plots a local contact map centered on the provided SV breakpoint coordinates. For intra-chromosomal SVs, contact counts will be distance-normalized. All contact matrices will be min-max scaled to the range [0, 1].
The input breakpoint coordinates should follow the format: "chrom1,pos1,chrom2,pos2".
This is useful for visually checking whether the expected contact patterns are present around SV breakpoints, including those identified by short-read or long-read whole-genome sequencing methods.
-
filterSV
Filters the predicted SVs based on probability values.
-
evaluateSV
Evaluates a predefined list of SVs using EagleC2 models.
-
reformatSV
Reformats the output from predictSV into a format compatible with
NeoLoopFinder <https://github.com/XiaoTaoWang/NeoLoopFinder>_. -
annotate-gene-fusion
Annotates gene fusion events for a list of SV breakpoints.
-
plot-interSVs
Plots a contact map for a specified set of chromosomes, with predicted SVs marked.
-
plot-intraSVs
Plots a contact map for a specified genomic region, with predicted SVs marked.
As the commands annotate-gene-fusion, plot-interSVs, and plot-intraSVs are directly
inherited from the original EagleC, this documentation does not cover them in detail. For
more information, please refer to the orignal EagleC documentation <https://github.com/XiaoTaoWang/EagleC>_
Quick Start
The following steps will guide you through the process of using EagleC2. All commands below are expected to be executed in a terminal window.
- Unzip the pre-trained models
Place the downloaded pre-trained models in your working directory and unzip the archive::
$ unzip EagleC2-models.zip
2. Download the test dataset
Download the test dataset FY1199.used_for_SVpredict.mcool <https://www.jianguoyun.com/p/DYoL0UgQh9qdDBjdpoEGIAA>_,
which contains ~18 million contact pairs. This dataset is derived from FY1199,
a human lymphoblastoid cell line with a known balanced inter-chromosomal translocation
between chromosomes 11 and 22 (46,XY,t(11;22)(q23.3;q11.2)). Place the file in the
same directory as the pre-trained models.
- Run the SV prediction command
Execute the following command to perform SV prediction on this Hi-C dataset::
$ predictSV --mcool FY1199.used_for_SVpredict.mcool --resolutions 25000,50000,100000 \
--high-res 25000 --prob-cutoff-1 0.5 --prob-cutoff-2 0.5 -O FY1199_EagleC2 \
-g hg38 --balance-type ICE -p 8 --intra-extend-size 1,1,1 --inter-extend-size 1,1,1
For view a full description of each parameter, run::
$ predictSV -h
What happens when you run the above command
This command performs genome-wide SV prediction on ICE-normalized contact matrices
at 50 kb and 100 kb resolutions (as specified by --resolutions, excluding those
listed in --high-res). To accelerate computation, pixels with significantly
elevated contact counts are identified and extended by 1 bin on both ends (controlled
by --intra-extend-size and --inter-extend-size; the values specified for these
parameters correspond to each resolution listed in --resolutions) to cover potential
SV breakpoints.
SV predicted at coarser resolutions are progressively refined at higher resolutions.
For example, an SV initially predicted at 100 kb (with a probability cutoff of 0.5,
set by --prob-cutoff-1) will be refined at 50 kb. If the probability at 50 kb exceeds
the second cutoff (set by --prob-cutoff-2), the SV will be further refined at 25 kb.
Otherwise, the 50 kb coordinates are reported as final.
SV predictions across all res
