EagleC
A deep-learning framework for predicting a full range of structural variations from bulk and single-cell contact maps
Install / Use
/learn @XiaoTaoWang/EagleCREADME
.. note:: EagleC2 <https://github.com/XiaoTaoWang/EagleC2>_ is now publicly available.
Try it and let us know your feedback!
EagleC
Hi-C technique has been shown to be a promising method to detect structural variations (SVs) in human genomes. However, algorithms that can use Hi-C data for a full-range SV detection have been severely lacking. Current methods can only identify inter-chromosomal translocations and long-range intra-chromosomal SVs (>1Mb) at less-than-optimal resolution. Therefore, we develop EagleC, a framework that combines deep-learning and ensemble-learning strategies to predict a full-range of SVs at high-resolution. Importantly, we show that EagleC can uniquely capture a set of fusion genes that are missed by WGS or nanopore. Furthermore, EagleC also effectively captures SVs in other chromatin interaction platforms, such as HiChIP, ChIA-PET, and capture Hi-C. We apply EagleC in over 100 cancer cell lines and primary tumors, and identify a valuable set of high-quality SVs. Finally, we demonstrate that EagleC can be applied to single-cell Hi-C and used to study the SV heterogeneity in primary tumors.
.. image:: ./images/framework.png :align: center
Unique features of EagleC
- EagleC is able to accurately detect a full range of SVs including short-range SVs with breakpoint distance less than 100kb or even 50kb
- EagleC is designed to accept any 3C-based contact maps, including Hi-C, Micro-C, HiChIP, ChIA-PET, capture Hi-C, and single-cell Hi-C
- EagleC can be used to predict SVs in any species (it has been tested in human, mouse, and zebrafish)
Citation
Wang, X., Luan, Y., Yue, F. EagleC: A deep-learning framework for detecting a full range of structural variations from bulk and single-cell contact maps. Sci Adv. 2022.
Navigation
Installation_Download pre-trained models_Overview of the commands_Quick Start_Annotate gene fusions_Visualize predicted SVs on contact maps_Locate high-resolution coordinates given a list of low-resolution SVs_Predict SVs at higher resolutions_Predict SVs in other species_
Installation
First, install following python packages using mamba <https://mamba.readthedocs.io/en/latest/installation.html>_::
$ conda config --add channels defaults
$ conda config --add channels bioconda
$ conda config --add channels conda-forge
$ mamba create -n EagleC scikit-learn statsmodels matplotlib cooler pyBigWig pyensembl python=3.8 joblib=1.0.1 cython=0.29.24 "tensorflow<=2.11"
.. note:: matplotlib and pyBigWig are only required if you want to use the visualization module to view the predicted SVs on contact maps, and pyensembl is only required if you want to annotate potential gene fusions given a list of SV breakpoints.
If you are installing EagleC in Linux, just execute the command below to install
EagleC from PyPI <https://pypi.org/project/eaglec/>_::
$ pip install eaglec==0.1.9
If you are installing EagleC in MacOS, please download and install an appropriate package
from here <https://github.com/XiaoTaoWang/EagleC/releases>_::
$ pip install eaglec-0.1.9-cp38-cp38-macosx_10_9_x86_64.whl
Download pre-trained models
We have trained a series of EagleC models covering various sequencing depths for both bulk Hi-C maps and single-cell Hi-C maps. Before running EagleC, we recommend downloading these pre-trained models by simply executing the command below. In prediction, EagleC will automatically select the most appropriate models according to the number of contacts in your contact map::
$ download-pretrained-models
Overview of the commands
EagleC is distributed with 6 command-line tools. Type command [-h] in a terminal
window to learn the basic usage of each command.
-
predictSV
predictSV is the main command we used to predict SVs from bulk Hi-C/HiChIP/ChIA-PET contact maps in this work. It is based on predictSV-single-resolution and automatically combines predictions from 5kb, 10kb, and 50kb resolutions. For 10kb and 50kb predictions, it further searches for the most probable breakpoint coordinates within a local region on 5kb contact maps so that all the reported SVs are at the 5kb resolution.
The inputs to this command are three genome-wide contact maps at 5kb, 10kb, and 50kb resolutions in .cool format (cool URIs, refer to
cooler <https://github.com/open2c/cooler>_ if you are not familiar with this format). If you only have.hic files <https://github.com/aidenlab/juicer>, consider converting your files to the ".cool" format usinghic2cool <https://github.com/4dn-dcic/hic2cool>orpairLiftOver <https://github.com/XiaoTaoWang/pairLiftOver#usage>. The predicted SVs can be selected to be reported in two formats: 1) "--output-format full" will report 8 columns for each SV, including breakpoint coordinates (chrom1, pos1, chrom2, pos2) and probability values of each fusion type (++, +-, -+, and --) (refer to Figures S1-S2 for the definition of each fusion type); 2) "--output-format NeoLoopFinder" will output a file (6 columns) that can be directly used as theNeoLoopFinder <https://github.com/XiaoTaoWang/NeoLoopFinder>input. -
predictSV-single-resolution
This command predicts SVs at single resolution. By default, it searches for SVs throughout the whole genome; however, it can also perform a local search on high-resolution matrices if SVs at lower resolutions are provided through the parameter "--low-resolution-breaks".
-
merge-redundant-SVs
This command merges multiple SV calls from the same sample. The inputs are one or multiple SV files from predictSV or predictSV-single-resolution in "full" format (8 columns). Again, the output format has two options ("full" and "NeoLoopFinder").
-
annotate-gene-fusion
This command can be used to annotate gene fusion events for a list of SV breakpoints. The input to this command is an SV file with breakpoint coordinate information (chrom1, pos1, chrom2, pos2) in the first four columns and a release number of ensembl genes.
-
plot-interSVs
This command can be used to plot a chromosome-wide contact map with predicted SVs marked on it.
-
plot-intraSVs
This command can be used to plot a local intra-chromosomal contact map with predicted SVs marked on it.
Quick Start
First, let's download a processed Hi-C dataset (~163M contact pairs) in SK-N-AS (a neuroblastoma cell line)::
$ wget -O SKNAS-MboI-allReps-filtered.mcool -L https://www.dropbox.com/s/f80bgn11d7wfgq8/SKNAS-MboI-allReps-filtered.mcool?dl=0
The downloaded ".mcool" file contains contact matrices at multiple resolutions. To list all
individual cool URIs within it, execute the cooler ls command below::
$ cooler ls SKNAS-MboI-allReps-filtered.mcool
SKNAS-MboI-allReps-filtered.mcool::/resolutions/5000
SKNAS-MboI-allReps-filtered.mcool::/resolutions/10000
SKNAS-MboI-allReps-filtered.mcool::/resolutions/25000
SKNAS-MboI-allReps-filtered.mcool::/resolutions/50000
SKNAS-MboI-allReps-filtered.mcool::/resolutions/100000
SKNAS-MboI-allReps-filtered.mcool::/resolutions/250000
SKNAS-MboI-allReps-filtered.mcool::/resolutions/500000
SKNAS-MboI-allReps-filtered.mcool::/resolutions/1000000
SKNAS-MboI-allReps-filtered.mcool::/resolutions/2500000
SKNAS-MboI-allReps-filtered.mcool::/resolutions/5000000
Next, let's use the predictSV command to predict SVs on this dataset::
$ predictSV --hic-5k SKNAS-MboI-allReps-filtered.mcool::/resolutions/5000 \
--hic-10k SKNAS-MboI-allReps-filtered.mcool::/resolutions/10000 \
--hic-50k SKNAS-MboI-allReps-filtered.mcool::/resolutions/50000 \
-O SK-N-AS -g hg38 --balance-type CNV --output-format full \
--prob-cutoff-5k 0.8 --prob-cutoff-10k 0.8 --prob-cutoff-50k 0.99999
As we mentioned in Overview of the commands_, contact matrices at three resolutions
5kb, 10kb, and 50kb will be used. Here are some suggestions for individual parameters:
--balance-type, here by specifying "--balance-type CNV", predictSV will perform predictions on CNV-normalized matrices. You can also select to use ICE-normalized matrices by specifying "--balance-type ICE" or Raw matrices by specifying "--balance-type Raw". According to our test, for the same sample, running on the Raw matrix tends to detect more SVs with lower accuracy, while running on the CNV/ICE normalized matrices usually achieves higher accuracy but detects fewer SVs.
.. note:: If you choose CNV, make sure you have run "correct-cnv" of the
NeoLoopFinder <https://github.com/XiaoTaoWang/NeoLoopFinder>_
toolkit before you run this command; if you choose ICE, make sure you have run
"cooler balance" on your Hi-C matrices before you run this command.
- By default, we apply probability cutoffs of 0.8, 0.8, and 0.99999 at 5kb, 10kb, and 50kb resolutions, respectively. We found this set of cutoffs achieved a good tradeoff between sensitivity and specificity in most of our tests. If you care more about sensitivity, just tune down these cutoffs.
Running predictSV on a single CPU core is expected to be slow, as it iterates submatrices of all candidate pixels on these contact matrices. To speed up the calculation, predictSV supports parallel computation for different intra-chromosomal and inter-chromosomal matrices, by creating hidden lock files to avoid conflicts between jobs. This strategy is especially efficient when you are performing the calculation in a computational cluster. Depending on your cluster environment, you need to create a job submission script. Here is an example slurm script named as "slurm-predictSV.sh"::
#!/bin/bash
#SBATCH -A b1042
#SBATCH -p genomicsguestA
#SBATCH -t 48:00:00
#SBATCH -N 1
Security Score
Audited on Oct 21, 2025
