NeoLoopFinder
A computation framework for genome-wide detection of enhancer-hijacking events from chromatin interaction data in re-arranged genomes
Install / Use
/learn @XiaoTaoWang/NeoLoopFinderREADME
Neo-loop Finder
.. image:: https://codeocean.com/codeocean-assets/badge/open-in-code-ocean.svg :target: https://codeocean.com/capsule/8407443/tree/v1 .. image:: https://static.pepy.tech/personalized-badge/neoloop?period=total&units=international_system&left_color=black&right_color=orange&left_text=Downloads :target: https://pepy.tech/project/neoloop
Although recent efforts have shown that structural variations (SVs) can disrupt the 3D genome organization and induce enhancer-hijacking, no computational tools exist to detect such events from chromatin interaction data, such as Hi-C. Here, we develop NeoLoopFinder, a computational framework to identify the chromatin interactions induced by SVs, such as inter-chromosomal translocations, large deletions, and inversions. Our framework can automatically reconstruct local Hi-C maps surrounding the breakpoints, normalize copy number variation and allele effects, and capture local optimal signals. We applied NeoLoopFinder in Hi-C data from 50 cancer cell lines and primary tumors and identified tens of recurrent genes associated with enhancer-hijacking in different cancer types. To validate the algorithm, we deleted hijacked enhancers by CRISPR/Cas9 and showed that the deletions resulted in the reduction of the target oncogene expression. In summary, NeoLoopFinder is a novel tool for identifying potential tumorigenic mechanisms and suggesting new diagnostic and therapeutic targets.
Citation
Wang, X., Xu, J., Zhang, B., Hou, Y., Song, F., Lyu, H., Yue, F. Genome-wide detection of enhancer-hijacking events from chromatin interaction data in re-arranged genomes. Nat Methods. 2021.
Installation
NeoLoopFinder and all the dependencies can be installed using either mamba <https://mamba.readthedocs.io/en/latest/installation.html>_
or pip <https://pypi.org/project/pip/>_::
$ conda config --add channels r
$ conda config --add channels defaults
$ conda config --add channels bioconda
$ conda config --add channels conda-forge
$ conda config --set channel_priority strict
$ mamba create -n neoloop cooler matplotlib pyensembl pybigwig intervaltree rpy2 r-mgcv scikit-learn=1.1.2 joblib=1.1.0 "pomegranate<=0.14.8"
$ mamba activate neoloop
$ pip install -U neoloop TADLib
Overview
neoloop-finder is distributed with 9 scripts. You can learn the basic usage of each script
by typing command [-h] in a terminal window, where "command" is one of the following
script names:
-
calculate-cnv
Calculate the copy number variation profile from Hi-C map using a generalized additive model with the Poisson link function
-
segment-cnv
Perform HMM segmentation on a pre-calculated copy number variation profile.
-
plot-cnv
Plot genome-wide CNV profiles and segments.
-
correct-cnv
Remove copy number variation effects from cancer Hi-C.
-
simulate-cnv
Simulate CNV effects on a normal Hi-C. The inputs are the Hi-C matrix of a normal cell in .cool format, the Hi-C matrix of a cancer cell in .cool format, and the CNV segmentation file of the same cancer cell in bedGraph format.
-
assemble-complexSVs
Assemble complex SVs. The inputs are a list of simple SVs and the Hi-C matrix of the same sample.
-
neoloop-caller
Identify neo-loops across SV breakpoints. The required inputs are the output SV assemblies from
assemble-complexSVsand the corresponding Hi-C map in .cool format. -
neotad-caller
Identify neo-TADs. The inputs are the same as
neoloop-caller. -
searchSVbyGene
Search SV assemblies by gene name.
Tutorial
This tutorial will cover the overall pipeline of NeoLoopFinder <https://www.nature.com/articles/s41592-021-01164-w/figures/1>.
Given a Hi-C map in .cool/.mcool <https://cooler.readthedocs.io/en/latest/schema.html#multi-resolution>
format and an SV list in the same sample, NeoLoopFinder starts with the inference of
the genome-wide copy number variation (CNV) profile and remove the CNV effects from
Hi-C. Then it resolves complex SVs and reconstructs local Hi-C matrices surrounding SV
breakpoints. And finally, it detects chromatin loops on each SV/complex SV assembly,
including both loops in the regions not affected by SVs and loops across the breakpoints.
Copy number inference from Hi-C map
.. note::
If the chromosome names in your .cool files do not have the "chr" prefix,
please make sure to add the "chr" prefix using add_prefix_to_cool.py <https://raw.githubusercontent.com/XiaoTaoWang/NeoLoopFinder/master/scripts/add_prefix_to_cool.py>_
before you run calculate-cnv (issue #1 <https://github.com/XiaoTaoWang/NeoLoopFinder/issues/1>).
Also make sure you have run cooler balance on your cool files before
you run correct-cnv (issue #8 <https://github.com/XiaoTaoWang/NeoLoopFinder/issues/8>).
First, let's download a processed Hi-C dataset in SK-N-MC (a neuroepithelioma cell line)::
$ wget -O SKNMC-MboI-allReps-filtered.mcool -L https://www.dropbox.com/s/tuhhrecipkp1u8k/SKNMC-MboI-allReps-filtered.mcool?dl=0
The downloaded ".mcool" file contains contact matrices at multiple resolutions. To list all
individual cool URIs within it, execute the cooler ls command below::
$ cooler ls SKNMC-MboI-allReps-filtered.mcool
SKNMC-MboI-allReps-filtered.mcool::/resolutions/5000
SKNMC-MboI-allReps-filtered.mcool::/resolutions/10000
SKNMC-MboI-allReps-filtered.mcool::/resolutions/25000
SKNMC-MboI-allReps-filtered.mcool::/resolutions/50000
SKNMC-MboI-allReps-filtered.mcool::/resolutions/100000
SKNMC-MboI-allReps-filtered.mcool::/resolutions/250000
SKNMC-MboI-allReps-filtered.mcool::/resolutions/500000
SKNMC-MboI-allReps-filtered.mcool::/resolutions/1000000
SKNMC-MboI-allReps-filtered.mcool::/resolutions/2500000
SKNMC-MboI-allReps-filtered.mcool::/resolutions/5000000
To infer the genome-wide CNV profile at a specific resolution, just run calculate-cnv using the cool URI at that resolution as input. For example, the following command will calculate the CNV profile at the 25kb resolution::
$ calculate-cnv -H SKNMC-MboI-allReps-filtered.mcool::resolutions/25000 -g hg38 \
-e MboI --output SKNMC_25k.CNV-profile.bedGraph
Here the -g parameter indicates the reference genome you used for mapping
your Hi-C data, which currently supports hg38, hg19, mm10, and mm9.
And the "-e" parameter indicates the restriction enzyme used in your
Hi-C experiment, which currently supports HindIII, MboI, DpnII, BglII,
Arima, and uniform, where uniform may be specified when the genome was
cutted using a sequence-independent/uniform-cutting enzyme
(please refer to issue 24 <https://github.com/XiaoTaoWang/NeoLoopFinder/issues/24>_).
The inferred CNV values for each 25kb bin will be reported in the bedGraph format as follows::
$ head SKNMC_25k.CNV-profile.bedGraph
chr1 0 25000 0.3622223616602325
chr1 25000 50000 0.16018489189648388
chr1 50000 75000 0.6700770894724766
chr1 75000 100000 0.29407421138399936
chr1 100000 125000 0.7064836696780397
chr1 125000 150000 0.18356628377821504
chr1 150000 175000 0.008115191530591481
chr1 175000 200000 1.9345786937265874
chr1 200000 225000 1.1066640487666337
chr1 225000 250000 0.0
Since the raw CNV profiles are usually relatively noisy, the next step is to identify CNV segments from the original signals::
$ segment-cnv --cnv-file SKNMC_25k.CNV-profile.bedGraph --binsize 25000 \
--ploidy 2 --output SKNMC_25k.CNV-seg.bedGraph --nproc 4
Here the --ploidy parameter indicates the ploidy or on average how many chromosome
copies are there in your sample's cell nucleus. For example, in our analysis,
we set this parameter to 2 for diploid/pseudodiploid cells, 3 for triploid/hypotriploid
cells, 4 for hypotetraploid cells, and 5 for hypopentaploid cells. This information
is usually obtained from karyotyping, but if you are not sure about it for your samples,
you can safely set it to 2.
So how does the inferred CNV look like? For this job, you can use the plot-cnv command::
$ plot-cnv --cnv-profile SKNMC_25k.CNV-profile.bedGraph \
--cnv-segment SKNMC_25k.CNV-seg.bedGraph \
--output-figure-name SKNMC_25k.CNV.genome-wide.png \
--dot-size 0.5 --dot-alpha 0.2 --line-width 1 --boundary-width 0.5 \
--label-size 7 --tick-label-size 6 --clean-mode
.. image:: ./images/SKNMC_25k.CNV.genome-wide.png :align: center
If you want to zoom into specific chromosomes, you can specify the chromosome labels
on the command using the -C parameter::
$ plot-cnv --cnv-profile SKNMC_25k.CNV-profile.bedGraph \
--cnv-segment SKNMC_25k.CNV-seg.bedGraph \
--output-figure-name SKNMC_25k.CNV.bychrom.png \
--dot-size 1.5 --dot-alpha 0.3 --line-width 1.5 --boundary-width 1 \
--label-size 7 --tick-label-size 6 --maximum-value 3 \
--minimum-value -5 -C 3 4 5 6 7 8
.. image:: ./images/SKNMC_25k.CNV.bychrom.png :align: center
Note that most key parameters of the CNV segmentation algorithm is now tunable since
v0.4.1, so if you are not satisfied with the segmentation outputted by the default
parameters, it would always be a good idea to tune those parameters yourself to find
the best solution (see an example here issue #3 <https://github.com/XiaoTaoWang/NeoLoopFinder/issues/3#issuecomment-1261176468>_).
At the end of this section, let's compute the CNV profiles and CNV segments at 10kb and 5kb resolutions as well::
$ calculate-cnv -H SKNMC-MboI-allReps-filtered.mcool::resolutions/10000 -g hg38 \
-e MboI --output SKNMC_10k.CNV-profile.bedGraph
$ segment-cnv --cnv-file SKNMC_10k.CNV-profile.bedGraph --binsize 10000 \
--ploidy 2 --output SKNMC_
