HapHiC
HapHiC: a fast, reference-independent, allele-aware scaffolding tool based on Hi-C data
Install / Use
/learn @zengxiaofei/HapHiCREADME
HapHiC: a fast, reference-independent, allele-aware scaffolding tool based on Hi-C data

HapHiC is an allele-aware scaffolding tool that uses Hi-C data to scaffold haplotype-phased genome assemblies into chromosome-scale pseudomolecules. Unlike ALLHiC, another allele-aware scaffolder, HapHiC can achieve this without the need for reference genomes. Our evaluations indicate that HapHiC outperforms other Hi-C scaffolding tools with higher tolerance to low contig N50, low Hi-C sequencing depth, and various types of assembly errors. Additionally, HapHiC is super-fast and also suitable for haplotype-collapsed diploid and allopolyploid genome assemblies.
Features:
- [x] Chromosome-level scaffolding of haplotype-phased assemblies without reference genomes
- [x] Efficient correction of chimeric contigs (misjoins) with little impact on contig N50
- [x] Much higher tolerance to chimeric contigs, collapsed contigs, and switch errors
- [x] Improved performance in chromosome assignment of contigs
- [x] Improved performance in ordering and orienation of contigs
- [x] Super-fast and memory-efficient
- [x] Able to order and orient contigs without prior knowledge of the number of chromosomes (quick view mode)
- [x] Able to utilize phasing information from hifiasm with varying confidence levels
- [x] Extensive compatibility and user-friendly interface: supports chromap; provides a built-in one-command pipeline; able to produce highly customizable vector graphics for contact maps
Recent updates:
- Version 1.0.7 (2025.03.28): HapHiC now supports longer contigs (< 2^63-1 bp). However, upstream and downstream tools (e.g., bwa, Juicebox) may not yet support contigs longer than 2^31-1 bp.
- Version 1.0.6 (2024.08.26): There is no longer a need to manually set the scale factor in Juicebox. In addition, the saved
.review.assemblyfile can now be parsed correctly by Juicebox. - Version 1.0.5 (2024.07.05): Improved stability in ordering and orientation of contigs through a comparison of fast sorting and ALLHiC optimization.
- Version 1.0.4 (2024.07.03): Add a
haphic refsortcommand for ordering and orienting whole scaffolds according to a reference genome. - Version 1.0.3 (2024.03.21): Add support for the pairs format used in chromap.
- Version 1.0.2 (2023.12.08): We have introduced a
haphic plotcommand for Hi-C contact map visualization. - Version 1.0.1 (2023.11.30): Improved AGP output by incorporating a YaHS-style
scaffolds.raw.agpfor compatibility with the Juicebox visualization method suggested by YaHS.
Terminology: To ensure conciseness and clarity, we use the term "contigs" to refer to the fragmented genome sequences in the input assembly, although they could be either contigs or scaffolds in actuality.
Table of contents
- Installation
- Quick start
- Go through the pipeline step by step
- Examples
- Work with hifiasm
- Quick view mode
- Juicebox curation
- Visualization
- Order and orient whole scaffolds using a reference genome
- Frequently asked questions (FAQs)
- Problems and bug reports
- Citing HapHiC
- Reproducibility
<span id="installation">Installation</span>
HapHiC has been tested and validated on servers running Linux, equipped with either Intel Xeon, AMD EPYC, or Hygon C86 CPUs.
# (1) Download HapHiC from GitHub
$ git clone https://github.com/zengxiaofei/HapHiC.git
# (2) Resolve dependencies
# We strongly recommend using conda to install dependencies. If you prefer manual installation, refer to HapHiC/conda_env/create_conda_env_py310.sh
# We have also included additional environments for Python 3.11 and 3.12 in the directory HapHiC/conda_env/
$ conda env create -f HapHiC/conda_env/environment_py310.yml
# Activate the HapHiC conda environment
$ conda activate haphic # or: source /path/to/conda/bin/activate haphic
# (3) Check whether all dependencies are correctly installed
$ /path/to/HapHiC/haphic check
# (4) Show all available commands and help message
$ /path/to/HapHiC/haphic -h
[!NOTE]
Please note that the Bioconda version of HapHiC is NOT officially maintained and has known issues that can cause the pipeline to abort. To ensure a successful installation, please set up the HapHiC Conda environment strictly in accordance with the approach above.
<span id="quick_start">Quick start</span>
<span id="align">Align Hi-C data to the assembly</span>
First, you need to prepare a BAM file by aligning Hi-C data to the assembly. Here is the way that we recommend:
# (1) Align Hi-C data to the assembly, remove PCR duplicates and filter out secondary and supplementary alignments
$ bwa index asm.fa
$ bwa mem -5SP -t 28 asm.fa /path/to/read1_fq.gz /path/to/read2_fq.gz | samblaster | samtools view - -@ 14 -S -h -b -F 3340 -o HiC.bam
# (2) Filter the alignments with MAPQ 1 (mapping quality ≥ 1) and NM 3 (edit distance < 3)
$ /path/to/HapHiC/utils/filter_bam HiC.bam 1 --nm 3 --threads 14 | samtools view - -b -@ 14 -o HiC.filtered.bam
[!NOTE]
- Here,
asm.facan be haplotype-collapsed contigs (e.g.,p_ctgin hifiasm), haplotype-phased unitigs (e.g.,p_utgin hifiasm), or one or more sets of haplotype-resolved contigs (e.g.,hap*.p_ctgin hifiasm). In addition,asm.famay also be scaffolds output by other scaffolders.- You can prepare the BAM file according to your own preferences or requirements, but DO NOT sort it by coordinate. If your BAM file is already sorted by coordinate, you need to resort it by read name (
samtools sort -n).- We DO NOT recommend the Juicer pipeline for Hi-C reads alignment, particularly in haplotype-phased assemblies.
- samblaster is used to mark PCR duplicates, and it can be downloaded here.
<span id="pipeline">Run HapHiC scaffolding pipeline</span>
(i) One-line command. HapHiC provides a one-line command haphic pipeline to execute the entire scaffolding pipeline. The required parameters are:
asm.fa, your genome assembly file in FASTA format.HiC.filtered.bam, the BAM file prepared in the previous step (the .pairs file output by chromap is also acceptable since version 1.0.3).nchrs, the number of chromosomes present in the assembly, and also the expected number of output scaffolds.
$ /path/to/HapHiC/haphic pipeline asm.fa HiC.filtered.bam nchrs
(ii) Restriction site. The default restriction site is GATC (MboI/DpnII). You can modify this using the --RE parameter. If you are unsure or if your Hi-C library was constructed without restriction enzymes (REs), it is acceptable to leave it as the default.
# For HindIII
$ /path/to/HapHiC/haphic pipeline asm.fa HiC.filtered.bam nchrs --RE "AAGCTT"
# For Arima two-enzyme chemistry
$ /path/to/HapHiC/haphic pipeline asm.fa HiC.filtered.bam nchrs --RE "GATC,GANTC"
# For Arima four-enzyme chemistry
$ /path/to/HapHiC/haphic pipeline asm.fa HiC.filtered.bam nchrs --RE "GATC,GANTC,CTNAG,TTAA"
(iii) Contig correction. To correct misjoined contigs based on Hi-C linking information, use --correct_nrounds to enable assembly correction and set the number of correction rounds. For example:
# Typically, two rounds of assembly correction are enough
$ /path/to/HapHiC/haphic pipeline asm.fa HiC.filtered.bam nchrs --correct_nrounds 2
[!NOTE]
Nowadays, assemblers like hifiasm rarely produces misjoins when using high-quality long reads. Therefore, this parameter is unnecessary and may even reduce assembly contiguity. Even if there are a few assembly errors, they can be manually corrected in Juicebox by breaking the misjoins.
(iv) Switch error. If your input assembly is haplotype-phased and has a high switch error rate (often introduced by assemblers when the sequence divergence between haplotypes is very low), use --remove_allelic_links to remove Hi-C links between allelic contigs, thereby increasing tolerance to such errors. The value should be the ploidy of the assembly. For example:
# For haplotype-phased assembles of autotetraploids, set the parameter to 4
$ /path/to/HapHiC/haphic pipeline asm.fa HiC.filtered.bam nchrs --remove_allelic_links 4
[!NOTE]
If your input assembly is haplotype-phased and the Hi-C reads are aligned using other methods like chromap, we also recommend including this parameter to mitigate the adverse effects of incorrect mapping.
(v) Performance. Use --threads to set the number of threads for BAM file reading, and --processes to create multiple processes for contig ordering and orientation. For example:
$ /path/to/HapHiC/haphic pipeline asm.fa HiC.filtered.bam nchrs --threads 8 --processes 8
Parameters
For more information, run haphic pipeline --help.
Final outputs
01.cluster/corrected_asm.fa: The corrected contigs in FASTA format. This file is generated only when assembly correction is enabled.04.build/scaffolds.agp: A SALSA-style [AGP file](ht
