GCI
A program for assessing the T2T genome continuity
Install / Use
/learn @yeeus/GCIREADME
Genome Continuity Inspector
About
Genome Continuity Inspector (GCI) is an assembly assessment tool for high-quality genomes (e.g. T2T genomes). After stringently filtering the alignments generated by mapping long reads (PacBio HiFi and/or Oxford Nanopore long reads) back to the genome assembly, GCI will report potential assembly issues and also a score to quantify the continuity of assembly. Moreover, GCI provides many useful utility scripts, either filtering alignments in the GCI way or plotting curated alignment depth to manually detect assembly quality in particular regions.

Contents
Requirements
- canu (optional, but wanted for trio-binning long reads)
- seqkit (optional, but wanted for trio-binning long reads, if phased by HiC)
- minimap2 (optional, but wanted for mapping)
- winnowmap (optional, but wanted for mapping)
- veritymap (optional, for mapping)
- samtools (for sam/bam processing)
- paftools.js (for converting sam to paf)
As for GCI, it requires:
- python3.x (tested in python3.10)
- pysam (stable version)
- biopython (stable version)
- numpy (stable version)
- matplotlib (stable version)
- bamsnap (optional, for plotting in utility
filter_bam.py)
Bioconda
Thanks to @SwiftSeal, now you can install GCI with bioconda.
conda install bioconda::gci
Parameters
python GCI.py --help
usage: GCI.py [-r FILE] [--hifi [...]] [--nano [...]] [--chrs] [-R FILE] [-ts INT] [-dp FLOAT] [-t INT] [-d PATH] [-o STR] [-mq INT] [--mq-cutoff INT] [-ip FLOAT] [-op FLOAT]
[-cp FLOAT] [-fl INT] [-p] [-dmin FLOAT] [-dmax FLOAT] [-ws INT] [-it STR] [-f] [-h] [-v]
A program for assessing the T2T genome
Input/Output:
-r FILE, --reference FILE
The reference file
--hifi [ ...] PacBio HiFi reads alignment files (at least one bam file)
--nano [ ...] Oxford Nanopore long reads alignment files (at least one bam file)
--chrs A list of chromosomes separated by comma
-R FILE, --regions FILE
Bed file containing regions
Be cautious! If both specify `--chrs` and `--regions`, chromosomes in regions bed file should be included in the chromosomes list
-ts INT, --threshold INT
The threshold of depth to be reported as issues [0]
-dp FLOAT, --dist-percent FLOAT
The distance between the candidate gap intervals for combining in chromosome units [0.005]
-t INT, --threads INT
Number of threads [1]
-d PATH The directory of output files [.]
-o STR, --output STR Prefix of output files [GCI]
Filter Options:
-mq INT, --map-qual INT
Minium mapping quality for alignments [30]
--mq-cutoff INT The cutoff of mapping quality for keeping the alignment [50]
(only used when inputting more than one alignment files)
-ip FLOAT, --iden-percent FLOAT
Minimum identity (num_match_res/len_aln) of alignments [0.9]
-op FLOAT, --ovlp-percent FLOAT
Minimum overlapping percentage of the same read alignment if inputting more than one alignment files [0.9]
-cp FLOAT, --clip-percent FLOAT
Maximum clipped percentage of the alignment [0.1]
-fl INT, --flank-len INT
The flanking length of the clipped bases [15]
Plot Options:
-p, --plot Visualize the finally filtered whole genome (and regions if providing the option `-R`) depth [False]
-dmin FLOAT, --depth-min FLOAT
Minimum depth in folds of mean coverage for plotting [0.1]
-dmax FLOAT, --depth-max FLOAT
Maximum depth in folds of mean coverage for plotting [4.0]
-ws INT, --window-size INT
The window size when plotting [50000]
-it STR, --image-type STR
The format of the output images: png or pdf [png]
Other Options:
-f, --force Force rewriting of existing files [False]
-h, --help Show this help message and exit
-v, --version Show program's version number and exit
Examples:
python GCI.py -r ref.fa --hifi hifi.bam hifi.paf ... --nano nano.bam nano.paf ...
Usage
- (For haplotype-resolved genome) Prepare trio-binned HiFi / ONT reads (if parental HiFi / ONT sequencing data are available, please skip this step)
# we recommend using canu for trio-binning if parental NGS reads are available
canu -haplotype \
-p $prefix -d $dictionary \
genomeSize=3g \
maxThreads=$threads \
-haplotypePat $pat \
-haplotypeMat $mat \
-pacbio-hifi $hifi \ ## binning ONT reads with -nanopore $ont
useGrid=false
# because there would be unknown reads that couldn't be reliably binned, we suggest combining them with binned reads
cat ${canu_mat.fa.gz} ${canu_unknown.fa.gz} > ${canu_mat.final.fa.gz}
cat ${canu_pat.fa.gz} ${canu_unknown.fa.gz} > ${canu_pat.final.fa.gz}
# If parental NGS reads are not accessible (e.g., the genome was phased by HiC),
# one can map HiFi / ONT reads to combined genomes to select the primary (& supplementary) alignments,
# which could be used for downstream analysis
minimap2 -t $threads -ax map-hifi <(cat $hap1_asm $hap2_asm) $hifi > minimap2.hifi.sam
paftools.js sam2paf -p minimap2.hifi.sam | cut -f 1,6 > minimap2.hifi.list
for hap in hap1 hap2; do
grep "$hap" minimap2.hifi.list | cut -f1 > minimap2.hifi.$hap.list
seqkit grep -f minimap2.hifi.$hap.list $hifi -o hifi.$hap.fq
done
- Map HiFi and/or ONT reads to assemblies (using minimap2 and winnowmap)
# minimap2
minimap2 -t $threads -ax map-hifi $mat_asm $mat_hifi > ${mat.minimap2.hifi.sam} ## mapping ONT reads with -ax map-ont
samtools view -@ $threads -Sb ${mat.minimap2.hifi.sam} | samtools sort -@ $threads -o ${mat.minimap2.hifi.bam}
samtools index ${mat.minimap2.hifi.bam} ## this is necessary!!!
## If providing paf file, we recommend using paftools to convert sam to paf
paftools.js sam2paf ${mat.minimap2.hifi.sam} | sort -k6,6V -k8,8n > ${mat.minimap2.hifi.paf} ## please sort the paf file because our program don't automatically sort the file by the targets names!
# winnowmap
meryl count k=15 output $mat_merylDB $mat_asm
meryl print greater-than distinct=0.9998 $mat_merylDB > $mat_repetitive_k15.txt
winnowmap -W $mat_repetitive_k15.txt -ax map-pb $mat_asm $mat_hifi > ${mat.winnowmap.hifi.sam}
samtools view -@ $threads -Sb ${mat.winnowmap.hifi.sam} | samtools sort -@ $threads -o ${mat.winnowmap.hifi.bam}
samtools index ${mat.winnowmap.hifi.bam}
paftools.js sam2paf ${mat.winnowmap.hifi.sam} | sort -k6,6V -k8,8n > ${mat.winnowmap.hifi.paf}
- Filter the mapping files and get the genome continuity index
We recommend to input only one alignment file per software (minimap2 and winnowmap) using the same set of long reads. Importantly, GCI needs at least one bam file for one type of long reads, which means you'll get errors when providing only paf files.
# Before this, make sure you've generated the index file (.bai) for bam files
# We recommend to input one bam and one paf file produced by two softwares (for example, one bam file from winnowmap and one paf file from minimap2)
# which would be theoretically faster and consume less memory but at the cost of lower sensitivity relative to all bams
# PDF is recommended because PNG file may lose some details though GCI will output png files by default
python GCI.py -r ref.fa --hifi hifi.bam hifi.paf --nano ont.bam ont.paf -t 8 -p -it pdf ...
Note: as #21 said, in GCI, -mq/--map-qual is not just a “keep more reads → cover more bases” switch. Lowering -mq (e.g., to 0) brings in many low-MAPQ alignments, which are typically multi-mapping reads from repetitive/low-complexity regions. Therefore, for uniformity analysis, using a strict MQ threshold (e.g., 30-60) is often more interpretable.
Test data
You can first download the test files from zenodo
tar zxf example.tar.gz
# Then you can run GCI on the test data (it will take some seconds)
python GCI.py -r example/MH63.fasta \
--hifi example/MH63_winnowmap_hifi.subsample.bam example/MH63.minimap2_hifi.subsample.paf \
-d my_test -o MH63 -f
# And you will get output files my_test/MH63.depth.gz, my_test/MH63.0.depth.bed, and my_test/MH63.gci
# which will be the same as files provided in folder example
Outputs
only providing one type of reads
- ${dictionay}/
- (${prefix}.gaps.bed) ## the positions of Ns in the assembly
- ${prefix}.depth.gz ## the gzipped whole-genome depth file
- ${prefix}.${threshold}.depth.bed ## the merged depth file in bed format
- ${
Related Skills
node-connect
342.0kDiagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps
frontend-design
84.7kCreate distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.
openai-whisper-api
342.0kTranscribe audio via OpenAI Audio Transcriptions API (Whisper).
commit-push-pr
84.7kCommit, push, and open a PR
