SpecImmune
Accurate Typing of Diverse Immune-Related Gene Families from Long-Read Sequencing Data. It can handle HLA, KIR, IG, TCR, CYP gene families. It supports both WGS and amplicon data. It supports PacBio and Nanopore data.
Install / Use
/learn @deepomicslab/SpecImmuneREADME
A Scalable Framework for Comprehensive Typing of Polymorphic Immune Genes from Long-Read Data
SpecImmune is a bioinformatics software tool designed to accurately type five key immune-related gene families—HLA, KIR, IG, TCR, and CYP—from long-read sequencing data. These genes are critical for human immune functions and drug metabolism, but their genetic complexity makes them difficult to decode using traditional short-read sequencing methods. SpecImmune leverages the advantages of long-read sequencing technologies, such as Nanopore and PacBio, to provide highly accurate genotyping of these gene families.
Key features of SpecImmune include:
-
Accurate Typing of Immune-Related Genes
SpecImmune can type HLA, KIR, IG, TCR, and CYP genes with high accuracy by categorizing long reads to specific loci and selecting the best-matching alleles from a reference database. -
Broad Compatibility
It supports whole-genome sequencing (WGS) and targeted amplicon sequencing data from various long-read sequencing platforms like ONT and PacBio. -
Superior Performance
SpecImmune outperforms existing tools such as SpecHLA, HLA*LA, and Pangu in typing accuracy, particularly for HLA and CYP genes. It is also the only tool capable of typing KIR and germline IG/TCR from long-read data. -
Consensus Sequence Reconstruction
It bins reads to alleles and reconstructs consensus sequences, ensuring high-quality haplotype sequences for each typed gene. -
Visualization of Results
SpecImmune provides visual reports in an IGV-like report, allowing users to observe novel variants and the confidence of typing results, making it easier to interpret and validate findings. -
Efficient and User-Friendly
SpecImmune is computationally efficient, making it suitable for use on personal computers, enabling convenient use in clinical settings. -
Easy to extend to other genes
Provide detailed instruction to extend SpecImmune to type other genes.
Quick start
Install
First, create the env with conda or mamba, and activate the env.
Use conda/mamba
git clone git@github.com:deepomicslab/SpecImmune.git
cd SpecImmune/
conda env create -n SpecImmune -f environment.yml
conda activate SpecImmune
After creating and activating the env, install dysgu:
pip install --no-deps dysgu==1.6.2
Second, make the software in bin/ executable.
chmod +x -R bin/*
Use Docker
cd docker/ and see detailed instructions there.
Database construction
Third, build the allele database. You can build a database for all gene families, or just the ones you need. For HLA, KIR, and IG/TCR:
python scripts/make_db.py -o ./db -i HLA
python scripts/make_db.py -o ./db -i KIR
python scripts/make_db.py -o ./db -i IG_TR
For CYP, download the complete pharmvar database at Pharmvar, unzip it, merge the alleles of all CYP loci into a single fasta file, and afford the path to the fasta file to SpecImmune:
find pharmvar-* -type f -name "*.fasta" -exec cat {} + > CYP.all.fasta ## replace it with your local pharmvar file
python scripts/make_db.py -o ./db -i CYP --CYP_fa CYP.all.fasta
While running, denote the path of db/ to SpecImmune by the parameter --db.
For IG/TCR and CYP typing, the no-alt hg38 reference is needed, this can be downloaded at no_alt_hg38. You can also generate it by yourself.
Run & test
Perform SpecImmune with
python3 scripts/main.py -h
Please go to the test/ folder, run SpecImmune with given scripts, and check results.
Note:
- SpecImmune now supports Linux and Windows WSL systems.
- For short-read data, pls use SpecHLA.
Basic Usage
Main functions
| Scripts | Description | | --- | --- | |scripts/ExtractReads.sh| Extract gene-region-related reads from enrichment-free data.| |scripts/make_db.py| Construct the dependent database.| |scripts/main.py| Typing with Nanopore or PacBio data. | |evaluation/|Scripts for evaluating the performance and real-data analyses.| |simulation/|Generate simulated data.|
Extract gene-region-related reads
First extract gene reads with enrichment-free data. Otherwise, Typing would be slow. Map reads onto the whole hg38 (Chromosome name should be like chr1, chr2..., and should contain alternative contigs and alleles), then use ExtracReads.sh to extract reads by
Usage: ExtractReads.sh -s <sample_id> -i <input_bam_or_cram> -g <gene_class> -o <output_directory> [-r <reference>]
-s Sample ID or gene ID (required)
-i Input BAM or CRAM file mapped to hg38 (required)
-g Gene class, one of: HLA, KIR, CYP, IG_TR (required)
-o Output directory (required)
-r Reference file (required if input is CRAM)
Note:
whole hg38should contain alternative contigs and alleles to retain as more reads as possible. For example, it should contain plenty of different HLA alleles.
Typing
HLA Typing
Perform four-field HLA typing by
python3 SpecImmune/scripts/main.py \
-r <fastq> \
-j <threads> \
-i HLA \
-n <sample_id> \
-o <outdir> \
--db SpecImmune/db \
-y <datatype>
Example cmd:
python3 SpecImmune/scripts/main.py -n $sample -o $outdir -j 15 -y pacbio -i HLA -r $fq --db ../db/
Perform full-resolution HLA typing with long-read RNA data
python3 SpecImmune/scripts/main.py \
-r <fastq> \
-j <threads> \
-i HLA \
-n <sample_id> \
-o <outdir> \
--db SpecImmune/db \
--seq_tech rna \
--RNA_type traditional
KIR Typing
python3 SpecImmune/scripts/main.py \
-r <fastq> \
-j <threads> \
-i KIR \
-n <sample_id> \
-o <outdir> \
--db SpecImmune/db \
-y <datatype>
Example cmd:
python3 SpecImmune/scripts/main.py -n $sample -o $outdir -j 10 -y pacbio -i KIR -r $fq --hete_p 0.2
CYP Typing
python3 SpecImmune/scripts/main.py \
-r <fastq> \
-j <threads> \
-i CYP \
-n <sample_id> \
-o <outdir> \
--hg38 <no_alt_ref> \
--db SpecImmune/db \
-y <datatype>
Example cmd:
python3 SpecImmune/scripts/main.py --hg38 $ref -n $sample -o $outdir -j 10 -y nanopore -i CYP -r $fq --align_method_1 minimap2
IG&TCR Typing
python3 SpecImmune/scripts/main.py \
-r <fastq> \
-j <threads> \
-i IG_TCR \
-n <sample_id> \
-o <outdir> \
--db SpecImmune/db \
-y <datatype> \
--hg38 <no_alt_ref>
Example cmd:
python3 SpecImmune/scripts/main.py --hg38 $ref -n $sample -o $outdir -j 10 -y pacbio -i IG_TR -r $fq
Using DeepVariant for Small Variant Calling
SpecImmune uses longshot as the default tool for small variant calling. To switch to deepvariant, ensure you have set up the Singularity environment as described below.
Add the following options to the command for any module:
--snv_tool deepvariant
--dv_sif <path_to_deepvariant_sif>
Prerequisites
Install Singularity and SquashFS Tools
To use deepvariant for small variant calling, install Singularity and SquashFS tools:
conda install -c conda-forge singularity
conda install -c conda-forge squashfs-tools
DeepVariant Setup
Follow the official DeepVariant instructions to download the desired version of the DeepVariant Singularity image. For example:
# Set the version of DeepVariant
BIN_VERSION="1.8.0"
# Pull the DeepVariant Singularity image
singularity pull docker://google/deepvariant:"${BIN_VERSION}"
This will create a Singularity image file named deepvariant_${BIN_VERSION}.sif.
Example of using DeepVariant:
python3 SpecImmune/scripts/main.py \
-r <fastq> \
-j <threads> \
-i HLA \
-n <name> \
-o <outdir> \
--align_method_1 minimap2 \
-y <datatype> \
--db <db> \
--snv_tool deepvariant \
--dv_sif ../deepvariant_${BIN_VERSION}.sif
Commands
Full arguments can be seen in
usage: python3 main.py -h
Typing with only long-read data.
Required arguments:
-r Long-read fastq file. PacBio or Nanopore. (default: None)
-n Sample ID (default: None)
-o The output folder to store the typing results. (default: ./output)
-i HLA,KIR,CYP,IG_TR (default: HLA)
Optional arguments:
-j Number of threads. (default: 5)
-k The mean depth in a window lower than this value will be masked by N, set 0 to avoid masking (default: 5)
-y Read type, [nanopore|pacbio|pacbio-hifi]. (default: pacbio)
--db Database folder, which can be obtained by scripts/make_db.py (default: /data4/wangxuedong/test_specimmune/SpecImmune/scripts/../db/)
--hg38 No-alt hg38 Referece fasta file, used by IG_TR and CYP typing (default: None)
-f, --first_run Set False for rerun (default: True)
--min_identity Minimum alignment identity to assign a read to an allele. (default: 0.85)
--hete_p Minor haplotype frequency lower than this value is regarded as homology in best-matched allele pair selection. (default: 0.3)
--candidate_allele_num
Maintain this number of alleles for best-matched allele pair selection. (default: 200)
--min_read_num Min support read number for each locus. (default: 2)
--max_read_num Max support re
Related Skills
ai-cmo
Collection of my Agent Skills and books.
orbit-planning
O.R.B.I.T. - strategic project planning before you build. Objective, Requirements, Blueprint, Implementation Roadmap, Track.
next
A beautifully designed, floating Pomodoro timer that respects your workspace.
product-manager-skills
34PM skill for Claude Code, Codex, Cursor, and Windsurf: diagnose SaaS metrics, critique PRDs, plan roadmaps, run discovery, and coach PM career transitions.
