SkillAgentSearch skills...

SpecImmune

Accurate Typing of Diverse Immune-Related Gene Families from Long-Read Sequencing Data. It can handle HLA, KIR, IG, TCR, CYP gene families. It supports both WGS and amplicon data. It supports PacBio and Nanopore data.

Install / Use

/learn @deepomicslab/SpecImmune

README

A Scalable Framework for Comprehensive Typing of Polymorphic Immune Genes from Long-Read Data

SpecImmune is a bioinformatics software tool designed to accurately type five key immune-related gene families—HLA, KIR, IG, TCR, and CYP—from long-read sequencing data. These genes are critical for human immune functions and drug metabolism, but their genetic complexity makes them difficult to decode using traditional short-read sequencing methods. SpecImmune leverages the advantages of long-read sequencing technologies, such as Nanopore and PacBio, to provide highly accurate genotyping of these gene families.

Key features of SpecImmune include:

  1. Accurate Typing of Immune-Related Genes
    SpecImmune can type HLA, KIR, IG, TCR, and CYP genes with high accuracy by categorizing long reads to specific loci and selecting the best-matching alleles from a reference database.

  2. Broad Compatibility
    It supports whole-genome sequencing (WGS) and targeted amplicon sequencing data from various long-read sequencing platforms like ONT and PacBio.

  3. Superior Performance
    SpecImmune outperforms existing tools such as SpecHLA, HLA*LA, and Pangu in typing accuracy, particularly for HLA and CYP genes. It is also the only tool capable of typing KIR and germline IG/TCR from long-read data.

  4. Consensus Sequence Reconstruction
    It bins reads to alleles and reconstructs consensus sequences, ensuring high-quality haplotype sequences for each typed gene.

  5. Visualization of Results
    SpecImmune provides visual reports in an IGV-like report, allowing users to observe novel variants and the confidence of typing results, making it easier to interpret and validate findings.

  6. Efficient and User-Friendly
    SpecImmune is computationally efficient, making it suitable for use on personal computers, enabling convenient use in clinical settings.

  7. Easy to extend to other genes
    Provide detailed instruction to extend SpecImmune to type other genes.

Quick start

Install

First, create the env with conda or mamba, and activate the env.

Use conda/mamba

git clone git@github.com:deepomicslab/SpecImmune.git
cd SpecImmune/
conda env create -n SpecImmune -f environment.yml
conda activate SpecImmune

After creating and activating the env, install dysgu:

pip install --no-deps dysgu==1.6.2

Second, make the software in bin/ executable.

chmod +x -R bin/*

Use Docker

cd docker/ and see detailed instructions there.

Database construction

Third, build the allele database. You can build a database for all gene families, or just the ones you need. For HLA, KIR, and IG/TCR:

python scripts/make_db.py -o ./db  -i HLA

python scripts/make_db.py -o ./db  -i KIR

python scripts/make_db.py -o ./db  -i IG_TR

For CYP, download the complete pharmvar database at Pharmvar, unzip it, merge the alleles of all CYP loci into a single fasta file, and afford the path to the fasta file to SpecImmune:

find pharmvar-* -type f -name "*.fasta" -exec cat {} + > CYP.all.fasta ## replace it with your local pharmvar file
python scripts/make_db.py -o ./db  -i CYP --CYP_fa CYP.all.fasta

While running, denote the path of db/ to SpecImmune by the parameter --db.

For IG/TCR and CYP typing, the no-alt hg38 reference is needed, this can be downloaded at no_alt_hg38. You can also generate it by yourself.

Run & test

Perform SpecImmune with

python3 scripts/main.py -h

Please go to the test/ folder, run SpecImmune with given scripts, and check results.

Note:

  • SpecImmune now supports Linux and Windows WSL systems.
  • For short-read data, pls use SpecHLA.

Basic Usage

Main functions

| Scripts | Description | | --- | --- | |scripts/ExtractReads.sh| Extract gene-region-related reads from enrichment-free data.| |scripts/make_db.py| Construct the dependent database.| |scripts/main.py| Typing with Nanopore or PacBio data. | |evaluation/|Scripts for evaluating the performance and real-data analyses.| |simulation/|Generate simulated data.|

Extract gene-region-related reads

First extract gene reads with enrichment-free data. Otherwise, Typing would be slow. Map reads onto the whole hg38 (Chromosome name should be like chr1, chr2..., and should contain alternative contigs and alleles), then use ExtracReads.sh to extract reads by

Usage: ExtractReads.sh -s <sample_id> -i <input_bam_or_cram> -g <gene_class> -o <output_directory> [-r <reference>]
  -s  Sample ID or gene ID (required)
  -i  Input BAM or CRAM file mapped to hg38 (required)
  -g  Gene class, one of: HLA, KIR, CYP, IG_TR (required)
  -o  Output directory (required)
  -r  Reference file (required if input is CRAM)

Note:

  • whole hg38 should contain alternative contigs and alleles to retain as more reads as possible. For example, it should contain plenty of different HLA alleles.

Typing

HLA Typing

Perform four-field HLA typing by

python3 SpecImmune/scripts/main.py \
        -r <fastq> \
        -j <threads> \
        -i HLA \
        -n <sample_id> \
        -o <outdir> \
        --db SpecImmune/db \
        -y <datatype> 

Example cmd:

python3 SpecImmune/scripts/main.py -n $sample -o $outdir -j 15 -y pacbio -i HLA -r $fq --db ../db/ 

Perform full-resolution HLA typing with long-read RNA data

python3 SpecImmune/scripts/main.py \
        -r <fastq> \
        -j <threads> \
        -i HLA \
        -n <sample_id> \
        -o <outdir> \
        --db SpecImmune/db  \
        --seq_tech rna \
        --RNA_type traditional

KIR Typing

python3 SpecImmune/scripts/main.py \
        -r <fastq> \
        -j <threads> \
        -i KIR \
        -n <sample_id> \
        -o <outdir> \
        --db SpecImmune/db \
        -y <datatype> 

Example cmd:

python3 SpecImmune/scripts/main.py -n $sample -o $outdir -j 10 -y pacbio -i KIR -r $fq --hete_p 0.2

CYP Typing

python3 SpecImmune/scripts/main.py \
        -r <fastq> \
        -j <threads> \
        -i CYP \
        -n <sample_id> \
        -o <outdir> \
        --hg38 <no_alt_ref> \
        --db SpecImmune/db \
        -y <datatype>

Example cmd:

python3 SpecImmune/scripts/main.py --hg38 $ref -n $sample -o $outdir -j 10 -y nanopore -i CYP -r $fq --align_method_1 minimap2

IG&TCR Typing

python3 SpecImmune/scripts/main.py \
        -r <fastq> \
        -j <threads> \
        -i IG_TCR \
        -n <sample_id> \
        -o <outdir> \
        --db SpecImmune/db \
        -y <datatype> \
        --hg38 <no_alt_ref>

Example cmd:

python3 SpecImmune/scripts/main.py --hg38 $ref -n $sample -o $outdir -j 10 -y pacbio -i IG_TR -r $fq

Using DeepVariant for Small Variant Calling

SpecImmune uses longshot as the default tool for small variant calling. To switch to deepvariant, ensure you have set up the Singularity environment as described below.

Add the following options to the command for any module:

--snv_tool deepvariant
--dv_sif <path_to_deepvariant_sif>

Prerequisites

Install Singularity and SquashFS Tools

To use deepvariant for small variant calling, install Singularity and SquashFS tools:

conda install -c conda-forge singularity
conda install -c conda-forge squashfs-tools

DeepVariant Setup

Follow the official DeepVariant instructions to download the desired version of the DeepVariant Singularity image. For example:

# Set the version of DeepVariant
BIN_VERSION="1.8.0"

# Pull the DeepVariant Singularity image
singularity pull docker://google/deepvariant:"${BIN_VERSION}"

This will create a Singularity image file named deepvariant_${BIN_VERSION}.sif.

Example of using DeepVariant:

python3 SpecImmune/scripts/main.py \
        -r <fastq> \
        -j <threads> \
        -i HLA \
        -n <name> \
        -o <outdir> \
        --align_method_1 minimap2 \
        -y <datatype> \
        --db <db> \
        --snv_tool deepvariant \
        --dv_sif ../deepvariant_${BIN_VERSION}.sif

Commands

Full arguments can be seen in

usage: python3 main.py -h

Typing with only long-read data.

Required arguments:
  -r                  Long-read fastq file. PacBio or Nanopore. (default: None)
  -n                  Sample ID (default: None)
  -o                  The output folder to store the typing results. (default: ./output)
  -i                  HLA,KIR,CYP,IG_TR (default: HLA)

Optional arguments:
  -j                  Number of threads. (default: 5)
  -k                  The mean depth in a window lower than this value will be masked by N, set 0 to avoid masking (default: 5)
  -y                  Read type, [nanopore|pacbio|pacbio-hifi]. (default: pacbio)
  --db                Database folder, which can be obtained by scripts/make_db.py (default: /data4/wangxuedong/test_specimmune/SpecImmune/scripts/../db/)
  --hg38              No-alt hg38 Referece fasta file, used by IG_TR and CYP typing (default: None)
  -f, --first_run   Set False for rerun (default: True)
  --min_identity      Minimum alignment identity to assign a read to an allele. (default: 0.85)
  --hete_p            Minor haplotype frequency lower than this value is regarded as homology in best-matched allele pair selection. (default: 0.3)
  --candidate_allele_num 
                        Maintain this number of alleles for best-matched allele pair selection. (default: 200)
  --min_read_num      Min support read number for each locus. (default: 2)
  --max_read_num      Max support re

Related Skills

View on GitHub
GitHub Stars35
CategoryProduct
Updated19d ago
Forks1

Languages

F*

Security Score

95/100

Audited on Mar 11, 2026

No findings