ScHLAcount
Count HLA alleles in single-cell RNA-seq data
Install / Use
/learn @10XGenomics/ScHLAcountREADME
scHLAcount
Overview
scHLAcount allows you to count molecules in single-cell RNA-seq data for class I genes HLA-A, B, and C; and class II genes DPA1, DPB1, DRA1, DRB1, DQA1, and DQB1 using a personalized reference genome. You can either use provided HLA types determined by alternative methods or call HLA types with this tool then quantify against those calls. See the Using scHLAcount section for more details.

Uses
scHLAcount can be used to look at allele specific expression of HLA genes. It can also be used to evaluate loss of heterozygosity by overlaying cell-specific counts onto an expression based t-SNE projection and looking for clusters with complete loss of one haplotype. General loss of HLA expression can also be evaluated with scHLAcount, and performs better at this task than default Cell Ranger, particularly in the case where the sample has HLA haplotypes that are diverged from the reference.
scHLAcount DEV
HLA genotyping and allele-specific expression for single-cell RNA sequencing
USAGE:
sc_hla_count [FLAGS] [OPTIONS] --bam <FILE> --cell-barcodes <FILE>
FLAGS:
--use-exact-count If specified, will use exact alignment to allele sequences to count moleucles (very slow!)
-h, --help Prints help information
--primary-alignments If specified, will use primary alignments only
--unmapped If specified, will also use unmapped reads for genotyping
-V, --version Prints version information
OPTIONS:
-b, --bam <FILE> Cellranger BAM file
-c, --cell-barcodes <FILE> File with cell barcodes to be evaluated
-f, --fasta-cds <FILE> Multi-FASTA file with CDS sequence of each allele [default: ]
-g, --fasta-genomic <FILE> Multi-FASTA file with genomic sequence of each allele [default: ]
-d, --hladb-dir <PATH> Directory of the IMGT-HLA database [default: ]
-i, --hla-index <FILE> debruijn_mapping pseudoalignment index file constructed from IMGT-HLA database [default: ]
--log-level <log_level> Logging level [default: error] [possible values: info, debug, error]
-o, --out-dir <OUTPUT_DIR> [default: hla-typer-results]
--pl-tmp <PSEUDOALIGNER_TMP> Directory to write the pseudoaligner temporary files generated [default: pseudoaligner_tmp]
-r, --region <STRING> Samtools-format region string of reads to use [default: 6:28510120-33480577]
Limitations
While scHLAcount can determine HLA haplotypes given a HLA database like the one at IMGT, our testing has shown that alternative tools such as arcasHLA perform better at HLA genotyping. Therefore, we recommend that you use either alternative methods or arcasHLA to determine your genotypes before using scHLAcount to assign allele specific counts in your single cell RNA-seq dataset.
We have determined that the best results for genotyping and allele-specific counting are found with 5' GEX data. There is a much stronger coverage bias towards the end of the transcript in 3' GEX data, which poses a problem for genotyping and molecule counting of class I genes because most of the variable sites between these three paralogs are contained in exons 2 and 3, which are at the 5' end of the transcript. The following figure shows the coverage profile for 5', 3'v2 and 3'v3 GEX assays, normalized to 0 and 1 for the minimum and maximum coverage seen in the region, respectively, for each assay.

Installation
scHLAcount has automatically generated downloadable binaries for generic linux and Mac OSX under the releases page. The linux binaries are expected to work on our supported Operating Systems.
Compiling from source
scHLAcount is a standard Rust executable project, that works with stable Rust >=1.13.
If you need to compile from source, install Rust, then type cargo build --release from within the directory containing the scHLAcount source code. The executable will appear at target/release/sc_hla_count. As usual it's important to use a release build to get good performance.
Testing
If you have compiled scHLAcount from source, you can run the tiny test dataset by typing the command cargo test --release from wthin the directory containing the scHLAcount source code.
The test data files in the test/ folder also provide a simple example of the inputs and outputs for scHLAcount.
Support
scHLAcount is provided as an open-source tool for use by the community. Although we cannot provide full support for the software please submit a GitHub Issue if you have any problems, questions or comments. We would also be happy to consider Pull Requests that fix bugs or provide enhancements.
Scripts in the /paper directory show how to reproduce results from our manuscript and are not supported.
Using scHLAcount
Case 1: You have HLA genotypes for some or all class I / class II genes
Other Requirements: samtools
- Download the the IMGT/HLA database, available at Github or FTP. You only need the
hla_gen.fastaandhla_nuc.fastafiles, but you can download the whole database if you choose. - Use
samtools faidxto index thehla_gen.fastaandhla_nuc.fastafiles. - Create a file of the known genotypes, at most two per gene, with one genotype on each line. Follow the template at
paper/sample_gt.txt. - We strongly recommend that if genotypes are unknown for any of the genes, you put the reference genome allele for those genes in the known genotypes file. Alleles represented in the GRCh38 primary assembly are listed below:
A*03:01:01:01
B*07:02:01:01
C*07:02:01:01
DQA1*01:02:01:01
DQB1*06:02:01:01
DRB1*15:01:01:01
DPA1*01:03:01:01
DPB1*04:01:01:01
- If the indexed IMGT/HLA database files are not in the current directory, edit the
prepare_reference.shfile to point to these files. - Run
prepare_reference.sh known_genotypes.txtto get your custom referencescds.fastaandgen.fasta. The samtools command will fail if the coding and genomic sequence of all alleles specified are not present in the database! If multiple alleles are present that match the provided level of specificity of the genotype, one will be chosen arbitrarily. - Run scHLAcount with the custom references as
-fand-gparameters. Do not use the-iand-dparameters.
Case 2: You do not have HLA genotypes
- Download the the IMGT/HLA database, available at Github or FTP. You only need the files
hla_gen.fasta,hla_nuc.fasta, andAllele_status.txtbut you can download the whole database if you choose. - The directory containing these files should be provided as the
-dparameter to scHLAcount. - Run scHLAcount with the
-dparameter. Do not use the-f,-g, or-iparameters. - If you run the program again and want to skip building the index, just specify the file
hla_nuc.fasta.idxas the-iparameter. This file is located in the pseudoaligner temporary folder specified by the--pl-tmpparameter. - If you run the program again on and want to skip calculating the genotypes, you can use the
pseudoaligner_nuc.faandpseudoaligner_gen.faas the-fand-gparameters. These files are located in the pseudoaligner temporary folder specified by the--pl-tmpparameter.
Outputs
scHLAcount produces genome matrices in the same Market Exchange format that Cell Ranger uses. This is a sparse matrix format that can be read by common packages. Column labels are the cell barcodes included in the cell barcode input file (specified with --cell-barcodes).
Genotyping Algorithm Details
If HLA genotypes are not available, scHLAcount provides a preliminary implementation of a genotyping algorithm similar to the one demonstrated on bulk RNA-seq data in arcasHLA (Orenbuch et al., Bioinformatics 2019) and HLApers (Aguiar et al., PLoS Genetics 2019). Although 5' and 3' Gene Expression read coverage is highly skewed towards one end of the transcript (see Limitations) we have found that combining reads from all cells has enough coverage along the length of the transcript to achieve similar results to those demonstrated in bulk RNA-seq.
We use the Allele_status.txt metadata file from the IMGT/HLA database to select the full-length coding sequences of alleles from hla_nuc.fasta for genes HLA-A, -B, -C, DPA1, DPB1, DRA1, DRB1, DQA1, and DQB1 in the IMGT HLA database that also have complete genomic sequences available. We exclude null alleles (with suffix of N) because by definition, we would never call these genotypes from RNA sequencing data. A colored deBruijn graph of these alleles is constructed using k-mer size 24. We build on the Rust debruijn_mapping crate. From the aligned BAM file, all reads aligned to the MHC region (default is GRCh38 coordinates 6:28510120-33480577) are pseudo-aligned to the graph, and the set of alleles to which they align (''equivalence class'') is reported, if the length of the alignment is at least 40 bases, with up to 2 mismatches permitted outside the initial seed. Expectation maximization (EM) is performed on the equivalence class counts, with the accelerated implementation SquareM of Varadhan and Roland (Scandinavian J of Statistics, 2008).
For each of the eight HLA genes included in the graph, we rank the alleles of that ge
Related Skills
node-connect
351.2kDiagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps
frontend-design
110.6kCreate distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.
openai-whisper-api
351.2kTranscribe audio via OpenAI Audio Transcriptions API (Whisper).
qqbot-media
351.2kQQBot 富媒体收发能力。使用 <qqmedia> 标签,系统根据文件扩展名自动识别类型(图片/语音/视频/文件)。
