SkillAgentSearch skills...

ScHLAcount

Count HLA alleles in single-cell RNA-seq data

Install / Use

/learn @10XGenomics/ScHLAcount
About this skill

Quality Score

0/100

Supported Platforms

Universal

README

scHLAcount

Overview

scHLAcount allows you to count molecules in single-cell RNA-seq data for class I genes HLA-A, B, and C; and class II genes DPA1, DPB1, DRA1, DRB1, DQA1, and DQB1 using a personalized reference genome. You can either use provided HLA types determined by alternative methods or call HLA types with this tool then quantify against those calls. See the Using scHLAcount section for more details.

Workflow Figure

Uses

scHLAcount can be used to look at allele specific expression of HLA genes. It can also be used to evaluate loss of heterozygosity by overlaying cell-specific counts onto an expression based t-SNE projection and looking for clusters with complete loss of one haplotype. General loss of HLA expression can also be evaluated with scHLAcount, and performs better at this task than default Cell Ranger, particularly in the case where the sample has HLA haplotypes that are diverged from the reference.

scHLAcount DEV

HLA genotyping and allele-specific expression for single-cell RNA sequencing

USAGE:
    sc_hla_count [FLAGS] [OPTIONS] --bam <FILE> --cell-barcodes <FILE>

FLAGS:
        --use-exact-count       If specified, will use exact alignment to allele sequences to count moleucles (very slow!)
    -h, --help                  Prints help information
        --primary-alignments    If specified, will use primary alignments only
        --unmapped              If specified, will also use unmapped reads for genotyping
    -V, --version               Prints version information

OPTIONS:
    -b, --bam <FILE>                    Cellranger BAM file
    -c, --cell-barcodes <FILE>          File with cell barcodes to be evaluated
    -f, --fasta-cds <FILE>              Multi-FASTA file with CDS sequence of each allele [default: ]
    -g, --fasta-genomic <FILE>          Multi-FASTA file with genomic sequence of each allele [default: ]
    -d, --hladb-dir <PATH>              Directory of the IMGT-HLA database [default: ]
    -i, --hla-index <FILE>              debruijn_mapping pseudoalignment index file constructed from IMGT-HLA database [default: ]
        --log-level <log_level>         Logging level [default: error]  [possible values: info, debug, error]
    -o, --out-dir <OUTPUT_DIR>           [default: hla-typer-results]
        --pl-tmp <PSEUDOALIGNER_TMP>    Directory to write the pseudoaligner temporary files generated [default: pseudoaligner_tmp]
    -r, --region <STRING>               Samtools-format region string of reads to use [default: 6:28510120-33480577]

Limitations

While scHLAcount can determine HLA haplotypes given a HLA database like the one at IMGT, our testing has shown that alternative tools such as arcasHLA perform better at HLA genotyping. Therefore, we recommend that you use either alternative methods or arcasHLA to determine your genotypes before using scHLAcount to assign allele specific counts in your single cell RNA-seq dataset.

We have determined that the best results for genotyping and allele-specific counting are found with 5' GEX data. There is a much stronger coverage bias towards the end of the transcript in 3' GEX data, which poses a problem for genotyping and molecule counting of class I genes because most of the variable sites between these three paralogs are contained in exons 2 and 3, which are at the 5' end of the transcript. The following figure shows the coverage profile for 5', 3'v2 and 3'v3 GEX assays, normalized to 0 and 1 for the minimum and maximum coverage seen in the region, respectively, for each assay.

GEX Coverage Figure

Installation

scHLAcount has automatically generated downloadable binaries for generic linux and Mac OSX under the releases page. The linux binaries are expected to work on our supported Operating Systems.

Compiling from source

scHLAcount is a standard Rust executable project, that works with stable Rust >=1.13.

If you need to compile from source, install Rust, then type cargo build --release from within the directory containing the scHLAcount source code. The executable will appear at target/release/sc_hla_count. As usual it's important to use a release build to get good performance.

Testing

If you have compiled scHLAcount from source, you can run the tiny test dataset by typing the command cargo test --release from wthin the directory containing the scHLAcount source code.

The test data files in the test/ folder also provide a simple example of the inputs and outputs for scHLAcount.

Support

scHLAcount is provided as an open-source tool for use by the community. Although we cannot provide full support for the software please submit a GitHub Issue if you have any problems, questions or comments. We would also be happy to consider Pull Requests that fix bugs or provide enhancements.

Scripts in the /paper directory show how to reproduce results from our manuscript and are not supported.

Using scHLAcount

Case 1: You have HLA genotypes for some or all class I / class II genes

Other Requirements: samtools

  1. Download the the IMGT/HLA database, available at Github or FTP. You only need the hla_gen.fasta and hla_nuc.fasta files, but you can download the whole database if you choose.
  2. Use samtools faidx to index the hla_gen.fasta and hla_nuc.fasta files.
  3. Create a file of the known genotypes, at most two per gene, with one genotype on each line. Follow the template at paper/sample_gt.txt.
  4. We strongly recommend that if genotypes are unknown for any of the genes, you put the reference genome allele for those genes in the known genotypes file. Alleles represented in the GRCh38 primary assembly are listed below:
A*03:01:01:01
B*07:02:01:01
C*07:02:01:01
DQA1*01:02:01:01
DQB1*06:02:01:01
DRB1*15:01:01:01
DPA1*01:03:01:01
DPB1*04:01:01:01
  1. If the indexed IMGT/HLA database files are not in the current directory, edit the prepare_reference.sh file to point to these files.
  2. Run prepare_reference.sh known_genotypes.txt to get your custom references cds.fasta and gen.fasta. The samtools command will fail if the coding and genomic sequence of all alleles specified are not present in the database! If multiple alleles are present that match the provided level of specificity of the genotype, one will be chosen arbitrarily.
  3. Run scHLAcount with the custom references as -f and -g parameters. Do not use the -i and -d parameters.

Case 2: You do not have HLA genotypes

  1. Download the the IMGT/HLA database, available at Github or FTP. You only need the files hla_gen.fasta, hla_nuc.fasta, and Allele_status.txt but you can download the whole database if you choose.
  2. The directory containing these files should be provided as the -d parameter to scHLAcount.
  3. Run scHLAcount with the -d parameter. Do not use the -f, -g, or -i parameters.
  4. If you run the program again and want to skip building the index, just specify the file hla_nuc.fasta.idx as the -i parameter. This file is located in the pseudoaligner temporary folder specified by the --pl-tmp parameter.
  5. If you run the program again on and want to skip calculating the genotypes, you can use the pseudoaligner_nuc.fa and pseudoaligner_gen.fa as the -f and -g parameters. These files are located in the pseudoaligner temporary folder specified by the --pl-tmp parameter.

Outputs

scHLAcount produces genome matrices in the same Market Exchange format that Cell Ranger uses. This is a sparse matrix format that can be read by common packages. Column labels are the cell barcodes included in the cell barcode input file (specified with --cell-barcodes).

Genotyping Algorithm Details

If HLA genotypes are not available, scHLAcount provides a preliminary implementation of a genotyping algorithm similar to the one demonstrated on bulk RNA-seq data in arcasHLA (Orenbuch et al., Bioinformatics 2019) and HLApers (Aguiar et al., PLoS Genetics 2019). Although 5' and 3' Gene Expression read coverage is highly skewed towards one end of the transcript (see Limitations) we have found that combining reads from all cells has enough coverage along the length of the transcript to achieve similar results to those demonstrated in bulk RNA-seq.

We use the Allele_status.txt metadata file from the IMGT/HLA database to select the full-length coding sequences of alleles from hla_nuc.fasta for genes HLA-A, -B, -C, DPA1, DPB1, DRA1, DRB1, DQA1, and DQB1 in the IMGT HLA database that also have complete genomic sequences available. We exclude null alleles (with suffix of N) because by definition, we would never call these genotypes from RNA sequencing data. A colored deBruijn graph of these alleles is constructed using k-mer size 24. We build on the Rust debruijn_mapping crate. From the aligned BAM file, all reads aligned to the MHC region (default is GRCh38 coordinates 6:28510120-33480577) are pseudo-aligned to the graph, and the set of alleles to which they align (''equivalence class'') is reported, if the length of the alignment is at least 40 bases, with up to 2 mismatches permitted outside the initial seed. Expectation maximization (EM) is performed on the equivalence class counts, with the accelerated implementation SquareM of Varadhan and Roland (Scandinavian J of Statistics, 2008).

For each of the eight HLA genes included in the graph, we rank the alleles of that ge

Related Skills

View on GitHub
GitHub Stars63
CategoryDevelopment
Updated3mo ago
Forks12

Languages

TeX

Security Score

95/100

Audited on Jan 8, 2026

No findings