SkillAgentSearch skills...

HLAminer

⛏ HLA predictions from NGS shotgun data

Install / Use

/learn @BirolLab/HLAminer

README

Release Downloads Issues link link Thank you for your Stars

Logo

HLAminer (c) 2011-present

Derivation of HLA (Human Leukocyte Antigen) class I and II predictions from DNA/RNA sequencing datasets

*This manual assumes that you have a working knowledge of Unix, and some shell and perl scripting experience

CONTENTS


  1. SYNOPSIS
  2. LICENSE
  3. OVERVIEW
  4. DESCRIPTION
  5. INSTALL
  6. COMMANDS AND OPTIONS
  7. PREDICTING FROM LONG (NANOPORE/PACBIO) READS
  8. REFERENCE SEQUENCE FOR LONG-READ ALIGNMENTS
  9. DATABASES
  10. AUTHORS
  11. CITING
  12. FULL LICENSE

SYNOPSIS <a name=synopsis></a>


HLAminer is a pipeline for predicting Human Leukocyte Antigen (HLA) signatures from shotgun sequence data (ie. whole genome, whole transcriptome/RNA-Seq, exome), at the group and allele resolution. It supports predictions from a variety of DNA sequencing technologies including those from Illumina, MGI, PacBio and Oxford Nanopore.
Predictions are either derived from targeted sequence assembly, or direct sequence alignments.

For quick tests on Illumina RNA-seq data:

  1. Copy ./test-demo/ eg. cp -rf test-demo foo
  2. In folder "foo", edit the patient.fof file to point to your NGS RNAseq data. Ensure all paths are ok.
  3. For HLA Predictions by Targeted Assembly of Shotgun Reads: execute ./HLAminer/foo/HPTASRrnaseq.sh For HLA Predictions by Read Alignment: execute ./HLAminer/foo/HPRArnaseq.sh

LICENSE <a name=license></a>


HLAminer Copyright (c) 2011-present Canada's Michael Smith Genome Science Centre. All rights reserved. TASR Copyright (c) 2010-present Canada's Michael Smith Genome Science Centre. All rights reserved. SSAKE Copyright (c) 2006-present Canada's Michael Smith Genome Science Centre. All rights reserved.

Due to the clinical implications of HLAminer, the code is now released under the BC Cancer Agency software license agreement (academic use). Details of the license can be accessed at: and at the bottom of this readme file

For commercial licensing options, please contact Patrick Rebstein prebstein@bccancer.bc.ca

Software components of HLAminer (eg. TASR) are still distributed under the terms of the GNU General Public License

OVERVIEW <a name=overview></a>


Derivation of HLA class I and class II predictions from shotgun sequence datasets (HLAminer) by:

  1. Targeted Assembly of Shotgun Reads (HPTASR)
  2. Read Alignment (HPRA)

BEST SHORT READ RESULTS ARE OBTAINED WITH HPTASR WITH READS 100bp AND UP (IDEALLY 150bp). IT WILL WORK WITH SHORTER READS (50bp) BUT 4-digit HLA ALLELE PREDICTIONS MAY BE AMBIGUOUS

This clip summarizes the pipeline: https://www.youtube.com/watch?v=j-g8Geh5ST8&list=LL&index=110

DESCRIPTION <a name=description></a>


The HLA prediction by targeted assembly of short sequence reads (HPTASR), performs targeted de novo assembly of HLA NGS reads and align the resulting contigs to reference HLA alleles from the IMGT/HLA sequence repository using commodity hardware with standard specifications (<2GB RAM, 2GHz). Putative HLA types are inferred by mining and scoring the contig alignments and an expect value is determined for each. The method is accurate, simple and fast to execute and, for transcriptome data, requires low depth of coverage. Known HLA class I/class II reference sequences available from the IMGT/HLA public repository are read by TASR using default options (Warren and Holt 2011) to create a hash table of all possible 15 nt words (k-mers) from these reference sequences. Note that this parameter is customizable and larger k values will yield predictions with increased specificity (at the possible expense of sensitivity). Subsequently, NGS data sets are interrogated for the presence of one of these kmers (on either strand) at the 5’ or 3’ start. Whenever an HLA word is identified, the read is recruited as a candidate for de novo assembly. Upon de novo assembly of all recruited reads, a set of contigs is generated. Only sequence contigs equal or larger than 200nt in length are considered for further analysis, as longer contigs better resolve HLA allelic variants. Reciprocal BLASTN alignments are performed between the contigs and all HLA allelic reference sequences. HPTASR mines the alignments, scoring each possible HLA allele identified, computing and reporting an expect value (E-value) based on the chance of contigs characterizing given HLA alleles and, reciprocally, the chance of reference HLA alleles aligning best to certain assembled contig sequences

The HLA prediction from direct read alignment (HPRA) method is conceptually simpler and faster to execute, since paired reads are aligned up-front to reference HLA alleles. Alignments from the HPTASR and HPRA methods are processed by the same software (HLAminer.pl) to derive HLA-I predictions by scoring and evaluating the probability of each candidate bearing alignments.

What's new in version 1.4?


Ability to stream the (.sam) output of modern read aligners, directly into HLAminer. Initial support for predicting HLA types from long nanopore reads such as those from Oxford Nanopore Technologies. Better information/sub-routine/date tracking in hlaminer

What's new in version 1.3?


A more concise HLA allele summary in HLAminer_HPTASR.csv and HLAminer_HPRA.csv (associated .log is unchanged and lists all predictions) Keeps top two [highest-scoring by HLA group] predictions per gene and only the 'P' designated allele when the summary include HLA Sequences reported to have the same antigen binding domain. For the original output, refer to the HLAminer_v1-2.pl included in the ./bin directory A prediction example from MCF-7 PacBio RNA-seq reads is also provided

What's new in version 1.2?


Updated all HLA sequence databases Corrected shell script that download HLA sequences to reflect change of location at EBI (ie. fasta sub folder) Added support for predictions from direct alignment of single-end reads

INSTALL <a name=install></a>


<pre> 1. Download and decompress the tar ball gunzip HLAminer_v1-4.tar.gz tar -xvf HLAminer_v1-4.tar 2. Make sure you see the following directories: ./bin ./databases ./docs ./test-demo 3. Read the docs in the ./docs/ folder 4. Change/Add/Adjust the perl shebang line of each .pl and .sh script in the ./bin/ folder as needed </pre>

From direct Read Alignment (HPRA, faster but less accurate): HPRArnaseq_classI.sh HPRArnaseq_classI-II.sh HPRAwgs_classI.sh HPRAwgs_classI-II.sh -and for single end reads- HPRArnaseq_classI_SE.sh HPRArnaseq_classI-II_SE.sh HPRAwgs_classI_SE.sh HPRAwgs_classI-II_SE.sh

From Targeted Assembly (HPTASR, longer but more accurate): HPTASRrnaseq_classI.sh HPTASRrnaseq_classI-II.sh HPTASRwgs_classI.sh HPTASRwgs_classI-II.sh

*Running HPTASRwgs(rnaseq)_classI-II.sh will take longer than HPTASRwgs(rnaseq)_classI.sh, due to the reciprocal BLAST step. You may remove this step from the former (and HLAminer.pl command) to speed things up. However, this step is helpful in weeding out spurious alignments to HLA references. That said, if you're solely interested in HLA-I, you have the option to run the latter set of scripts [HPTASRwgs(rnaseq)_classI.sh].

Also, in the ncbiBlastConfig2-2-XX.txt files (bin and test-demo directories), you may adjust the number of threads and number of reported alignments to speed things up. The options have different name depending on the blast version, refer to the blast manual eg. v2.2.22 option:description -a:threads -v:number of descriptions -b:number of alignments

v2.2.28 -num_threads:threads -max_target_seqs:number of hit sequences to report (when output is 5/xml)

In our hands, a few tests show that blast 2.2.22 may be faster than blast+ (2.2.28) while producing accurate results - HLAminer (Warren et al. 2012) was thoroughly tested with 2.2.22

NCBI blast may be downloaded from: ftp://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/ -or- ftp://ftp.ncbi.nlm.nih.gov/blast/executables/release/

HLAminer.pl parseXMLblast.pl link

  1. You must install perl module Bio::SearchIO to use HPTASR
  2. Edit the fullpath location of bwa and other software dependencies in the shell scripts in the ./bin/ folder, as needed
  3. For your convenience, ncbi blastall and formatdb have been placed in the ./bin/ folder and executed from the following shell scripts:

NAME,PROCESS,NGS DATA TYPE,PREDICTIONS HPRArnaseq_classI.sh,Paired read alignment,RNAseq (transcriptome),HLA-I A,B,C genes HPRArnaseq_classI-II.sh,Paired read alignment,RNAseq (transcriptome),HLA-I A,B,C and HLA-II DP,DQ,DR genes

HPRAwgs_classI.sh,Paired read alignment,Exon capture (exome) and WGS (genome),HLA-I A,B,C genes HPRAwgs_classI-II.sh,Paired read alignment,Exon capture (exome) and WGS (genome),HLA-I A,B,C and HLA-II DP,DQ,DR genes

HPTASRrnaseq_classI.sh,Targeted assembly of sequence reads,RNAseq (transcriptome),HLA-I A,B,C genes HPTASRrnaseq_clas

Related Skills

View on GitHub
GitHub Stars55
CategoryDevelopment
Updated3d ago
Forks15

Languages

Perl

Security Score

85/100

Audited on Mar 27, 2026

No findings