Loftee

No description available

Generate Convert Improve

Install / Use

/learn @konradjk/Loftee

About this skill

Quality Score

0/100

README

LOFTEE (Loss-Of-Function Transcript Effect Estimator)

Loss-of-function pipeline (inspired by MacArthur et al., 2012, published in Karczewski et al., 2020).

A VEP plugin to identify LoF (loss-of-function) variation.

Currently assesses variants that are:

Stop-gained
Splice site disrupting
Frameshift variants

Note: the master branch does not work with GRCh38. Please use the grch38 branch.

Filters

LOFTEE implements a set of filters to deem a LoF as "low-confidence" (LC). Variants that pass these filters are labeled as "high-confidence" (HC).

For stop-gained and frameshift variants, LOFTEE removes:

Variants that are near the end of the transcript (based on the 50 bp rule, with modifications as described in Karczewski et al., 2019 supplement)
Variants that land in an exon with non-canonical splice sites around it (i.e. intron does not start with GT and end with AG)

For splice-site variants, LOFTEE removes:

Variants that only affect splicing of UTRs
Variants that are not predicted to affect a donor site (GC -> GT)
Variants where MaxEntScan does not predict an effect on splicing
Variants that are "rescued" by nearby, in-frame splice sites (max_scan_distance determines distance from original splice site where rescue splice sites can occur; default = 15 bp)
Variants in small introns (min_intron_size; default = 15 bp; only relevant to older versions of Gencode)

For all variants, LOFTEE removes:

Variants where the purported LoF allele is the ancestral state (across primates)
Variants in incomplete transcripts (only relevant to older versions of Gencode)

Flags

LOFTEE implements a series of flags in addition to the above filters. Flagged variants should be treated with caution, particularly when doing genome-wide scans of LoF variation. However, they largely relate to the properties of individual transcripts or exons, so domain knowledge of a given gene will typically outperform these flags.

For stop-gained and frameshift variants, LOFTEE flags:

Variants in genes with only a single exon
Variants in exons that do not have the evolutionary signature of a protein-coding gene based on PhyloCSF
Variants where no exon number is indicated (apparently because the variant overlaps an intron)

For splice-site variants, LOFTEE flags:

Variants in NAGNAG sites (acceptor sites rescued by in-frame acceptor site)
Variants that fall in an intron with a non-canonical splice site (i.e. intron does not start with GT and end with AG).

Predictions of splice-altering variants

LOFTEE also makes predictions of other splice (OS) variants that may cause LoF by disrupting normal splicing patterns.

For variants that occur in the extended (but not essential) splice sites, LOFTEE uses logistic regression models to predict whether the splice site is significantly disrupted.

LOFTEE also uses an SVM model to predict variants that cause LoF by creating de novo donor splice sites leading to a frameshift.

Requirements

VEP
Perl >= 5.10.1
Ancestral sequence (human_ancestor.fa[.gz|.rz])
Samtools (must be on path)
PhyloCSF database (phylocsf.sql) for conservation filters

Usage

LOFTEE is easiest run when cloned from Github and passed to VEP using --dir_plugins (or move all files in the directory into ~/.vep/Plugins/).

Basic usage:

perl variant_effect_predictor.pl [--other options to VEP] --plugin LoF,loftee_path:/path/to/loftee --dir_plugins /path/to/loftee

Pass additional options to LOFTEE by:

perl variant_effect_predictor.pl [--other options to VEP] --plugin LoF,loftee_path:/path/to/loftee,human_ancestor_fa:/path/to/human_ancestor.fa.gz

Options:

loftee_path

Path to loftee directory. Default is the current working directory. Note: Your PERL5LIB should also contain this path.

min_intron_size

Minimum intron size, below which a variant should be filtered.

fast_length_calculation

The Ensembl API can be used to calculate transcript length in two different methods: one approximate (fast; usually within 3 bp of correct length) and one perfect (slow). Default: fast.

human_ancestor_fa

Location of human_ancestor.fa file (need associated tabix index file), available for download here (for samtools 0.1.19 and older): http://www.broadinstitute.org/~konradk/loftee/human_ancestor.fa.rz and http://www.broadinstitute.org/~konradk/loftee/human_ancestor.fa.rz.fai. Courtesy of Javier Herrero and the 1000 Genomes Project (source: ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/phase1/analysis_results/supporting/ancestral_alignments/). samtools 1.x uses bgzipped inputs for samtools faidx and downloads are available here: https://s3.amazonaws.com/bcbio_nextgen/human_ancestor.fa.gz, https://s3.amazonaws.com/bcbio_nextgen/human_ancestor.fa.gz.fai, https://s3.amazonaws.com/bcbio_nextgen/human_ancestor.fa.gz.gzi. If this flag is set to 'false', the ancestral allele will not be checked and filtered.

conservation_file

The required SQL database (gzip) can be downloaded here. Alternatively, this can be loaded into MySQL by downloading the source file here and loaded into MySQL with the schema available here. This route requires an additional load of the GERP base and exon files into the same database of gerp_bases and gerp_exons repsectively. You will then need to create a [loftee] entry in your ~/.my.cnf (creating one if it does not exist) that looks like:

<pre> [loftee] host=your_mysql_host user=your_mysql_user password=your_mysql_pass database=your_mysql_db </pre>

check_complete_cds

The Ensembl API contains a "Complete CDS" annotation that indicates that a start and stop codon has been identified for this transcript. This flag unfortunately requires Ensembl database access, and thus, severely decreases performance and is disabled by default.

get_splice_features

Flag indicating whether or not to write splice prediction features to LoF_info field. Default: 1.

donor_disruption_cutoff

The minimum cutoff on DONOR_DISRUPTION_PROB (computed from logistic regression model) used to predict a DONOR_DISRUPTION LoF. Default: 0.98.

acceptor_disruption_cutoff

The minimum cutoff on ACCEPTOR_DISRUPTION_PROB (computed from logistic regression model) used to predict a ACCEPTOR_DISRUPTION LoF. Default: 0.99.

donor_disruption_mes_cutoff

If no conservation_file is specified, then LOFTEE cannot use the logistic regression model to compute DONOR_DISRUPTION_PROB. Instead, it will predict donor disruption using only the impact of the variant on the splice site’s MES score. In this case, donor_disruption_mes_cutoff is the minimum cutoff used to predict DONOR_DISRUPTION. Default: 6 (i.e. the variant must lower the MES score of the splice site by at least 6 to activate DONOR_DISRUPTION).

acceptor_disruption_mes_cutoff

Ditto for variants affecting the acceptor site. Default: 7.

max_scan_distance

The maximum distance (in bp) from the disrupted donor or acceptor splice site where LOFTEE will look for "rescue" splice sites. Default: 15.

donor_rescue_cutoff

The minimum cutoff on RESCUE_DONOR_MES (i.e. the highest MES score out of all in-frame donor splice sites within max_scan_distance bp of the original splice site) used to activate the RESUCE_DONOR filter. Default: 8.5.

acceptor_rescue_cutoff

The minimum cutoff on RESCUE_ACCEPTOR_MES used to activate the RESCUE_ACCEPTOR filter. Default: 8.5.

exonic_denovo_only

If this flag is set to true, LOFTEE will only look for de novo donor splice sites occuring in the exon. Default: 1.

weak_donor_cutoff

Minimum MES of the annotated donor site for LOFTEE to consider any potential de novo donor alternatives. This is necessary because instances of annotated sites with very low MES scores lead to the false prediction of many de novo donor-creating variants. Default: -4.

max_denovo_donor_distance

The maximum distance from the original donor splice site where LOFTEE will look for de novo donor splice sites. Default: 200.

denovo_donor_cutoff

The minimum cutoff on DE_NOVO_DONOR_PROB (computed from SVM model) used to predict a DE_NOVO_DONOR LoF. Default: 0.995.

Output

The output is the standard VEP output, or standard VEP VCF if --vcf is passed to VEP. For those unfamiliar with VEP's VCF output, the annotations are written to the CSQ attribute of the INFO field. Here, a comma-separated list of consequences, corresponding to each transcript-(alternate)allele pair, is written with each entry as a pipe-delimited set of annotations. With more alleles and transcripts (and especially with the --everything flag), this will inevitably make for some very long INFO fields that are difficult to parse by eye.

See src/read_vep_vcf.py for a barebones example of a parsing script, or the section below on Parsing the VEP/LoF VCF for some tips and tricks.

From VEP, a VCF line may look like:

<pre> 1 1178848 rs115005664 G

Related Skills

node-connect

350.8k

Diagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps

frontend-design

110.4k

Create distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.

openai-whisper-api

350.8k

Transcribe audio via OpenAI Audio Transcriptions API (Whisper).

qqbot-media

350.8k

QQBot 富媒体收发能力。使用 <qqmedia> 标签，系统根据文件扩展名自动识别类型（图片/语音/视频/文件）。