Neoepiscope

predicts neoepitopes from phased somatic mutations detected using tumor/normal DNA-seq data

Generate Convert Improve

Install / Use

/learn @pdxgx/Neoepiscope

About this skill

Quality Score

0/100

README

neoepiscope

neoepiscope is peer-reviewed open-source software for predicting neoepitopes from DNA sequencing (DNA-seq) data. Where most neoepitope prediction software confines attention to neoepitopes arising from at most one somatic mutation, often just an SNV, neoepiscope uses assembled haplotype output of HapCUT2 to also enumerate neoepitopes arising from more than one somatic mutation. neoepiscope also takes into account frameshifting from indels and permits personalizing the reference transcriptome using germline variants.

Read our paper in Bioinformatics

Note

neoepiscope v0.2.x has a critical bug where homozygous variants are not phased with heterozygous variants. Please update to the most recent version.

License

neoepiscope is licensed under the MIT license. See LICENSE for more details.

Portions of neoepiscope---specifically, segments of code in transcript.py, bowtie_index.py, and download.py---are taken from Rail-RNA, which is copyright (c) 2015 Abhinav Nellore, Leonardo Collado-Torres, Andrew Jaffe, James Morton, Jacob Pritt, José Alquicira-Hernández, Christopher Wilks, Jeffrey T. Leek, and Ben Langmead and licensed under the MIT License.

Support

or email hellopdxgx@gmail.com.

Installing neoepiscope

neoepiscope is compatible with Python 3.6 and higher. To install, run

pip install neoepiscope

Note: if this fails on macOS 10 (Catalina) or newer, the required pysam installation may be unable to find the C compiler. To solve this, you can try either 1) running xcode-select --install or 2) installing pysam via conda (e.g. conda install -c bioconda pysam) before trying pip install neoepiscope again.

To download compatible reference annotation files (hg19, GRCh38, and/or mouse mm9) and link installations of relevant optional softwares to neoepiscope (e.g. netMHCpan), you will need to use our download functionality. Run the command:

neoepiscope download

and respond to the prompts as relevant for your needs.

To make sure that the software is running properly, clone this repository, and from within it run:

python setup.py test

Using neoepiscope

Preparing reference files (for those using references other than human hg19 or GRCh38 or mouse mm9)

If you aren't using human hg19 or GRCh38 or mouse mm9 reference builds from our download functionality, you will need to download and prepare your own annotation files. Before calling any neoepitopes, run neoepiscope in index mode to prepare dictionaries of transcript data used in neoepitope prediction:

neoepiscope index -g <GTF> -d <DIRECTORY TO HOLD PICKLED DICTIONARIES>

Options:

-g, --gtf path to GTF file

-d, --dicts path to write pickled dictionaries

Ensure proper ordering of VCF

To call neoepitopes from somatic mutations, ensure that the column with data for the tumor sample in your VCF file precedes the column with data from a matched normal sample. If it does not, run neoepiscope in swap mode to produce a new VCF:

neoepiscope swap -i <INPUT VCF> -o <SWAPPED VCF>

Options:

-i, --input path to input VCF

-o, --output path to swapped VCF

Add germline variation (optional)

If you would like to include germline variation in your neoepitope prediction, merge your somatic and germline VCFs for a sample prior to phasing variants:

neoepiscope merge -g <GERMLINE VCF> -s <SOMATIC VCF> -o <MERGED VCF>

Options:

-g, --germline path to germline VCF

-s, --somatic path to somatic VCF

-o, --output path to write merged VCF

-t, --tumor-id tumor ID (matching sample in tumor BAM file's read group field)

If you plan to use GATK's ReadBackedPhasing for haplotype phasing (see below), make sure to specify a tumor ID using the -t flag. It should match the sample name in the header of your tumor BAM file (the SM value in the read group field).

Predict haplotype phasing

Next, run HapCUT2 with your merged or somatic VCF and your tumor BAM file (make sure to use --indels 1 when running extractHAIRS if you wish to predict neoepitopes resulting from insertions and deletions). Before calling neoepitopes, prep your HapCUT2 output to included unphased mutations as their own haplotypes and flag germline variants if relevant:

neoepiscope prep -v <VCF> -c <HAPCUT2 OUTPUT> -o <ADJUSTED HAPCUT OUTPUT>

Options:

-v, --vcf path to VCF file used to generate HapCUT2 output

-c, --hapcut2-output path to original HapCUT2 output

-o, --output path to output file

-p, --phased flag input VCF as phased with GATK ReadBackedPhasing

Alternatively, you may perform phasing using GATK's ReadBackedPhasing on your merged or somatic VCF. If you phased variants with GATK instead of HapCUT2, make sure to use the -p flag when running neoepiscope prep to format your output:

neoepiscope prep -v <VCF> -o <ADJUSTED HAPCUT OUTPUT> -p

You may also predict neoepitopes without phasing by preparing your merged or somatic VCF:

neoepiscope prep -v <VCF> -o <ADJUSTED HAPCUT OUTPUT>

Neoepitope prediction

Finally, call neoepitopes:

neoepiscope call -b <GENOME BUILD> -c <PREPPED HAPCUT2 OUTPUT> [options]

Options:

-x, --bowtie-index path to bowtie index of reference genome

-d, --dicts path to directory containing pickled dictionaries generated in index mode

-b, --build which genome build to use (human hg19 or GRCh38 or mouse mm9; overrides -x and -d options)

-c, --merged-hapcut2-output path to HapCUT2 output adjusted by neoepiscope prep

-v, --vcf path to VCF file used to generate HapCUT2 output

-o, --output path to output file

-f, --fasta output additional fasta file output

-k, --kmer-size kmer size for neoepitope prediction (default 8-11 amino acids)

-p, --affinity-predictor software to use for MHC binding predictions (default MHCflurry v1 with rank and affinity scores)

-a, --alleles alleles to use for MHC binding predictions

-n, --no-affinity do not run binding affinity predictions, overrides the -p and -a options

-g, --germline how to handle germline mutations (by default includes as background variation)

-s, --somatic how to handle somatic mutations (by default includes for neoepitope enumeration)

-e, --rna-edits path to directory containing REDIportal-formatted RNA edits file

-u, --upstream-atgs handling of translation from upstream start codons - ("novel" (default) only, "all", "none", "reference" only)

-i, --isolate isolate mutations - disables phasing of mutations which share a haplotype

--nmd enumerate neoepitopes from nonsense mediated decay transcripts

--pp enumerate neoepitopes from polymorphic pseudogene transcripts

--igv enumerate neoepitopes from IG V transcripts

--trv enumerate neoepitopes from TR V transcripts

--allow-nonstart enumerate neoepitopes from transcripts without annotated start codons

--allow-nonstop enumerate neoepitopes from transcripts without annotated stop codons

--rna-bam path to paired end RNA-seq alignment file

--transcript-counts path to file containing per-transcript read counts

--tpm-threshold minimum transcript TPM required to retain neoepitope

Using the --build option requires use of our download functionality to procure and index the required reference files for human hg19, human GRCh38, and/or mouse mm9. If using an alternate genome build, you will need to download your own bowtie index and GTF files for that build and use the neoepiscope index mode to prepare them for use with the --dicts and --bowtie-index options.

Haplotype information should be included using -c /path/to/haplotype/file. This in the form of HapCUT2 output, generated either from your somatic VCF or a merged germline/somatic VCF made with our neoepiscope merge functionality. The HapCUT2 output should be adjusted using our neoepiscope prep functionality to ensure that mutations that lack phasing data are still included in analysis.

If you wish to extract variant allele frequency information from your somatic VCF to be output with relevant epitopes, include the path to the somatic VCF you used to create your merged VCF using -v /path/to/VCF.

To specify the output file, use -o /path/to/output_file. If no output file is specified, the output will be written to standard out. By default, only data on neoepitopes is output in the file. By using the --fasta option, an additional f

Related Skills

node-connect

347.0k

Diagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps

frontend-design

107.8k

Create distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.

openai-whisper-api

347.0k

Transcribe audio via OpenAI Audio Transcriptions API (Whisper).

qqbot-media

347.0k

QQBot 富媒体收发能力。使用 <qqmedia> 标签，系统根据文件扩展名自动识别类型（图片/语音/视频/文件）。