Seq2Neo
Seq2Neo: a comprehensive pipeline for cancer neoantigen immunogenicity prediction
Install / Use
/learn @XSLiuLab/Seq2NeoREADME
Seq2Neo: a comprehensive pipeline for cancer neoantigen immunogenicity prediction
Overview
Neoantigens derived from somatic DNA alterations are ideal cancer-specific targets. However, not all somatic DNA mutations can result in immunogenicity in cancer cells, and efficient tools for predicting the immunogenicity of neoepitope are still urgently needed. Here we present the Seq2Neo pipeline, which provides a one-stop solution for neoepitope features prediction from raw sequencing data, and neoantigens derived from different types of genome DNA alterations, including point mutations, insertion deletions, and gene fusions are supported. Importantly a convolutional neural networks (CNN) based model has been trained to predict the immunogenicity of neoepitope. And this model shows improved performance compared with currently available tools in immunogenicity prediction in independent datasets.
Installation
Seq2Neo runs on a Linux operation system like the CentOS system (recommended), and it is open-source software under an academic free license (AFL) v3.0.
Conda
We strongly recommend using the conda command line for installation as this will solve dependencies automatically. The web of the package is https://anaconda.org/liuxslab/seq2neo.
-
Firstly, you need to install the Anaconda or Miniconda (recommended), and set channels in the
~/.condarcfile like this:channels: - conda-forge - bioconda - menpo - main - r - msys2 - pytorch - pytorch-lts - simpleitk show_channel_urls: trueYou can replace those with Tsinghua mirrors or others.
-
Secondly, you should execute the following commands to create a new environment named Seq2Neo or other you like on your Linux system, and then activate it:
conda create -n Seq2Neo conda activate Seq2Neo -
Thirdly, you can install the package through the following conda command:
conda install -c liuxslab seq2neo -
Finally, please installation of following packages manually due to the reasons of permission or others:
- Annovar == latest ANNOVAR website (openbioinformatics.org)
- HLAHD == 1.4.0 HLA-HD (kyoto-u.ac.jp)
- netCTLpan == 1.1.b NetCTLpan - 1.1 - Services - DTU Health Tech
- netMHCpan == 4.1.b NetMHCpan - 4.1 - Services - DTU Health Tech
- STAR-Fusion == 1.10.1 STAR-Fusion/STAR-Fusion: STAR-Fusion codebase (github.com)
Following corresponding official instructions to install those packages on your system.
Docker
We also provide a docker image (liuxslab/seq2neo - Docker Image | Docker Hub) that contains all package dependencies. You need to install docker in advance on your system. Then the command docker pull liuxslab/seq2neo:latest will pull the latest seq2neo image into your local machine. You can put resource files required by BWA, Mutect2, and others in one folder resource_files, which has several classified folders like bqsr_resource, mutect2_resource, starfusion_resource, ref_genome ( reference to the section of "The module of whole"), then execute the following commands to start a docker container and activate Seq2Neo conda environment including seq2neo and its dependencies:
docker run -it -v /path/to/resource_files:/home/resource_files liuxslab/seq2neo:latest /bin/bash
cd /home/ # enter home directory so you can find the binding resource files
conda activate Seq2Neo
In the Seq2Neo environment, you can run seq2neo commands, please refer to the following section of "The module of whole".
Pip (not recommend)
You can install the stable release of Seq2Neo with:
pip install Seq2Neo
However, you should install all of the dependencies manually. It includes the following softwares and packages that should be installed in advance:
- bamtools=2.5.1
- bwa=0.7.17
- fastp=0.23.2
- perl=5.26.2=h470a237_0
- samtools=1.15.1
- star=2.7.8a
- tpmcalculator=0.0.4
- vcftools=0.1.16
- bowtie2 == 2.3.5
- gatk == 4.2.5
Then, you should also install the packages mentioned in the Conda section.
Usage
Seq2Neo consists of 3 modules, which are whole, download, and immuno. The module of whole is responsible for running the entire process, and contains several subprocesses. The download module can download a specified version of human reference genome (hg19 / hg38) from the GATK and index it. The last module of immuno supports the prediction of immunogenicity score of specified peptides and MHCs:
usage: seq2neo [-h] {whole,immuno,download} ...
A pipeline from sequence to neoantigen prediction
positional arguments:
{whole,immuno,download}
whole Run whole pipeline(Seq2Neo) with fastq/bam/sam/vcf file
immuno Run immunogenicity prediction with specified peptides and MHCs
download downloading human reference genome from GATK and building indexes
optional arguments:
-h, --help show this help message and exit
Thanks for using Seq2Neo
The module of whole
How to download necessary reference files
You need to download the necessary reference files before running Seq2Neo:
-
Download three BQSR known sites files used to recalibrate base quality score, those files should be put in a directory like bqsr_resource, and index files are needed to accelerate the speed of Seq2Neo. The commands are following:
mkdir bqsr_resource && cd bqsr_resource prefix=ftp://gsapubftp-anonymous@ftp.broadinstitute.org/bundle/hg38/ wget ${prefix}dbsnp_146.hg38.vcf.gz wget ${prefix}dbsnp_146.hg38.vcf.gz.tbi wget ${prefix}1000G_phase1.snps.high_confidence.hg38.vcf.gz wget ${prefix}1000G_phase1.snps.high_confidence.hg38.vcf.gz.tbi wget ${prefix}Mills_and_1000G_gold_standard.indels.hg38.vcf.gz wget ${prefix}Mills_and_1000G_gold_standard.indels.hg38.vcf.gz.tbi -
Download hg38 datasets of annovar via the following commands:
cd /path/to/annovar perl annotate_variation.pl --downdb --webfrom annovar --buildver hg38 refGene humandb/ -
Download the necessary reference files used to call Mutect2, those files should be put in a directory like mutect2_resource, and index files are needed to accelerate the speed of Seq2Neo. The commands are following:
mkdir mutect2_resource && cd mutect2_resource prefix=ftp://gsapubftp-anonymous@ftp.broadinstitute.org/bundle/Mutect2/ wget ${prefix}af-only-gnomad.hg38.vcf.gz wget ${prefix}af-only-gnomad.hg38.vcf.gz.tbi wget ${prefix}GetPileupSummaries/small_exac_common_3.hg38.vcf.gz wget ${prefix}GetPileupSummaries/small_exac_common_3.hg38.vcf.gz.tbi prefix=https://storage.googleapis.com/gatk-best-practices/somatic-hg38/ wget ${prefix}1000g_pon.hg38.vcf.gz wget ${prefix}1000g_pon.hg38.vcf.gz.tbi -
Download the AGFusion database and pyensembl reference genome, we select the max release of 95 to download:
pyensembl install --species homo_sapiens --release 95 agfusion download -g hg38 --release 95 -
Download the genome library of STAR-Fusion (1.10.1) to call gene fusions via the following commands:
ref=GRCh38_gencode_v37_CTAT_lib_Mar012021.plug-n-play.tar.gz wget https://data.broadinstitute.org/Trinity/CTAT_RESOURCE_LIB/__genome_libs_StarFv1.10/${ref} tar -zxvf ${ref}The size of the compressed genome library is about 31 G, Chinese researchers can download it at a higher speed by using some useful tools like Thunder Official Website.
-
Download the human reference genome and build indexes via the following commands:
mkdir ref_genome && cd ref_genome seq2neo download --build hg38 --dir .
How to run
Suppose you have the following files, they are tumor RNA-seq and WES data, normal WES data, VCF and corresponding sam and sort_bam files. Then you can run Seq2Neo to obtain potential neoantigens in different situations. The following is some examples:
-
Have tumor dna, tumor rna and normal dna fastq files
seq2neo whole --data-type fastq --ref Homo_sapiens_assembly38.fasta --normal-dna normal_dna_1.fastq normal_dna_2.fastq --tumor-dna tumor_dna_1.fastq tumor_dna_2.fastq --tumor-rna tumor_rna_1.fastq tumor_rna_2.fastq --normal-name normal_name --tumor-name tumor_name --annovar-db-dir annovar/humandb/ --known-site-dir bqsr_resource/ --mutect2-dataset-dir mutect2_resource/ --genome-lib-dir GRCh38_gencode_v37_CTAT_lib_Mar012021.plug-n-play/ctat_genome_lib_build_dir/ --agfusion-db agfusion.homo_sapiens.95.db --pon 1000g_pon.hg38.vcf.gz --len 8 9 10 11 --threadN 15 --java-options '"-Xmx50G"' --hlahd-dir hlahd.1.4.0/ --out out/ -
Have tumor dna and tumor rna fastq files
seq2neo whole --data-type without-normal-dna --ref Homo_sapiens_assembly38.fasta --tumor-dna tumor_dna_1.fastq tumor_dna_2.fastq --tumor-rna tumor_rna_1.fastq tumor_rna_2.fastq --tumor-name tumor_name --annovar-db-dir annovar/humandb/ --known-site-dir bqsr_resource/ --mutect2-dataset-dir mutect2_resource/ --genome-lib-dir GRCh38_gencode_v37_CTAT_lib_Mar012021.plug-n-play/ctat_genome_lib_build_dir/ --agfusion-db agfusion.homo_sapiens.95.db --pon 1000g_pon.hg38.vcf.gz --len 8 9 --threadN 15 --java-options '"-Xmx50G"' --hlahd-dir hlahd.1.4.0/ --out out/ -
Have tumor dna and normal dna fastq files
seq2neo whole --data-type without-tumor-rna --ref Homo_sapiens_assembly38.fasta --normal-dna normal_dna_1.fastq normal_dna_2.fastq --tumor-dna tumor_dna_1
Related Skills
node-connect
353.1kDiagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps
frontend-design
111.6kCreate distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.
openai-whisper-api
353.1kTranscribe audio via OpenAI Audio Transcriptions API (Whisper).
qqbot-media
353.1kQQBot 富媒体收发能力。使用 <qqmedia> 标签,系统根据文件扩展名自动识别类型(图片/语音/视频/文件)。
