ERVcaller
ERVcaller is a tool designed to accurately detect and genotype non-reference unfixed endogenous retroviruses (ERVs) and other transposable elements (TEs) in the human genome using next-generation sequencing (NGS) data. We evaluated the tools using both simulated and real benchmark whole-genome sequencing (WGS) datasets. ERVcaller is capable to accurately detect various TE insertions of any lengths, particularly ERVs. It allows for the use of a TE reference library regardless of sequence complexity, such as the entire RepBase database. It is easy to install and use with command lines.
Install / Use
/learn @xunchen85/ERVcallerREADME
ERVcaller v1.4
Introduction
ERVcaller is a tool designed to accurately detect and genotype non-reference unfixed endogenous retroviruses (ERVs) and other transposon elements (TEs) in the human genome using next-generation sequencing (NGS) data. We evaluated the tool using both simulated and benchmark whole-genome sequencing (WGS) datasets. ERVcaller is capable of accurately detecting various TE insertions of any length, particularly ERVs. It can be applied to both paired-end and single-end WGS, WES, or targeted DNA sequencing data. It supports the use of FASTQ or BAM files(s) generated by different aligners (only BWA, Bowtie were tested). In addition, ERVcaller is capable of detecting insertion breakpoints at single-nucleotide resolution. It allows for the use of either consensus TE sequences or a TE library containing abundant TE sequences as the reference, such as the entire RepBase database. Thus, ERVcaller can be used to detect insertions from highly mutated or novel TE sequences. It is easy to install and use with the command line. Complementary to ERVcaller, other bioinformatics tools designed to detect large deletions may be used to detect TEs that are present in the human reference genome but not in testing samples.
We have also published a book character which provided a step-by-step guide on using ERVcaller and other tools to characterize polymorphic TE insertions in human populations.
• Xun Chen, Guillaume Bourque, and Clement Goubert (2023): Genotyping of Transposable Element Insertions Segregating in Human Populations Using Short-Read Realignments, Transposable Elements: Methods and Protocols, Methods in Molecular Biology, vol. 2607, https://doi.org/10.1007/978-1-0716-2883-6_4
Installation
Extract the latest ERVcaller installer
$ tar vxzf ERVcaller_v.1.4.tar.gz
Installing dependent software
Users need to successfully install the following software separately and make them available in the default search path (such as by using the Linux command “export” or adding them to your .bashrc).
• BWA-0.7.10: http://bio-bwa.sourceforge.net/bwa.shtml
• Samtools-1.6 (or later than 1.2): http://www.htslib.org/doc/samtools.html
• R-3.3.2 (or higher): https://www.r-project.org/
• SE_MEI (Modified version included in the Scripts folder of the ERVcaller installer)
Preparing the references
Human reference genome (hg38 by default. If BAM file(s) are used as input, the same build as the reference used for alignment should be used)
$ wget http://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/hg38.fa.gz
$ gunzip hg38.fa.gz
$ bwa index hg38.fa
TE reference genome. A TE reference is provided by the ERVcaller installer (i.e., the TE consensus sequences consisting of one Alu, LINE1, SVA, and HERV-K consensus sequence each; the human TE library containing 23 TE sequences; and the ERV library extracted from the Repbase database); or a user-defined TE reference library.
$ cd user_installed_full_path/Database/
$ bwa index TE_consensus.fa
Running ERVcaller
Make the installed dependent tools available in the default search path
$ export PATH=$PATH:$home/bwa-master/
$ export PATH=$PATH:$home/samtools-1.6/
$ export PATH=$PATH:$home/SE-MEI/
$ export PATH=$PATH:$home/R/
Print help page
$ perl user_installed_full_path/ERVcaller_v1.4.pl
ERVcaller: running command line
$ perl user_installed_path/ERVcaller_v1.4.pl -i sample_ID -f .bam -H hg38.fa -T TE_consensus.fa –S 20 -BWA_MEM –t No._threads
Detecting TE insertions using a BAM file as input
$ perl user_installed_path/ERVcaller_v.1.4.pl -i TE_seq -f .bam -H hg38.fa -T TE_consensus.fa -I folder_of_input_data -O folder_for_output_files -t 12 -S 20 -BWA_MEM
Detecting TE insertions using paired-end FASTQ file as input
$ perl user_installed_path/ERVcaller_v.1.4.pl -i TE_seq -f .fq.gz -H hg38.fa -T TE_consensus.fa -I folder_of_input_data -O folder_for_output_files -t 12 -S 20 -BWA_MEM
Detecting TE insertions using multiple BAM files as input
$ perl user_installed_path/ERVcaller_v.1.4.pl -i TE_seq -f .list -H hg38.fa -T TE_consensus.fa -I folder_of_input_data -O folder_for_output_files -t 12 -S 20 -BWA_MEM -m
Detecting and genotyping TE insertions using a BAM file as input
$ perl user_installed_path/ERVcaller_v.1.4.pl -i TE_seq -f .bam -H hg38.fa -T TE_consensus.fa -I folder_of_input_data -O folder_for_output_files -t 12 -S 20 -BWA_MEM -G
Output file
Output for each sample
The output VCF file (VCFv4.2) will be generated after running. All annotations are listed below:
##fileformat=VCFv4.2
##fileDate=2019121
##source=ERVcaller_v.1.4
##reference=file:hg38.fa
##ALT=<ID=INS:MEI:HERVK,Description="HERVK insertion">
##INFO=<ID=TSD,Number=2,Type=String,Description="NUCLEOTIDE,LEN, Nucleotides and length of the Target Site Duplication (NULL for unknown)">
##INFO=<ID=INFOR,Number=6,Type=String,Description="NAME,START,END,LEN,DIRECTION,STATUS; NULL for unknown values. Status of detected TE: 0 = Inconsistent direction for the supporting reads; 1 = One breakpoint detected by only chimeric and/or improper reads without split reads; 2 = Only one breakpoint is detected and covered by split reads; 3 = Two breakpoints detected, and both of them are not covered by split reads; 4 = Two breakpoints detected, and one of them are not covered by split reads; 5 = Two breakpoints detected, and both of them are covered by split reads;">
##INFO=<ID=CR,Number=1,Type=Integer,Description="Number of chimeric and improper reads support the TE insertion">
##INFO=<ID=SR,Number=1,Type=String,Description="Number of split reads support TE insertion and the breakpoint">
##INFO=<ID=GTF,Number=1,Type=String,Description="If the detected TE insertions genotyped">
##INFO=<ID=GR,Number=1,Type=Float,Description="The ratio of the number of reads support TE insertions versus the total number of reads at this TE insertion location">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=GQ,Number=1,Type=Float,Description="Genotype quality (Phred transformed)">
##FORMAT=<ID=GL,Number=G,Type=Float,Description="Genotype likelihood">
##FORMAT=<ID=DPI,Number=1,Type=Integer,Description="The number of reads support TE insertions">
##FORMAT=<ID=DPN,Number=1,Type=Integer,Description="The number of reads support non-TE insertions">
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT TE_seq
chr1 5617379 . T <INS_MEI:HERV> . . TSD=NULL,NULL;INFOR=HERVK,1,7831,7831,+,4;CR=64;SR=3;GTF=YES;GR=1.000 GT:GQ:GL:DPN:DPI 1/1:40:0,0,1:0:67
Merging multiple samples
Create a file containing the sample list
Combine multiple samples with providing a list of consensus TE loci
$ perl user_installed_path/Scripts/Combine_VCF_files.pl -l sample_list -c 1KGP.TE.sites.vcf -o Output_merged.vcf
Combine multiple samples without providing a list of consensus TE loci
$ perl user_installed_path/Scripts/Combine_VCF_files.pl -l sample_list -o Output_merged.vcf
Calculate the number of reads support non-insertions at the consensus TE loci per sample (It is recommended to filter out low-quality TE loci from the combined VCF file first before running this script)
$ perl user_installed_path/Scripts/Calculate_reads_among_nonTE_locations.pl -i Output_merged.vcf -S sampleID -o output.nonTE -b bamFile.bam -s paired-end -l length_insertsize -L std_insertsize -r read_length -t threads
Distinguish missing genotypes and non-insertion genotypes at the consensus TE loci to get the final genotypes for all samples
$ cat *.nonTE >nonTE_allsamples
$ perl user_installed_path/Scripts/Distinguish_nonTE_from_missing_genotype.pl -n nonTE_allsamples -v Output_merged.vcf -o Output_merged-final.vcf
FAQ
How to install dependent tools?
You can follow the links listed below for information about downloading and/or installing all the dependent tools except the modified SE_MEI which is already included with ERVcaller:
• BWA-0.7.10: http://bio-bwa.sourceforge.net/bwa.shtml
• Bowtie2: http://bowtie-bio.sourceforge.net/bowtie2/manual.shtml
• Samtools-1.6 (or later than 1.2): http://www.htslib.org/doc/samtools.html
• R: https://www.r-project.org/
How to set the shell environment variables for the installed dependent tools?
You can set temporary variables by using the Linux “export” command line before you run ERVcaller every time, or you can modify the shell profile file (ie. .bashrc) for longtime use. You should run for all tools above, except R which is mostly set when installed. For example:
$ export PATH=$PATH:/home/Tools/samtools/
Where to get the human reference genome?
You can download hg38 here: http://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/. It is recommended that the file hg38.fa.gz is downloaded and unzipped for reference indexing.
Can we use other TE references we collected ourselves?
Yes, you can. You should be able to use any TE reference sequences specific to your research.
Where can I find test data?
You can find the test input data under the ERVcaller_v.1.4/test/ folder. There is example input data in both BAM and FASTQ format for testing.
There is also an example VCF output file in the folder: ERVcaller_v.1.4/test/example_output/
Where can I find more information about the output format?
You can find the full information here: https://samtools.github.io/hts-specs/VCFv4.2.pdf.
Which parameters were used to produce the example test output file?
The following command line was used to produce the example file:
$ perl ERVcaller_v.1.4.pl -i TE_seq -f .bam -H hg38.fa -T TE_consensus.fa -G
How to speed up ERVcaller?
You can use “-t <threads>” to use multi-thread computing. You can skip the
