XTea
Comprehensive TE insertion identification with WGS/WES data from multiple sequencing technics
Install / Use
/learn @parklab/XTeaREADME
xTea
xTea (comprehensive transposable element analyzer) is designed to identify TE insertions from paired-end Illumina reads, barcode linked-reads, long reads (PacBio or Nanopore), or hybrid data from different sequencing platforms and takes whole-exome sequencing (WES) or whole-genome sequencing (WGS) data as input.

Download
-
short reads (Illumina and Linked-Reads)
- 1.1 latest version
git clone https://github.com/parklab/xTea.git- 1.2 cloud binary version
git clone --single-branch --branch release_xTea_cloud_1.0.0-beta https://github.com/parklab/xTea.git -
long reads (PacBio or Nanopore)
git clone --single-branch --branch xTea_long_release_v0.1.0 https://github.com/parklab/xTea.git -
de novo TE insertion (trio data as input; check xTea-trioML branch for more details)
git clone --single-branch --branch xTea-trioML https://github.com/parklab/xTea.git -
mosaic TE insertion (high depth WGS data as input; check xtea_mosaic branch for more details)
git clone --single-branch --branch xtea_mosaic https://github.com/parklab/xTea.git -
pre-processed repeat library used by xTea (this library is used for both short and long reads)
wget https://github.com/parklab/xTea/raw/master/rep_lib_annotation.tar.gz -
gene annotation files are downloaded from GENCODE. Decompressed gff3 files are required.
- For GRCh38 (or hg38), gff3 files are downloaded and decompressed from https://www.gencodegenes.org/human/release_33.html ;
- For GRCh37 (or hg19), gff3 files are downloaded and decompressed from https://www.gencodegenes.org/human/release_33lift37.html ;
- For CHM13v2, gff3 files are downloaded from https://s3-us-west-2.amazonaws.com/human-pangenomics/T2T/CHM13/assemblies/annotation/chm13.draft_v2.0.gene_annotation.gff3;
- Or use the latest version
Dependencies
-
bwa (version 0.7.17 or later, which requires the -o option), which can be downloaded from https://github.com/lh3/bwa.
-
samtools (version 1.0 or later), which can be downloaded from https://github.com/samtools.
-
minimap2 (for long reads only), which can be downloaded from https://github.com/lh3/minimap2.
-
wtdbg2 (for long reads only), which can be downloaded from https://github.com/ruanjue/wtdbg2.
-
Python 2.7+/3.6+
-
For the following packages, only a conda-based installation in shown. You may also install these in other ways, such as pip.
-
pysam (https://github.com/pysam-developers/pysam, version 0.12 or later) is required.
- Install pysam:
conda config --add channels r conda config --add channels bioconda conda install pysam -y
- Install pysam:
-
sortedcontainers
- Install sortedcontainers
conda install sortedcontainers -y
- Install sortedcontainers
-
numpy, scikit-learn, and pandas
- Install numpy, scikit-learn and pandas
conda install numpy scikit-learn=0.18.1 pandas -y
- Install numpy, scikit-learn and pandas
-
DF21 (this is used to replease scikit-learn, which is complained by several users for version incompatible)
- Install DF21
pip install deep-forest
- Install DF21
-
-
Note: bwa and samtools need to be added to the $PATH.
Install
-
Use Conda
xtea is a bioconda package, to install first make sure the bioconda channel has been added:
conda config --add channels defaults conda config --add channels bioconda conda config --add channels conda-forgeThen, install xtea (while creating a new enviroment):
conda create -n your_env xtea=0.1.6Or install directly via:
conda install -y xtea=0.1.6 -
Install-free
If the dependencies have already been installed, then install-free mode is recommended. One can directly run the downloaded python scripts.
Run xTea
-
Input
-
A sample id file list, e.g. a file named
sample_id.txtwith content as follows (each line represents one unique sample id):NA12878 NA12877 -
A file of listed alignments:
-
An Illumina bam/cram file (sorted and indexed) list, e.g. a file named
illumina_bam_list.txtwith content as follows (two columns separated by a space or tab: sample-id bam-path):NA12878 /path/na12878_illumina_1_sorted.bam NA12877 /path/na12877_illumina_1_sorted.bam -
A 10X bam/cram file (sorted and indexed, see BarcodeMate regarding barcode-based indicies) list, e.g. a file named
10X_bam_list.txtwith content as follows (three columns separated by a space or tab: sample-id bam-path barcode-index-bam-path):NA12878 /path/na12878_10X_1_sorted.bam /path/na12878_10X_1_barcode_indexed.bam NA12877 /path/na12877_10X_1_sorted.bam /path/na12877_10X_1_barcode_indexed.bam -
A case-ctrl bam/cram file list (three columns separated by a space or tab: sample-id case-bam-path ctrl-bam-path
DO0001 /path/DO001_case_sorted.bam /path/DO001_ctrl_sorted.bam DO0002 /path/DO002_case_sorted.bam /path/DO002_ctrl_sorted.bam
-
-
-
Run the pipeline from local cluster or machine
2.1 Generate the running script (if it is install-free, then use the full path of the downloaded
bin/xteainstead.)- Run on a cluster or a single node (by default
xteaassumes the reference genome is GRCh38 or hg38. Forhg19orGRCh37, please usextea_hg19; forCHM13, please usegnrt_pipeline_local_chm13.py)-
Here, the slurm system is used as an example. If using LSF, replace
--slurmwith--lsf. For those using clusters other than slurm or LSF, users must adjust the generated shell script header accordingly. Users also must adjust the number of cores (-n) and memory (-m) accordingly. In general, each core will require 2-3G memory to run. For very high depth bam files, runtime (denoted by-t) may take longer. -
Note that
--xteais a required option that points to the exact folder containing python scripts. -
Using only Illumina data
xtea -i sample_id.txt -b illumina_bam_list.txt -x null -p ./path_work_folder/ -o submit_jobs.sh -l /home/rep_lib_annotation/ -r /home/reference/genome.fa -g /home/gene_annotation_file.gff3 --xtea /home/ec2-user/xTea/xtea/ -f 5907 -y 7 --slurm -t 0-12:00 -q short -n 8 -m 25 -
Using only 10X data
xtea -i sample_id.txt -b null -x 10X_bam_list.txt -p ./path_work_folder/ -o submit_jobs.sh -l /home/ec2-user/rep_lib_annotation/ -r /home/ec2-user/reference/genome.fa -g /home/gene_annotation_file.gff3 --xtea /home/ec2-user/xTea/xtea/ -y 7 -f 5907 --slurm -t 0-12:00 -q short -n 8 -m 25 -
Using hybrid data of 10X and Illumina
xtea -i sample_id.txt -b illumina_bam_list.txt -x 10X_bam_list.txt -p ./path_work_folder/ -o submit_jobs.sh -l /home/ec2-user/rep_lib_annotation/ -r /home/ec2-user/reference/genome.fa -g /home/gene_annotation_file.gff3 --xtea /home/ec2-user/xTea/xtea/ -y 7 -f 5907 --slurm -t 0-12:00 -q short -n 8 -m 25 -
Using case-ctrl mode
xtea --case_ctrl --tumor -i sample_id.txt -b case_ctrl_bam_list.txt -p ./path_work_folder/ -o submit_jobs.sh -l /home/ec2-user/rep_lib_annotation/ -r /home/ec2-user/reference/genome.fa -g /home/gene_annotation_file.gff3 --xtea /home/ec2-user/xTea/xtea/ -y 7 -f 5907 --slurm -t 0-12:00 -q short -n 8 -m 25 -
Working with long reads (non case-ctrl; more detailed steps please check the "xTea_long_release_v0.1.0" branch)
xtea_long -i sample_id.txt -b long_read_bam_list.txt -p ./path_work_folder/ -o submit_jobs.sh --rmsk ./rep_lib_annotation/LINE/hg38/hg38_L1_larger_500_with_all_L1HS.out -r /home/ec2-user/reference/genome.fa --cns ./rep_lib_annotation/consensus/LINE1.fa --rep /home/ec2-user/rep_lib_annotation/ --xtea /home/ec2-user/xTea_long/xtea_long/ -f 31 -y 15 -n 8 -m 32 --slurm -q long -t 2-0:0:0 -
Parameters:
Required: -i: samples id list (one sample id per line); -b: Illumina bam/cram file list (sorted and indexed — one file per line); -x: 10X bam file list (sorted and indexed — one file per line); -p: working directory, where the results and temporary files will be saved; -l: repeat library directory (directory which contains decompressed files from "rep_lib_annotation.tar.gz"); -r: reference genome fasta/fa file; -y: type of repeats to process (1-L1, 2-Alu, 4-SVA, 8-HERV; sum the number corresponding to the repeat type to process multiple repeats. For example, to run L1 and SVA only, use `-y 5`. Each repeat type will be processed separately, however some of the early processing steps are common to multiple repeat types. Thus, when analyzing a large cohort, to improve the efficiency (and save money on the cloud), it is highly recommended to run the tool on one repeat type first, and subsequently on the rest. For example, first use '-y 1', and for then use '-y 6' in a second run); -f: steps to run. (5907 means run all the steps); --xtea: this is the full path of the xTea/xtea folder (or the xTea_long_release_v0.1.0 folder for long reads module), where the python scripts reside in; -g: gene annotation file in gff3 format; -o: generated running scripts under the working folder; Optional: -n: number of cores (default: 8, should be an integer); -m: maximum memory in GB (default: 25, should be an integer); -q: cluster partition name; -t: job runtime; --flklen: flanking region length; --lsf: add this option if using an LSF cluster (by default, use of the slurm scheduler is assumed); --tumor: indicates the tumor sample in a case-ctrl pair; --purity: tumor purity (by default 0.45); --blacklist: blacklist file in bed format. Listed regions will be filtered out; --slurm: runs using the slurm scheduler. Generates a script header fit for this scheduler; The following cutoffs will be automatically set based on read depth (and also purity in the case of a tumor sample); These parameters have been tho
-
- Run on a cluster or a single node (by default
Related Skills
node-connect
346.8kDiagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps
frontend-design
107.6kCreate distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.
openai-whisper-api
346.8kTranscribe audio via OpenAI Audio Transcriptions API (Whisper).
qqbot-media
346.8kQQBot 富媒体收发能力。使用 <qqmedia> 标签,系统根据文件扩展名自动识别类型(图片/语音/视频/文件)。
