TEtrimmer
TEtrimmer: a novel tool to automate manual curation of transposable elements
Install / Use
/learn @qjiangzhao/TEtrimmerREADME
Contents
Introduction
Many tools have been developed for the discovery and annotation of transposable elements (TEs). However, the high-quality TE consensus library construction still requires manual curation of TEs, which is time-consuming and needs experts with an in-depth understanding of TE biology.
TEtrimmer is a powerful software designed to automate the manual curation of TEs. The input can be a TE library from de novo TE discovery tools, such as EDTA and RepeatModeler2, or a TE library from closely related species. For each input consensus sequence, TEtrimmer automatically performs BLASTN search, sequence extraction, extension, multiple sequence alignment (MSA), MSA clustering, MSA cleaning, TE boundary definition, and TE classification. TEtrimmer also provides a graphical user interface (GUI) to inspect and improve predicted TEs, which can assist achieving manual curation-level TE consensus libraries easily.
Installation
TEtrimmer can be installed by 1. Conda, 2. Singularity, or 3. Docker.
1. Conda (Many thanks to HangXue)
You have to install miniconda on your computer in advance.
We highly recommend installation with mamba, as it is much faster.
# Create new conda environment
conda create --name TEtrimmer python=3.10 samtools=1.22.1
# Install mamba
conda install -c conda-forge mamba
# Activate new environment
conda activate TEtrimmer
# Install TEtrimmer
mamba install bioconda::tetrimmer
# Display options of TETrimmer
TEtrimmer --help
# If you encounter "ClobberError" or "ClobberWarning", don't worry! wait until it is finished!
# The Error or Warning could be like this:
ClobberError: This transaction has incompatible packages due to a shared path.
packages: bioconda/osx-64::blast-2.5.0-boost1.64_2, bioconda/osx-64::rmblast-2.14.1-hd94f91d_0
path: 'bin/blastx'
# The bioconda::tetrimmer package includes the TEtrimmer source code, but the version inside may be outdated.
# If you want to run the latest version of TEtrimmer via the bioconda::tetrimmer environment
# Clone the new version of TEtrimmer from Github
git clone https://github.com/qjiangzhao/TEtrimmer.git
# Run the cloned TEtrimmer inside the bioconda::tetrimmer environment
conda activate TEtrimmer
python <your_path_to_cloned_TEtrimmer_folder_which_contain_TEtrimmer.py>/TEtrimmer.py --help
or See required dependencies TEtrimmer_dependencies.
or conda installation via .yml
# Clone the github repository for TEtrimmer.
git clone https://github.com/qjiangzhao/TEtrimmer.git
# Install mamba
conda install -c conda-forge mamba
# Install TEtrimmer by the "yml" file
mamba env create -f <path to/TEtrimmer_env.yml>
Here is the provided TEtrimmer_env.yml
2. Singularity
# Download and generate TEtrimmer "sif" file
singularity pull docker://quay.io/biocontainers/tetrimmer:1.5.4--hdfd78af_0
# Other versions of the docker image can be found from https://quay.io/repository/biocontainers/tetrimmer?tab=tags
# Run TEtrimmer based on sif file
# If <your_path_to_store_PFAM_database> doesn't contain PFAM database
# TEtrimmer can automatically download PFAM to <your_path_to_store_PFAM_database>
singularity exec --writable-tmpfs \
--bind <your_path_contain_genome_file>:/genome \
--bind <your_path_contain_input_TE_library_file>:/input \
--bind <your_output_path>:/output \
--bind <your_path_to_store_PFAM_database>:/pfam \
<your_path_contain_sif_file>/tetrimmer_1.4.0--hdfd78af_0.sif \
TEtrimmer \
-i /input/<TE_library_name.fasta> \
-g /genome/<genome_file_name.fasta> \
-o /output \
--pfam_dir /pfam \
-t 20 --classify_all
# The Singularity image includes the TEtrimmer source code, but the version inside may be outdated.
# If you want to run the latest version of TEtrimmer via the singularity image
# Clone the new version of TEtrimmer from Github
git clone https://github.com/qjiangzhao/TEtrimmer.git
singularity exec --writable-tmpfs \
--bind <your_path_to_cloned_TEtrimmer_folder_which_contain_TEtrimmer.py>:/TEtrimmer_cloned \
--bind <your_path_contain_genome_file>:/genome \
--bind <your_path_contain_input_TE_library_file>:/input \
--bind <your_output_path>:/output \
--bind <your_path_to_store_PFAM_database>:/pfam \
<your_path_contain_sif_file>/tetrimmer_1.4.0--hdfd78af_0.sif \
python TEtrimmer_cloned/TEtrimmer.py \
-i /input/<TE_library_name.fasta> \
-g /genome/<genome_file_name.fasta> \
-o /output \
--pfam_dir /pfam \
-t 20 --classify_all
# You might get the following erro when run TEtrimmer v1.5.4 with the dokcer image:
TE Aid error for <your_sequence> with error /TEtrimmer/TE-Aid-master/TE-Aid: line 288: join: command not found
rm: can't remove 'header': No such file or directory
# This won't affect your final results. I will fix this in the next version.
3. Docker
# Download TEtrimmer docker image
docker pull quay.io/biocontainers/tetrimmer:1.5.4--hdfd78af_0
docker run -it --name TEtrimmer -v <bind_your_path>:/data quay.io/biocontainers/tetrimmer:1.4.0--hdfd78af_0
# Then you can run TEtrimmer inside TEtrimmer container
# Please note: Run TEtrimmer via Docker is relatively slower than Conda and Singularity.
4. Or simply copy and paste the following command to your terminal to install
mamba create -n TEtrimmer 'python=3.10' aliview 'bedtools>=2.31.1' bioconductor-biostrings blast cd-hit emboss ghostscript hmmer iqtree mafft nseg perl pfam_scan r-base r-rcpp recon repeatmasker repeatmodeler samtools trf pip perl-moose -c conda-forge -c bioconda
pip install biopython click dataclasses dill joblib matplotlib multiprocess numpy pandas plotly pypdf2 regex requests seaborn scikit-learn tk urllib3
cpan IPC::Run
cpan install Moose
Runtime test
We evaluated the runtime performance of TEtrimmer on genomes of four organisms, i.e., D. melanogaster, D. rerio, O. sativa, and B. hordei. For each genome, the analysis was executed three times using a compute node allocated via SLURM with 48 CPU cores (Intel Xeon 8468 Sapphire) and 140 GB of RAM. Runtime and output size were recorded for each repetition, and the mean and standard deviation were calculated across the three runs.
TEtrimmer exhibits a considerably longer runtime when executed on the Windows WSL system.
| | | EDTA as input for TEtrimmer | | | RepeatModeler2 as input for TEtrimmer | | | |-------------------|--------------------|---------------------------------|--------------|-------------------------|-------------------------------------------|--------------|-------------------------| | Species | Genome size (Mbp) | Input TE number | Runtime (h) | Output folder size (GB) | Input TE number | Runtime (h) | Output folder size (GB) | | B. hordei | 124 | 996 | 0.92 ± 0.049 | 2.30 | 818 | 0.83 ± 0.040 | 2.10 | | D. melanogaster | 144 | 819 | 0.66 ± 0.067 | 0.92 | 480 | 0.66 ± 0.046 | 0.96 | | D. rerio | 1,679 | 8,631 | 4.95 ± 0.225 | 15.10 | 3,504 | 2.32 ± 0.066 | 5.50 | | O. sativa | 373 | 10,404 | 3.30 ± 0.200 | 7.30 | 2,334 | 1.31 ± 0.090 | 2.50 |
Test
# To see all options
TEtrimmer --help
or
# To see all options
python <path to TEtrimmer>/TEtrimmer.py --help
- Download the test files test_input.fa and test_genome.fa.
TEtrimmer --input_file <path to test_input.fa> \
--genome_file <path to test_genome.fasta> \
--output_dir <output directory> \
--num_threads 20
--classify_all
Inputs
- Genome file: The genome sequence in FASTA format (.fa or .fasta).
- TE consensus library: TEtrimmer uses the TE consensus library from de novo TE annotation tools, like
RepeatModelerorEDTA, as input. For this reason, you have to runRepeatModeleror other TE annotation software first.
# TEtrimmer package already includes RepeatModeler. Below is an exmpale command of running RepeatModeler.
# Build genome database index files
BuildDatabase -name <genome_file_database_name> <genome_file.fa>
# Run RepeatModeler
RepeatModeler -database <genome_file_database_name> \
-threa

