MCHelper

MCHelper: An automatic tool to curate transposable element libraries

Generate Convert Improve

Install / Use

/learn @GonzalezLab/MCHelper

About this skill

Quality Score

0/100

README

MCHelper

MCHelper: An automatic tool to curate transposable element libraries

Introduction
Installation
- Linux/Windows
- MacOS
Testing
Usage
Docker
Recovering
Inputs
Outputs
Citation

Introduction

The number of species with high quality genome sequences continues to increase, in part due to scaling up of multiple large scale biodiversity sequencing projects. While the need to annotate genic sequences in these genomes is widely acknowledged, the parallel need to annotate transposable element sequences that have been shown to alter genome architecture, rewire gene regulatory networks, and contribute to the evolution of host traits is becoming ever more evident. However, accurate genome-wide annotation of transposable element sequences is still technically challenging. Several de novo transposable element identification tools are now available, but manual curation of the libraries produced by these tools is needed to generate high quality genome annotations. Manual curation is time-consuming, and thus impractical for largescale genomic studies, and lacks reproducibility. In this work, we present the Manual Curator Helper tool, MCHelper, which automates the TE library curation process. By leveraging MCHelper's fully automated mode with the outputs from two de novo transposable element identification tools, RepeatModeler2 and REPET, in fruit fly, rice, and zebrafish, we show a substantial improvement in the quality of the transposable element libraries and genome annotations. MCHelper libraries are less redundant, with up to 54% reduction in the number of consensus sequences, have up to 11.4% fewer false positive sequences, and also have up to ~45% fewer “unclassified/unknown” transposable element consensus sequences. Genome-wide transposable element annotations were also improved, including larger unfragmented insertions.

Installation

Linux/Windows

For Windows Systems is necessarity to have a functional installation of Windows Subsystem for Linux (WSL) version 2, the Poppler package installed (sudo apt-get install poppler-utils), as well as the QT package (sudo apt-get install qtbase5-dev).

It is recommended to install the dependencies in an Anaconda environment.

git clone https://github.com/gonzalezlab/MCHelper.git

Then, locate the MCHelper folder and find the file named "MCHelper.yml". Then, install the environment:

conda env create -f MCHelper/MCHelper.yml

Now, unzip all the databases needed by MCHelper:

cd MCHelper/db
unzip '*.zip'
conda activate MCHelper
makeblastdb -in allDatabases.clustered_rename.fa -dbtype nucl

Then, download the pfam database released by REPET group and renamed it:

wget https://urgi.versailles.inrae.fr/download/repet/profiles/ProfilesBankForREPET_Pfam35.0_GypsyDB.hmm.tar.gz

tar xvf ProfilesBankForREPET_Pfam35.0_GypsyDB.hmm.tar.gz
mv ProfilesBankForREPET_Pfam35.0_GypsyDB.hmm Pfam35.0.hmm

And that's it. You have now installed MCHelper.

MacOS

These installation instructions have been tested on MacOS with M1/M2 architectures (Apple Silicon, arch: arm64). Therefore, these instructions are not compatible with MacOS with Intel Core processors).

Set up Rosetta.

Download and install the iTerm (or duplicate it if you have already installed, then rename it to, for example, iTerm_X86_64).
Right click on the icon iTerm (or iTerm_X86_64 if you renamed it), and select the option Get Info, and check box: Open using Rosetta
Open the new terminal iTerm (or iTerm_X86_64)
Verify the architecture: uname -m. It should appear: x86_64

Install Mambaforge Using the same iTerm (or iTerm_X86_64) we configured earlier, download the Mambaforge script and install it:

wget https://github.com/conda-forge/miniforge/releases/download/23.11.0-0/Mambaforge-23.11.0-0-MacOSX-x86_64.sh 
chmod +x Mambaforge-23.11.0-0-MacOSX-x86_64.sh

./Mambaforge-23.11.0-0-MacOSX-x86_64.sh

Follow the prompts, install, and initialize conda.

Install MCHelper conda environment using the special YML file for Mac (MCHelper_Mac.yml), using the Rosetta iTerm (or iTerm_X86_64):

git clone https://github.com/gonzalezlab/MCHelper.git

conda env create -f MCHelper/MCHelper_Mac.yml

conda activate MCHelper_Mac

Download and rename the TRF binary for MacOS:

cd MCHelper/tools
wget https://github.com/Benson-Genomics-Lab/TRF/releases/download/v4.09.1/trf409.macosx

rm -f trf409.linux64
mv trf409.macosx trf409.linux64
chmod +x trf409.linux64
cd -

Now, unzip all the databases needed by MCHelper:

cd MCHelper/db
unzip '*.zip'
makeblastdb -in allDatabases.clustered_rename.fa -dbtype nucl

Then, download the pfam database released by REPET group and renamed it:

wget https://urgi.versailles.inrae.fr/download/repet/profiles/ProfilesBankForREPET_Pfam35.0_GypsyDB.hmm.tar.gz

tar xvf ProfilesBankForREPET_Pfam35.0_GypsyDB.hmm.tar.gz
mv ProfilesBankForREPET_Pfam35.0_GypsyDB.hmm Pfam35.0.hmm

And that's it. You have now installed MCHelper.

Testing

To test MCHelper, we provide some example inputs and also the expected results (located at Test_dir/) to allow you to compare with your own outputs. To check MCHelper is running properly, you can do:

First, activate the anaconda enviroment, if it isn't activated yet:

conda activate MCHelper

Then, be sure you are in the main folder (this one where MCHelper.py is located) and unzip the D. melanogaster genome:

unzip Test_dir/repet_input/Dmel_genome.zip -d Test_dir/repet_input/

Next step is download and format the host genes from BUSCO

wget https://busco-data.ezlab.org/v4/data/lineages/diptera_odb10.2020-08-05.tar.gz
mv diptera_odb10.2020-08-05.tar.gz Test_dir/repet_input/ 
cd Test_dir/repet_input/
tar xvf diptera_odb10.2020-08-05.tar.gz
cat diptera_odb10/hmms/*.hmm > diptera_odb10.hmm
cd -

Now, run the MCHelper script:

mkdir Test_dir/repet_output_own

python3 MCHelper.py -r A -t 8 -i Test_dir/repet_input/ -o Test_dir/repet_output_own -g Test_dir/repet_input/Dmel_genome.fasta --input_type repet -b Test_dir/repet_input/diptera_odb10.hmm -a F -n Dmel

This test will take the REPET's output and will do the curation automatically, using most of the parameters by default. If you want to run the test for a toy library with less sequences, you can execute:

unzip Test_dir/fasta_input/Dmel_genome.zip -d Test_dir/fasta_input/

mkdir Test_dir/fasta_output_own

python3 MCHelper.py -r A -t 8 -l Test_dir/fasta_input/TE_lib_toy.fa -o Test_dir/fasta_output_own -g Test_dir/fasta_input/Dmel_genome.fna --input_type fasta -b Test_dir/repet_input/diptera_odb10.hmm -a F

Usage

Be sure you have activated the anaconda environment:

conda activate MCHelper

Then, execute MCHelper with default parameters. For REPET input (see Testing for a practical example):

python3 MCHelper.py -i path/to/repet_output -o path/to/MCHelper_output -g path/to/genome -n repet_name_project --input_type repet -b path/to/reference_genes.hmm -a F

For fasta input:

python3 MCHelper.py -l path/to/TE_library_in_fasta -o path/to/MCHelper_output -g path/to/genome --input_type fasta -b path/to/reference_genes.hmm -a F

To see the full help documentation run:

python3 MCHelper.py --help

Full list of parameters include:

-h, --help show this help message and exit
-r MODULE, --module MODULE: module of curation [A, C, U, T, E, M]. Required*
-i INPUT_DIR, --input INPUT_DIR: Directory with the files required to do the curation (REPET output directory). Required*
-g GENOME, --genome GENOME: Genome used to detect the TEs. Required*
-o OUTPUTDIR, --output OUTPUTDIR: Path to the output directory. Required*
--te_aid TE_AID: Do you want to use TE-aid? [Y or N]. Default=Y.
-a AUTOMATIC: Level of automation: F: fully automated, S: semi-automated, M: fully manual?. Default=F.
-n PROJ_NAME: REPET project name. Required for repet input*
-t CORES: cores to execute some steps in parallel. Default=all available cores.
-m REF_LIBRARY_UNCLASSIFIED_MODULE: Path to the sequences to be used as references in the unclassified module.
-v VERBOSE Verbose? [Y or N]. Default=N.
--input_type INPUT_TYPE: Input type: fasta or REPET.
-l USER_LIBRARY: User defined library to be used with input type fasta.
-b BUSCO_LIBRARY: Reference/BUSCO genes to filter out TEs (HMM format required).
-z MINBLASTHITS Minimum number of blast hits to process an element.
-c MINFULLLFRAGMENTS: Minimum number of full-length fragments to process an element. Default=1
-s PERC_SSR: Maximum length covered by single repetitions (in percentage between 0-100) allowed for a TE not to be removed. Default=60.
-e EXT_NUCL Number of nucleotides to extend each size of the element. Default=500.
-x NUM_ITE Number of iterations to extend the elements. Default=16.
-k clust_algorithm Clustering algorithm: cd-hit or meshclust. Default=cd-hit
--version show program's version number and exit.

MCHelper can be run in three different modes: Fully automatic (F), semi-automatic (S) and manual (M). The way you can control this is with the parameter -a [F,S or M]. Notice that the fully automatic mode will make all the decision by you and, at the end, will generate different outputs curated and non-curated sequences. In contrast, the semi-automatic mode runs the structural check and allows the user to inspect the consensus sequences that do not fit the structural requirements. The manual mode does not run the structural check and se

Related Skills

node-connect

337.3k

Diagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps

frontend-design

83.2k

Create distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.

openai-whisper-api

337.3k

Transcribe audio via OpenAI Audio Transcriptions API (Whisper).

commit-push-pr

83.2k

Commit, push, and open a PR

GonzalezLab

View profile

View on GitHub

GitHub Stars45

CategoryDevelopment

Updated4mo ago

Forks4

GonzalezLab/MCHelper

Languages

Python

Security Score

92/100

Audited on Nov 5, 2025

No findings

MCHelper

Install / Use

README

MCHelper

Table of Contents

Introduction

Installation

Linux/Windows

MacOS

Testing

Usage

Related Skills