Mcclintock
Meta-pipeline to identify transposable element insertions using next generation sequencing data
Install / Use
/learn @bergmanlab/McclintockREADME
McClintock: <sub><sup>A meta-pipeline to identify transposable element insertions using short-read whole genome sequencing data</sup></sub>
<a name="started"></a> Getting Started
# INSTALL (Requires Conda and Mamba to be installed)
git clone git@github.com:bergmanlab/mcclintock.git
cd mcclintock
mamba env create -f install/envs/mcclintock.yml --name mcclintock
conda activate mcclintock
python3 mcclintock.py --install
python3 test/download_test_data.py
# RUN
python3 mcclintock.py \
-r test/sacCer2.fasta \
-c test/sac_cer_TE_seqs.fasta \
-g test/reference_TE_locations.gff \
-t test/sac_cer_te_families.tsv \
-1 test/SRR800842_1.fastq.gz \
-2 test/SRR800842_2.fastq.gz \
-p 4 \
-o /path/to/output/directory
Table of Contents
- Getting Started
- Introduction
- Installing Conda/Mamba
- Installing McClintock
- McClintock Usage
- McClintock Input
- McClintock Output
- Run Examples
- Citation
- License
<a name="intro"></a> Introduction
Many methods have been developed to detect transposable element (TE) insertions from short-read whole genome sequencing (WGS) data, each of which has different dependencies, run interfaces, and output formats. McClintock provides a meta-pipeline to reproducibly install, execute, and evaluate multiple TE detectors, and generate output in standardized output formats. A description of the original McClintock 1 pipeline and evaluation of the original six TE detectors on the yeast genome can be found in Nelson, Linheiro and Bergman (2017) G3 7:2763-2778. A description of the re-implemented McClintock 2 pipeline, the reproducible simulation system, and evaluation of 12 TE detectors on the yeast genome can be found in Chen, Basting, Han, Garfinkel and Bergman (2023) Mobile DNA 14:8. The set of TE detectors currently included in McClintock 2 are:
- ngs_te_mapper - Linheiro and Bergman (2012)
- ngs_te_mapper2 - Han et al. (2021)
- PoPoolationTE - Kofler et al. (2012)
- PoPoolationTE2 - Kofler et al. (2016)
- RelocaTE - Robb et al. (2013)
- RelocaTE2 - Chen et al. (2017)
- RetroSeq - Keane et al. (2012)
- TEBreak - Schauer et al. (2018)
- TEFLoN - Adrion et al. (2017)
- TE-locate - Platzer et al. (2012)
- TEMP - Zhuang et al. (2014)
- TEMP2 - Yu et al. (2021)
<a name="conda"></a> Installing Conda/Mamba via Miniforge
McClintock is written in Python3 leveraging the SnakeMake workflow system and is designed to run on linux operating systems. Installation of software dependencies for McClintock and its component methods is automated by Conda, thus a working installation of Conda (and it's reimplementation Mamba) is required to install McClintock. Conda/Mamba can be installed via the Miniforge installer.
wget -O Miniforge3.sh "https://github.com/conda-forge/miniforge/releases/latest/download/Miniforge3-$(uname)-$(uname -m).sh"
bash Miniforge3.sh -b -p "${HOME}/conda"
source "${HOME}/conda/etc/profile.d/conda.sh"
source "${HOME}/conda/etc/profile.d/mamba.sh"
conda init
conda initrequires you to close and open a new terminal before it take effect
<a name="install"></a> Installing McClintock
After installing and updating Conda/Mamba, McClintock and its component methods can be installed by: 1. cloning the repository, 2. creating the Conda environment, and 3. running the install script.
Clone McClintock Repository
git clone git@github.com:bergmanlab/mcclintock.git
cd mcclintock
Create McClintock Conda Environment
mamba env create -f install/envs/mcclintock.yml --name mcclintock
- This installs the base dependencies needed to run the main McClintock script (
Snakemake,Python3,BioPython) into themcclintockConda environment.
Activate McClintock Conda Environment
conda activate mcclintock
- This adds the dependencies installed in the McClintock conda environment to the environment
PATHso that they can be used by the McClintock scripts. - This environment must <ins>always</ins> be activated prior to running any of the McClintock scripts
- NOTE: Sometimes activating conda environments does not work via
conda activate myenvwhen run through a script submitted to a queueing system, this can be fixed by activating the environment in the script as shown below
CONDA_BASE=$(conda info --base)
source ${CONDA_BASE}/etc/profile.d/conda.sh
conda activate mcclintock
- For more on Conda: see the Conda User Guide
Install McClintock Component Methods
- To install all of the component methods and create their associated conda environments, use the following command:
python3 mcclintock.py --install
- If you only want to install specific methods to save space and time, you can specify method(s) using the
-mflag:
python3 mcclintock.py --install -m <method1>,<method2>
-
NOTE: If you re-run either the full installation or installation of specific methods, the installation script will do a clean installation and remove previously installed components.
-
If you want to install missing methods to an already existing mcclintock installation, you can use the
--resumeflag:
python3 mcclintock.py --install --resume
- NOTE: If you use the
--resumeflag when installing specific method(s) with-m, the installation script will only install the specified method(s) if they haven't previously been installed. Do not use the--resumeflag if you want to do a clean installation of a specific method.
<a name="run"></a> McClintock Usage
Running the complete McClintock pipeline requires a fasta reference genome (option -r), a set of TE consensus/canonical sequences present in the organism (option -c), and fastq paired-end sequencing reads (options -1 and -2). If only single-end fastq sequencing data are available, then this can be supplied using only option -1, however only the TE detectors that handle single-ended data will be run. Optionally, if a detailed annotation of TE sequences in the reference genome has been performed, a GFF file with annotated reference TEs (option -g) and a tab-delimited "taxonomy" file linking annotated insertions to their TE family (option -t) can be supplied. Example input files are included in the test directory.
##########################
## Required ##
##########################
-r, --reference REFERENCE
A reference genome sequence in fasta format
-c, --consensus CONSENSUS
The consensus sequences of the TEs for the species in
fasta format
-1, --first FIRST
The path of the first fastq file from paired end read
sequencing or the fastq file from single read
sequencing
##########################
## Optional ##
##########################
-h, --help show this help message and exit
-2, --second SECOND
The path of the second fastq file from a paired end
read sequencing
-p, --proc PROC The number of processors to use for parallel stages of
the pipeline [default = 1]
-o, --out OUT An output folder for the run. [default = '.']
-m, --methods METHODS
A comma-delimited list containing the software you
want the pipeline to use for analysis. e.g. '-m
relocate,TEMP,ngs_te_mapper' will launch only those
three methods. If this option is not set, all methods
will be run [options: ngs_te_mapper, ngs_te_mapper2,
relocate, relocate2, temp, temp2, retroseq,
popoolationte, popoolationte2, te-locate, teflon,
coverage, trimgalore, map_reads, tebreak]
-g, --locations LOCATIONS
The locations of known TEs in the re
Related Skills
node-connect
335.9kDiagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps
frontend-design
82.7kCreate distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.
openai-whisper-api
335.9kTranscribe audio via OpenAI Audio Transcriptions API (Whisper).
commit-push-pr
82.7kCommit, push, and open a PR
