McClintock: A meta-pipeline to identify transposable element insertions using short-read whole genome sequencing data

<a name="started"></a> Getting Started

# INSTALL (Requires Conda and Mamba to be installed)
git clone git@github.com:bergmanlab/mcclintock.git
cd mcclintock
mamba env create -f install/envs/mcclintock.yml --name mcclintock
conda activate mcclintock
python3 mcclintock.py --install
python3 test/download_test_data.py

# RUN
python3 mcclintock.py \
    -r test/sacCer2.fasta \
    -c test/sac_cer_TE_seqs.fasta \
    -g test/reference_TE_locations.gff \
    -t test/sac_cer_te_families.tsv \
    -1 test/SRR800842_1.fastq.gz \
    -2 test/SRR800842_2.fastq.gz \
    -p 4 \
    -o /path/to/output/directory

Getting Started
Introduction
Installing Conda/Mamba
Installing McClintock
McClintock Usage
McClintock Input
McClintock Output
Run Examples
Citation
License

<a name="intro"></a> Introduction

Many methods have been developed to detect transposable element (TE) insertions from short-read whole genome sequencing (WGS) data, each of which has different dependencies, run interfaces, and output formats. McClintock provides a meta-pipeline to reproducibly install, execute, and evaluate multiple TE detectors, and generate output in standardized output formats. A description of the original McClintock 1 pipeline and evaluation of the original six TE detectors on the yeast genome can be found in Nelson, Linheiro and Bergman (2017) G3 7:2763-2778. A description of the re-implemented McClintock 2 pipeline, the reproducible simulation system, and evaluation of 12 TE detectors on the yeast genome can be found in Chen, Basting, Han, Garfinkel and Bergman (2023) Mobile DNA 14:8. The set of TE detectors currently included in McClintock 2 are:

<a name="conda"></a> Installing Conda/Mamba via Miniforge

McClintock is written in Python3 leveraging the SnakeMake workflow system and is designed to run on linux operating systems. Installation of software dependencies for McClintock and its component methods is automated by Conda, thus a working installation of Conda (and it's reimplementation Mamba) is required to install McClintock. Conda/Mamba can be installed via the Miniforge installer.

wget -O Miniforge3.sh "https://github.com/conda-forge/miniforge/releases/latest/download/Miniforge3-$(uname)-$(uname -m).sh"
bash Miniforge3.sh -b -p "${HOME}/conda" 
source "${HOME}/conda/etc/profile.d/conda.sh"
source "${HOME}/conda/etc/profile.d/mamba.sh"
conda init

conda init requires you to close and open a new terminal before it take effect

<a name="install"></a> Installing McClintock

After installing and updating Conda/Mamba, McClintock and its component methods can be installed by: 1. cloning the repository, 2. creating the Conda environment, and 3. running the install script.

Clone McClintock Repository

git clone git@github.com:bergmanlab/mcclintock.git
cd mcclintock

Create McClintock Conda Environment

mamba env create -f install/envs/mcclintock.yml --name mcclintock

This installs the base dependencies needed to run the main McClintock script (Snakemake, Python3, BioPython) into the mcclintock Conda environment.

Activate McClintock Conda Environment

conda activate mcclintock

This adds the dependencies installed in the McClintock conda environment to the environment PATH so that they can be used by the McClintock scripts.
This environment must <ins>always</ins> be activated prior to running any of the McClintock scripts
NOTE: Sometimes activating conda environments does not work via conda activate myenv when run through a script submitted to a queueing system, this can be fixed by activating the environment in the script as shown below

CONDA_BASE=$(conda info --base)
source ${CONDA_BASE}/etc/profile.d/conda.sh
conda activate mcclintock

For more on Conda: see the Conda User Guide

Install McClintock Component Methods

To install all of the component methods and create their associated conda environments, use the following command:

python3 mcclintock.py --install

If you only want to install specific methods to save space and time, you can specify method(s) using the -m flag:

python3 mcclintock.py --install -m <method1>,<method2>

NOTE: If you re-run either the full installation or installation of specific methods, the installation script will do a clean installation and remove previously installed components.
If you want to install missing methods to an already existing mcclintock installation, you can use the --resume flag:

python3 mcclintock.py --install --resume

NOTE: If you use the --resume flag when installing specific method(s) with -m, the installation script will only install the specified method(s) if they haven't previously been installed. Do not use the --resume flag if you want to do a clean installation of a specific method.

<a name="run"></a> McClintock Usage

Running the complete McClintock pipeline requires a fasta reference genome (option -r), a set of TE consensus/canonical sequences present in the organism (option -c), and fastq paired-end sequencing reads (options -1 and -2). If only single-end fastq sequencing data are available, then this can be supplied using only option -1, however only the TE detectors that handle single-ended data will be run. Optionally, if a detailed annotation of TE sequences in the reference genome has been performed, a GFF file with annotated reference TEs (option -g) and a tab-delimited "taxonomy" file linking annotated insertions to their TE family (option -t) can be supplied. Example input files are included in the test directory.

##########################
##       Required       ##
##########################
  -r, --reference REFERENCE
                        A reference genome sequence in fasta format
  -c, --consensus CONSENSUS
                        The consensus sequences of the TEs for the species in
                        fasta format
  -1, --first FIRST
                        The path of the first fastq file from paired end read
                        sequencing or the fastq file from single read
                        sequencing

##########################
##       Optional       ##
##########################
  -h, --help            show this help message and exit
  -2, --second SECOND
                        The path of the second fastq file from a paired end
                        read sequencing
  -p, --proc PROC       The number of processors to use for parallel stages of
                        the pipeline [default = 1]
  -o, --out OUT         An output folder for the run. [default = '.']
  -m, --methods METHODS
                        A comma-delimited list containing the software you
                        want the pipeline to use for analysis. e.g. '-m
                        relocate,TEMP,ngs_te_mapper' will launch only those
                        three methods. If this option is not set, all methods
                        will be run [options: ngs_te_mapper, ngs_te_mapper2, 
                        relocate, relocate2, temp, temp2, retroseq, 
                        popoolationte, popoolationte2, te-locate, teflon, 
                        coverage, trimgalore, map_reads, tebreak]

  -g, --locations LOCATIONS
                        The locations of known TEs in the re

Mcclintock

Install / Use

README

McClintock: <sub><sup>A meta-pipeline to identify transposable element insertions using short-read whole genome sequencing data</sup></sub>