Jitterbug
Jitterbug is a bioinformatic software that predicts insertion sites of transposable elements in a sample sequenced by short paired-end reads with respect to an assembled reference.
Install / Use
/learn @elzbth/JitterbugREADME
Summary
This set of programs allows one to identify transposable element insertions (TEI) in a sequenced sample with respect to an assembled reference.
The main script jitterbug.py performs this analysis, using a .bam file of mapped reads and the annotation of TEs in the reference.
Additional modules are provided to filter and plot these results, compare tumor/normal pairs as well evaluate predictions from simulated data.
Citation
If you use this software in your work, please cite:
Jitterbug: somatic and germline transposon insertion detection at single-nucleotide resolution Elizabeth Hénaff, Luís Zapata, Josep M. Casacuberta, and Stephan Ossowski
BMC Genomics. 2015; 16: 768. PMCID: PMC4603299
Installation
Dependencies:
Jitterbug requires the following to run:
Python (https://www.python.org/) version 2.7 or greater (Jitterbug has not been tested with Python3)
Python modules: pysam (https://github.com/pysam-developers/pysam) pybedtools (https://pythonhosted.org/pybedtools/) psutil (https://pypi.python.org/pypi/psutil)
For the companion scripts to plot results and process tumor/normal pairs, you will need:
matplotlib (http://matplotlib.org/) matplotlib-venn (https://pypi.python.org/pypi/matplotlib-venn) numpy (http://www.numpy.org/)
All of these modules are available to install through pip (https://pypi.python.org/pypi/pip) So, you can run:
pip install <module_name>
to install each of the above mentioned modules.
note on memory usage
If you are dealing with large input files (i.e. the human genome and not Arabidopsis) you may want to use the
--pre_filter
option to pre-select discordant reads with samtools. This will write an additional file to disk, but use less memory during runtime.
Specific requirements: For Jitterbug to run properly it makes use of a module called y_serial (included). This module requires that the specified output folder has read and write access for all users. So remember to set for example chmod -R 777 your_output_folder/ or it will crash
USE CASE 1: predict TEI in a single sample
STEP 1.1: run jitterbug.py to identify the candidate TE insertions:
example usage: run with bam file, default everything, write to present directory
jitterbug.py sample.bam te_annot.gff3
example usage: run with bam file, write to specified directory with specified prefix. Parallelize: use 8 threads, separating by 50 Kbp bins
jitterbug.py --numCPUs 8 --bin_size 50000000 --output_prefix /path/to/my/dir/prefix sample.bam te_annot.gff3
example usage: you have added nice unique identifier tags to your gff annotation (like in the hg19 and TAIR10 annotations provided in the data/ folder) and you want those to be reported in the final output.
jitterbug.py --TE_name_tag Name --numCPUs 8 --bin_size 50000000 --output_prefix /path/to/my/dir/prefix sample.bam te_annot.gff3
full list of options:
usage: jitterbug.py [-h] [-v] [--pre_filter] [-l LIB_NAME] [-d SDEV_MULT]
[-o OUTPUT_PREFIX] [-n NUMCPUS] [-b BIN_SIZE] [-q MINMAPQ]
[-t TE_NAME_TAG] [-s CONF_LIB_STATS] [-c MIN_CLUSTER_SIZE]
[--disc_reads_bam DISC_READS_BAM]
mapped_reads TE_annot
positional arguments:
mapped_reads reads mapped to reference genome in bam format, sorted
by position
TE_annot annotation of transposable elements in reference
genome, in gff3 format
optional arguments:
-h, --help show this help message and exit
-v, --verbose print more output to the terminal
--pre_filter pre-filter reads with samtools, and save intermediate
filtered read subset
-l LIB_NAME, --lib_name LIB_NAME
sample or library name, to be included in final gff
output
-d SDEV_MULT, --sdev_mult SDEV_MULT
use SDEV_MULT*fragment_sdev + fragment_length when
calculating insertion intervals. Best you don't touch
this.
-o OUTPUT_PREFIX, --output_prefix OUTPUT_PREFIX
prefix of output files. Can be
path/to/directory/file_prefix
-n NUMCPUS, --numCPUs NUMCPUS
number of CPUs to use
-b BIN_SIZE, --bin_size BIN_SIZE
If parallelized, size of bins to use, in bp. If
numCPUs > 1 and bin_size == 0, will parallelize by
entire chromosomes
-q MINMAPQ, --minMAPQ MINMAPQ
minimum read mapping quality to be considered
-t TE_NAME_TAG, --TE_name_tag TE_NAME_TAG
name of tag in TE annotation gff file to use to record
inserted TEs
-s CONF_LIB_STATS, --conf_lib_stats CONF_LIB_STATS
tabulated config file that sets the values to use for
fragment length and sdev, read length and sdev. 4 tab-
deliminated lines: key value, keys
are:fragment_length, fragment_length_SD, read_length,
read_length_SD
-c MIN_CLUSTER_SIZE, --min_cluster_size MIN_CLUSTER_SIZE
min number of both fwd and rev reads to predict an
insertion
--disc_reads_bam DISC_READS_BAM
for debug. Use as input bam file of discordant reads
only (generated by running with --pre_filter), and
skip the step of perusing the input bam for discordant
reads
this will generate the following output files:
<output_prefix>.TE_insertions_paired_clusters.gff3
Annotation in gff3 format of the predicted insertions, with the 9th column's tags describing characteristics of the predictions
<output_prefix>.read_stats.txt
Stats collected on the library: length and standard deviations of the fragments and reads
<output_prefix>.filter_config.txt
Config file with reasonable defaults for the subsequent highly recommended filtering step. These defaults are calculated as a function of the library characteristics
<output_prefix>.run_stats.txt
Stats collected about the run: calculation time, number of processors used
<output_prefix>.TE_insertions_paired_clusters.supporting_clusters.table
Table with infomation on the reads that support each cluster. This table is intended to be easily manipulated with standard *NIX tools in order to, for example, extract the anchor and mate read's sequences for assembly and primer design
The table follows the following format:
Insertion lines: one per predicted insertion site, corresponding to a pair of overlapping clusters, one fwd, one rev
I cluster_pair_ID lib chrom start end num_fwd_reads num_rev_reads fwd_span rev_span best_sc_pos_st best_sc_pos_end sc_pos_support
Cluster lines (two per insertion, one fwd and one rev):
C cluster_pair_ID lib direction start end chrom num_reads span
Read lines (fwd reads consitute the fwd clusters, rev reads the rev clusters) the reads' status can be "anchor": those that consitute the cluster, or "mate": are the anchors' mates, which map to a TE
R cluster_pair_ID lib direction interval_start interval_end chrom status bam_line
STEP 1.2: filter results
We recommend you filter the results generated by the previous step
- to eliminate insertions with poor support
- to eliminate insertions which overlap with Ns in your reference
The first step selects high-confidence insertions based on a set of metrics (see figure Supp2A in related publication). Depending on your application, you might want to have more relaxed filtering criteria: to be sure to recover as many true insertions as possible, knowing those come with more FP, or be stricter and have less TP but with higher confidence. The filter_config.txt file output in the first step is automatically generated with reasonable defaults for the given sequencing library. these defaults are: 2 < cluster size < 5coverage 2 < span < mean fragment length mean read length < interval length < 2(isize_mean + 2isize_sdev - (rlen_mean - rlen_sdev)) 2 < softclipped support < 5coverage pick consistent: name annotation
The second step eliminates false positives that are due to a poorly assembled region. Our experience is that repetitive sequences tend to be more difficult to assemble, and N islands in a draft assembly are often unassembled transposons. For that reason, insertions spanning Ns are likely not insertions in the sample, but absence in the reference sequence of a TE common to both the reference and the sample.
example:
jitterbug_filter_results_func.py -g sample.TE_insertions_paired_clusters.gff3 -c sample.filter_config.txt -o sample.TE_insertions_paired_clusters.filtered.gff3
intersectBed -a sample.TE_insertions_paired_clusters.filtered.gff3 -b N_annot.gff3 -v > sample.TE_insertions_paired_clusters.filtered.noNs.gff3
(intersectBed is part of the BedTools suite, which you can download at https://github.com/arq5x/bedtools2)
usage: jitterbug_filter_results_func.py [-h] [-g GFF] [-c CONFIG] [-o OUTPUT]
optional arguments: -h, --help show this help message and exit -g GFF, --gff GFF file in gff3 format of TEI generated by Jitterbug -c CONFIG, --config CONFIG config file with filtering parameters, generated by Jitterbug -o OUTPUT, --output OUTPUT name of output file
USE CASE 2 : looking for somatic insertions in a tumor/normal pair
Included in this package is a module that performs the comparison of a ND/TD pair of samples.
STEP 2.1: run Jitterbug on ND and TD samples separately
Related Skills
eval
86 agent-executable skill packs converted from RefoundAI’s Lenny skills (unofficial). Works with Codex + Claude Code.
next
A beautifully designed, floating Pomodoro timer that respects your workspace.
product-manager-skills
24PM skill for Claude Code, Codex, Cursor, and Windsurf: diagnose SaaS metrics, critique PRDs, plan roadmaps, run discovery, and coach PM career transitions.
devplan-mcp-server
3MCP server for generating development plans, project roadmaps, and task breakdowns for Claude Code. Turn project ideas into paint-by-numbers implementation plans.
