SkillAgentSearch skills...

Jitterbug

Jitterbug is a bioinformatic software that predicts insertion sites of transposable elements in a sample sequenced by short paired-end reads with respect to an assembled reference.

Install / Use

/learn @elzbth/Jitterbug
About this skill

Quality Score

0/100

Supported Platforms

Universal

README

Summary

This set of programs allows one to identify transposable element insertions (TEI) in a sequenced sample with respect to an assembled reference.

The main script jitterbug.py performs this analysis, using a .bam file of mapped reads and the annotation of TEs in the reference.

Additional modules are provided to filter and plot these results, compare tumor/normal pairs as well evaluate predictions from simulated data.

Citation

If you use this software in your work, please cite:

Jitterbug: somatic and germline transposon insertion detection at single-nucleotide resolution Elizabeth Hénaff, Luís Zapata, Josep M. Casacuberta, and Stephan Ossowski

BMC Genomics. 2015; 16: 768. PMCID: PMC4603299

Installation

Dependencies:

Jitterbug requires the following to run:

Python (https://www.python.org/) version 2.7 or greater (Jitterbug has not been tested with Python3)

Python modules: pysam (https://github.com/pysam-developers/pysam) pybedtools (https://pythonhosted.org/pybedtools/) psutil (https://pypi.python.org/pypi/psutil)

For the companion scripts to plot results and process tumor/normal pairs, you will need:

matplotlib (http://matplotlib.org/) matplotlib-venn (https://pypi.python.org/pypi/matplotlib-venn) numpy (http://www.numpy.org/)

All of these modules are available to install through pip (https://pypi.python.org/pypi/pip) So, you can run:

pip install <module_name>

to install each of the above mentioned modules.

note on memory usage

If you are dealing with large input files (i.e. the human genome and not Arabidopsis) you may want to use the

--pre_filter 

option to pre-select discordant reads with samtools. This will write an additional file to disk, but use less memory during runtime.

Specific requirements: For Jitterbug to run properly it makes use of a module called y_serial (included). This module requires that the specified output folder has read and write access for all users. So remember to set for example chmod -R 777 your_output_folder/ or it will crash

USE CASE 1: predict TEI in a single sample

STEP 1.1: run jitterbug.py to identify the candidate TE insertions:

example usage: run with bam file, default everything, write to present directory

jitterbug.py sample.bam te_annot.gff3

example usage: run with bam file, write to specified directory with specified prefix. Parallelize: use 8 threads, separating by 50 Kbp bins

jitterbug.py --numCPUs 8  --bin_size 50000000 --output_prefix /path/to/my/dir/prefix sample.bam te_annot.gff3

example usage: you have added nice unique identifier tags to your gff annotation (like in the hg19 and TAIR10 annotations provided in the data/ folder) and you want those to be reported in the final output.

jitterbug.py --TE_name_tag Name --numCPUs 8  --bin_size 50000000 --output_prefix /path/to/my/dir/prefix sample.bam te_annot.gff3

full list of options:

usage: jitterbug.py [-h] [-v] [--pre_filter] [-l LIB_NAME] [-d SDEV_MULT]
                    [-o OUTPUT_PREFIX] [-n NUMCPUS] [-b BIN_SIZE] [-q MINMAPQ]
                    [-t TE_NAME_TAG] [-s CONF_LIB_STATS] [-c MIN_CLUSTER_SIZE]
                    [--disc_reads_bam DISC_READS_BAM]
                    mapped_reads TE_annot

positional arguments:
  mapped_reads          reads mapped to reference genome in bam format, sorted
                        by position
  TE_annot              annotation of transposable elements in reference
                        genome, in gff3 format

optional arguments:
  -h, --help            show this help message and exit
  -v, --verbose         print more output to the terminal
  --pre_filter          pre-filter reads with samtools, and save intermediate
                        filtered read subset
  -l LIB_NAME, --lib_name LIB_NAME
                        sample or library name, to be included in final gff
                        output
  -d SDEV_MULT, --sdev_mult SDEV_MULT
                        use SDEV_MULT*fragment_sdev + fragment_length when
                        calculating insertion intervals. Best you don't touch
                        this.
  -o OUTPUT_PREFIX, --output_prefix OUTPUT_PREFIX
                        prefix of output files. Can be
                        path/to/directory/file_prefix
  -n NUMCPUS, --numCPUs NUMCPUS
                        number of CPUs to use
  -b BIN_SIZE, --bin_size BIN_SIZE
                        If parallelized, size of bins to use, in bp. If
                        numCPUs > 1 and bin_size == 0, will parallelize by
                        entire chromosomes
  -q MINMAPQ, --minMAPQ MINMAPQ
                        minimum read mapping quality to be considered
  -t TE_NAME_TAG, --TE_name_tag TE_NAME_TAG
                        name of tag in TE annotation gff file to use to record
                        inserted TEs
  -s CONF_LIB_STATS, --conf_lib_stats CONF_LIB_STATS
                        tabulated config file that sets the values to use for
                        fragment length and sdev, read length and sdev. 4 tab-
                        deliminated lines: key value, keys
                        are:fragment_length, fragment_length_SD, read_length,
                        read_length_SD
  -c MIN_CLUSTER_SIZE, --min_cluster_size MIN_CLUSTER_SIZE
                        min number of both fwd and rev reads to predict an
                        insertion
  --disc_reads_bam DISC_READS_BAM
                        for debug. Use as input bam file of discordant reads
                        only (generated by running with --pre_filter), and
                        skip the step of perusing the input bam for discordant
                        reads

this will generate the following output files:

<output_prefix>.TE_insertions_paired_clusters.gff3

Annotation in gff3 format of the predicted insertions, with the 9th column's tags describing characteristics of the predictions

<output_prefix>.read_stats.txt

Stats collected on the library: length and standard deviations of the fragments and reads

<output_prefix>.filter_config.txt

Config file with reasonable defaults for the subsequent highly recommended filtering step. These defaults are calculated as a function of the library characteristics

<output_prefix>.run_stats.txt

Stats collected about the run: calculation time, number of processors used

<output_prefix>.TE_insertions_paired_clusters.supporting_clusters.table

Table with infomation on the reads that support each cluster. This table is intended to be easily manipulated with standard *NIX tools in order to, for example, extract the anchor and mate read's sequences for assembly and primer design

The table follows the following format:

Insertion lines: one per predicted insertion site, corresponding to a pair of overlapping clusters, one fwd, one rev

I       cluster_pair_ID lib     chrom   start   end     num_fwd_reads   num_rev_reads   fwd_span        rev_span        best_sc_pos_st  best_sc_pos_end sc_pos_support

Cluster lines (two per insertion, one fwd and one rev):

C       cluster_pair_ID lib     direction       start   end     chrom   num_reads       span

Read lines (fwd reads consitute the fwd clusters, rev reads the rev clusters) the reads' status can be "anchor": those that consitute the cluster, or "mate": are the anchors' mates, which map to a TE

R       cluster_pair_ID lib     direction       interval_start  interval_end    chrom   status  bam_line

STEP 1.2: filter results

We recommend you filter the results generated by the previous step

  • to eliminate insertions with poor support
  • to eliminate insertions which overlap with Ns in your reference

The first step selects high-confidence insertions based on a set of metrics (see figure Supp2A in related publication). Depending on your application, you might want to have more relaxed filtering criteria: to be sure to recover as many true insertions as possible, knowing those come with more FP, or be stricter and have less TP but with higher confidence. The filter_config.txt file output in the first step is automatically generated with reasonable defaults for the given sequencing library. these defaults are: 2 < cluster size < 5coverage 2 < span < mean fragment length mean read length < interval length < 2(isize_mean + 2isize_sdev - (rlen_mean - rlen_sdev)) 2 < softclipped support < 5coverage pick consistent: name annotation

The second step eliminates false positives that are due to a poorly assembled region. Our experience is that repetitive sequences tend to be more difficult to assemble, and N islands in a draft assembly are often unassembled transposons. For that reason, insertions spanning Ns are likely not insertions in the sample, but absence in the reference sequence of a TE common to both the reference and the sample.

example:

jitterbug_filter_results_func.py -g sample.TE_insertions_paired_clusters.gff3 -c sample.filter_config.txt -o sample.TE_insertions_paired_clusters.filtered.gff3

intersectBed -a sample.TE_insertions_paired_clusters.filtered.gff3 -b N_annot.gff3 -v > sample.TE_insertions_paired_clusters.filtered.noNs.gff3

(intersectBed is part of the BedTools suite, which you can download at https://github.com/arq5x/bedtools2)

usage: jitterbug_filter_results_func.py [-h] [-g GFF] [-c CONFIG] [-o OUTPUT]

optional arguments: -h, --help show this help message and exit -g GFF, --gff GFF file in gff3 format of TEI generated by Jitterbug -c CONFIG, --config CONFIG config file with filtering parameters, generated by Jitterbug -o OUTPUT, --output OUTPUT name of output file

USE CASE 2 : looking for somatic insertions in a tumor/normal pair

Included in this package is a module that performs the comparison of a ND/TD pair of samples.

STEP 2.1: run Jitterbug on ND and TD samples separately

Related Skills

View on GitHub
GitHub Stars17
CategoryProduct
Updated2y ago
Forks8

Languages

Python

Security Score

60/100

Audited on Mar 17, 2024

No findings