TEPIC (version 2.2)

TEPIC offers workflows for the prediction and analysis of Transcription Factor (TF) binding sites including:

TF affinity computation in user provided regions
Computation of continuous and discrete TF-gene scores
Linear regression analysis to infer potential key regulators (INVOKE)
Logistic regression analysis to suggest regulators related to gene expression differences between samples (DYNAMITE)

A graphical overview on the workflows of TEPIC is shown below. Blue font indicates input, black italic font indicates output.

News

06.02.2020: Our recently introduced TEPIC-extension supporting regulatory sites from chromatin capture experiments to estimate the influence of enhancers on genome-wide gene expression is now published in BMC Epigenetics and Chromatin.

08.07.2019: We present a novel feature to include TFBS in regulatory sites determined by chromatin conformation capture data. Using an extended feature space representation, the INVOKE model can investigate the regulatory influence of TFs bound to promoters and enhancers separately.

10.10.2018: TEPIC 2.0 is now published in Bioinformatics.

13.08.2018: In addition to the gene-centric annotation, the functionality for transcript based annotation has been added.

21.05.2018: A new collection of TF motifs is included. They are available in the folder PWMs/2.1.

03.05.2018: INVOKE now supports computing a F-test to judge the importance of individual features.

17.08.2017: TEPIC TF-gene scores can now be binarisied using background regions provided by the user.

21.06.2017: TEPIC TF-gene scores can be binarisied using a TF and tissue specific affinity cut off. This can be combined with the dynamic networks learner DREM to build gene regulatory networks that make appropriate use of time-specific epigenetics data. Further details on this new feature are available in the description.

13.06.2017: DYNAMITE our workflow for learning differential TF regulation is now included in the repository.

09.06.2017: Version 2.0 of TEPIC is available. With version 2 of TEPIC, we introduced new features:

We extended the set of PSEMs.
TF affinities are computed using a C++ implementation of TRAP.
Affinities can be normalised by peak length during TF-gene score computation.
The length of the PSEMs can be considered in the normalisation.
We introduced features for peak length and peak counts.
Scaling can be performed in two ways: The original way as proposed in the TEPIC manuscript by directly multiplying peak signal and TF affinities or by generating a separate signal feature.

Further, the repository now includes the code required to learn linear models from TF gene scores to predict gene expression. For further details, please see the INVOKE section.

Introduction

TEPIC offers a variety of workflows to compute and analyse TF binding site (TFBS) predictions. The core functionality is the fast and efficient annotation of user provided candidate regions with TF affinities using TRAP (1). These predictions are aggregated to TF-gene scores using a window-based assignment strategy.

Within this aggregation TEPIC offers exponential decay (2) and scaling of TF region scores using an epigenetic signal, e.g. the signal of an open chromatin assay. While computing TF-gene scores, TEPIC can perform normalisation for region length (optionally correcting for the length of the binding motifs as well) and produces separate features for peak length, peak count and/or peak signal. These features can be used in downstream applications, e.g. to determine the influence of chromatin accessiblity on gene expression, without considering detailed information on TF binding. In addition to the continuous TF-affinities, TEPIC offers a discrete assignment of TFs to genes using a TF-specific affinity threshold derived from random genomic sequences that show similar characteristics (GC content, length) as compared to the provided regions. Further details on the score computation are provided in the description.

TEPIC TF-gene scores can be used in several downstream applications to assess the regulatory role of TFs. Three applications are directly supported:

Using a linear regression analysis to highlight potential key TFs by predicting gene expression within a sample of interest MachineLearningPipelines/INVOKE
Suggesting regulators that might be linked to gene-expression differences between samples using a logistic regression classifier MachineLearningPipelines/DYNAMITE
Generating input for DREM to infer time-point specific transcriptional regulators from temporal epigenetics data MachineLearningPipelines/EPIC-DREM

Details on the models are provided in the respective subfolders as well as in the description. Here, we provide a brief description on the core funtionality of TEPIC, the computation of TF-gene scores.

Installing TEPIC

To run TEPIC the following packages/software must be installed:

Python (minimum version 2.7)
bedtools: Installation instructions for bedtools can be found here. Please make sure to add the bedtools installation to your path.
A C++ compiler supporting openmp to use the parallel implementation of TRAP.

To compile the C++ version of TRAP and to install possibly missing R-packages for downstream processing execute the script Code/compile_TRAP_install_R_packages.sh.

To use the script findBackground, which is necessary to compute TF specific affinity thresholds, the following python libraries are required:

numpy
scipy
twobitreader

Using TEPIC

To start TEPIC, run the script TEPIC.sh, located in the folder Code.

./TEPIC.sh

The following parameters are required to compute TF affinities in user defined regions:

-g The reference genome in plain (uncompressed) FASTA format with Ensembl-style chromosome names (i.e., without "chr" prefix). If a "chr" prefix is present, use the -j option.
-b Regions the user wants to be annotated; chromosome naming compatible to the reference genome file.
-o Prefix of the output files.
-p File containing position specific energy matrices (PSEM) (see next section for details).

To additionally compute TF-gene scores, the argument:

-a Genome annotation file (gtf). All genes contained in this file will be annotated. The file must have the original format provided by gencode, gzipped files are not supported.

needs to be specified.

Additional command arguments are:

-w Size of the window around the TSS of genes.
-d Signal of the open chromatin assay in bg format. Used to compute the average per peak signal within the regions specified in -b.
-e Boolean controlling exponential decay (default TRUE).
-n Indicates that the file in -b contains the average signal in the peaks in the specified column. In this case the -d option is not required to obtain scaled TF affinities.
-c Number of cores used within TRAP.
-f A gtf file containing genes of interest. Only regions contained in the file specified by the -b option that are within the window specified by the -w option around these genes will be annotated. The file must have the original format provided by gencode, gzipped files are not supported.
-y Flag indicating whether the entire gene body should be annotated with TF affinities. A window of half the size of the -w option will be additionaly considered upstream of the genes TSS.
-l Flag to be set if affinities should not be normalised by peak length.
-u Flag to be set if peak features for peak length and peak counts should not be generated.
-q Parameter to be set if only peak features should be generated (default FALSE).
-x If -d or -n is used together with this flag, the original (Decay-)Scaling formulation of TEPIC is used to compute gene-TF scores.
-m Path to a tab delimited file containing the length of the used PSEMs. This is incorporated in normalising peak length.
-z Flag indicating that the output of TEPIC should be zipped.
-k Path to a file containing background regions provided by the user. This option can not be used together with the -r option.
-r Path to a 2bit representation of the reference genome. This is required to compute a TF specific affinity threshold as well as a binary and sparse TF-gene interaction list. This can not be used together with the -k option.
-v p-value cut off used to determine a cut off to derive a binary score for TF binding (default 0.05).
-i minutes that should be spend at most per chromosome to find matching random regions (default 3).
-j Flag indicating that the reference genome contains a chr prefix.
-t Flag indicating that the annotation should be based on transcripts, not on genes.
-h Loop file containing chromatin contacts. Only intrachromosomal contacts are supported.
-s Loop window used to link a gene to a chromatin loop (default 5000).

Output

Depending on the used arguments, TEPIC produces files containing:

TF affinities for all user specified regions (Prefix_Affinity.txt). These files are always generated.
Scaled TF affinities for all user specified regions (Prefix_Scaled_Affinity.txt). This is only generated if the -d or

TEPIC

Install / Use

README