Tfit

Transcription fit (or Tfit) implements a finite mixture model to identify sites of bidirectional or divergent transcription in nascent transcription assays such as Global Run-On (GRO) and Precision Run-on (PRO) followed by sequencing data. Tfit is separated by two modules: (1) bidir and (2) model. Output from these modules can be exported to any genome browser, bed file examples are provided below.

Alt text

The bidir module will compute local likelihood statistics given a fixed template mixture model (estimated either from promoter associated transcription) or specified explicitly by the user (the former is encouraged). This method is fast and will finish in about 30 minutes on a single node single CPU machine. The output will be a bed file corresponding to areas of possible bidirectional transcription. This output is discussed heavily in later sections

The model module will compute full MLE estimates of the mixture model at user specified regions of the genome provided as a bed file. This bed file may be the output from the bidir module. It is recommended that users fit MLE estimates to the output of the bidir module as this will decrease false positives. Two files will output from this module: (A) a <job_name>_K_models_MLE.tsv and a <job_name>_divergent_classifications.bed both containing information regarding the location, spreading, pausing probability and strand probability of bidirectional events. This output is discussed heavily in later sections. Computation time of this module depends on model complexity bounds. If the user is profiling only for bidirectional transcripts, it is recommended that -mink and -maxk flags be set to 1. In this case, computation take will take around 2-3 hours on a single node, single CPU machine.

Please note that for both the bidir and model module computation time will decrease linearly with the number of available CPUs. Computation time is discussed throughout the below sections.

##System Requirements and Makefile Transcription Fit (Tfit) is written in the C/C++ program language that requires GNU compilers 4.7.3 or greater. Tfit uses the popular openMPI framework to perform massive parallelization via multithreading on multiple core, single node systems or multiple core, multiple node compute clusters.

After cloning this repo, please change directory into /where/you/clone/this/repo/Tfit/src/ and run make. Note: That if you are a University of Colorado user and are running Pando Compute cluster through the BioFrontiers institute you need to module load both GNU compilers and mpich wrapper, i.e. module load gcc_4.9.2, module load mpich_3.1.4!

$ cd  /where/you/clone/this/repo/Tfit/src/

$ make clean

$ make

If the program compiled successfully you should see the below output.

-------------------
GCC version: 5.1.0
...
successfully compiled
-------------------

If your program, did not compile properly it is likely that you do not have the correct dependencies. The three significant dependencies are listed below.

c++11 (this ships with most recent versions of GCC, please visit https://gcc.gnu.org/install/)
openmp (this ships with most recent versions of GCC, please visit https://gcc.gnu.org/install/)
MPI (this needs to installed and configured and serves as a wrapper for GCC, please visit https://www.open-mpi.org/faq/)

In short, the make file requires the path to mpic++ (install and config openMPI) to be in your PATH. Installing and configuring your gcc compilers will likely be cause for a headache. I found these sites useful

https://www.open-mpi.org
https://www.cyberciti.biz/faq/howto-apple-mac-os-x-install-gcc-compiler/

##Running and Installing Through Docker Container Running Tfit via a Docker container requires that Docker is installed and running. In Windows and Mac OSX you must run the Docker/Tfit script from the Docker Quickstart Terminal (https://www.docker.com/products/docker-toolbox).

To run Tfit via docker, simply run the Docker/Tfit script with the same arguments and options as you would use when running the standard Tfit program. This script will check if you have the biofrontiers/fstitch_tfit:latest image and download it if you do not. It will then pass the arguments you provide to the Tfit program inside the container and output the results and logs to the location you specify.

In order to ensure you are using the latest version of the container, you may run the following command before executing the Tfit script:

docker pull biofrontiers/fstitch_tfit:latest

Advanced users may want to build their own Docker images using the provided Dockerfile. To do so, run the following command from within the Docker directory:

docker build -t [image name] .

And alter the Docker/Tfit script, changing the REPOSITORY and TAG variables so that they match your built image.

##Bidir Module The bidir module scans across the genome for areas resembling bidirectional transcription by comparing a fixed template mixture model (user provided parameters or parameters estimated from promoter regions) to a noise model (uniform distribution) by a Likelihood ratio score (LLR).

The critical input parameters below:

| Flag | Type | Description | |------|------|-------------| | -i | </path/to/BedGraphFile_forward strand> |BedGraph File from above” | -j | </path/to/BedGraphFile_forward strand> |BedGraph File from above” | -ij| </path/to/BedGraphFile_joint_strand> |(if -i and -j are not specified) BedGraph File from above but reverse strand reads are specified by negative coverage values and forward strand reads are specified by positive coverage values, NOTE: either -ij is specified or both -i and -j are specified. An example of this combined joint bedgraph file is below. | -N | string | job_name, simple a string | -o | </path/to/output/directory> | where files will output | -log_out | </path/to/logoutput/directory> | As the program runs this file will be updated with progress

The non-critical input parameters are listed below, these all have default settings.

| Flag | Type | Description | |------|------|-------------| | -tss | </path/to/bedfile/of/promoter/locations/ | (promoter locations are provided for hg19 and mm10 in the annotations/ directory of this repo, it is recommended to optimize your template density function by promoter or TSS associated regions | -chr | string | where the bidir module will only run on specified chromosome (default is "all") | -bct | numerical | this is the LLR threshold, the default and recommended is 1

If TSS is not provided, the user can manually enter the template parameters of interest.

| Flag | Type | Description | |------|------|-------------| | -lambda | numerical | this is the entry length parameter for the EMG density function (default = 200 bp) | -sigma | numerical | this is the variance parameter for the EMG density function (default = 10 bp) | -pi | numerical | this is the strand bias parameter for the EMG density function (default = 0.5) | -w | numerical | this is the pausing probability parameter for the EMG density function (default = 0.5)

In brief, the template mixture model is parameterized by -lambda (entry length or amount of skew), -sigma (variance in loading, error), -pi (strand bias, probability of forward strand data point) and -w (pausing probability, how much bidirectional signal to elongation/noise signal). Neighboring genomic coordinates where the LLR exceeds some user defined threshold (-bct flag) are joined and are returned as a bed file (chrom[tab]start[tab]stop[newline]). An example of a bed file is provided below:

Alt text

Perhaps the most important user input file is the "BedGraph" File corresponding to the forward and reverse strand. This file is simple (chrom[tab]start[tab]stop[tab]coverage[newline]). Importantly, start and stop should be integer valued. Although coverage can be a float, we recommend this to be an integer as well, i.e. normalization by millions mapped for example may lead to numerical degeneracies in the optimization routine. An example bedgraph file is listed below.

Alt text

Example of joint forward and reverse strand bedgraph file:

Alt text

After the bidir model finishes a bed file will appear in the user specified output directory called: [-N]_prelim_bidir_hits.bed. This file will contain genomic intervals of interest where divergent transcription is likely occurring. This file may be used for downstream analysis or taken at face value. Last, but not least, the bidir module can be invoked as below:

$ Tfit bidir \<list of parameters and flags\>

Example output from the bidir module,i.e. [-N]_prelim_bidir_hits.bed, is provided below.

Alt text

##Model Module and List of Parameters Unlike the "bidir" module which utilizes an average or template version of the mixture model to scan the entire genome quickly, the "model" module will attempt to find (by maximum likelihood estimation, MLE) the best set of parameters (sigma,lambda, pi, w) on a per region of interest basis. Such a routine is especially valuable if it is believed that pausing or strand bias is changing following experimental perturbation. In addition, running the model module on the prelim_bidir_hits.bed file will greatly decrease the number of false positives as the MLE estimates will more accurately reflect the underlying structure of the region of interest rather than a static template model.

In short, MLE estimates are computed by the EM algorithm which is a convergent optimization method found commonly in gaussian mixture mod

Tfit

Install / Use

README

Tfit