OmicsIntegrator

This repository is the working directory for the Garnet-Forest bundle of python scripts for analyzing diverse forms of 'omic' data in a network context.

Generate Convert Improve

Install / Use

/learn @fraenkel-lab/OmicsIntegrator

About this skill

Quality Score

0/100

README

OmicsIntegrator has moved. See OmicsIntegrator2. This codebase is not maintained.

Omics Integrator is a package designed to integrate proteomic data, gene expression data and/or epigenetic data using a protein-protein interaction network. It is comprised of two modules, Garnet and Forest.

Contact: Amanda Kedaigle [mandyjoy@mit.edu]

Reference:

Network-Based Interpretation of Diverse High-Throughput Datasets through the Omics Integrator Software Package Tuncbag N<sup>*</sup>, Gosline SJC<sup>*</sup>, Kedaigle A, Soltis AR, Gitter A, Fraenkel E. PLoS Comput Biol 12(4): e1004879. doi:10.1371/journal.pcbi.1004879.

For a step-by-step protocol for running this software: Discovering altered regulation and signaling through network-based integration of transcriptomic, epigenomic and proteomic tumor data Kedaigle A, and Fraenkel E. Cancer Systems Biology: Methods in Molecular Biology, 2018.

System Requirements:

Python 2.6 or 2.7 (3.x version currently untested) and the dependencies below. We recommend that users without an existing Python environment install Anaconda (https://www.continuum.io/downloads) to obtain Python 2.7 and the following required packages:

numpy: http://www.numpy.org/
scipy: http://www.scipy.org/
matplotlib: http://matplotlib.org/
Networkx: http://networkx.github.io

msgsteiner package (version 1.3): code, license
Boost C++ library: http://www.boost.org
Cytoscape for viewing results graphically (tested on versions 2.8-3.2): http://www.cytoscape.org

Features

Maps gene expression data to transcription factors using chromatin accessibility data
Identifies proteins in the same pathway as hits using protein interaction network
Integrates numerous high throughput data types to determine testable biological hypotheses

Installation:

Omics Integrator is a collection of Python scripts and data files so can be easily installed on any system. Steps 1 through 4 are only required for Forest, and you may skip to step 5 if you will only be running Garnet.

Boost is pre-installed on many Linux distributions. If your operating system does not include Boost, follow the Boost getting started guide for instructions on how to download the library and extract files from the archive. To use the Homebrew package manager for Mac simply type brew install boost to install the library.
Download msgsteiner-1.3.tgz from http://staff.polito.it/alfredo.braunstein/code/msgsteiner-1.3.tgz (license)
Unpack files from the archive: tar -xvf msgsteiner-1.3.tgz
Enter the msgsteiner-1.3 subdirectory and run make

See this advice on compiling the C++ code if you encounter problems and this advice regarding compilation issues on OS X.
Make a note of the path to the compiled msgsteiner file that was created, which you will use when running Forest.
In Linux, use readlink -f msgsteiner in the msgsteiner-1.3 subdirectory to obtain the path.

Download the Omics Integrator package: OmicsIntegrator-0.3.1.tar.gz
Unpack files from the archive: tar -xvzf OmicsIntegrator-0.3.1.tar.gz
Make sure you have all the requirements using the pip tool by entering the directory and typing: pip install -r requirements.txt

Some users have reported errors when using this command to install matplotlib. To fix, install matplotlib independently (http://matplotlib.org) or use Anaconda as indicated above.

Now Omics Integrator is installed on your computer and can be used to analyze your data.

Examples

We provide many scripts and files to showcase the various capabilities of Omics Integrator. To run this:

Download the example files
Unpack by typing tar -xvzf OmicsIntegratorExamples.tar.gz in the dist directory.
Move the unpacked files into the example directory.

For specific details about the examples, check out the README file in the example directory.

Running garnet.py

Garnet is a script that runs a series of smaller scripts to map epigenetic data to genes and then scan the genome to determine the likelihood of a transcription factor binding the genome near that gene.

Usage: garnet.py [configfilename]

  -s SEED, --seed=SEED  An integer seed for the pseudo-random number
                        generators. If you want to reproduce exact results,
                        supply the same seed. Default = None.


Options:
  -h, --help            show this help message and exit
  --outdir=OUTDIR       Name of directory to place garnet output. DEFAULT:none
  --utilpath=ADDPATH    Destination of chipsequtil library, Default=../src

Unlike Forest, the Garnet configuration file is a positional argument and must not be preceded with --conf=. The configuration file should take the following format:

garnet input

[chromatinData]
#these files contain epigenetically interesting regions
bedfile = bedfilecontainingregions.bed
fastafile = fastafilemappedusinggalaxytools.fasta
#these two files are provided in the package
genefile = ../../data/ucsc_hg19_knownGenes.txt
xreffile = ../../data/ucsc_hg19_kgXref.txt
#distance to look from transcription start site
windowsize = 2000

[motifData]
#motif matrices to be used, data provided with the package
tamo_file = ../../data/matrix_files/vertebrates_clustered_motifs.tamo
#settings for scanning
genome = hg19
numthreads = 4
doNetwork = False
tfDelimiter = .

[expressionData]
expressionFile = tabDelimitedExpressionData.txt
pvalThresh = 0.01
qvalThresh =

[regression]
#for generating and saving regression plots
savePlot=False

Chromatin Data

Many BED-formatted (bedfile) and FASTA-formatted (fastafile) files are included in the examples/ directory. bedfile can also be output from MACS (with a .xls extension) or GPS/GEM (with a .txt extension). To use your own epigenetic data, convert to BED and upload the BED-file to http://usegalaxy.org and select Fetch Alignments/Sequences from the left menu to click on Extract Genomic DNA. This will produce a FASTA-formatted file that will work with garnet. We have provided gene (genefile) and xref (xreffile) annotations for both hg19 and mm9 - these files can be downloaded from http://genome.ucsc.edu/cgi-bin/hgTables if needed. The windowsize parameter determines the maximum distance from a transcription start site to consider an epigenetic event associated. 2kb is a very conservative metric.

motifData

We provide motif data in the proper TAMO format, the user just needs to enter the genome used. The default numthreads is 4, but the user can alter this depending on the processing power of their machine. doNetwork will create a NetworkX object mapping transcription factors to genes, required input for the SAMNet algorithm. tfDelimiter is an internal parameter to tell Garnet how to handle cases when many transcription factors map to the sam binding motif.

expressionData

If the user has expression data to evaluate, provide a tab-delimited file under expressionFile. File should have two columns, one containing the name of the gene and the second containing the log fold change of that gene in a particular condition. We recommend only including those genes whose change in expression is statistically significant. P-value (pvalThresh) or Q-value (qvalThresh) thresholds will be used to select only those transcription factors whose correlation with expression falls below the provided threshold.

regression

Linear regression plots are placed in a subdirectory named regression_plots if savePlot=True in the configuration file.

Garnet output

Garnet produces a number of intermediate files that enable you to better interpret your data or re-run a sub-script that may have failed. All files are placed in the directory provided by the --outdir option of the garnet script.

events_to_genes.fsa: This file contains the regions of the fastafile provided in the configuration file that are within the specified distance to a transcription start site.
events_to_genes.xls: This file contains each region, the epigenetic activity in that region, and the relationship of that region to the closest gene.
events_to_genes_with_motifs.txt: This contains the raw transcription factor scoring data for each region in the fasta file.
events_to_genes_with_motifs.tgm: This contains the transcription factor binding matrix scoring data mapped to the closest gene.
events_To_genes_with_motifs_tfids.txt: Names of transcription factors (or columns) of the matrix.
events_to_genes_with_motifs_geneids.txt: Names of genes (or rows) of the matrix.
events_to_genes_with_motifs.pkl: A Pickle-compressed Python File containing a dictionary data structure that contains files 4-6 (under the keys tgm,tfs, and genes) respectively as well as a delim key that describes what delimiter was

Related Skills

node-connect

349.9k

Diagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps

frontend-design

109.8k

Create distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.

openai-whisper-api

349.9k

Transcribe audio via OpenAI Audio Transcriptions API (Whisper).

qqbot-media

349.9k

QQBot 富媒体收发能力。使用 <qqmedia> 标签，系统根据文件扩展名自动识别类型（图片/语音/视频/文件）。