OmicsIntegrator
This repository is the working directory for the Garnet-Forest bundle of python scripts for analyzing diverse forms of 'omic' data in a network context.
Install / Use
/learn @fraenkel-lab/OmicsIntegratorREADME
OmicsIntegrator has moved. See OmicsIntegrator2. This codebase is not maintained.
Omics Integrator is a package designed to integrate proteomic data, gene expression data and/or epigenetic data using a protein-protein interaction network. It is comprised of two modules, Garnet and Forest.
Contact: Amanda Kedaigle [mandyjoy@mit.edu]
Copyright (c) 2015 Massachusetts Institute of Technology All rights reserved.
Reference:
Network-Based Interpretation of Diverse High-Throughput Datasets through the Omics Integrator Software Package Tuncbag N<sup>*</sup>, Gosline SJC<sup>*</sup>, Kedaigle A, Soltis AR, Gitter A, Fraenkel E. PLoS Comput Biol 12(4): e1004879. doi:10.1371/journal.pcbi.1004879.
For a step-by-step protocol for running this software: Discovering altered regulation and signaling through network-based integration of transcriptomic, epigenomic and proteomic tumor data Kedaigle A, and Fraenkel E. Cancer Systems Biology: Methods in Molecular Biology, 2018.
System Requirements:
- Python 2.6 or 2.7 (3.x version currently untested) and the dependencies below. We recommend that users without an existing Python environment install Anaconda (https://www.continuum.io/downloads) to obtain Python 2.7 and the following required packages:
- numpy: http://www.numpy.org/
- scipy: http://www.scipy.org/
- matplotlib: http://matplotlib.org/
- Networkx: http://networkx.github.io
-
Boost C++ library: http://www.boost.org
-
Cytoscape for viewing results graphically (tested on versions 2.8-3.2): http://www.cytoscape.org
Features
-
Maps gene expression data to transcription factors using chromatin accessibility data
-
Identifies proteins in the same pathway as
hitsusing protein interaction network -
Integrates numerous high throughput data types to determine testable biological hypotheses
Installation:
Omics Integrator is a collection of Python scripts and data files so can be easily installed on any system. Steps 1 through 4 are only required for Forest, and you may skip to step 5 if you will only be running Garnet.
- Boost is pre-installed on many Linux distributions. If your operating system
does not include Boost, follow the Boost getting started
guide for
instructions on how to download the library and extract files from the archive.
To use the Homebrew package manager for Mac simply type
brew install boostto install the library. - Download
msgsteiner-1.3.tgzfrom http://staff.polito.it/alfredo.braunstein/code/msgsteiner-1.3.tgz (license) - Unpack files from the archive:
tar -xvf msgsteiner-1.3.tgz - Enter the
msgsteiner-1.3subdirectory and runmake
- See this advice on compiling the C++ code if you encounter problems and this advice regarding compilation issues on OS X.
- Make a note of the path to the compiled msgsteiner file that was created, which you will use when running Forest.
- In Linux, use
readlink -f msgsteinerin themsgsteiner-1.3subdirectory to obtain the path.
- Download the Omics Integrator package: OmicsIntegrator-0.3.1.tar.gz
- Unpack files from the archive:
tar -xvzf OmicsIntegrator-0.3.1.tar.gz - Make sure you have all the requirements using the pip tool by entering the
directory and typing:
pip install -r requirements.txt
- Some users have reported errors when using this command to install matplotlib. To fix, install matplotlib independently (http://matplotlib.org) or use Anaconda as indicated above.
Now Omics Integrator is installed on your computer and can be used to analyze your data.
Examples
We provide many scripts and files to showcase the various capabilities of Omics Integrator. To run this:
- Download the example files
- Unpack by typing
tar -xvzf OmicsIntegratorExamples.tar.gzin thedistdirectory. - Move the unpacked files into the
exampledirectory.
For specific details about the examples, check out the README file in the example directory.
Running garnet.py
Garnet is a script that runs a series of smaller scripts to map epigenetic data to genes and then scan the genome to determine the likelihood of a transcription factor binding the genome near that gene.
Usage: garnet.py [configfilename]
-s SEED, --seed=SEED An integer seed for the pseudo-random number
generators. If you want to reproduce exact results,
supply the same seed. Default = None.
Options:
-h, --help show this help message and exit
--outdir=OUTDIR Name of directory to place garnet output. DEFAULT:none
--utilpath=ADDPATH Destination of chipsequtil library, Default=../src
Unlike Forest, the Garnet configuration file is a positional argument and must not
be preceded with --conf=. The configuration file should take the following format:
garnet input
[chromatinData]
#these files contain epigenetically interesting regions
bedfile = bedfilecontainingregions.bed
fastafile = fastafilemappedusinggalaxytools.fasta
#these two files are provided in the package
genefile = ../../data/ucsc_hg19_knownGenes.txt
xreffile = ../../data/ucsc_hg19_kgXref.txt
#distance to look from transcription start site
windowsize = 2000
[motifData]
#motif matrices to be used, data provided with the package
tamo_file = ../../data/matrix_files/vertebrates_clustered_motifs.tamo
#settings for scanning
genome = hg19
numthreads = 4
doNetwork = False
tfDelimiter = .
[expressionData]
expressionFile = tabDelimitedExpressionData.txt
pvalThresh = 0.01
qvalThresh =
[regression]
#for generating and saving regression plots
savePlot=False
Chromatin Data
Many BED-formatted (bedfile) and FASTA-formatted (fastafile) files are
included in the examples/ directory. bedfile can also be output from MACS
(with a .xls extension) or GPS/GEM (with a .txt extension).
To use your own epigenetic data, convert to BED and upload the
BED-file to http://usegalaxy.org and select Fetch Alignments/Sequences from the left
menu to click on Extract Genomic DNA. This will produce a FASTA-formatted file
that will work with garnet. We have provided gene (genefile) and xref
(xreffile) annotations for both hg19 and mm9 - these files can be downloaded
from http://genome.ucsc.edu/cgi-bin/hgTables if needed. The windowsize
parameter determines the maximum distance from a transcription start site to
consider an epigenetic event associated. 2kb is a very conservative metric.
motifData
We provide motif data in the proper TAMO format, the user just needs to enter
the genome used. The default numthreads is 4, but the user can alter this
depending on the processing power of their machine. doNetwork will create a
NetworkX object mapping transcription factors to genes, required input for the
SAMNet algorithm. tfDelimiter is an
internal parameter to tell Garnet how to handle cases when many transcription
factors map to the sam binding motif.
expressionData
If the user has expression data to evaluate, provide a tab-delimited file under
expressionFile. File should have two columns, one containing the name of the
gene and the second containing the log fold change of that gene in a particular
condition. We recommend only including those genes whose change in expression is
statistically significant. P-value (pvalThresh) or Q-value (qvalThresh)
thresholds will be used to select only those transcription factors whose
correlation with expression falls below the provided threshold.
regression
Linear regression plots are placed in a subdirectory named regression_plots if
savePlot=True in the configuration file.
Garnet output
Garnet produces a number of intermediate files that enable you
to better interpret your data or re-run a sub-script that may have failed. All
files are placed in the directory provided by the --outdir option of the
garnet script.
-
events_to_genes.fsa: This file contains the regions of the fastafile provided in the configuration file that are within the specified distance to a transcription start site.
-
events_to_genes.xls: This file contains each region, the epigenetic activity in that region, and the relationship of that region to the closest gene.
-
events_to_genes_with_motifs.txt: This contains the raw transcription factor scoring data for each region in the fasta file.
-
events_to_genes_with_motifs.tgm: This contains the transcription factor binding matrix scoring data mapped to the closest gene.
-
events_To_genes_with_motifs_tfids.txt: Names of transcription factors (or columns) of the matrix.
-
events_to_genes_with_motifs_geneids.txt: Names of genes (or rows) of the matrix.
-
events_to_genes_with_motifs.pkl: A Pickle-compressed Python File containing a dictionary data structure that contains files 4-6 (under the keys
tgm,tfs, andgenes) respectively as well as adelimkey that describes what delimiter was
Related Skills
node-connect
349.9kDiagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps
frontend-design
109.8kCreate distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.
openai-whisper-api
349.9kTranscribe audio via OpenAI Audio Transcriptions API (Whisper).
qqbot-media
349.9kQQBot 富媒体收发能力。使用 <qqmedia> 标签,系统根据文件扩展名自动识别类型(图片/语音/视频/文件)。
