Goatools
Python library to handle Gene Ontology (GO) terms
Install / Use
/learn @tanghaibao/GoatoolsREADME
GOATOOLS: A Python library for Gene Ontology analyses
| | | | ------- | --------------------------------------------------------------------- | | Authors | Haibao Tang (tanghaibao) | | | DV Klopfenstein (dvklopfenstein) | | | Brent Pedersen (brentp) | | | Fidel Ramirez (fidelram) | | | Aurelien Naldi (aurelien-naldi) | | | Patrick Flick (patflick) | | | Jeff Yunes (yunesj) | | | Kenta Sato (bicycle1885) | | | Chris Mungall (cmungall) | | | Greg Stupp (stuppie) | | | David DeTomaso (deto) | | | Olga Botvinnik (olgabot) | | Email | tanghaibao@gmail.com | | License | BSD |
How to cite
[!TIP] GOATOOLS is now published in Scientific Reports!
Klopfenstein DV, ... Tang H (2018) GOATOOLS: A Python library for Gene Ontology analyses Scientific reports
- GO Grouping: Visualize the major findings in a gene ontology enrichment analysis (GOEA) more easily with grouping. A detailed description of GOATOOLS GO grouping is found in the manuscript.
- Compare GO lists:
Compare two or more lists
of GO IDs using
goatools compare_gos, which can be used with or without grouping. - Stochastic GOEA simulations: One of the findings resulting from our simulations is: Larger study sizes result in higher GOEA sensitivity, meaning fewer truly significant observations go unreported. The code for the stochastic GOEA simulations described in the paper is found here

Contents
This package contains a Python library to
-
Process over- and under-representation of certain GO terms, based on Fisher's exact test. With numerous multiple correction routines including locally implemented routines for Bonferroni, Sidak, Holm, and false discovery rate. Also included are multiple test corrections from statsmodels: FDR Benjamini/Hochberg, FDR Benjamini/Yekutieli, Holm-Sidak, Simes-Hochberg, Hommel, FDR 2-stage Benjamini-Hochberg, FDR 2-stage Benjamini-Krieger-Yekutieli, FDR adaptive Gavrilov-Benjamini-Sarkar, Bonferroni, Sidak, and Holm.
-
Process the obo-formatted file from Gene Ontology website. The data structure is a directed acyclic graph (DAG) that allows easy traversal from leaf to root.
-
Read GO Association files:
- GAF (GO Annotation File)
- GPAD (Gene Product Association Data)
- NCBI's gene2go file
- id2gos format. See example
-
Print decendants count and/or information content for a list of GO terms
-
Get parents or ancestors for a GO term with or without optional relationships, including Print details about a GO ID's parents
-
Get children or descendants for a GO term with or without optional relationships
-
Compare two or more lists of GO IDs
-
Group GO terms for easier viewing
-
Map GO terms (or protein products with multiple associations to GO terms) to GOslim terms (analog to the map2slim.pl script supplied by geneontology.org)
Installation
Make sure your Python version >= 3.7, and download an
.obo file of the most current
GO:
wget http://current.geneontology.org/ontology/go-basic.obo
or .obo file for the most current GO
Slim terms (e.g.
generic GOslim) :
wget http://current.geneontology.org/ontology/subsets/goslim_generic.obo
PyPI
pip install goatools
To install the development version:
pip install git+git://github.com/tanghaibao/goatools.git
Bioconda
conda install -c bioconda goatools
Dependencies
When installing via PyPI or Bioconda as described above, all dependencies are automatically downloaded. Alternatively, you can manually install:
-
For statistical testing of GO enrichment:
scipy.stats.fisher_exactstatsmodels(optional) for access to a variety of statistical tests for GOEA
-
To plot the ontology lineage, install one of these two options:
- Graphviz, for graph visualization.
- pygraphviz, Python binding for communicating with Graphviz:
- pydot, a Python interface to Graphviz's Dot language.
Cookbook
run.sh contains example cases using the installed goatools CLI.
Find GO enrichment of genes under study
See examples in find_enrichment
The goatools find_enrichment command takes as arguments files
containing:
- gene names in a study
- gene names in population (or other study if
--compareis specified) - an association file that maps a gene name to a GO category.
Please look at tests/data folder to see examples on how to make these
files. when ready, the command looks like:
goatools find_enrichment --pval=0.05 --indent data/study \
data/population data/association
and can filter on the significance of (e)nrichment or (p)urification. it can report various multiple testing corrected p-values as well as the false discovery rate.
The e in the "Enrichment" column means "enriched" - the concentration
of GO term in the study group is significantly higher than those in
the population. The "p" stands for "purified" - significantly lower
concentration of the GO term in the study group than in the population.
Important note: by default, goatools find_enrichment propagates counts
to all the parents of a GO term. As a result, users may find terms in
the output that are not present in their association file. Use
--no_propagate_counts to disable this behavior.
Write GO hierarchy
goatools wr_hier: Given a GO ID, write the hierarchy below (default) or above (--up) the given GO.
Plot GO lineage
goatools go_plot:- Plots user-specified GO term(s) up to root
- Multiple user-specified GOs
- User-defined colors
- Plot relationships (
-r) - Optionally plot children of user-specfied GO terms
goatools plot_go_termcan plot the lineage of a certain GO term, by:
goatools plot_go_term --term=GO:0008135
This command will plot the following image.

Sometimes people like to stylize the graph themselves, use option
--gml to generate a GML output which can then be used in an external
graph editing software like Cytoscape. The
following image is produced by importing the GML file into Cytoscape
using yFile orthogonal layout and solid VizMapping. Note that the GML
reader plugin may need to be
downloaded and installed in the plugins folder of Cytoscape:
goatools plot_go_term --term=GO:0008135 --gml

Map GO terms to GOslim terms
See goatools map_to_slim for usage. As arguments it takes the gene ontology
files:
- the current gene ontology file
go-basic.obo - the GOslim file to be used (e.g.
goslim_generic.oboor any other GOslim file)
The script either maps one GO term to its GOslim terms, or protein products with multiple associations to all its GOslim terms.
To determine the GOslim terms for a single GO term, you can use the
