SkillAgentSearch skills...

Genome

Python library and scripts for retrieval and storage of genomics data in HDF5 format

Install / Use

/learn @gmcvicker/Genome
About this skill

Quality Score

0/100

Supported Platforms

Universal

README

Introduction

This repo contains a Python library, genome, for retrieval and storage of genomics data in HDF5 format. It is a lightweight wrapper over PyTables.

This repo contains the genome library and a set of utility scripts for reading and writing HDF5 files.

The purpose of this software is to:

  • Provide a simple abstraction over PyTables for retrieval of genomics data
  • Make it simple to convert other genomics data formats to HDF5

This library was inspired by and is loosely based on the Genomedata software by Michael Hoffman.

This document summarizes how to setup this software system, how to retrieve data from HDF5 files, and how to convert other file formats into HDF5. If you have any questions or run into any difficulty, please don't hesitate to contact me!

HDF5 and PyTables

HDF5 is a binary format designed for rapid access to huge amounts of data. It is widely-adopted and platform independent. Software libraries to access HDF5 file are available in C, C++, Java, Python, and Fortran90 (and possibly others). PyTables is a very nice high-level Python interface for reading and writing HDF5 files. I have written a python library and a number of scripts that use PyTables to create and read HDF5 files.

Advantages to using HDF5

  • Efficient storage of very large datasets
  • Very fast data retrieval
  • Fast random access to compressed files
  • Nice, simple Python interface (PyTables) that integrates with NumPy
  • APIs in several languages: C, C++, Java, Fortran
  • Binary files work on big- and little-endian machines

Disadvantages:

  • Files are not human readable
  • PyTables does not support write-concurrency (only one process at a time can safely write to an HDF5 file)
  • Unlike MySQL and PostgreSQL, The free version of PyTables does not support indexing (the commercial version does)
  • Using PyTables introduces several software dependencies: PyTables, NumPy, HDF5, zlib

Setup

This repo has three components: a small python library genome, a small C library libgenome and set of python scripts for data import and manipulation.

Dependencies

genome depends on PyTables, which depends on HDF5(http://www.hdfgroup.org/HDF5/), NumPy and several other packages. See the detailed PyTables installation guide. This guide for installing PyTables under Mac OSX.

genome additionally depends on:

  • Cython
  • ArgParse (for Python2.6 or older)
  • pysam (for reading BAM files only)
  • Scons (for compiling, would like to replace with standard build scripts)

Getting the source code

The first step is to obtain the source code from git. Let's assume you want to clone this repository into a directory named ~/src:

	mkdir ~/src
	cd ~/src
	git clone https://github.com/gmcvicker/genome

If at a later point you want to retrieve the latest code updates you can do the following:

	cd ~/src/genome
	git fetch
	git merge origin/master

Setting environment variables

To use the code you need to update set several shell environment variables. I recommend setting these variables in your ~/.bashrc or ~/.profile files using something like the following:

# update your python path by adding $HOME/src/genome/python/lib to the end
# this tells python where to find the genome library 
export PYTHONPATH=$PYTHONPATH:$HOME/src/genome/python/lib

# update LD_LIBRARY_PATH by adding $HOME/src/genome/c/lib/ 
# this is so the libgenome C library can be dynamically loaded by scripts that need it
export LD_LIBRARY_PATH=$HOME/src/genome/c/lib/:$LD_LIBRARY_PATH

# set some flags that are needed for compiling the C library
export CFLAGS=-I$HOME/src/genome/c/lib/
export LDFLAGS=-L$HOME/src/genome/c/lib/

# specify the location of 'database' where the HDF5 files that you want to use are
export GENOME_DB=/data/share/genome_db/

# scripts will use this genome assembly by default 
export GENOME_ASSEMBLY=hg19

Make sure that you remember to set these variables after adding them to your .bashrc for the first time. You can login again, or do source ~/.bashrc

Remember that when you submit jobs to a compute cluster (e.g. using SGE's qsub), they run in "batch mode" and may not execute your ~/.bashrc. To ensure that your jobs have the correct environment variables set, you should be able to pass a flag to your cluster submission command (e.g. the -V flag to qsub).

Compiling the C library and building python extensions

I use a build system called SCons to compile my code. To compile the libgenome C library type the following:

cd ~/src/genome/c
scons

The genome library uses Cython to create C extensions for python. To build these extensions do the following:

# build C extensions that are part of genome library
cd ~/src/genome/python/lib/genome
python setup.py build_ext --inplace

# build C extensions that are needed by import scripts
cd ~/src/genome/python/script/db/
python setup.py build_ext --inplace

This may give some warnings, which are OK. If you get errors that indicate the build failed, this might be because the LD_LIBRARY_PATH, CFLAGS, LDFLAGS environment variables are set incorrectly, or because the genome C library is not properly compiled.

Importing Data

A number of scripts can be used to load data into the HDF5 files. Most of these scripts will take a name for a new track. The data will then be written to file with the name of the track (and a .h5 postfix) under the directory pointed to by the GENOME_DB environment variable.

Subdirectories are automatically created for tracks. For example, if you create a new track called dnase/new_dnase_track this will create a new HDF5 file named $GENOME_DB/$GENOME_ASSEMBLY/dnase/new_dnase_track.h5. These scripts will report an error if you try to write over an existing track with the same name. If such a file exists you will need to remove it manually.

The scripts to convert files into HDF5 are located in genome/python/script/db. The create_track.py script is the most flexible, and it can take several different file formats as input. The following summarizes the scripts can be used to import different types of data.

Note: Many of these scripts require the argparse module. argparse is part of python2.7, but if you are using python2.6 or earlier then you will have to install argparse on your system.

Note: The scripts that read BAM files depend on pysam.

Creating a new database

A new database needs only one track, the chromosome track (which is actually more a table than a track). This can be created using the load_chr.py script described below. It may also be desirable to load a genome sequence track. An example of loading a genome sequence from fasta files is given in the create_track.py section below.

load_chr.py

Creates a table of chromosomes in the database. This table contains chromosome names, lengths and some other information. This is the only table that is required for the libary to work, and it only needs to be created once. If you want to create a new database (e.g. for another species or on a different machine) you will have to use this script to create a new table, or copy an existing chromosome.h5 file. The load_chr.py script takes a chromInfo.txt.gz file as input. These can be downloaded from the UCSC genome browser.

create_track.py

Takes fasta, wiggle, bedgraph, xb or tab-delimited text files as input. Data are stored in a track containing a 1D array for each chromosome. The imported data can currently be stored as any of the following data types: int8, uint8, int16, float32. Currently create_track.py expects fasta, wiggle, bedgraph and text files to be split into separate chromosomes with filenames that contain the chromosome name (e.g. chr1.wig.gz, chr2.wig.gz, etc.). The provided C program genome/c/program/split_wig_chrs.c can be used to split wiggle files up in this way.

Here is an example of how to load sequence data into the database (in this case for Drosophila melanogaster):

python create_track.py --assembly dm3 --format fasta --dtype uint8 seq  ~/data/Dmel/ucsc/dm3/seq/chr*.fa.gz

load_bed.py

Reads features from a BED file and stores them in a HDF5 file. Data imported this way are stored in a table with several columns (not as a 1D array).

load_bam_read_depth.py

Reads BAM or SAM files and stores read depths in a track as a 1D array of unsigned 16 bit integers (uint16s) for each chromosome. Discards strand information but could easily be modified to store forward and reverse strands separately. Requires the pysam python library. BAM file must first be sorted and indexed using samtools.

load_bam_5prime_ends.py

Reads BAM or SAM files and stores counts of 5' ends of reads as a 1D array of unsigned 16 bit integers (uint16s) for each chromosome. Forward and reverse strands are stored as separate tracks. Requires the pysam python library. BAM file must first be sorted and indexed using samtools. Example of use (note any number of bam files can be specified):

python load_bam_5prime_ends.py --assembly hg19 \
  	/my_tracks/fwd_read_counts \
    /my_tracks/rev_read_counts \
    wgEncodeDataRep1.bam   wgEncodeDataRep2.bam

load_bam_left_ends.py

Reads BAM or SAM files an

Related Skills

View on GitHub
GitHub Stars27
CategoryDevelopment
Updated6mo ago
Forks12

Languages

Python

Security Score

67/100

Audited on Sep 29, 2025

No findings