Pyensembl
Python interface to access reference genome features (such as genes, transcripts, and exons) from Ensembl
Install / Use
/learn @openvax/PyensemblREADME
<a href="https://pypi.python.org/pypi/pyensembl/">
<img src="https://img.shields.io/pypi/v/pyensembl.svg?maxAge=1000" alt="PyPI" />
</a>
PyEnsembl
PyEnsembl is a Python interface to Ensembl reference genome metadata such as exons and transcripts. PyEnsembl downloads GTF and FASTA files from the Ensembl FTP server and loads them into a local database. PyEnsembl can also work with custom reference data specified using user-supplied GTF and FASTA files.
Example Usage
from pyensembl import EnsemblRelease
# release 77 uses human reference genome GRCh38
data = EnsemblRelease(77)
# will return ['HLA-A']
gene_names = data.gene_names_at_locus(contig=6, position=29945884)
# get all exons associated with HLA-A
exon_ids = data.exon_ids_of_gene_name('HLA-A')
Installation
You can install PyEnsembl using pip:
pip install pyensembl
This should also install any required packages such as datacache.
Before using PyEnsembl, run the following command to download and install Ensembl data:
pyensembl install --release <list of Ensembl release numbers> --species <species-name>
For example, pyensembl install --release 75 76 --species human will download and install all
human reference data from Ensembl releases 75 and 76.
Alternatively, you can create the EnsemblRelease object from inside a Python
process and call ensembl_object.download() followed by ensembl_object.index().
Cache Location
By default, PyEnsembl uses the platform-specific Cache folder
and caches the files into the pyensembl sub-directory.
You can override this default by setting the environment key PYENSEMBL_CACHE_DIR
as your preferred location for caching:
export PYENSEMBL_CACHE_DIR=/custom/cache/dir
or
import os
os.environ['PYENSEMBL_CACHE_DIR'] = '/custom/cache/dir'
# ... PyEnsembl API usage
Usage tips
List installed genomes
To see the genomes for which PyEnsembl has already downloaded and indexed metadata you can run:
pyensembl list
Or equivalently do this in Python:
from pyensembl.shell import collect_all_installed_ensembl_releases
collect_all_installed_ensembl_releases()
Load genome in Python
Here's an example Python snippet that loads fly genome data from Ensembl release v100:
from pyensembl import EnsemblRelease
data = EnsemblRelease(release=100, species='drosophila_melanogaster')
Data structures
Gene
gene = genome.gene_by_id(gene_id='FBgn0011747')
Transcript
transcript = gene.transcripts[0]
Protein information
transcript.protein_id
transcript.protein_sequence
Non-Ensembl Data
PyEnsembl also allows arbitrary genomes via the specification of local file paths or remote URLs to both Ensembl and non-Ensembl GTF and FASTA files. (Warning: GTF formats can vary, and handling of non-Ensembl data is still very much in development.)
For example:
from pyensembl import Genome
data = Genome(
reference_name='GRCh38',
annotation_name='my_genome_features',
# annotation_version=None,
gtf_path_or_url='/My/local/gtf/path_to_my_genome_features.gtf', # Path or URL of GTF file
# transcript_fasta_paths_or_urls=None, # List of paths or URLs of FASTA files containing transcript sequences
# protein_fasta_paths_or_urls=None, # List of paths or URLs of FASTA files containing protein sequences
# cache_directory_path=None, # Where to place downloaded and cached files for this genome
)
# parse GTF and construct database of genomic features
data.index()
gene_names = data.gene_names_at_locus(contig=6, position=29945884)
API
The EnsemblRelease object has methods to let you access all possible
combinations of the annotation features gene_name, gene_id,
transcript_name, transcript_id, exon_id as well as the location of
these genomic elements (contig, start position, end position, strand).
