SkillAgentSearch skills...

XenoGI

Code for detecting genomic island insertions in clades of microbes.

Install / Use

/learn @ecbush/XenoGI
About this skill

Quality Score

0/100

Supported Platforms

Universal

README

====== xenoGI

Code for reconstructing genome evolution in clades of microbes.

Requirements

  • NCBI blast+

    We need blastp and makeblastdb executables (ftp://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/LATEST/).

  • MUSCLE V5 (https://www.drive5.com/muscle/). For creating protein or DNA alignments.

  • FastTree (http://www.microbesonline.org/fasttree/). For making gene trees.

  • GeneRax (https://github.com/BenoitMorel/GeneRax). For making (species tree aware) gene trees. This is optional but recommended.

  • Python 3

  • Python package dependencies

    • Biopython (http://biopython.org/). This is for parsing genbank files and can be installed using pip: pip3 install biopython

    • Parasail (https://github.com/jeffdaily/parasail). This is an optimized alignment library, used in calculating scores between proteins. It can also be installed using pip: pip3 install parasail

    • Numpy (http://www.numpy.org/). pip3 install numpy

    • Scipy (https://www.scipy.org/). pip3 install scipy

(The pip you use needs to correspond to a version of Python 3. In some cases it may just be called pip instead of pip3).

  • Additional dependencies

    If you make use of the makeSpeciesTree flag or xlMode.py, you will also need the following

    • ASTRAL (https://github.com/smirarab/ASTRAL/).
  • Comments on platforms.

    xenoGI is developed on Linux. The docker image (linked below) is the easiest way to run on Mac and Windows.

Installation

Via pip::

pip3 install xenoGI

(You will separately need to install blast+, MUSCLE, FastTree, and optionally GeneRax and ASTRAL.)

Via docker. For some instructions on using docker, go here:

https://hub.docker.com/r/ecbush/xenogi

Using docker, xenoGI get's run within a virtual machine. This is nice because you don't have to worry about all the dependencies above (they're provided in our image). This does come at some cost in terms of performance.

Citation

If you use xenoGI in a publication, please cite one or more of the following:

Bush EC, Clark AE, DeRanek CA, Eng A, Forman J, Heath K, Lee AB, Stoebel DM, Wang Z, Wilber M, Wu H. xenoGI: reconstructing the history of genomic island insertions in clades of closely related bacteria. BMC Bioinformatics. 19(32). 2018.

Liu J, Mawhorter R, Liu I, Santichaivekin S, Bush E, Libeskind-Hadas R. Maximum Parsimony Reconciliation in the DTLOR Model. BMC Bioinformatics. 22(394). 2021.

Liu N, Gonzalez TA, Fischer J, Hong C, Johnson M, Mawhorter R, Mugnatto F, Soh R, Somji S, Wirth JS, Libeskind-Hadas R and Bush EC. xenoGI 3: using the DTLOR model to reconstruct the evolution of gene families in clades of microbes. BMC bioinformatics. 24(1). 2023.

How to use

An example/ directory is included in this repository.

The sections below give some instructions about how to run xenoGI on this example. You can use this to make sure you've installed it properly and so forth. The github repository also contains a TUTORIAL which you can run through after completing the README.

The basic method works on a set of species with known phylogenetic relationships. In the example, these species are: E. coli K12, E. coli ATCC 11775, E. fergusonii and S. bongori. In cases where you don't know the species tree, xenoGI has methods to help you reconstruct it.

Required files


The working directory must contain:

* A parameter file. In the provided ``example/`` directory this is called ``params.py``.

* A newick format tree representing the relationships of the strains. In the example this is called ``example.tre``. Note that branch lengths are not used in xenoGI, and ``example.tre`` does not contain branch lengths. Also note that internal nodes should be given names in this tree. In the example.tre we label them s0, s1 etc. The parameter ``speciesTreeFN`` in ``params.py`` has the path to this tree file. If a strain tree is not available, xenoGI has some accessory methods, described below, to help obtain one.

* A subdirectory of sequence files. In the example, this is called ``ncbi/``. Contained in this subdirectory will be genbank (gbff) files for the species. The parameter ``genbankFilePath`` in ``params.py`` has the path to these files.

Naming of strains

The system needs a way to connect the sequence files to the names used in the tree.

In the example, the sequence files have names corresponding to their assembly accession number from ncbi. We connect these to the human readable names in example.tre using a mapping given in the file ncbiHumanMap.txt. This file has two columns, the first giving the name of the genbank file, and the second giving the name for the strain used in the tree file. In params.py the parameter fileNameMapFN is set to point to this file.

Note that the strain names should not contain any dashes, spaces, commas or special characters.

Another approach is to change the names of the sequence files to match what's in the tree. If you do this, then you should set fileNameMapFN = None in params.py. (This is not necessary in the example, which is already set to run the other way).

Pointing xenoGI to various executables


Before running xenoGI you'll have to ensure that it knows where various executables are. Edit ``params.py`` using a text editor such as emacs, vim, nano, Visual Studio Code etc. You should edit the following to give the absolute (full) path to the directory where the ``blastp`` and ``makeblastdb`` executables reside::

  blastExecutDirPath = '/usr/bin/'

(Change '/usr/bin/' to correspond to the right location on your system).

Also make sure that the absolute paths to MUSCLE and FastTree are correct in ``params.py`` (the parameters ``musclePath`` and ``fastTreePath``). If you intend to use generax to make species tree aware gene trees, then you also need to set ``geneRaxPath``. (The default parameter file is set to use generax, so unless you change the ``useGeneRaxToMakeSpeciesTrees`` parameter, described below, you'll need to supply a ``geneRaxPath``).

If you will be using the makeSpeciesTree functionality, then you will also need to specify ``astralPath`` and ``javaPath``.

Running the code
~~~~~~~~~~~~~~~~

If you install via pip, then you should have an executable script in your path called xenoGI.

You run the code from within the working directory. To run the example, you would cd into the ``example/`` directory. You will need to ensure that the ``params.py`` parameters file contains the correct path to the directory with the blastp and makeblastdb executables in it, as well as the MUSCLE and FastTree executables. Then, the various steps of xenoGI can be run all at once like this::

  xenoGI params.py runAll

They can also be run individually::

  xenoGI params.py parseGenbank
  xenoGI params.py runBlast
  xenoGI params.py calcScores
  xenoGI params.py makeFamilies
  xenoGI params.py makeIslands
  xenoGI params.py refine
  xenoGI params.py printAnalysis
  xenoGI params.py createIslandBed

If for some reason you don't want to install via pip, then you can download the repository and run the code like this::

  python3 path-to-xenoGI-github-repository/xenoGI-runner.py params.py runAll

(In this case you will have to make sure all the python package dependencies are satisfied.)

What the steps do
~~~~~~~~~~~~~~~~~

* ``parseGenbank`` runs through the genbank files and produces input files that are used by subsequent code. This step pulls out every CDS feature that has a ``/translation`` tag. The fields that are recorded (if present) are locus_tag, protein_id, product (that is gene description), and chromosomal coordinates as well as the protein sequence. If the parameter ``dnaBasedGeneTrees`` is True, the DNA sequence for each gene is kept as well.
  
* ``runBlast`` does an all vs. all protein blast of the genes in these strains. The number of processes it will run in parallel is specified by the ``numProcesses`` parameter in the parameter file. Before running a particular comparison, runBlast checks to see if the output file for that comparison already exists (e.g. from a previous run). If so it skips the comparison.
  
* ``calcScores`` calculates similarity and synteny scores between genes in the strains. It is also (mostly) parallelized.
  
* ``makeFamilies`` calculates gene families using blast, FastTree, GeneRax (optionally), and a customized variant of the DTL reconciliation algorithm called DTLOR. This approach considers synteny in the family formation process.

* ``makeIslands`` groups families according to their origin, putting families with a common origin together as islands. It is partly parallelized.

* ``refine`` reconsiders certain families in light of the output of makeIslands. In particular, this step looks at cases where there are multiple most parsimonious reconciliations, and chooses the reconciliation that is most consistent with neighboring families. It then re-runs makeIslands.
  
* ``printAnalysis`` produces a number of analysis/output files intended for the end user.

* ``createIslandBed`` produces bed files for each genome.

Locus families and locus islands
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

A brief illustration will allow us to define some terminology used in xenoGI's output. The basic goal of xenoGI is to group genes with a common origin and map them onto a phylogenetic tree.

Consider a clade of three species: (A,B),C. In this group, A and B are most closely related, and C is the outgroup. Gene a in species A has an ortholog b in species B. These two genes have high synteny, but have no ortholog in C. We call a and b a *locus family* because they are descended from a common ancestor, and occur in the same syntenic location.

When a genomic island inserts as a part of a horizontal transfer event, it typically brings in multiple locus families at the same time. xenoGI will attempt to group these into a *l
View on GitHub
GitHub Stars19
CategoryDevelopment
Updated1y ago
Forks4

Languages

Python

Security Score

75/100

Audited on Jun 14, 2024

No findings