IMPORTANT: Progressive Cactus has moved here:

https://github.com/ComparativeGenomicsToolkit/cactus

This version a) is actively maintained and developed and b) supports cloud computing platforms by using Toil in place of JobTree

Progressive Cactus Manual

v0.0 by Glenn Hickey (hickey@soe.ucsc.edu)

Progressive Cactus is a whole-genome alignment package.

Installation

Requirements

git
gcc 4.2 or newer
python 2.7
wget
ping
64bit processor and build environment
150GB+ of memory on at least one machine when aligning mammal-sized genomes; less memory is needed for smaller genomes.
Parasol or SGE for cluster support.
750M disk space

Instructions

Installing

IMPORTANT NOTE: Progressive Cactus does not presently support installation into paths that contain spaces. Until this is resolved, you can use a softlink as a workaround: ln -s "path with spaces" "installation path without spaces"

In the parent directory of where you want Progressive Cactus installed:

git clone git://github.com/glennhickey/progressiveCactus.git
cd progressiveCactus
git pull
git submodule update --init
make

It is also convenient to add the location of progressiveCactus/bin to your PATH environment variable. In order to run the included tools (ex hal2maf) in the submodules/ directory structure, first source progressiveCactus/environment to load the installed environment.

If any errors occur during the build process, you are unlikely to be able to use the tool. Please submit a GitHub issue so we can help out: not only will you help yourself, but others who wish to use the tool as well.

Note that all dependencies are also built and included in the submodules/ directory. This increases the size and build time but greatly simplifies installation and version management. The installation does not create or modify any files outside the progressiveCactus/ directory.

Updating the distribution

To update a progressiveCactus installation, run the following:

cd progressiveCactus
git pull
git submodule update --init
make ucscClean
make

This will update the installation and all the submodules it contains.

Using the progressiveCactus environment

In order to avoid incompatibilities between python versions, and other libraries it depends on, progressiveCactus creates a virtual environment that must be loaded to use any of the tools in the package, except the aligner. Loading this environment temporarily modifies your session's PATH, PYTHONPATH, and other environment variables so that you're able to use the tools more easily.

To load this environment, run source environment, or, for non-bash shells, . environment in the main progressiveCactus directory.

To disable the environment, run deactivate. It's necessary to disable the environment before rebuilding progressiveCactus.

Running the aligner

runProgressiveCactus.sh

The aligner is run using the bin/runProgressiveCactus.sh script in the installation directory. Details about the command line interface can be obtained as follows:

bin/runProgressiveCactus.sh --help

Usage: runProgressiveCactus.sh [options] <seqFile> <workDir> <outputHalFile>

Required arguments

<seqFile>

Text file containing the locations of the input sequences as well as their phylogenetic tree. The tree will be used to progressively decompose the alignment by iteratively aligning sibling genomes to estimate their parents in a bottom-up fashion. If the tree is not specified, then a star-tree will be assumed (a single root with all leaves connected to it) and all genomes will be aligned together at once. The file is formatted as follows:

NEWICK tree (optional)
name1 path1
name2 path2
...
nameN pathN

An optional * can be placed at the beginning of a name to specify that its assembly is of reference quality. This implies that it can be used as an outgroup for sub-alignments. If no genomes are marked in this way, all genomes are assumed to be of reference quality. The star should only be placed on the name-path lines and not inside the tree.

The tree, if specified, must be on a single line. All leaves must be labeled and these labels must be unique. Labels should not contain any spaces.
Branch lengths that are not specified are assumed to be 1
Lines beginning with # are ignored.
Sequence paths must point to either a FASTA file or a directory containing 1 or more FASTA files.
Sequence paths must not contain spaces.
Sequence paths that are not referred to in the tree are ignored
Leaves in the tree that are not mapped to a path are ignored
Each name / path pair must be on its own line
Paths must be absolute

Example:

  # Sequence data for progressive alignment of 4 genomes
  # human, chimp and gorilla are flagged as good assemblies.  
  # since orang isn't, it will not be used as an outgroup species.
 (((human:0.006,chimp:0.006667):0.0022,gorilla:0.008825):0.0096,orang:0.01831);
 *human /data/genomes/human/human.fa
 *chimp /data/genomes/chimp/
 *gorilla /data/genomes/gorilla/gorilla.fa
 orang /cluster/home/data/orang/

The sequences for each species are named by their fasta headers. To avoid ambiguity, the first word of each header must be unique within its genome. Additionally, by default we check that the header is alphanumeric. We do this to ensure compatibility with visualisation tools, e.g. the UCSC browser. To disable this behaviour, remove the first preprocessor tag from the config.xml file that you use.

<workDir>

Working directory for the cactus aligner. It will be created if it doesn't exist. If an incomplete alignment is found in this directory for the same input data, Progressive Cactus will attempt to continue it (ie skip any ancestral genomes that were successfully reconstructed previously). If this behavior is undesired, either erase the working directory or use the --overwrite option to restart from scratch.

When running on a cluster, <workDir> must be accessible by all nodes.

<outputHalFile>

Location of the output alignment in HAL (Hierarchical ALignment) format. This is a compressed file that can be accessed via the HAL Tools

Resuming existing jobs

If Progressive Cactus detects that some sub-alignments in the working directory have already been successfully completed, it will skip them by default. For example, if the last attempt crashed when aligning the human-chimp ancestor to gorilla, then rerunning will not recompute the human-chimp alignment. To force re-alignment of already-completed subalignments, use the --overwrite option or erase the working directory.

Progressive Cactus will always attempt to rerun the HAL exporter after alignmenet is completed, even if the alignment has not changed.

General Options

--configFile=CONFIGFILE

Location of progressive cactus configuration file in XML format. The default configuration file can be found in progressiveCactus/submodules/cactus/cactus_progressive_config.xml. These parameters are currently undocumented so modify at your own risk.

--legacy

Align all genomes at once. This consistent with the original version of Cactus that this package was designed to replace.

--autoAbortOnDeadlock

Abort automatically when jobTree monitor suspects a deadlock by deleting the jobTree folder. Will guarantee no trailing ktservers but still dangerous to use until we can more robustly detect deadlocks.

--overwrite

Re-align nodes in the tree that have already been successfully aligned.

JobTree Options and Running on the Cluster

Running with more threads on a single machine

If you're running on a single machine, you can give your alignment run additional threads by supplying the --maxThreads <N> option to the aligner. The default is 4, so if you're running anything sizable, you'll definitely want to increase this!

Running on a cluster batch system

Currently, the cluster systems Parasol and Sun GridEngine are supported. PBS/Torque support has stalled. If you're interested in using PBS/Torque, let us know.

Hopefully, your cluster setup has at least one beefy machine with lots of RAM, and several additional compute nodes, which may have less RAM and/or compute power. In this case, you'll want to run progressiveCactus so that it runs the initial alignment (blast) and alignment refinement (bar) stages, which are highly parallelizable, on the cluster, and keep the cactus DB on a central server. A decent starting point for options to provide to the aligner is:

--batchSystem <clusterSystem> --bigBatchSystem singleMachine --defaultMemory 8589934593 --bigMemoryThreshold 8589934592 --bigMaxMemory 893353197568 --bigMaxCpus 25 --maxThreads 25 --retryCount 3

where <clusterSystem> is either parasol or gridengine.

For more details, please see the Jobtree Manual.

Computation Time & Memory Usage

This code is under constant development and contains numerous different algorithms making a static assessment on computation time and memory usage difficult. However, to demonstrate the performance of progressiveCactus in practice the following is output from jobTreeStats for analysing the runtime for aligning 5 mammalian genomes:

[benedict@hgwdev tempProgressiveCactusAlignment]$ jobTreeStats --jobTree ./jobTree --pretty --sortCategory=time --sortField=total --sortReverse
Batch System: parasol
Default CPU: 1  Default Memory: 8.0G
Job Time: 30s  Max CPUs: 9.22337e+18  Max Threads: 25
Total Clock: 11m6s  Total Runtime: 20h11m51s
Slave
    Count |

ProgressiveCactus

Install / Use

README