Satsuma2

Satsuma2 is an optimised version of Satsuma, a tool to reliably align large and complex DNA sequences providing maximum sensitivity (to find all there is to find), specificity (to only find real homology) and speed (to accomodate the billions of base pairs in vertebrate genomes). Satsuma2 adresses these issues through three novel strategies:

cross-correlation, implemented via fast Fourier transformation.
a match scoring scheme that eliminates almost all false hits.
an asynchronous "battleship"-like search that enables fast whole-genome alignment.

Satsuma2 also interfaces with MizBee, a multi-scale synteny browser for exploring conservation relationships in comparative genomics data (http://www.cs.utah.edu/~miriah/mizbee/Overview.html).

Satsuma2 is implemented in C++ on Linux.

Licensing

Satsuma2 is free software: you can redistribute it and/or modify it under the terms of the Lesser GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the Lesser GNU General Public License for more details.

Citing Satsuma2

Please contact us to discuss how you can cite Satsuma2.

Installation

Download the source code from https://github.com/bioinfologics/satsuma2.git and compile it using CMake v3.3+. To run, Satsuma2 requires GCC v5.2+. The binaries are generated in the bin/ directory.

NOTE: if you encounter the error "... undefined reference to `pthread_create'" during compilation, add the flag -pthread to CMakeLists.txt, i.e. change:

set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -lpthread -std=c++14 -O3 -w") to: set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -lpthread -pthread -std=c++14 -O3 -w")

We are experimenting with providing pre-compiled binaries which can be downloaded from the releases section. Once downloaded, you should be able to run these directly.

Quick start

Set the SATSUMA2_PATH environment variable to point to the directory containing the binaries.
Configure satsuma_run.sh to match your job submission system (LSF, PBS, SLURM etc.)
Run SatsumaSynteny2 with default parameters (uses one single-threaded slave);

./SatsumaSynteny2 -q query.fa -t target.fa -o output_dir

You can test the installation using the test dataset provided (dog vs. human) by running the test_SatsumaSynteny2 script.

Running Satsuma2

As Satsuma2 calls other executables (HomologyByXCorr, MergeXCorrMatches etc.), you need to set an environment variable to tell the software where to find the binaries;

export SATSUMA2_PATH=/path/to/binaries

Running SatsumaSynteny2 with no command-line arguments shows the available parameters;

$ ./SatsumaSynteny2

./SatsumaSynteny2: Wrapper around high-sensitivity alignments.

Available arguments:

-q<string> : query fasta sequence
-t<string> : target fasta sequence
-o<string> : output directory
-l<int> : minimum alignment length (def=0)
-t_chunk<int> : target chunk size (def=4096)
-q_chunk<int> : query chunk size (def=4096)
-sl_mem<int> : memory requirement for slaves (Gb) (def=100)
-do_refine<bool> : refinement steps (def=0)
-min_prob<double> : minimum probability to keep match (def=0.99999)
-cutoff<double> : signal cutoff (def=1.8)
-prob_table<bool> : approximate match prob using a table lookup in slaves (def=0)
-min_matches<int> : minimum matches per target to keep iterating (def=20)
-m<int> : number of jobs per block (def=4)
-slaves<int> : number of processing slaves (def=1)
-threads<int> : number of working threads per processing slave (def=1)
-km_mem<int> : memory required for kmatch (Gb) (def=100)
-km_sync<bool> : run kmatch jobs synchronously (def=1)
-seed<string> : loads seeds and runs from there (kmatch files prefix) (def=)
-min_seed_length<int> : minimum length for kmatch seeds (after collapsing) (def=24)
-max_seed_kmer_freq<int> : maximum frequency for kmatch seed kmers (def=1)
-old_seed<string> : loads seeds and runs from there (xcorr*data) (def=)
-pixel<int> : number of blocks per pixel (def=24)
-nofilter<bool> : do not pre-filter seeds (slower runtime) (def=0)
-dups<bool> : allow for duplications in the query sequence (def=0)
-dump_cycle_matches<bool> : dump matches on each cycle (for debug/testing) (def=0)

Required parameters are query FASTA (-q), target FASTA (-t) and output directory (-o). The query and target sequences are chunked (based on the -t_chunk and -q_chunk parameters) then KMatch is used to detect aligning regions between chunks. The number of chunks generated depends on the length of your query and target sequences. The amount of memory reserved for KMatch can be modified using the -km_mem parameter which defaults to 100Gb.

SatsumaSynteny2 despatches slave processes to compare the chunks which run asynchronously. The number of slaves, threads per slave and memory limit per slave are specified using the -slaves, -threads and -sl_mem parameters. The default is to run one single-threaded slave using 100Gb of memory. Slaves can be run on a single machine or submitted via a job submission system such as LSF, PBS or SLURM and the satsuma_run.sh file is used by SatsumaSynteny2 to start the slaves. Before running SatsumaSynteny2, you need to modify this file to suit your HPC environment by commenting out the lines you don't need with #. You will also need to change 'QueueName' to a queue that exists on your system. For example, to run on SLURM your file should look like this;

#!/bin/sh

# Script for starting Satsuma jobs on different job submission environments
# One section only should be active, ie. not commented out

# Usage: satsuma_run.sh <current_path> <kmatch_cmd> <ncpus> <mem> <job_id> <run_synchronously>
# mem should be in Gb, ie. 100Gb = 100

# no submission system, processes are run locally either synchronously or asynchronously
#if [ "$6" -eq 1 ]; then
#  eval "$2"
#else
#  eval "$2" &
#fi

##############################################################################################################
## For the sections below you will need to change the queue name (QueueName) to one existing on your system ##
##############################################################################################################

# qsub (PBS systems)
#echo "cd $1; $2" | qsub -V -qQueueName -l ncpus=$3,mem=$4G -N $5

# bsub (LSF systems)
#mem=`expr $4 + 1000`
#bsub -o ${5}.log -J $5 -n $3 -q QueueName -R "rusage[mem=$mem]" "$2"

# SLURM systems
echo "#!/bin/sh" > slurm_tmp.sh
echo srun $2 >> slurm_tmp.sh
sbatch -p QueueName -c $3 -J $5 -o ${5}.log --mem ${4}G slurm_tmp.sh

Notes

If SatsumaSynteny2 is run without a submission system, KMatch jobs will be launched synchronously in order to keep memory requirements low. If you have plenty of memory available you can opt to run the KMatch jobs asynchronously (-km_sync 0). KMatch requires a lot of memory and multiple KMatch processes running at the same time may cause SatsumaSynteny2 to abort if not enough memory is available.
The parameters -km_mem and -sl_mem are only applied when using a job submission system. We strongly recommend using a job submission system to run SatsumaSynteny2 which allows more control of the resource requirements of this software.
If the output directory is not empty, SatsumaSynteny2 will not overwrite any files but exit with an error message.
Idling processes self-terminate after two minutes. The overall alignments will still complete, but using fewer processes.
If alignment runs locally but not on the server farm, check whether processes on the farm can communicate via TCP/IP.
Currently, the entire sequences are loaded into RAM by each process. For comparison of large genomes, we strongly recommend to make sure that the CPUs have enough RAM available (~ the size of both genomes in bytes).

Parameter choice, execution and data preparation

The default parameters should work well for most genomes.
SatsumaSynteny2 runs most efficiently on either multi-processor machines or on clusters that are tightly coupled (fast access to files shared by the control process and the slaves)
Especially for larger genomes, we recommend leaving one CPU dedicated to the control process SatsumaSynteny2.
For larger genomes (>1Gb), we recommend using one chromosome of one genome as the query sequence and the entire other genome as the target sequence, and process alignments one query chromosome at a time.
To include large-scale duplications in the query sequence (in addition to the target sequence), use the option –dups.
If using the option –nofilter, the number of initial searches (-ni) should be higher than the number of processes (-n) to ensure that subsequent processes have sufficient seeds. Note that initial searches will be queued to a number of processes specified by -n.
When many processes search a tight space, the number of pixels per CPU (-m) should be small (e.g. ‘–m 1’ as in the sample script/data set) to avoid unbalanced load (i.e. some processes get all the pixels while others are starved, since they overlap). However, a small value for –m increases inter-process communication, which should be a consideration when deploying hundreds of processes.

Making SatsumaSynteny2 converge (a temporary note)

Given a new and more exhaustive convergence model, which is still under active development, Satsuma2 may fail to converge into a single final result, and rather enter an iteration cycle, where lots of small (or not so small) changes are made to the general alignment search strategy. Instead of hiding this behaviour under a fixed cutoff after a number of iterations, we have chosen to expose it, and allow the user to examine and choose the intermediate result that best suits the biological question.

For this r

Satsuma2

Install / Use

README

Satsuma2

Licensing

Citing Satsuma2

Installation

Quick start

Running Satsuma2

Notes

Parameter choice, execution and data preparation

Making SatsumaSynteny2 converge (a temporary note)