ABySS

ABySS is a de novo sequence assembler intended for short paired-end reads and genomes of all sizes.

Installation
Dependencies
- Dependencies for linked reads
- Optional dependencies
Compiling ABySS from source
Before starting an assembly
Modes
- Bloom filter mode
- MPI mode (legacy)
Examples
Optimizing the parameters k and kc
Running ABySS on a cluster
Using the DIDA alignment framework
Assembly Parameters
ABySS programs
Export to SQLite Database
- Database parameters
- Helper programs
Citation
Related Publications
Support
Authors

Installation

Install ABySS using Conda (recommended)

If you have the Conda package manager (Linux, MacOS) installed, run:

conda install -c bioconda -c conda-forge abyss

Or you can install ABySS in a dedicated environment:

conda create -n abyss-env
conda activate abyss-env
conda install -c bioconda -c conda-forge abyss

Install ABySS using Homebrew

If you have the Homebrew package manager (Linux, MacOS) installed, run:

brew install abyss

Install ABySS on Windows

Install Windows Subsystem for Linux from which you can run Conda or Homebrew installation.

Dependencies

Dependencies for linked reads

ARCS for scaffolding.
Tigmint for correcting assembly errors.

These can be installed through Conda:

conda install -c bioconda arcs tigmint

Or Homebrew:

brew install brewsci/bio/arcs brewsci/bio/links-scaffolder

Optional dependencies

pigz for parallel gzip.
samtools for reading BAM files.
zsh for reporting time and memory usage.

Conda:

conda install -c bioconda samtools
conda install -c conda-forge pigz zsh

Homebrew:

brew install pigz samtools zsh

Compiling ABySS from source

When compiling ABySS from source the following tools are required:

ABySS requires a C++ compiler that supports OpenMP such as GCC.

The following libraries are required:

Conda:

conda install -c conda-forge boost openmpi
conda install -c bioconda google-sparsehash btllib

It is also helpful to install the compilers Conda package that automatically passes the correct compiler flags to use the available Conda packages:

conda install -c conda-forge compilers

Homebrew:

brew install boost open-mpi google-sparsehash

ABySS will receive an error when compiling with Boost 1.51.0 or 1.52.0 since they contain a bug. Later versions of Boost compile without error.

To compile, run the following:

./autogen.sh
mkdir build
cd build
../configure --prefix=/path/to/abyss
make
make install

You may also pass the following flags to configure script:

--with-boost=PATH
--with-mpi=PATH
--with-sqlite=PATH
--with-sparsehash=PATH
--with-btllib=PATH

Where PATH is the path to the directory containing the corresponding dependencies. This should only be necessary if configure doesn't find the dependencies by default. If you are using Conda, PATH would be the path to the Conda installation. SQLite and MPI are optional dependencies.

The above steps install ABySS at the provided path, in this case /path/to/abyss. Not specifying --prefix would install in /usr/local, which requires sudo privileges when running make install.

ABySS requires a modern compiler such as GCC 6 or greater. If you have multiple versions of GCC installed, you can specify a different compiler:

../configure CC=gcc-10 CXX=g++-10

While OpenMPI is assumed by default you can switch to LAM/MPI or MPICH using:

    ../configure --enable-lammpi
    ../configure --enable-mpich

The default maximum k-mer size is 192 and may be decreased to reduce memory usage or increased at compile time. This value must be a multiple of 32 (i.e. 32, 64, 96, 128, etc):

../configure --enable-maxk=160

If you encounter compiler warnings that are not critical, you can allow the compilation to continue:

../configure --disable-werror

To run ABySS, its executables should be found in your PATH environment variable. If you installed ABySS in /opt/abyss, add /opt/abyss/bin to your PATH:

PATH=/opt/abyss/bin:$PATH

Before starting an assembly

ABySS stores temporary files in TMPDIR, which is /tmp by default on most systems. If your default temporary disk volume is too small, set TMPDIR to a larger volume, such as /var/tmp or your home directory.

export TMPDIR=/var/tmp

Modes

Bloom filter mode

The recommended mode of running ABySS is the Bloom filter mode. Specifying the Bloom filter memory budget with the B parameter enables this mode, which can reduce memory consumption by ten-fold compared to the MPI mode. B may be specified with unit suffixes 'k' (kilobytes), 'M' (megabytes), 'G' (gigabytes). If no units are specified bytes are assumed. Internally, the Bloom filter assembler allocates the entire memory budget (B * 8/9) to a Counting Bloom filter, and an additional (B/9) memory to another Bloom filter that is used to track k-mers that have previously been included in contigs.

A good value for B depends on a number of factors, but primarily on the genome being assembled. A general guideline is:

P. glauca (~20Gbp): B=500G H. sapiens (~3.1Gbp): B=50G C. elegans (~101Mbp): B=2G

For other genome sizes, the value for B can be interpolated. Note that there is no downside to using larger than necessary B value, except for the memory required. To make sure you have selected a correct B value, inspect the standard error log of the assembly process and ensure that the reported FPR value under Counting Bloom filter stats is 5% or less. This requires using verbosity level 1 with v=-v option.

MPI mode (legacy)

This mode is legacy and we do not recommend running ABySS with it. To run ABySS in the MPI mode, you need to specify the np parameter, which specifies the number of processes to use for the parallel MPI job. Without any MPI configuration, this will allow you to use multiple cores on a single machine. To use multiple machines for assembly, you must create a hostfile for mpirun, which is described in the mpirun man page.

Do not run mpirun -np 8 abyss-pe. To run ABySS with 8 threads, use abyss-pe np=8. The abyss-pe driver script will start the MPI process, like so: mpirun -np 8 ABYSS-P.

The paired-end assembly stage is multithreaded, but must run on a single machine. The number of threads to use may be specified with the parameter j. The default value for j is the value of np.

Examples

Assemble a small synthetic data set

wget http://www.bcgsc.ca/platform/bioinfo/software/abyss/releases/1.3.4/test-data.tar.gz
tar xzvf test-data.tar.gz
abyss-pe k=25 name=test B=1G \
	in='test-data/reads1.fastq test-data/reads2.fastq'

Calculate assembly contiguity statistics:

abyss-fac test-unitigs.fa test-contigs.fa test-scaffolds.fa

Assembling a paired-end library

To assemble paired reads in two files named reads1.fa and reads2.fa into contigs in a file named ecoli-contigs.fa, run the command:

abyss-pe name=ecoli k=96 B=

Abyss

Install / Use

README