Abyss
:microscope: Assemble large genomes using short reads
Install / Use
/learn @BirolLab/AbyssREADME
ABySS
ABySS is a de novo sequence assembler intended for short paired-end reads and genomes of all sizes.
Please cite our papers.
Contents
- Installation
- Dependencies
- Compiling ABySS from source
- Before starting an assembly
- Modes
- Examples
- Optimizing the parameters k and kc
- Running ABySS on a cluster
- Using the DIDA alignment framework
- Assembly Parameters
- ABySS programs
- Export to SQLite Database
- Citation
- Related Publications
- Support
- Authors
Installation
Install ABySS using Conda (recommended)
If you have the Conda package manager (Linux, MacOS) installed, run:
conda install -c bioconda -c conda-forge abyss
Or you can install ABySS in a dedicated environment:
conda create -n abyss-env
conda activate abyss-env
conda install -c bioconda -c conda-forge abyss
Install ABySS using Homebrew
If you have the Homebrew package manager (Linux, MacOS) installed, run:
brew install abyss
Install ABySS on Windows
Install Windows Subsystem for Linux from which you can run Conda or Homebrew installation.
Dependencies
Dependencies for linked reads
These can be installed through Conda:
conda install -c bioconda arcs tigmint
Or Homebrew:
brew install brewsci/bio/arcs brewsci/bio/links-scaffolder
Optional dependencies
Conda:
conda install -c bioconda samtools
conda install -c conda-forge pigz zsh
Homebrew:
brew install pigz samtools zsh
Compiling ABySS from source
When compiling ABySS from source the following tools are required:
ABySS requires a C++ compiler that supports OpenMP such as GCC.
The following libraries are required:
Conda:
conda install -c conda-forge boost openmpi
conda install -c bioconda google-sparsehash btllib
It is also helpful to install the compilers Conda package that automatically passes the correct compiler flags to use the available Conda packages:
conda install -c conda-forge compilers
Homebrew:
brew install boost open-mpi google-sparsehash
ABySS will receive an error when compiling with Boost 1.51.0 or 1.52.0 since they contain a bug. Later versions of Boost compile without error.
To compile, run the following:
./autogen.sh
mkdir build
cd build
../configure --prefix=/path/to/abyss
make
make install
You may also pass the following flags to configure script:
--with-boost=PATH
--with-mpi=PATH
--with-sqlite=PATH
--with-sparsehash=PATH
--with-btllib=PATH
Where PATH is the path to the directory containing the corresponding dependencies. This should only be necessary if configure doesn't find the dependencies by default. If you are using Conda, PATH would be the path to the Conda installation. SQLite and MPI are optional dependencies.
The above steps install ABySS at the provided path, in this case /path/to/abyss.
Not specifying --prefix would install in /usr/local, which requires
sudo privileges when running make install.
ABySS requires a modern compiler such as GCC 6 or greater. If you have multiple versions of GCC installed, you can specify a different compiler:
../configure CC=gcc-10 CXX=g++-10
While OpenMPI is assumed by default you can switch to LAM/MPI or MPICH using:
../configure --enable-lammpi
../configure --enable-mpich
The default maximum k-mer size is 192 and may be decreased to reduce memory usage or increased at compile time. This value must be a multiple of 32 (i.e. 32, 64, 96, 128, etc):
../configure --enable-maxk=160
If you encounter compiler warnings that are not critical, you can allow the compilation to continue:
../configure --disable-werror
To run ABySS, its executables should be found in your PATH environment variable. If you
installed ABySS in /opt/abyss, add /opt/abyss/bin to your PATH:
PATH=/opt/abyss/bin:$PATH
Before starting an assembly
ABySS stores temporary files in TMPDIR, which is /tmp by default on most systems. If your default temporary disk volume is too small, set TMPDIR to a larger volume, such as /var/tmp or your home directory.
export TMPDIR=/var/tmp
Modes
Bloom filter mode
The recommended mode of running ABySS is the Bloom filter mode. Specifying
the Bloom filter memory budget with the B parameter enables this mode, which can
reduce memory consumption by ten-fold compared to the MPI mode. B may be specified
with unit suffixes 'k' (kilobytes), 'M' (megabytes), 'G' (gigabytes). If no units
are specified bytes are assumed. Internally, the Bloom filter assembler allocates
the entire memory budget (B * 8/9) to a Counting Bloom filter, and an additional
(B/9) memory to another Bloom filter that is used to track k-mers that have previously
been included in contigs.
A good value for B depends on a number of factors, but primarily on the
genome being assembled. A general guideline is:
P. glauca (~20Gbp): B=500G
H. sapiens (~3.1Gbp): B=50G
C. elegans (~101Mbp): B=2G
For other genome sizes, the value for B can be interpolated. Note that
there is no downside to using larger than necessary B value, except for
the memory required. To make sure you have selected a correct B value,
inspect the standard error log of the assembly process and ensure that the
reported FPR value under Counting Bloom filter stats is 5% or less. This
requires using verbosity level 1 with v=-v option.
MPI mode (legacy)
This mode is legacy and we do not recommend running ABySS with it.
To run ABySS in the MPI mode, you need to specify the np parameter,
which specifies the number of processes to use for the parallel MPI job.
Without any MPI configuration, this will allow you to use multiple cores
on a single machine. To use multiple machines for assembly, you must create
a hostfile for mpirun, which is described in the mpirun man page.
Do not run mpirun -np 8 abyss-pe. To run ABySS with 8 threads, use
abyss-pe np=8. The abyss-pe driver script will start the MPI
process, like so: mpirun -np 8 ABYSS-P.
The paired-end assembly stage is multithreaded, but must run on a
single machine. The number of threads to use may be specified with the
parameter j. The default value for j is the value of np.
Examples
Assemble a small synthetic data set
wget http://www.bcgsc.ca/platform/bioinfo/software/abyss/releases/1.3.4/test-data.tar.gz
tar xzvf test-data.tar.gz
abyss-pe k=25 name=test B=1G \
in='test-data/reads1.fastq test-data/reads2.fastq'
Calculate assembly contiguity statistics:
abyss-fac test-unitigs.fa test-contigs.fa test-scaffolds.fa
Assembling a paired-end library
To assemble paired reads in two files named reads1.fa and
reads2.fa into contigs in a file named ecoli-contigs.fa, run the
command:
abyss-pe name=ecoli k=96 B=
