ReSeq

More realistic simulator for genomic DNA sequences from Illumina machines that achieves a similar k-mer spectrum as the original sequences.

Abstract
Requirements
Installation
Bioconda
Quick start examples
Apply errors and qualities directly to sequences
Parameter
File Formats
FAQ
Included libraries
Publication

<a name="abstract"></a>Abstract

Even though sequencing biases and errors have been deeply researched to adequately account for them, comparison studies, e.g. for error correction, assembly or variant calling, face the problem that synthetic datasets resemble the real output of high-throughput sequencers only in very limited ways, resulting in optimistic estimated performance of programs run on simulated data compared to real data. Therefore, comparison studies are often based on real data. However, this approach has its own difficulties, since the ground truth is unknown and can only be estimated, which introduces its own biases and circularity towards easy solutions and the methods used.

Real Sequence Reproducer shortens the gap between simulated and real data evaluations by adequately reproducing key statistics of real data, like the coverage profile, systematic errors and the k-mer spectrum. When these characteristics are translated into new synthetic computational experiments (i.e. simulated data), the performance can be more accurately estimated. Combining our simulator and real data gives two valuable perspectives on the performance of tools to minimize biases.

<a name="requirements"></a>Requirements

| Requirement | Ubuntu/Debian | CentOS | Manual installation | Comments | Tested version | |---------------------------|------------------------------------|--------|---------------------|----------|----------------| | Linux system | | Compiler supporting C++14 | sudo apt install build-essential | sudo yum install gcc gcc-c++ glibc-devel make | | On CentOs 7 the g++ compiler is too old to support C++14 so you need to additionally to the yum command install a newer version following for example this guide (https://linuxhostsupport.com/blog/how-to-install-gcc-on-centos-7/). If the standard install path usr/local/ was used, afterwards the CXX variable has to be set to /usr/local/bin/c++, the CC variable to /usr/local/bin/gcc, /usr/local/lib64/ has to be added to your LD_LIBRARY_PATH and /usr/local/bin to your PATH before installing boost. | 7.2.1 20171019| | ZLIB | sudo apt-get install zlib1g-dev | sudo yum install zlib-devel | | | | | BZip2 | sudo apt-get install libbz2-dev | sudo yum install bzip2-devel | | | | | Python 3 | | | | | 3.6.11 | | Python libraries | sudo apt-get install python3-dev | sudo yum install python3-devel.x86_64 | | | | | Git | sudo apt-get install git | sudo yum install git | | | | | CMake | sudo apt-get install cmake | (too old version) | https://cmake.org/install/ | The newest versions (starting 3.16) require sudo apt-get install libssl-dev or sudo yum install openssl-devel | 3.5.1 | | Boost C++ libraries | sudo apt-get install libboost-all-dev | (version not working) | https://www.boost.org/doc/libs/1_71_0/more/getting_started/unix-variants.html | Only download and extraction in section 1 and library builds in section 5 are strictly needed, if you set a prefix you need to set BOOST_ROOT to this prefix before the installation process below or you will get boost library errors at the cmake and make step. If you manually installed g++ run ./b2 without sudo so the environment variables CXX and CC are found. | 1.67.0 | | SWIG | sudo apt-get install swig | (too old version) | http://www.swig.org/Doc4.0/Preface.html | If you set a prefix you need to add prefix/bin to your PATH variable | 3.0.8 |

<a name="installation"></a>Installation

To install to the standard folder usr/local or to keep everything in the build folder:

cd /where/you/want/to/build/ReSeq
git clone https://github.com/schmeing/ReSeq.git
cd ReSeq
mkdir build
cd build
cmake ..
make

To install to a different folder the same steps apply but the cmake .. line has to be exchange with:

cmake -DCMAKE_INSTALL_PREFIX=/where/you/want/to/install/ReSeq/ ..

The executable file will afterwards be /where/you/want/to/build/ReSeq/ReSeq/build/bin/reseq and can be added to the PATH variable or copied to the desired place.

Alternatively ReSeq can be install to the standard folder usr/local or the previously defined folder by:

make install

To test the installation run:

reseq test

Some useful python scripts can be found in /where/you/want/to/install/ReSeq/ReSeq/python or after an installation in usr/local/bin or /where/you/want/to/install/ReSeq/bin/.

<a name="conda"></a>Bioconda

ReSeq can also be installed in an automatic fashion via anaconda/miniconda(https://docs.conda.io/projects/continuumio-conda/en/latest/user-guide/install/index.html) with the following command:

conda install -c bioconda -c conda-forge reseq

However, updates will not be as frequent and the option to switch to the devel branch to get the most recent bugfixes is missing.

<a name="quickstart"></a>Quick start examples

To create simulated data similar to real data you first need to map the real data to a reference. For example with bowtie2:

bowtie2-build my_reference.fa my_reference
bowtie2 -p 32 -X 2000 -x my_reference -1 my_data_1.fq -2 my_data_2.fq | samtools sort -m 10G -@ 4 -T _tmp -o my_mappings.bam -

To run the full simulation pipeline (Stats creation, Probability estimation, Simulation) execute:

reseq illuminaPE -j 32 -r my_reference.fa -b my_mappings.bam -1 my_simulated_data_1.fq -2 my_simulated_data_2.fq

The same is done by the following three commands for the three different steps (So you can run for example only the simulation the second time you want to simulate from the same real data):

reseq illuminaPE -j 32 -r my_reference.fa -b my_mappings.bam --statsOnly
reseq illuminaPE -j 32 -s my_mappings.bam.reseq --stopAfterEstimation
reseq illuminaPE -j 32 -R my_reference.fa -s my_mappings.bam.reseq --ipfIterations 0 -1 my_simulated_data_1.fq -2 my_simulated_data_2.fq

In order to add variation (to simulate diploid genomes or populations) the parameter -V needs to be added:

reseq illuminaPE -j 32 -r my_reference.fa -b my_mappings.bam -V my_variation.vcf -1 my_simulated_data_1.fq -2 my_simulated_data_2.fq

reseq illuminaPE -j 32 -R my_reference.fa -s my_mappings.bam.reseq -V my_variation.vcf --ipfIterations 0 -1 my_simulated_data_1.fq -2 my_simulated_data_2.fq

To run a simulation with tiles the tile information needs to stay in the read names after the mapping. This means there must not be a space before it, like it is often the case for read archive data. To replace the space on the fly with an underscore the reseq-prepare-names.py script is provided. In this case run the mapping like this:

bowtie2 -p 32 -X 2000 -x my_reference -1 <(reseq-prepare-names.py my_data_1.fq my_data_2.fq) -2 <(reseq-prepare-names.py my_data_2.fq my_data_1.fq) | samtools sort -m 10G -@ 4 -T _tmp -o my_mappings.bam -

For best results it is always advised to create your own profiles from a dataset very closely matching the desired sequencer, chemistry, fragmentation, adapters, PCR cycles, etc. Furthermore, training on the same or a closely related species, is best to be sure that the necessary profile space (e.g. range of GC content) is well populated. However, in many situations there is no specific case that should be simulated, but a wide variety of datasets is important. Under this condition finding good datasets is tedious work and recreating the same profile from a given dataset does not help to ensure the quality of the simulated data. Therefore, this repository is designed to provide high-quality profiles with detailed information on the original datasets. Whenever possible, method benchmarks should be performed on the simulated and the original dataset to verify that the simulation created realistic conditions for the particular use case.

<a name="errormodel"></a>Apply errors and qualities directly to sequences

In case you cannot use the coverage model, you can directly provide sequences to ReSeq, which will be converted to reads. This includes adding qualities as well as InDel and substitution errors and cutting the sequence to the read length. If sequences are shorter than the read length ReSeq automatically adds an adapter.

reseq seqToIllumina -j2 -i my_sequences.fa -o my_simulated_reads.fq -s my_stats_profile.reseq

For it to work, all necessary informations need to be provided to ReSeq's error and quality model. Therefore, each input sequence in the fasta file must have the following form:

>{sequence id} {template segment};{fragment length};{error tendencies};{error rates}
{sequence to convert}

{sequence id}: The desired sequence id. It can contain spaces. The final read description in the output fastq will be: @{sequence id} {cigar} E{number of errors in read}`

{template segment}: 1 for first reads or 2 for second reads.

{sequence to convert}: Sequence to which errors and qualities will be added. It may only contain A, C, G or T. Ns are not permitted, since a conversion should be performed in a consistent manner for all reads stemming from a given position in the reference (see reseq replaceN).

ReSeq

Install / Use

README

ReSeq

Table of Contents

<a name="abstract"></a>Abstract

<a name="requirements"></a>Requirements

<a name="installation"></a>Installation

<a name="conda"></a>Bioconda

<a name="quickstart"></a>Quick start examples

<a name="errormodel"></a>Apply errors and qualities directly to sequences