Badread is a long-read simulator tool that makes – you guessed it – bad reads! It can imitate many kinds of problems one might encounter in real long-read sets: chimeras, low-quality regions, systematic basecalling errors and more.

Badread does not try to be best at imitating real reads (though it's not too bad, see this comparison between Badread and other long-read simulators). Rather, it was intended to give users control over the quality of its simulated reads. I made Badread for the purpose of testing tools which take long reads as input. With it, one can increase the rate of different types of read problems, to see what effect it has.

Badread is published in the Journal of Open Source Software. If you use it in your research, please cite this manuscript:<br>

Wick RR. Badread: simulation of error-prone long reads. Journal of Open Source Software. 2019;4(36):1316. doi:10.21105/joss.01316.

Requirements
Installation
Quick usage
Method
Detailed usage
Contributing
License

Requirements

Badread runs on MacOS and Linux. It may not work natively on Windows (I haven't tried) but can be run using the Windows subsystem for Linux. It requires Python 3.6 or later.

To install Badread you'll need pip and Git. It also uses a few Python packages (Edlib, NumPy, SciPy and Matplotlib) but these should be taken care of by the installation process.

Installation

Install from source

You can install Badread using pip, either from a local copy:

git clone https://github.com/rrwick/Badread.git
pip3 install ./Badread
badread --help

Or directly from GitHub:

pip3 install git+https://github.com/rrwick/Badread.git
badread --help

If these installation commands aren't working for you (e.g. an error message like Command 'pip3' not found or command 'gcc' failed with exit status 1) then check out the installation issues page on the wiki.

Run without installation

Badread can also be run directly from its repository by using the badread-runner.py script, no installation required:

git clone https://github.com/rrwick/Badread.git
Badread/badread-runner.py -h

If you run Badread this way, it's up to you to make sure that all necessary Python packages are installed.

Quick usage

If you need a reference genome to try out Badread, you can download this file which is an assembly of the Klebsiella pneumoniae SGH10 genome – a nasty hypervirulent strain (read more about it here).

Badread's default settings correspond to Oxford Nanopore R10.4.1 reads of mediocre quality:

badread simulate --reference ref.fasta --quantity 50x \
    | gzip > reads.fastq.gz

To simulate older Oxford Nanopore reads (R9.4.1, worse basecalling):

badread simulate --reference ref.fasta --quantity 50x \
    --error_model nanopore2020 --qscore_model nanopore2020 --identity 90,98,5 \
    | gzip > reads.fastq.gz

To simulate PacBio HiFi reads:

badread simulate --reference ref.fasta --quantity 50x \
    --error_model pacbio2021 --qscore_model pacbio2021 --identity 30,3 \
    | gzip > reads.fastq.gz

Very bad reads:

badread simulate --reference ref.fasta --quantity 50x --glitches 1000,100,100 \
    --junk_reads 5 --random_reads 5 --chimeras 10 --identity 80,90,6 --length 4000,2000 \
    | gzip > reads.fastq.gz

Pretty good reads:

badread simulate --reference ref.fasta --quantity 50x --glitches 10000,10,10 \
    --junk_reads 0.1 --random_reads 0.1 --chimeras 0.1 --identity 20,3 \
    | gzip > reads.fastq.gz

Very good reads:

badread simulate --reference ref.fasta --quantity 50x --error_model random \
    --qscore_model ideal --glitches 0,0,0 --junk_reads 0 --random_reads 0 --chimeras 0 \
    --identity 30,3 --length 40000,20000 --start_adapter_seq "" --end_adapter_seq "" \
    | gzip > reads.fastq.gz

Method

Badread simulates reads by mimicking the process of real sequencing: breaking the DNA into fragments, adding adapters and then reading the fragments into nucleotide sequences.

Here is an overview of how Badread makes each of its reads:

Use the fragment length distribution to choose a length for the read.
Choose a type of fragment:
- Most will be fragments of sequence from the reference FASTA. These are equally likely to come from either strand, and can loop around circular references. If there are multiple reference sequences with different depths, then the likelihood of the fragment coming from each sequence is proportional to that sequence's depth.
- Depending on the settings, some fragments may also be junk or random sequence.
Add adapter sequences to the start and end of the fragment, based on the adapter settings.
As determined by the chimera rate, there is a chance that Badread will make another fragment and concatenate it onto the current fragment (possibly with adapter sequences in between, possibly not).
Add glitches to the fragment, based on the glitch settings.
Choose a percent identity for the read using the read identity distribution.
'Sequence' the fragment by adding errors until it has the target percent identity.
- Errors are chosen using the error model and are added at random positions in the read.
- This step performs periodic alignments between the original fragment and the error-added sequence, so Badread can track the read's actual identity. This allow it to be precise (if Badread is aiming for a 91.5% identity read, it will be very close to 91.5% identity) but slow. If you find that Badread is too slow, check out the wiki page on running it in parallel.
Generate quality scores for each base using the qscore model.
Output the read and quality in FASTQ format.

Detailed usage

Command line

usage: badread simulate --reference REFERENCE --quantity QUANTITY [--length LENGTH]
                        [--identity IDENTITY] [--error_model ERROR_MODEL]
                        [--qscore_model QSCORE_MODEL] [--seed SEED] [--start_adapter START_ADAPTER]
                        [--end_adapter END_ADAPTER] [--start_adapter_seq START_ADAPTER_SEQ]
                        [--end_adapter_seq END_ADAPTER_SEQ] [--junk_reads JUNK_READS]
                        [--random_reads RANDOM_READS] [--chimeras CHIMERAS] [--glitches GLITCHES]
                        [--small_plasmid_bias] [-h] [--version]

Generate fake long reads

Required arguments:
  --reference REFERENCE           Reference FASTA file (can be gzipped)
  --quantity QUANTITY             Either an absolute value (e.g. 250M) or a relative depth (e.g. 25x)

Simulation parameters:
  Length and identity and error distributions

  --length LENGTH                 Fragment length distribution (mean and stdev, default: 15000,13000)
  --identity IDENTITY             Sequencing identity distribution (mean,max,stdev for beta
                                  distribution or mean,stdev for normal qscore distribution, default:
                                  95,99,2.5)
  --error_model ERROR_MODEL       Can be "nanopore2018", "nanopore2020", "nanopore2023", "pacbio2016",
                                  "pacbio2021", "random" or a model filename (default: nanopore2023)
  --qscore_model QSCORE_MODEL     Can be "nanopore2018", "nanopore2020", "nanopore2023", "pacbio2016",
                                  "pacbio2021", "random", "ideal" or a model filename (default:
                                  nanopore2023)
  --seed SEED                     Random number generator seed for deterministic output (default:
                                  different output each time)

Adapters:
  Controls adapter sequences on the start and end of reads

  --start_adapter START_ADAPTER   Adapter parameters for re

Badread

Install / Use

README

Table of contents