fqzcomp v4.6

James Bonfield, Wellcome Trust Sanger Institute 2011 - 2013.

This is the 4th variant of an experimental code to compress fastq data. It has a number of caveats.

The fastq must consist of [ACGTN], or [0123.]. Anything else is considered as an "N" or ".".
Sequence and quality should be on one line only.
All bases called as N will be assumed to have quality 0. This can disagreements after decompression. Arguably this is a feature as how can you have anything other than zero confidence that "N" is the correct call?

Although the program usage and documentation refer to .fqz files, this is not compatible with the previous two variants using that format extension. (However the magic number has been amended to include the appropriate format version code so they can detect which files are which.)

It has been tested on Illumina, 454 data, PacBio and SOLiD data files. However there may be specific caveats with this formats as some break the assumptions listed above. See below for suggested compression parameters for each machine type.

Usage

The program reads from stdin and writes to stdout, so it can be used as part of a pipeline.

To compress: fqzcomp [options] < a.fasta > a.fqz

To uncompress fqzcomp -d < a.fqz > a.fasta

Additionally in v4.0 onwards it can read/write to files instead.

fqzcomp -s5 -q3 a.fasta a.fqz
fqzcomp -d a.fqz a.fasta

Options

Compression options are as follows.

-s <level> Specifies the size of the sequence context. Increasing this will improve compression on large data sets, but each increment in level will quadruple the memory used by the sequence compression steps. Further more increasing it too high may harm compression on small files.

Specifying a "+" after level will use two context sizes (eg -s5+);
a small fixed size and the <level> specified. This greatly helps
compression of small files and also slightly helps compression of
larger files.

Defaults to -s3.

-b Use both strands when updating sequence contexts. This has no memory issues, but it slows down the compression and has only a small gain in compression ratios. Off by default.

Do not use this with SOLiD data files.

-e By default the sequence contexts use 8-bit counters for memory efficiency. This forces 16-bit counters. It has a very small gain in compression ratio, but doubles sequence compression memory usage.

-q <level> Specifies the degree of quality compression, with a value from -q1 to -q3.

-q1
    Uses the previous quality value and the maximum of the two
    previous to that as a context for predicting the next value.
    Combined this adds 12 bits of context.

-q2
    In addition to -q1, this extends the context by adding a
    single bit to indicate if the 2nd and 3rd previous qualities
    are identical to each other, as well as using 3 bits of
    context for the running-delta. (A measure of how variable a
string of quality values are.) This is the default level.

 -q3
    As per -q2, but also adds 4 bits worth of context holding the
    position each the sequence.

-Q <distance> Requests that quality values should be stored in an approximate fashion. This is a lossy compressor.

All quality values are stored within +/- <distance> from their
actual value, but not necessarily always the same distance.

Defaults to -Q0 (lossless).

-n <level> Controls the name compression level, from -n1 to -n2.

-n1 uses a very simple string based compressor that takes into
account the previous byte on this name and the same byte position
in the previous name, with some basic logic to try and keep
current and previous names approximately in alignment.

-n2 breaks the name down into tokens based on non alpha-numeric
characters and attempts to encode using numeric deltas where
appropriate. It typically compresses better on regular names
produced by (for example) Illumina machines, but may behave poorly
on other data. You should test which performs best for your data
set.

Defaults to -n2.

-P Disables all multi-threading. This has no impact on the data format produced.

The default threading used is very minor, typically using under
200% of cpu usage, and considerably less on high -s levels.

To uncompress, simply use -d. The compression parameters are saved in the output file so the decompression program knows what options to repeat during decompression.

Caveats

The decompression may not perfectly match the original data. Sometimes this can generate checksum failures too (intentionally), although these can be glossed over by using -X on the decompressor.

Known issues are:

Names in the "+" line are ignored. Eg:

@SRR1234 ACGTACGTACGTGGGA +SRR1234 %^$&^!&((&&&&&

becomes

@SRR1234 ACGTACGTACGTGGGA + %^$&^!&((&&&&&

(This does not generate a checksum failure.)

A strict policy is applied on the meaning of quality value 0. Specifically any non-N base call must have a quality >0 and any N base call must have a quality of 0. These rules are enforced during compression, giving checksum failures on decompression. (Community question: should I do the checksum after enforcing the rule to make the changes silent?)

For example:

@SRR1234 ACGTACGTACGTNGGA +SRR1234 ^%%%$%%^!%%$$$"&

becomes

@SRR1234 ACGTACGTACGTNGGA +SRR1234 ^%%%$%%^"%%$!$"&

The quality for base 9 (A) has been changed from !(0) to "(1). The quality for base 13(N) has been changed from $ to !(0).

Algorithms

Many of the algorithms are derived from fqzcomp-3.0 and in turn from fqzcomp-2.1, although the primary parsing code has been completely rewritten since version 2. Each type of data gets encoded using its own model.

Names are compared against the previous name, with a certain amount of wobble to match up colons and spaces where the intervening text is of different length (essentially a poor mans name alignment). This essentially means the context for encoding a name symbol is the character at the same position within the previous name. There is also dedicated prefix and suffix matching to help speed up encoding and reduce space further.

Sequences are converted into 2-bit values and the last N bases are used as the context for encoding the next base. The length of this context is the primary affect of changing the level with -s1 to -s9. Note it is not expected that you will have sufficient memory to cope with -s9. Practically going beyond -s6 may pose difficulties.

N bases are replaced with the most likely of A,C,G,T - whichever takes the least space to encode - and are replaced by Ns during decoding by noting that the quality is zero. (This is enforced.)

Qualities use an Order-2 model based on the previous two quality values. This can be augmented (-q1) with an additional context holding the approximate amount of variance along the sequence. By summing the differences between each quality and the next we can arrive at a figure indicating whether this sequence has smooth behaviour - a gradual reduction in quality along the length of the read - or highly variable behaviour with lots of dips in quality.

It was found that this delta summation had better discrimination of quality distributions than purely the position along the sequence. However on very large data sets the approximate position along the sequence can also be added in to the context by specifying -q3. Note that this uses memory memory and is slower.

Normally quality is encoded losslessly, but the -Q option permits encoding approximate qualities. With -Q 2 we could encode quality 35 as 33, 34, 35, 36 or 37. We choose (with approximations for speed) the quality value from that set which requires the fewest bits to encode and emit that instead. This was found to be significantly better at reducing storage volumes than simply degrading the quality into a series of N-wide bins, while still closely approximating the shape of a quality profile and preserving any dips and spikes.

The entropy encoding is based on a range coder taken from coders6c2 from http://ctxmodel.net/rem.pl?-37. This range coder was written by Eugene Shelwien, but it looks probable that it derived from Dmitry Subbotin's range coder in PPMd. The modelling code is API compatible with the copy from Eugene Shelwien's coders6c2 (which looks in turn to be derived from Dmitry Shkarin's PPMd code), but the functions themselves have been completely rewritten to better fit the stationary probability models used in this data.

Platform recommendations

Different platforms give very different profiles of qualities and sequences. While all the compression parameters are valid, you may find that attempting to overcompress gives no gains (or even marginally harms compression) while still incurring additional CPU time. The space/time tradeoff will therefore differ per platform.

Illumina: -n2 -s7+ -b -q3, or for small files -n2 -s5 -q2

The bulk of development was done against Illumina files as these
were the test set for the SequenceSqueeze competition. It should
work fine with the defaults.

454: -n1 -s7+ -b -q2, or -n1 -s5 -b -q2 for smaller filers

Tested to work OK, but -n1 gives a better name compression ratio
than -n2. (I have more work to do with -n2 it seems.)  For large
files try "-n1 -s7+ -b -q3", with smaller -s for small files.

PacBio: -n2 -s4+ -q1, or -n2 -s3 -q1 for smaller files

Tested to work mostly fine, but various non-N bases get turned
into Ns due to them having quality 0. Solution: replace quality 0
by quality 1 to prevent this. (tr '!' '"') How to fix in this
code?

PacBIO doesn't compress too well due to significantly higher
variability in the quality values (close to 3.7bit per).
Increasing -q has no real gains and infact ha

Fqzcomp

Install / Use

README