% shrink(1) | Compression program

NAME

SHRINK - An interface to the Shrink Compression Library

SYNOPSES

shrink -h

shrink [-lrvemz] -c [in.file] [out.file]

shrink [-lrvemz] -d [in.file] [out.file]

shrink [-lremz] string

DESCRIPTION

Project:     Shrink: Small Compression Library
Maintainer:  Richard James Howe
License:     The Unlicense
Email:       howe.r.j.89@gmail.com
Website:     <https://github.com/howerj/shrink>

This project provides [LZSS][], [LZP][], and [RLE][] compression routines, neither of which do any allocation internally apart from on the stack.

The project provides the routines as a library and as a utility program. The library is suitable for inclusion in an [embedded][] project or product. Other CODECs have been added, some experimental, such as [Move-To-Front][] (meant as an aid to other CODECs) and [Elias-Gamma][] encoding.

OPTIONS

File de/compression utility, default is compress with LZSS, can use RLE. If "out.file" is not given output is written to standard out, if "in.file" and "out.file" are not given input is taken from standard in and output to standard out.

-- stop processing arguments
-t run built in self tests, zero is pass
-h print help and exit
-v verbose
-c compress
-d decompress
-l use LZSS
-r use Run Length Encoding
-e use Elias-Gamma Encoding
-z use LZP
-m use Move-To-Front Encoding
-H add hash to output, implies -v
-s # hex dump encoded string instead of file I/O

RETURN CODE

This program returns zero on success and non-zero on failure.

BUILDING

You will need [GNU Make][] and a [C][] compiler that is capable of compiling [C99][]. Type:

make

This will build a library ('libshrink.a') and an executable ('shrink', or 'shrink.exe' on Windows).

To execute a series of tests (this will require '[cmp][]' to be installed and on the [PATH][]):

make test

Which will also build the library and execute if not, and compress and decompress some files.

Running

For a full list of commands, after building, consult the build in help by giving the '-h' option:

./shrink -h

Example, compressing then decompressing a file:

./shrink -c file.txt file.smol
./shrink -d file.smol file.big

The command can be run as a filter:

./shrink < file.txt > file.smol

There is not too much to it.

C API and library integration

The [C][] [API][] is minimal, it provides just three function, a single data structure and an enumeration to select which [CODEC][] is used. All functions return negative on failure and zero on success. All function assert their inputs so long as [NDEBUG][] was not defined when the library was compiled.

The function shrink can be used with shrink_t data structure to redirect the compressors input and output arbitrarily. It requires that the user provides their own [fgetc][] and [fputc][] equivalent functions as the functions pointers get and put respectively. get returns the same as [fgetc][] and put returns the same as [fputc][]. get is passed *in and put is passed *out, as one would expect.

typedef struct {
	int (*get)(void *in);
	int (*put)(int ch, void *out);
	void *in, *out;
	size_t read, wrote;
} shrink_t;

int shrink(shrink_t *io, int codec, int encode);

The [CODEC][] will continue to consume input with get until and [EOF][] is returned or there is an error (an error could include being unable to output a byte or invalid input).

The read and wrote fields contain the number of bytes read in by get and written by put, they do not need to be updated by the [API][] user.

A common use of any compression library is encoding blocks bytes in memory, as such the common example is provided for with the function shrink_buffer. Internally it uses shrink with some internally defined callbacks for get and put. Before execution *outlength should contain the length of _*out_ and after it contains the number of bytes written to the output (and zero if shrink_buffer return an error).

int shrink_buffer(int codec, int encode, const char *in,
	size_t inlength, char *out, size_t *outlength);

codec and encode are both used in the same way as shrink.

The function shrink_tests executes a series of built in self tests that checks that the basic functionality of the library is correct. If the [NDEBUG][] macro is defined at library compile time then this function always returns success and the test functionality (and the associated code) is removed. If the tests are compiled in then a return value of zero indicates the tests completed successfully, and a negative value indicates failure.

int shrink_tests(void);

The library has minimal dependencies, just some memory related functions (specifically [memset][], [memmove][], [memchr][], and if tests are compiled in then [memcmp][] and [strlen][] are also used). [assert][] is also used. If you have to hand roll your own memory functions in an [embedded][] system, do not implement these functions naively - look into how optimize these functions, they will improve the performance of this library (and the rest of your system).

The [CODEC][] is especially sensitive to the speed at which [memchr][] matches characters, the [musl][] C library shows that optimizing these very simple string functions such as [memchr][] is non-trivial.

When the [LZSS][] [CODEC][] is used a fairly large buffer (~4KiB depending on options) is allocated on the stack. The [RLE][] [CODEC][] uses comparatively negligible resources. This large stack allocation may cause problems in embedded environments. There are two solutions to this problem, either decreasing the sliding window dictionary size (and hence decreasing the size of the stack allocation and decreasing compression efficiency) or by defining the macro USE_STATIC at compile time. This will change the allocation to be of a static storage duration, which means the [LZSS][] [CODEC][] will not be [reentrant][] nor [thread safe][].

Further customization of the library can be done by editing to [shrink.c][]. This is not usually ideal, however the library is tiny and the configurable parameters are macros at the beginning of [shrink.c][]. Parameters that can be configured include the lookahead buffer size, the dictionary windows size, and the minimal match length of the [LZSS][] [CODEC][], the [RLE][] [CODEC][] is also configurable, the only parameter is the run length however.

The test driver program [main.c][] contains an example of stream redirection that reads from a [FILE][] handle. This is not part of the library itself as the [FILE][] set of functionality is not usually available in embedded systems.

Whilst every effort was made to ensure program correctness the library is written in an unsafe language, do not return the [CODEC][] on untrusted input.

It is a shame that [C][] lacks a facility for [coroutines][] built into the language and available in a fully portable way, as they are a very easy way to make code non-blocking (useful in an cooperating multitasking environment) but they are also the easiest way to make programs that can chain partially produced output and feed it into consumer, for example chaining the output of the [RLE][] [CODEC][] into the input of the [LZSS][] [CODEC][]. [C][] does not offer this facility and the only way to implement this start/stop functionality portably would complicate the internals of the library greatly, so it has not implemented, perhaps later.

There are many more ways of improving this library; more CODECs, improved speed, etcetera. They will not be implemented as the idea of this library is simplicity and a small size.

CODECS

There are two CODECS available, [RLE][] (Run Length Encoding) and [LZSS][] (a type of dictionary encoding). The former is best for data that has lots of repetitions (such as sparse files contain many zeros), whereas [LZSS][] can be used with arbitrary (compressible) data. The performance of [LZSS][] is pretty poor in terms of its compression ratio, and the performance of the compressor code itself could be improved in terms of its speed. Both CODECS are small.

RLE

The Run Length Encoder works on binary data, it encodes successive runs of bytes as a length and the repeated byte up to runs of 127, or until a differing byte is found. This is then emitted as two bytes; the length and the byte to repeat. Input data with no runs in it is encoded as the length of the data that differs plus that data, up to 128 bytes can be encoded this way. If the first byte is larger than 128 then a run of literal data is encoded, if it is less, then it encodes a run of data.

The encoded data looks like this, for a run of literals:

 Command Byte       Data Byte(s)
.------------.--------------------------.
| 128-255    | Literal Data ....        |
.------------.--------------------------.

And this for a repetition of characters:

 Command Byte Data Byte
.------------.---------.
| 0-127      | Byte    |
.------------.---------.

An error occurs if a command byte is not follow by the requisite number of data bytes. How the encoding and decoding works should be trivial - examine the input for sequences of characters greater than the break-even point and output the repetition command, else output a literal run. [LZSS][] requires more explanation.

The number of repetitions has 2 added to it, so it can encode between 2 and 129 characters. There is no point in encoding run lengths less than or equal to 2, so this is used to boost the maximum run length.

LZSS

[LZSS][] belongs to the [dictionary compression][] family of lossless compression algorithms. They work by building up a dictionary and replacing strings with smaller references to those strings in the dictionary. The difference between all of the [dictionary compression][] algorithms is in the details of how matches are encoded and how things are evicted from the dictionary (mostly).

[LZSS][] encodes the data into references into the dynamically gen

Shrink

Install / Use

README