<html><head> <title>Lossless Data Compression Program: Hybrid LZ77 RLE</title> <link rev=made href="mailto:a1bert@iki.fi"> <link rel=parent href="index.html"> </head> <body> This page is <A HREF="http://www.iki.fi/a1bert/Dev/pucrunch/"> http://www.iki.fi/a1bert/Dev/pucrunch/</A><BR> Other C64-related stuff -> <A HREF="http://www.iki.fi/a1bert/Dev/"> http://www.iki.fi/a1bert/Dev/</A><BR> <HR> Pucrunch found elsewhere in random order:<BR> <A HREF="http://www.smspower.org/maxim/smssoftware/pucrunch.html">Maxim's World of Stuff - SMS/Meka stuff</A> <BR> <A HREF="http://www.devrs.com/gb/asmcode.php#asmcomp">GameBoy Dev'rs - Compression</A> <BR> <A HREF="http://members.iinet.net.au/~freeaxs/gbacomp/">Headspin's Guide to Compression, Files Systems Screen Effects and MOD Players for the Gameboy Advance </A> <BR> <A HREF="http://hiwaay.net/~jfrohwei/gameboy/libs.html">Jeff Frohwein's GameBoy Libs</A> <BR> <A HREF="http://www.dwedit.org/crunchyos/manual/crnchyos.html">CrunchyOS - an assembly shell for the TI83+ and TI83+SE calculators</A> <BR> <HR> First Version 14.3.1997, Rewritten Version 17.12.1997<BR> Another Major Update 14.10.1998<BR> Last Updated 22.11.2008<BR> Pasi Ojala, <A HREF="http://www.iki.fi/a1bert/"><EM>a1bert@iki.fi</EM></A><BR> <H2>An Optimizing Hybrid LZ77 RLE Data Compression Program, aka</H2> <H2>Improving Compression Ratio for Low-Resource Decompression</H2>

Short:<BR> Pucrunch is a Hybrid LZ77 and RLE compressor, uses an Elias Gamma Code for lengths, mixture of Gamma Code and linear for LZ77 offset, and ranked RLE bytes indexed by the same Gamma Code. Uses no extra memory in decompression.

<P> <A HREF="#Progs">Programs</A> - Source code and executables <HR> <H2>Introduction</H2>

Since I started writing demos for the C64 in 1989 I have always wanted to program a compression program. I had a lot of ideas but never had the time, urge or the necessary knowledge or patience to create one. In retrospect, most of the ideas I had then were simply bogus ("magic function theory" as Mark Nelson nicely puts it). But years passed, I gathered more knowledge and finally got an irresistible urge to finally realize my dream.

<P> The nice thing about the delay is that I don't need to write the actual compression program to run on a C64 anymore. I can write it in portable ANSI-C code and just program it to create files that would uncompress themselves when run on a C64. Running the compression program outside of the target system provides at least the following advantages. <UL> <LI> I can use portable ANSI-C code. The compression program can be compiled to run on a Unix box, Amiga, PC etc. And I have all the tools to debug the program and gather profiling information to see why it is so slow :-) <LI> The program runs much faster than on C64. If it is still slow, there is always multitasking to allow me to do something else while I'm compressing a file. <LI> There is 'enough' memory available. I can use all the memory I possibly need and use every trick possible to increase the compression ratio as long as the decompression remains possible and feasible on a C64. <LI> Large files can be compressed as easily as shorter files. Most C64 compressors can't handle files larger than around 52-54 kilobytes (210-220 disk blocks). <LI> Cross-development is easier because you don't have to transfer a file into C64 just to compress it. <LI> It is now possible to use the same program to handle VIC20 and C16/+4 files also. It hasn't been possible to compress VIC20 files at all. At least I don't know any other VIC20 compressor around. <P> </UL> <P> <H4>Memory Refresh and Terms for Compression</H4> <DL> <DT> Statistical compression <DD> Uses the uneven probability distribution of the source symbols to shorten the average code length. Huffman code and arithmetic code belong to this group. By giving a short code to symbols occurring most often, the number ot bits needed to represent the symbols decreases. Think of the Morse code for example: the characters you need more often have shorter codes and it takes less time to send the message. <P> <DT> Dictionary compression <DD> Replaces repeating strings in the source with shorter representations. These may be indices to an actual dictionary (Lempel-Ziv 78) or pointers to previous occurrances (Lempel-Ziv 77). As long as it takes fewer bits to represent the reference than the string itself, we get compression. LZ78 is a lot like the way BASIC substitutes tokens for keywords: one-byte tokens expand to whole words like PRINT#. LZ77 replaces repeated strings with (length,offset) pairs, thus the string <TT>VICIICI</TT> can be encoded as <TT>VICI(3,3)</TT> -- the repeated occurrance of the string <TT>ICI</TT> is replaced by a reference. <P> <DT> Run-length encoding <DD> Replaces repeating symbols with a single occurrance of the symbol and a repeat count. For example assembly compilers have a <TT>.zero</TT> keyword or equivalent to fill a number of bytes with zero without needing to list them all in the source code. <P> <DT> Variable-length code <DD> Any code where the length of the code is not explicitly known but changes depending on the bit values. A some kind of end marker or length count must be provided to make a code a prefix code (uniquely decodable). For ASCII (or Latin-1) text you know you get the next letter by reading a full byte from the input. A variable-length code requires you to read part of the data to know how many bits to read next. <P> <DT> Universal codes <DD> Universal codes are used to encode integer numbers without the need to know the maximum value. Smaller integer values usually get shorter codes. Different universal codes are optimal for different distributions of the values. Universal codes include Elias Gamma and Delta codes, Fibonacci code, and Golomb and Rice codes. <P> <DT> Lossless compression <DD> Lossless compression algorithms are able to exactly reproduce the original contents unlike lossy compression, which omits details that are not important or perceivable by human sensory system. This article only talks about lossless compression. <P> </DL> <P> My goal in the pucrunch project was to create a compression system in which the decompressor would use minimal resources (both memory and processing power) and still have the best possible compression ratio. A nice bonus would be if it outperformed every other compression program available. These understandingly opposite requirements (minimal resources and good compression ratio) rule out most of the state-of-the-art compression algorithms and in effect only leave RLE and LZ77 to be considered. Another goal was to learn something about data compression and that goal at least has been satisfied. <P> I started by developing a byte-aligned LZ77+RLE compressor/decompressor and then added a Huffman backend to it. The Huffman tree takes 384 bytes and the code that decodes the tree into an internal representation takes 100 bytes. I found out that while the Huffman code gained about 8% in my 40-kilobyte test files, the gain was reduced to only about 3% after accounting the extra code and the Huffman tree. <P> Then I started a more detailed analysis of the LZ77 offset and length values and the RLE values and concluded that I would get better compression by using a variable-length code. I used a simple variable-length code and scratched the Huffman backend code, as it didn't increase the compression ratio anymore. This version became pucrunch. <P> Pucrunch does not use byte-aligned data, and is a bit slower than the byte-aligned version because of this, but is much faster than the original version with the Huffman backend attached. And pucrunch still does very well compression-wise. In fact, it does very well indeed, beating even LhA, Zip, and GZip in some cases. But let's not get too much ahead of ourselves. <P> To get an improvement to the compression ratio for LZ77, we have only some options left. We can improve on the encoding of literal bytes (bytes that are not compressed), we can reduce the number of literal bytes we need to encode, and shorten the encoding of RLE and LZ77. In the algorithm presented here all these improvement areas are addressed both collectively (one change affects more than one area) and one at a time. <P> <OL> <LI> By using a variable-length code we can gain compression for even 2-byte LZ77 matches, which in turn reduces the number of literal bytes we need to encode. Most LZ77-variants require 3-byte matches to get any compression because they use so many bits to identify the length and offset values, thus making the code longer than the original bytes would've taken. <P> <LI> By using a new literal byte tagging system which distinguishes uncompressed and compressed data efficiently we can reduce number of extra bits needed to make this distinction (the encoding overhead for literal bytes). This is especially important for files that do not compress well. <P> <LI> By using RLE in addition to LZ77 we can shorten the encoding for long byte run sequences and at the same time set a convenient upper limit to LZ77 match length. The upper limit performs two functions: <UL> <LI> we only need to encode integers in a specific range <LI> we only need to search strings shorter than this limit (if we find a string long enough, we can stop there) </UL> Short byte runs are compressed either using RLE or LZ77, whichever gets the best results. <P> <LI> By doing statistical compression (more frequent symbols get shorter representations) on the RLE bytes (in this case symbol ranking) we can gain compression for even 2-byte run lengths, which in turn reduces the number of literal bytes we need to encode. <P> <LI> By carefully selecting which string matches and/or run lengths to use we can take advantage of the variable-length code. It may be advantageous to compress a string as two shorter matc

Pucrunch

Install / Use

README