gztool

GZIP files indexer, compressor and data retriever. Create small indexes for gzipped files and use them for quick and random data extraction. No more waiting when the end of a 10 GiB gzip is needed!

See Installation for Ubuntu, the Release page for executables for your platform, and Compilation in case you want to compile the tool.

Also, a magic file to correctly identify gztool's index files with linux file command is provided: you can append it (or overwrite your empty) /etc/magic file or append/copy it to your home directory as ~/.magic (note the point prepending the name).

Considerations

Please, note that the initial complete index creation for an already gzip-compressed file (-i) still consumes as much time as a complete file decompression.
Once created the index will reduce access time to the gzip file data.

Nonetheless, note that gztool creates index interleaved with extraction of data (-b), so in the practice there's no waste of time. Note that if extraction of data or just index creation are stopped at any moment, gztool will reuse the remaining index on the next run over the same data, so time consumption is always minimized.

Also gztool can monitor a growing gzip file (for example, a log created by rsyslog directly in gzip format) and generate the index on-the-fly, thus reducing in the practice to zero the time of index creation. See the -S (Supervise) option.

Index size is about 0.33% of a compressed gzip file size if the index is created after the file was compressed, or 10-100 times smaller (just a few Bytes or kiB) if gztool itself compresses the data (-c).

Note that the size of the index depends on the span between index points on the uncompressed stream - by default it is 10 MiB: this means that when retrieving randomly situated data only 10/2 = 5 MiB of uncompressed data must be decompressed (on average) no matter the size of the gzip file - which is a fairly low value!
The span between index points can be adjusted with -s (span) option (the minimum is -s 1 or 1 MiB).
For example, a span of -s 20 will create indexes half the size, and -s 5 will create indexes twice bigger.

Background

By default gzip-compressed files cannot be accessed in random mode: any byte required at position N requires the complete gzip file to be decompressed from the beginning to the N byte.
Nonetheless Mark Adler, the author of zlib, provided years ago a cryptic file named zran.c that creates an "index" of "windows" filled with 32 kiB of uncompressed data at different positions along the un/compressed file, which can be used to initialize the zlib library and make it behave as if compressed data begin there.

gztool builds upon zran.c to provide a useful command line tool.
Also, some optimizations and brand new behaviours have been added:

gztool can correctly read incomplete gzip-concatenated-files (using -p), that is, a gzip composed of a concatenation of gzip files, some of which are not correctly terminated. This can happen, for example, when using rsyslog's veryRobustZip omfile option and the process that is logging is abruptly terminated and then restarted.
gztool can store line numbering information in the index (use only if source data is text!), and retrieve data from a specific line number using -L. (Using -[xXz] when creating the index selects Unix new line format (default), old Mac new line format, or no line information respectively.)
gztool can Supervise an still-growing gzip file (for example, a log created by rsyslog directly in gzip format) and generate the index on-the-fly, thus reducing in the practice to zero the time of index creation. See -S.
extraction of data and index creation are interleaved, so there's no waste of time for the index creation.
index files are reusable, so they can be stopped at any time and reused and/or completed later.
an ex novo index file format has been created to store the index
span between index points is raised by default from 1 MiB to 10 MiB, and can be adjusted with -s (span).
windows are compressed in file
windows are not loaded in memory unless they're needed, so the application memory footprint is fairly low (< 1 MiB)
gztool can compress files (-c) and at the same time generate an index that is about 10-100 times smaller than if the index is generated after the file has already been compressed with gzip.
Compatible with bgzip files (short-uncompressed complete-gzip-block sizes)
Compatible with complete gzip-concatenated-files (aka gzip members)
Compatible with rsyslog's veryRobustZip omfile option (variable-short-uncompressed complete-gzip-block sizes)
data can be provided from/to stdin/stdout
gztool can be used to remotely retrieve just a small part of a bigger gzip compressed file and successfully decompress it locally. See this stackexchange thread. Just note that the gztool index file must be also available.

Installation

gztool is available using directly apt-get install gztool in Debian 11 and later and in Ubuntu Groovy Gorilla (20.10) and above.

In Ubuntu, using my repository:

  sudo add-apt-repository ppa:roberto.s.galende/gztool
  sudo apt-get update
  sudo apt-get install gztool

See the Release page for executables for your platform, including Windows. If none fit your needs, gztool is very easy to compile: see next sections.

Compilation

zlib.a archive library is needed in order to compile gztool: the package providing it is actually zlib1g-dev (this may vary on your system):

$ sudo apt-get install zlib1g-dev

$ gcc -O3 -o gztool gztool.c -lz -lm

If you wish you can use autoconf to check the dependencies, build and test gztool:

$ autoreconf && ./configure && make check

This will produce a binary in gztool.

Compilation in Windows

Compilation in Windows is possible using gcc for Windows and compiling the original zlib code to obtain the needed archive library libz.a.
Please, note that executables for different platforms are provided on the Release page.

download gcc for Windows: mingw-w64
Install it and add the path for gcc.exe to your Windows PATH
Download zlib code and compile it with your new gcc: zlib
The previous step generates the file zlib.a that you need to compile gztool: Copy gztool.c to the directory where you compiled zlib, and do:

gcc -static -O3 -I. -o gztool gztool.c libz.a -lm

Usage

  gztool (v1.8.0)
  GZIP files indexer, compressor and data retriever.
  Create small indexes for gzipped files and use them
  for quick and random-positioned data extraction.
  No more waiting when the end of a 10 GiB gzip is needed!
  //github.com/circulosmeos/gztool (by Roberto S. Galende)

  $ gztool [-[abLnsv] #] [-[1..9]AcCdDeEfFhilpPqrRStTwWxXzZ|u[cCdD]] [-I <INDEX>] <FILE>...

  Note that actions `-bcStT` proceed to an index file creation (if
  none exists) INTERLEAVED with data flow. As data flow and
  index creation occur at the same time there's no waste of time.
  Also you can interrupt actions at any moment and the remaining
  index file will be reused (and completed if necessary) on the
  next gztool run over the same data.

 -[1..9]: Factor of compression to use with `-[c|u[cC]]`, from
     best speed (`-1`) to best compression (`-9`). Default is `-6`.
 -a #: Await # seconds between reads when `-[ST]|Ec`. Default is 4 s.
 -A: Modifier for `-[rR]` to indicate the range of bytes/lines in
     absolute values, instead of the default incremental values.
 -b #: extract data from indicated uncompressed byte position of
     gzip file (creating or reusing an index file) to STDOUT.
     Accepts '0', '0x', and suffixes 'kmgtpe' (^10) 'KMGTPE' (^2).
 -C: always create a 'Complete' index file, ignoring possible errors.
 -c: compress a file like with gzip, creating an index at the same time.
 -d: decompress a file like with gzip.
 -D: do not delete original file when using `-[cd]`.
 -e: if multiple files are indicated, continue on error (if any).
 -E: end processing on first GZIP end of file marker at EOF.
     Nonetheless with `-c`, `-E` waits for more data even at EOF.
 -f: force file overwriting if index file already exists.
 -F: force index creation/completion first, and then action: if
     `-F` is not used, index is created interleaved with actions.
 -h: print brief help; `-hh` prints this help.
 -i: create index for indicated gzip file (For 'file.gz' the default
     index file name will be 'file.gzi'). This is the default action.
 -I string: index file name will be

Gztool

Install / Use

README