Gztool
extract random-positioned data from gzip files with no penalty, including gzip tailing like with 'tail -f' !
Install / Use
/learn @circulosmeos/GztoolREADME
gztool
GZIP files indexer, compressor and data retriever. Create small indexes for gzipped files and use them for quick and random data extraction. No more waiting when the end of a 10 GiB gzip is needed!
See Installation for Ubuntu, the Release page for executables for your platform, and Compilation in case you want to compile the tool.
Also, a magic file to correctly identify gztool's index files with linux file command is provided: you can append it (or overwrite your empty) /etc/magic file or append/copy it to your home directory as ~/.magic (note the point prepending the name).
Considerations
- Please, note that the initial complete index creation for an already gzip-compressed file (
-i) still consumes as much time as a complete file decompression.
Once created the index will reduce access time to the gzip file data.
Nonetheless, note that gztool creates index interleaved with extraction of data (-b), so in the practice there's no waste of time. Note that if extraction of data or just index creation are stopped at any moment, gztool will reuse the remaining index on the next run over the same data, so time consumption is always minimized.
Also gztool can monitor a growing gzip file (for example, a log created by rsyslog directly in gzip format) and generate the index on-the-fly, thus reducing in the practice to zero the time of index creation. See the -S (Supervise) option.
- Index size is about 0.33% of a compressed gzip file size if the index is created after the file was compressed, or 10-100 times smaller (just a few Bytes or kiB) if
gztoolitself compresses the data (-c).
Note that the size of the index depends on the span between index points on the uncompressed stream - by default it is 10 MiB: this means that when retrieving randomly situated data only 10/2 = 5 MiB of uncompressed data must be decompressed (on average) no matter the size of the gzip file - which is a fairly low value!
The span between index points can be adjusted with -s (span) option (the minimum is -s 1 or 1 MiB).
For example, a span of -s 20 will create indexes half the size, and -s 5 will create indexes twice bigger.
Background
By default gzip-compressed files cannot be accessed in random mode: any byte required at position N requires the complete gzip file to be decompressed from the beginning to the N byte.
Nonetheless Mark Adler, the author of zlib, provided years ago a cryptic file named zran.c that creates an "index" of "windows" filled with 32 kiB of uncompressed data at different positions along the un/compressed file, which can be used to initialize the zlib library and make it behave as if compressed data begin there.
gztool builds upon zran.c to provide a useful command line tool.
Also, some optimizations and brand new behaviours have been added:
gztoolcan correctly read incompletegzip-concatenated-files (using-p), that is, a gzip composed of a concatenation ofgzipfiles, some of which are not correctly terminated. This can happen, for example, when using rsyslog's veryRobustZip omfile option and the process that is logging is abruptly terminated and then restarted.gztoolcan store line numbering information in the index (use only if source data is text!), and retrieve data from a specific line number using-L. (Using-[xXz]when creating the index selects Unix new line format (default), old Mac new line format, or no line information respectively.)gztoolcan Supervise an still-growing gzip file (for example, a log created by rsyslog directly in gzip format) and generate the index on-the-fly, thus reducing in the practice to zero the time of index creation. See-S.- extraction of data and index creation are interleaved, so there's no waste of time for the index creation.
- index files are reusable, so they can be stopped at any time and reused and/or completed later.
- an ex novo index file format has been created to store the index
- span between index points is raised by default from 1 MiB to 10 MiB, and can be adjusted with
-s(span). - windows are compressed in file
- windows are not loaded in memory unless they're needed, so the application memory footprint is fairly low (< 1 MiB)
gztoolcan compress files (-c) and at the same time generate an index that is about 10-100 times smaller than if the index is generated after the file has already been compressed with gzip.- Compatible with
bgzipfiles (short-uncompressed complete-gzip-block sizes) - Compatible with complete
gzip-concatenated-files (aka gzip members) - Compatible with rsyslog's veryRobustZip omfile option (variable-short-uncompressed complete-gzip-block sizes)
- data can be provided from/to stdin/stdout
gztoolcan be used to remotely retrieve just a small part of a bigger gzip compressed file and successfully decompress it locally. See this stackexchange thread. Just note that thegztoolindex file must be also available.
Installation
-
gztoolis available using directlyapt-get install gztoolin Debian 11 and later and in Ubuntu Groovy Gorilla (20.10) and above. -
In Ubuntu, using my repository:
sudo add-apt-repository ppa:roberto.s.galende/gztool sudo apt-get update sudo apt-get install gztool -
See the Release page for executables for your platform, including Windows. If none fit your needs,
gztoolis very easy to compile: see next sections.
Compilation
zlib.a archive library is needed in order to compile gztool: the package providing it is actually zlib1g-dev (this may vary on your system):
$ sudo apt-get install zlib1g-dev
$ gcc -O3 -o gztool gztool.c -lz -lm
If you wish you can use autoconf to check the dependencies, build and
test gztool:
$ autoreconf && ./configure && make check
This will produce a binary in gztool.
Compilation in Windows
Compilation in Windows is possible using gcc for Windows and compiling the original zlib code to obtain the needed archive library libz.a.
Please, note that executables for different platforms are provided on the Release page.
-
download gcc for Windows: mingw-w64
-
Install it and add the path for gcc.exe to your Windows PATH
-
Download zlib code and compile it with your new gcc: zlib
-
The previous step generates the file zlib.a that you need to compile gztool: Copy gztool.c to the directory where you compiled zlib, and do:
gcc -static -O3 -I. -o gztool gztool.c libz.a -lm
Usage
gztool (v1.8.0)
GZIP files indexer, compressor and data retriever.
Create small indexes for gzipped files and use them
for quick and random-positioned data extraction.
No more waiting when the end of a 10 GiB gzip is needed!
//github.com/circulosmeos/gztool (by Roberto S. Galende)
$ gztool [-[abLnsv] #] [-[1..9]AcCdDeEfFhilpPqrRStTwWxXzZ|u[cCdD]] [-I <INDEX>] <FILE>...
Note that actions `-bcStT` proceed to an index file creation (if
none exists) INTERLEAVED with data flow. As data flow and
index creation occur at the same time there's no waste of time.
Also you can interrupt actions at any moment and the remaining
index file will be reused (and completed if necessary) on the
next gztool run over the same data.
-[1..9]: Factor of compression to use with `-[c|u[cC]]`, from
best speed (`-1`) to best compression (`-9`). Default is `-6`.
-a #: Await # seconds between reads when `-[ST]|Ec`. Default is 4 s.
-A: Modifier for `-[rR]` to indicate the range of bytes/lines in
absolute values, instead of the default incremental values.
-b #: extract data from indicated uncompressed byte position of
gzip file (creating or reusing an index file) to STDOUT.
Accepts '0', '0x', and suffixes 'kmgtpe' (^10) 'KMGTPE' (^2).
-C: always create a 'Complete' index file, ignoring possible errors.
-c: compress a file like with gzip, creating an index at the same time.
-d: decompress a file like with gzip.
-D: do not delete original file when using `-[cd]`.
-e: if multiple files are indicated, continue on error (if any).
-E: end processing on first GZIP end of file marker at EOF.
Nonetheless with `-c`, `-E` waits for more data even at EOF.
-f: force file overwriting if index file already exists.
-F: force index creation/completion first, and then action: if
`-F` is not used, index is created interleaved with actions.
-h: print brief help; `-hh` prints this help.
-i: create index for indicated gzip file (For 'file.gz' the default
index file name will be 'file.gzi'). This is the default action.
-I string: index file name will be
