<img src="plots/logo.png" alt="logo" width="50"/> index4j

This repository is a Java library developed by Dynatrace that implements the FM-Index succinct data structure. The FM-Index takes advantage of the relationship between the suffix array and the Burrows-Wheeler transform to enable both compression and fast queries.

This index is specifically suited for compressing and querying text files without requiring prior tokenization. This means that you can compress a text file and then query it for arbitrary substrings without the need to fully decompress it, i.e., only the compressed size is required to be in memory. Furthermore, query performance depends on the number of located matches and total decompressed bytes, rather than whether the query is aligned with a particular token.

In addition, this repository also contains further data structures for working with bit vectors and integer sequences in Java.

First steps

To add a dependency on index4j using Maven, use the following:

<dependency>
    <groupId>com.dynatrace.index4j</groupId>
    <artifactId>indices</artifactId>
    <version>0.3.1</version>
</dependency>

To add the dependency using Gradle:

implementation group: 'com.dynatrace.index4j', name: 'indices', version: '0.3.1'

The current minimum Java version should be 17.

Supported data structures

Fm-Index: supports count, locate, extract and extractUntilBoundary (along with its left and right variations) with arbitrary patterns while compressing your data. In the case of logs, roughly 20% of the input size is required to store both data and index.
RrrVector (Raman, Raman & Rao): compressed bit vectors which both compress and enables rank and select queries.
SuffixArray: supports querying arbitrary patterns on input text requiring O(n logn) time and O(5n) space.
Wavelet FBB: Fast wavelet trees that compress and support rank and select queries on arbitrary sequences.
Burrows-Wheeler transform: enables calculating the BWT of inputs as well as compute redundancy metrics.

How-to

Using the FM-Index

The following examples show how to make use of the provided FM-Index implementation. Note that both compression and query speed are controlled by the sample rate parameter (find more details here). In essence, a larger sample rate results in better compression but slower query performance, and viceversa. Note that "larger" here means "larger distance" between samples, and therefore, actually translates into fewer samples.

Creating an FM-Index

The following snippet creates an FM-Index for a char array text using a sample rate of 32 and enabling the extraction/decompression of the index.

char[] text = ...
FmIndex fmi = new FmIndexBuilder()
        .setSampleRate(32)
        .setEnableExtraction(true)
        .build(text);

To get some statistics, we can do the following:

System.out.println(fmi.getAlphabetLength());                            // how many symbols
System.out.println(fmi.getInputLength());                               // the length of the input `text`
int size = Serialization.writeToByteArray(FmIndex::write, fmi).length;  // actual compressed size
System.out.println((size * 100 / fmi.getInputLength()) + "%)");         // relative size to the input

Note that the original input string is no longer required to perform any query at all, and in fact the whole input can be retrieved back by decompressing or extracting between 0 and |text|.

Counting occurrences of a pattern

Count how many times the pattern DEBUG (with spaces before and after) is in the input data.

char[] pattern = " DEBUG ".toCharArray();
System.out.println(fmi.count(pattern));

Locating the positions of an occurrence

Using the previous pattern, find at most 10 matches and their position in the original input.

int[] locations = new int[10];
int found = fmi.locate(pattern, 0, pattern.length, locations, 10); // find max. 10 matches
System.out.println(found); // print number of matches
System.out.println(Arrays.toString(locations)); // print the locations of the matches

Extracting key-value pairs

Using the previously found matches and locations, extract the right-most string containing the match and until a comma ',' is found.

char[] destination = new char[100]; // maximum length per extraction
for (int i = 0; i < found; i++) {
    int length = fmi.extractUntilBoundaryRight(locations[i] - 1, destination, 0, ','); // extract previous locations until finding a comma
    System.out.println(new String(destination, 0, length)); // print the extracted string
}

Extracting whole records

Extracting the full string, left and right from every match until a line separator '\n' is found.

char[] destination = new char[512]; // maximum length per extraction 
for (int i = 0; i < found; i++) {
    int length = fmi.extractUntilBoundary(locations[i], destination, 0, '\n');
    System.out.println(new String(destination, 0, length));
}

Serializing

All data structures (except the naive WaveletTree) can be serialized and deserialized via the SerializationReader and the SerializationWriter, respectively.

You can do so as shown below:

FmIndex fmi = new FmIndexBuilder().build(...);
byte[] serialized    = Serialization.writeToByteArray(FmIndex::write, fmi);
FmIndex deserialized = Serialization.readFromByteArray(FmIndex::read, serialized);

Using RRR Bit vectors

Similarly to the FM-Index, RRR bit vectors also make use of the sample rate. A higher sample rate results in using less space but at the expense of slower rank and access queries. In particular, a sample rate of e.g. 32 will result in saving a prefix sum every 32 bits, as well as saving the position of offsets every 32 bits. Therefore, in the worst case, up to 32/BLOCK_SIZE blocks need to be traversed before finding the rank or access answer.

Creating RRR Bit vectors

We have two options to create RRR bit vectors: either through an array of integers, which we interpret as a sequence of bits (e.g. if the int value is 2 then we have a bit sequence of 010...000000) or creating it through an already existing BitVector from the sux4j library. For example, using a sequence of integers and a sample rate of 32:

int[] array = new Random().ints(1000).toArray();
RrrVector rrr = new RrrVector(array, 32);

Or from a BitVector:

int lengthInBits = 1000;
BitVector bv = LongArrayBitVector.getInstance().length(lengthInBits);
// Fill in some values
Random random = new Random();
for (int i = 0; i < lengthInBits; i++) {
    bv.set(i, random.nextBoolean());
}
// Create it
RrrVector rrr = new RrrVector(bv, 32);

Ranking zeros or ones

The following snippet shows how to count the number of 1's or 0's until a given position (exclusive).

int[] array = ...
RrrVector rrr = new RrrVector(array, 32);
System.out.println(rrr.rankOnes(10)); // Number of 1's until position 10
System.out.println(rrr.rankZeroes(50)); // Number of 0's until position 50

Access an arbitrary position

Get the value of the bit at position i:

int[] array = ...
RrrVector rrr = new RrrVector(array, 32);
System.out.println(rrr.access(7)); // Get the bit value as a boolean (true or false) of the bit sequence at position 7

Serializing

Similar to the FM-Index, we can serialize and deserialize our RRR bit vectors as follows:

int[] array = ...
RrrVector rrr = new RrrVector(array, 32);
byte[] serialized      = Serialization.writeToByteArray(RrrVector::write, rrr);
RrrVector deserialized = Serialization.readFromByteArray(RrrVector::read, serialized);

Using Wavelet trees

Currently, two implementations of the wavelet tree are supported. However, only the Fixed-block boosting wavelet tree should be used since its performance and memory consumption is much better than the vanilla wavelet tree implementation.

The Fixed-block boosting wavelet tree (FBB-WT in advance) also uses the sample rate parameter since it is composed of multiple RRR bit vectors.

Creating FBB-WT

The FBB-WT can be built from either a sequence of chars or shorts. In any case, both are interpreted as symbols and should be mapped to a sequence {0,1,2, ..., n} such that v_{i+1} = v_{i} + 1 for all i. This allows for a larger alphabet, up to 32,768 different symbols, whereas otherwise if there is a single symbol with a value above 32,768 it will fail. The following snippet shows how to create an FBB-WT without such mapping:

char[] text = "This is an example".toCharArray();
WaveletFixedBlockBoosting wavelet

Index4j

Install / Use

README

<img src="plots/logo.png" alt="logo" width="50"/> index4j

Content

First steps

Supported data structures

How-to

Using the FM-Index

Creating an FM-Index

Counting occurrences of a pattern

Locating the positions of an occurrence

Extracting key-value pairs

Extracting whole records

Serializing

Using RRR Bit vectors

Creating RRR Bit vectors

Ranking zeros or ones

Access an arbitrary position

Serializing

Using Wavelet trees

Creating FBB-WT