Index4j
Dynatrace FM-Index library
Install / Use
/learn @dynatrace-oss/Index4jREADME
<img src="plots/logo.png" alt="logo" width="50"/> index4j
This repository is a Java library developed by Dynatrace that implements the FM-Index succinct data structure. The FM-Index takes advantage of the relationship between the suffix array and the Burrows-Wheeler transform to enable both compression and fast queries.
This index is specifically suited for compressing and querying text files without requiring prior tokenization. This means that you can compress a text file and then query it for arbitrary substrings without the need to fully decompress it, i.e., only the compressed size is required to be in memory. Furthermore, query performance depends on the number of located matches and total decompressed bytes, rather than whether the query is aligned with a particular token.
In addition, this repository also contains further data structures for working with bit vectors and integer sequences in Java.
Content
- First steps
- Supported data structures
- How-to and example use cases
- Benchmarks
- Details about the data structures
- About this project
First steps
To add a dependency on index4j using Maven, use the following:
<dependency>
<groupId>com.dynatrace.index4j</groupId>
<artifactId>indices</artifactId>
<version>0.3.1</version>
</dependency>
To add the dependency using Gradle:
implementation group: 'com.dynatrace.index4j', name: 'indices', version: '0.3.1'
The current minimum Java version should be 17.
Supported data structures
- Fm-Index: supports
count,locate,extractandextractUntilBoundary(along with its left and right variations) with arbitrary patterns while compressing your data. In the case of logs, roughly 20% of the input size is required to store both data and index. - RrrVector (Raman, Raman & Rao): compressed bit vectors which both compress and enables rank and select queries.
- SuffixArray: supports querying arbitrary patterns on input text requiring
O(n logn)time andO(5n)space. - Wavelet FBB: Fast wavelet trees that compress and support rank and select queries on arbitrary sequences.
- Burrows-Wheeler transform: enables calculating the BWT of inputs as well as compute redundancy metrics.
How-to
Using the FM-Index
The following examples show how to make use of the provided FM-Index implementation. Note that both compression and
query speed are controlled by the sample rate parameter (find more details here). In essence, a larger
sample rate results in better compression but slower query performance, and viceversa. Note that "larger" here means
"larger distance" between samples, and therefore, actually translates into fewer samples.
Creating an FM-Index
The following snippet creates an FM-Index for a char array text using a sample rate of 32 and enabling
the extraction/decompression of the index.
char[] text = ...
FmIndex fmi = new FmIndexBuilder()
.setSampleRate(32)
.setEnableExtraction(true)
.build(text);
To get some statistics, we can do the following:
System.out.println(fmi.getAlphabetLength()); // how many symbols
System.out.println(fmi.getInputLength()); // the length of the input `text`
int size = Serialization.writeToByteArray(FmIndex::write, fmi).length; // actual compressed size
System.out.println((size * 100 / fmi.getInputLength()) + "%)"); // relative size to the input
Note that the original input string is no longer required to perform any query at all, and in fact the whole input
can be retrieved back by decompressing or extracting between 0 and |text|.
Counting occurrences of a pattern
Count how many times the pattern DEBUG (with spaces before and after) is in the input data.
char[] pattern = " DEBUG ".toCharArray();
System.out.println(fmi.count(pattern));
Locating the positions of an occurrence
Using the previous pattern, find at most 10 matches and their position in the original input.
int[] locations = new int[10];
int found = fmi.locate(pattern, 0, pattern.length, locations, 10); // find max. 10 matches
System.out.println(found); // print number of matches
System.out.println(Arrays.toString(locations)); // print the locations of the matches
Extracting key-value pairs
Using the previously found matches and locations, extract the right-most string containing the match and until a comma ',' is found.
char[] destination = new char[100]; // maximum length per extraction
for (int i = 0; i < found; i++) {
int length = fmi.extractUntilBoundaryRight(locations[i] - 1, destination, 0, ','); // extract previous locations until finding a comma
System.out.println(new String(destination, 0, length)); // print the extracted string
}
Extracting whole records
Extracting the full string, left and right from every match until a line separator '\n' is found.
char[] destination = new char[512]; // maximum length per extraction
for (int i = 0; i < found; i++) {
int length = fmi.extractUntilBoundary(locations[i], destination, 0, '\n');
System.out.println(new String(destination, 0, length));
}
Serializing
All data structures (except the naive WaveletTree) can be serialized and deserialized
via the SerializationReader and
the SerializationWriter, respectively.
You can do so as shown below:
FmIndex fmi = new FmIndexBuilder().build(...);
byte[] serialized = Serialization.writeToByteArray(FmIndex::write, fmi);
FmIndex deserialized = Serialization.readFromByteArray(FmIndex::read, serialized);
Using RRR Bit vectors
Similarly to the FM-Index, RRR bit vectors also make use of the sample rate. A higher sample rate results in using less
space but at the expense of slower rank and access queries. In particular, a sample rate of e.g. 32
will result in saving a prefix sum every 32 bits, as well as saving the position of offsets
every 32 bits. Therefore, in the worst case, up to 32/BLOCK_SIZE blocks need to be traversed before finding
the rank or access answer.
Creating RRR Bit vectors
We have two options to create RRR bit vectors: either through an array of integers, which we interpret as a sequence
of bits (e.g. if the int value is 2 then we have a bit sequence of 010...000000) or creating it through an already
existing BitVector from the sux4j library. For example, using a sequence of integers and a sample rate of 32:
int[] array = new Random().ints(1000).toArray();
RrrVector rrr = new RrrVector(array, 32);
Or from a BitVector:
int lengthInBits = 1000;
BitVector bv = LongArrayBitVector.getInstance().length(lengthInBits);
// Fill in some values
Random random = new Random();
for (int i = 0; i < lengthInBits; i++) {
bv.set(i, random.nextBoolean());
}
// Create it
RrrVector rrr = new RrrVector(bv, 32);
Ranking zeros or ones
The following snippet shows how to count the number of 1's or 0's until a given position (exclusive).
int[] array = ...
RrrVector rrr = new RrrVector(array, 32);
System.out.println(rrr.rankOnes(10)); // Number of 1's until position 10
System.out.println(rrr.rankZeroes(50)); // Number of 0's until position 50
Access an arbitrary position
Get the value of the bit at position i:
int[] array = ...
RrrVector rrr = new RrrVector(array, 32);
System.out.println(rrr.access(7)); // Get the bit value as a boolean (true or false) of the bit sequence at position 7
Serializing
Similar to the FM-Index, we can serialize and deserialize our RRR bit vectors as follows:
int[] array = ...
RrrVector rrr = new RrrVector(array, 32);
byte[] serialized = Serialization.writeToByteArray(RrrVector::write, rrr);
RrrVector deserialized = Serialization.readFromByteArray(RrrVector::read, serialized);
Using Wavelet trees
Currently, two implementations of the wavelet tree are supported. However, only the Fixed-block boosting wavelet tree should be used since its performance and memory consumption is much better than the vanilla wavelet tree implementation.
The Fixed-block boosting wavelet tree (FBB-WT in advance) also uses the sample rate parameter since it is composed of multiple RRR bit
vectors.
Creating FBB-WT
The FBB-WT can be built from either a sequence of chars or shorts. In any case, both are interpreted as symbols and should
be mapped to a sequence {0,1,2, ..., n} such that v_{i+1} = v_{i} + 1 for all i. This allows for a larger alphabet,
up to 32,768 different symbols, whereas otherwise if there is a single symbol with a value above 32,768 it will fail.
The following snippet shows how to create an FBB-WT without such mapping:
char[] text = "This is an example".toCharArray();
WaveletFixedBlockBoosting wavelet
