hash4j

hash4j is a Java library by Dynatrace that includes various non-cryptographic hash algorithms and data structures that are based on high-quality hash functions.

Content

Hash algorithms
Similarity hashing
Approximate distinct counting
File hashing
Consistent hashing
Benchmark results
Contribution FAQ

Hash algorithms

hash4j currently implements the following hash algorithms:

Murmur3: 32-bit and 128-bit
Wyhash: final version 3 and final version 4
Komihash: version 4.3 (compatible with 4.7) and version 5.0 (compatible with 5.10 and 5.28)
FarmHash: farmhashna and farmhashuo
PolymurHash 2.0
XXHash: XXH32, XXH64, XXH3_64bits, XXH3_128bits
Rapidhash v3
ChibiHash v2
MetroHash: 64-bit and 128-bit

All hash functions are thoroughly tested against the native reference implementations and also other libraries like Guava Hashing, Zero-Allocation Hashing, Apache Commons Codec, or crypto (see CrossCheckTest.java).

Usage

The interface allows direct hashing of Java objects in a streaming fashion without first mapping them to byte arrays. This minimizes memory allocations and keeps the memory footprint of the hash algorithm constant regardless of the object size.

class TestClass { 
    int a = 42;
    long b = 1234567890L;
    String c = "Hello world!";
}

TestClass obj = new TestClass(); // create an instance of some test class

var hasher = Hashing.komihash5_0(); // create a hasher instance (can be static)

// variant 1: hash object by passing data into a hash stream
long hash1 = hasher.hashStream().putInt(obj.a).putLong(obj.b).putString(obj.c).getAsLong(); // gives 0x90553fd9c675dfb2L

// variant 2: hash object by defining a funnel
HashFunnel<TestClass> funnel = (o, sink) -> sink.putInt(o.a).putLong(o.b).putString(o.c);
long hash2 = hasher.hashToLong(obj, funnel); // gives 0x90553fd9c675dfb2L

// create a hash stream instance (can be static or thread-local)
var hashStream = Hashing.komihash5_0().hashStream();

// variant 3: allocation-free by reusing a pre-allocated hash stream instance
long hash3 = hashStream.reset().putInt(obj.a).putLong(obj.b).putString(obj.c).getAsLong(); // gives 0x90553fd9c675dfb2L

// variant 4: allocation-free and using a funnel
long hash4 = hashStream.resetAndHashToLong(obj, funnel); // gives 0x90553fd9c675dfb2L

More examples can be found in HashingDemo.java.

Similarity hashing

Similarity hashing algorithms are able to compute hash signature of sets that allow estimation of set similarity without using the original sets. Following algorithms are currently available:

MinHash
SuperMinHash
SimHash
FastSimHash: A fast implementation of SimHash using a bit hack (see this blog post)

Usage

ToLongFunction<String> stringHashFunc = s -> Hashing.komihash5_0().hashCharsToLong(s);

Set<String> setA = IntStream.range(0, 90000).mapToObj(Integer::toString).collect(toSet());
Set<String> setB = IntStream.range(10000, 100000).mapToObj(Integer::toString).collect(toSet());
// intersection size = 80000, union size = 100000
// => exact Jaccard similarity of sets A and B is J = 80000 / 100000 = 0.8

int numberOfComponents = 1024;
int bitsPerComponent = 1;
// => each signature will take 1 * 1024 bits = 128 bytes

SimilarityHashPolicy policy =
SimilarityHashing.superMinHash(numberOfComponents, bitsPerComponent);
SimilarityHasher simHasher = policy.createHasher();

byte[] signatureA = simHasher.compute(ElementHashProvider.ofCollection(setA, stringHashFunc));
byte[] signatuerB = simHasher.compute(ElementHashProvider.ofCollection(setB, stringHashFunc));

double fractionOfEqualComponents = policy.getFractionOfEqualComponents(signatureA, signatuerB);

// this formula estimates the Jaccard similarity from the fraction of equal components
double estimatedJaccardSimilarity =
    (fractionOfEqualComponents - Math.pow(2., -bitsPerComponent))
        / (1. - Math.pow(2., -bitsPerComponent)); // gives a value close to 0.8

Approximate distinct counting

Counting the number of distinct elements exactly requires space that must increase linearly with the count. However, there are algorithms that require much less space by counting just approximately. The space-efficiency of those algorithms can be compared by means of the storage factor which is defined as the state size in bits multiplied by the squared relative standard error of the estimator

$\text{storage factor} := (\text{relative standard error})^2 \times (\text{state size})$.

This library implements two algorithms for approximate distinct counting:

HyperLogLog: This implementation uses 6-bit registers. The default estimator, which is an improved version of the original estimator, leads to an asymptotic storage factor of $18 \ln 2 - 6 = 6.477$. Using the definition of the storage factor, the corresponding relative standard error is roughly $\sqrt{\frac{6.477}{6 m}} = \frac{1.039}{\sqrt{m}}$. The state size is $6m = 6\cdot 2^p$ bits, where the precision parameter $p$ also defines the number of registers as $m = 2^p$. Alternatively, the maximum-likelihood estimator can be used, which achieves a slightly smaller asymptotic storage factor of $6\ln(2)/(\frac{\pi^2}{6}-1)\approx 6.449$ corresponding to a relative error of $\frac{1.037}{\sqrt{m}}$, but has a worse worst-case runtime performance. In case of non-distributed data streams, the martingale estimator can be used, which gives slightly better estimation results as the asymptotic storage factor is $6\ln 2 = 4.159$. This gives a relative standard error of $\sqrt{\frac{6\ln 2}{6m}} = \frac{0.833}{\sqrt{m}}$. The theoretically predicted estimation errors have been empirically confirmed by simulation results.
UltraLogLog: This algorithm is described in detail in this paper. Like for HyperLogLog, a precision parameter $p$ defines the number of registers $m = 2^p$. However, since UltraLogLog uses 8-bit registers to enable fast random accesses and updates of the registers, $m$ is also the state size in bytes. The default estimator leads to an asymptotic storage factor of 4.895, which corresponds to a 24% reduction compared to HyperLogLog and a relative standard error of $\frac{0.782}{\sqrt{m}}$. Alternatively, if performance is not an issue, the slower maximum-likelihood estimator can be used to obtain a storage factor of $8\ln(2)/\zeta(2,\frac{5}{4}) \approx 4.631$ corresponding to a 28% reduction and a relative error of $\frac{0.761}{\sqrt{m}}$. If the martingale estimator can be used, the storage factor will be just $5 \ln 2 = 3.466$ yielding an asymptotic relative standard error of $\frac{0.658}{\sqrt{m}}$. These theoretical formulas again agree well with the simulation results.

Both algorithms share the following properties:

Constant-time add-operations
Allocation-free updates
Idempotency, adding items already inserted before will never change the internal state
Mergeability, even for data structures

Hash4j

Install / Use

README

hash4j

Content

Hash algorithms

Usage

Similarity hashing

Usage

Approximate distinct counting