Strong (well-distributed and unpredictable) hashes:

Portable implementation of SipHash
HighwayHash, a 5x faster SIMD hash with security claims

Quick Start

To build on a Linux or Mac platform, simply run make. For Windows, we provide a Visual Studio 2015 project in the msvc subdirectory.

Run benchmark for speed measurements. sip_hash_test and highwayhash_test ensure the implementations return known-good values for a given set of inputs.

64-bit SipHash for any CPU:

    #include "highwayhash/sip_hash.h"
    using namespace highwayhash;
    HH_ALIGNAS(16) const HH_U64 key2[2] = {1234, 5678};
    char in[8] = {1};
    return SipHash(key2, in, 8);

64, 128 or 256 bit HighwayHash for the CPU determined by compiler flags:

    #include "highwayhash/highwayhash.h"
    using namespace highwayhash;
    HH_ALIGNAS(32) const HHKey key = {1, 2, 3, 4};
    char in[8] = {1};
    HHResult64 result;  // or HHResult128 or HHResult256
    HHStateT<HH_TARGET> state(key);
    HighwayHashT(&state, in, 8, &result);

64, 128 or 256 bit HighwayHash for the CPU on which we're currently running:

    #include "highwayhash/highwayhash_target.h"
    #include "highwayhash/instruction_sets.h"
    using namespace highwayhash;
    HH_ALIGNAS(32) const HHKey key = {1, 2, 3, 4};
    char in[8] = {1};
    HHResult64 result;  // or HHResult128 or HHResult256
    InstructionSets::Run<HighwayHash>(key, in, 8, &result);

C-callable 64-bit HighwayHash for the CPU on which we're currently running:

#include "highwayhash/c_bindings.h"
const uint64_t key[4] = {1, 2, 3, 4};
char in[8] = {1};
return HighwayHash64(key, in, 8);

Printing a 256-bit result in a hexadecimal format similar to sha1sum:

HHResult256 result;
printf("%016"PRIx64"%016"PRIx64"%016"PRIx64"%016"PRIx64"\n",
     result[3], result[2], result[1], result[0]);

Introduction

Hash functions are widely used, so it is desirable to increase their speed and security. This package provides two 'strong' (well-distributed and unpredictable) hash functions: a faster version of SipHash, and an even faster algorithm we call HighwayHash.

SipHash is a fast but 'cryptographically strong' pseudo-random function by Aumasson and Bernstein [https://www.131002.net/siphash/siphash.pdf].

HighwayHash is a new way of mixing inputs which may inspire new cryptographically strong hashes. Large inputs are processed at a rate of 0.24 cycles per byte, and latency remains low even for small inputs. HighwayHash is faster than SipHash for all input sizes, with 5 times higher throughput at 1 KiB. We discuss design choices and provide statistical analysis and preliminary cryptanalysis in https://arxiv.org/abs/1612.06257.

Applications

Unlike prior strong hashes, these functions are fast enough to be recommended as safer replacements for weak hashes in many applications. The additional CPU cost appears affordable, based on profiling data indicating C++ hash functions account for less than 0.25% of CPU usage.

Hash-based selection of random subsets is useful for A/B experiments and similar applications. Such random generators are idempotent (repeatable and deterministic), which is helpful for parallel algorithms and testing. To avoid bias, it is important that the hash function be unpredictable and indistinguishable from a uniform random generator. We have verified the bit distribution and avalanche properties of SipHash and HighwayHash.

64-bit hashes are also useful for authenticating short-lived messages such as network/RPC packets. This requires that the hash function withstand differential, length extension and other attacks. We have published a formal security analysis for HighwayHash. New cryptanalysis tools may still need to be developed for further analysis.

Strong hashes are also important parts of methods for protecting hash tables against unacceptable worst-case behavior and denial of service attacks (see "hash flooding" below).

128 and 256-bit hashes can be useful for verifying data integrity (checksums).

SipHash

Our SipHash implementation is a fast and portable drop-in replacement for the reference C code. Outputs are identical for the given test cases (messages between 0 and 63 bytes).

Interestingly, it is about twice as fast as a SIMD implementation using SSE4.1 (https://goo.gl/80GBSD). This is presumably due to the lack of SIMD bit rotate instructions prior to AVX-512.

SipHash13 is a faster but weaker variant with one mixing round per update and three during finalization.

We also provide a data-parallel 'tree hash' variant that enables efficient SIMD while retaining safety guarantees. This is about twice as fast as SipHash, but does not return the same results.

HighwayHash

We have devised a new way of mixing inputs with SIMD multiply and permute instructions. The multiplications are 32x32 -> 64 bits and therefore infeasible to reverse. Permuting equalizes the distribution of the resulting bytes.

The internal state is quite large (1024 bits) but fits within SIMD registers. Due to limitations of the AVX2 instruction set, the registers are partitioned into two 512-bit halves that remain independent until the reduce phase. The algorithm outputs 64 bit digests or up to 256 bits at no extra cost.

In addition to high throughput, the algorithm is designed for low finalization cost. The result is more than twice as fast as SipTreeHash.

We also provide an SSE4.1 version (80% as fast for large inputs and 95% as fast for short inputs), an implementation for VSX on POWER and a portable version (10% as fast). A third-party ARM implementation is referenced below.

Statistical analyses and preliminary cryptanalysis are given in https://arxiv.org/abs/1612.06257.

Versioning and stability

Now that 21 months have elapsed since their initial release, we have declared all (64/128/256 bit) variants of HighwayHash frozen, i.e. unchanging forever.

SipHash and HighwayHash are 'fingerprint functions' whose input -> hash mapping will not change. This is important for applications that write hashes to persistent storage.

Speed measurements

To measure the CPU cost of a hash function, we can either create an artificial 'microbenchmark' (easier to control, but probably not representative of the actual runtime), or insert instrumentation directly into an application (risks influencing the results through observer overhead). We provide novel variants of both approaches that mitigate their respective disadvantages.

profiler.h uses software write-combining to stream program traces to memory with minimal overhead. These can be analyzed offline, or when memory is full, to learn how much time was spent in each (possibly nested) zone.

nanobenchmark.h enables cycle-accurate measurements of very short functions. It uses CPU fences and robust statistics to minimize variability, and also avoids unrealistic branch prediction effects.

We compile the 64-bit C++ implementations with a patched GCC 4.9 and run on a single idle core of a Xeon E5-2690 v3 clocked at 2.6 GHz. CPU cost is measured as cycles per byte for various input sizes:

Algorithm | 8 | 31 | 32 | 63 | 64 | 1024 ---------------- | ----- | ---- | ---- | ---- | ---- | ---- HighwayHashAVX2 | 7.34 | 1.81 | 1.71 | 1.04 | 0.95 | 0.24 HighwayHashSSE41 | 8.00 | 2.11 | 1.75 | 1.13 | 0.96 | 0.30 SipTreeHash | 16.51 | 4.57 | 4.09 | 2.22 | 2.29 | 0.57 SipTreeHash13 | 12.33 | 3.47 | 3.06 | 1.68 | 1.63 | 0.33 SipHash | 8.13 | 2.58 | 2.73 | 1.87 | 1.93 | 1.26 SipHash13 | 6.96 | 2.09 | 2.12 | 1.32 | 1.33 | 0.68

SipTreeHash is slower than SipHash for small inputs because it processes blocks of 32 bytes. AVX2 and SSE4.1 HighwayHash are faster than SipHash for all input sizes due to their highly optimized handling of partial vectors.

Note that previous measurements included the initialization of their input, which dramatically increased timings especially for small inputs.

CPU requirements

SipTreeHash(13) requires an AVX2-capable CPU (e.g. Haswell). HighwayHash includes a dispatcher that chooses the implementation (AVX2, SSE4.1, VSX or portable) at runtime, as well as a directly callable function template that can only run on the CPU for which it was built. SipHash(13) and ScalarSipTreeHash(13) have no particular CPU requirements.

AVX2 vs SSE4

When both AVX2 and SSE4 are available, the decision whether to use AVX2 is non-obvious. AVX2 vectors are twice as wide, but require a higher power license (integer multiplications count as 'heavy' instructions) and can thus reduce the clock frequency of the core or entire socket(!) on Haswell systems. This partially explains the observed 1.25x (not 2x) speedup over SSE4. Moreover, it is inadvisable to only sporadically use AVX2 instructions because there is also a ~56K cycle warmup period during which AVX2 operations are slower, and Haswell can even stall during this period. Thus, we recommend avoiding AVX2 for infrequent hashing if the rest of the application is also not using AVX2. For any input larger than 1 MiB, it is probably worthwhile to enable AVX2.

SIMD implementations

Our x86 implementations use custom vector classes with overloaded operators (e.g. const V4x64U a = b + c) for type-safety and improved readability vs. compiler intrinsics (e.g. const __m256i a = _mm256_add_epi64(b, c)). The VSX implementation uses built-in vector types alongside Altivec intrinsics. A high-performance third-party ARM implementation is mentioned below.

Dispatch

Our instruction_sets dispatcher avoids running newer instructions on older CPUs that do not support them. However, intrinsics, and therefore also any vector classes that use them, require (on GCC < 4.9 or Clang < 3.9) a compiler flag that also allows the compiler to generate code for that CPU. This means the intrinsics must be placed in separate

Highwayhash

Install / Use

README