Smhasher3

Mirror of SMHasher3 repo from official Gitlab site. Submit all issues and PRs there, not on Github!

Generate Convert Improve

Install / Use

/learn @fwojcik/Smhasher3

About this skill

Quality Score

0/100

README

   _____ __  __ _    _           _              ____
  / ____|  \/  | |  | |         | |            |___ \
 | (___ | \  / | |__| | __ _ ___| |__   ___ _ __ __) |
  \___ \| |\/| |  __  |/ _` / __| '_ \ / _ \ '__|__ <
  ____) | |  | | |  | | (_| \__ \ | | |  __/ |  ___) |
 |_____/|_|  |_|_|  |_|\__,_|___/_| |_|\___|_| |____/
=======================================================

Test Results

If you are interested in the latest hash test results (currently from SMHasher3 SMHasher3 release-), they are in the results/ directory.

Summary

SMHasher3 is a tool for testing the quality of hash functions in terms of their distribution, collision, and performance properties. It constructs sets of hash keys, passes them through the hash function to test, and analyzes their outputs in numerous ways. It also does some performance testing of the hash function.

SMHasher3 is based on the SMHasher fork maintained by Reini Urban, which is in turn based on the original SMHasher by Austin Appleby. The commit history of both of those codebases up to their respective fork points is contained in this repository.

The major differences from rurban's fork are:

Fix several critical bugs
Several new tests and test methods added
Significant performance increases
Report on p-values for all supported tests
Detailed reporting on hashes when test failures occur
Better statistical foundations for some tests
Overhauled all hash implementations to be more consistent

Additional significant changes include:

Many fixes to threaded testing and hashing
More consistent testing across systems and configurations
More consistent and human-friendlier reporting formats
Common framework code explicitly sharable across all hashes
Flexible metadata system for both hashes and their implementations
Major progress towards full big-endian support
Support of more hash seed methods (64-bit seeds and ctx pointers)
Ability to supply a global seed value for testing
Test of varying alignments and buffer tail sizes during speed tests
Refactored code to improve maintainability and rebuild times
Reorganized code layout to improve readability
Compilation-based platform probing and configuration
Consistent code formatting
More explicit license handling
Fully C++11-based implementation

Current status

As of 2025-10-16, I consider SMHasher3 to have been fully released.

From this point, the plan is to have two branches: "main" and "dev". The main branch will have new hashes and updated hashes added to it as I am able. The dev branch will have those changes added to it also. Feature development will happen only on the dev branch, and those changes will occasionally get added to main, when some chunk of functionality is complete.

There won't be explicit release versioning. Instead, the version string has been updated to include the commit date of the last commit.

This code is compiled and run successfully on Linux x64, arm, and powerpc using gcc and clang quite often. Importantly, I do not have the ability to test on Mac or Windows environments. It has been compiled successfully using MSVC and clang-cl in the past; efforts are made to ensure this remains the case, but some things may slip through. The goal is to support all of the above, and while the CMake files Should Just Work(tm), MSVC in particular has its own ideas about some corners of the various specs. So reports of success or failure are appreciated, as are patches to make things work.

How to build

mkdir build
cd build
cmake .. or CC=mycc CXX=mycxx CXXFLAGS="-foo bar" cmake .. as needed for your system
make -j4 or make -j4 all test

How to use

./SMHasher3 --tests will show all available test suites
./SMHasher3 --list will show all available hashes and their descriptions
./SMHasher3 <hashname> will test the given hash with the default set of test suites (which is called "All" and is most but not literally all of them)
./SMHasher3 <hashname> --extra --notest=Speed,Hashmap will test the given hash with the default set of test suites excluding the Speed and Hashmap tests, with each run test suite using an extended set of tests
./SMHasher3 <hashname> --ncpu=1 will test the given hash with the default set of test suites, using only a single thread
./SMHasher3 --help will show many other usage options

Note that a hashname specified on the command-line is looked up via case-insensitive search, so you do not have to precisely match the names given from the list of available hashes. Even fuzzier name matching is planned for future releases.

If SMHasher3 found a usable threading implementation during the build, then the default is to assume --ncpu=4, which uses up to 4 threads to speed up testing. Not all test suites use threading. While all included hashes are thread-safe as of this writing, if a non-thread safe hash is detected then threading will be disabled and a warning will be given. If no usable threading library was found, then a warning will be given if a --ncpu= value above 1 was used.

Adding a new hash

To add a new hash function to be tested, either add the implementation to an existing file in hashes/ (if related hashes are already there), or copy hashes/EXAMPLE.cpp to a new filename and then add it to the list of files in hashes/Hashsrc.cmake.

Many more details can be found in hashes/README.addinghashes.md.

P-value reporting

This section has been placed near the front of the README because it is the most important and most visible new feature for existing SMHasher users.

The tests in the base SMHasher code had a variety of metrics for reporting results, and those metrics often did not take the test parameters into account well (or at all), leading to results that were questionable and hard to interpret. For example, the Avalanche test reports on the worst result over all test buckets, but tests with longer inputs have more buckets. This was not part of the result calculation, and so longer inputs naturally get higher percentage biases (on average) even with truly random hashes. In other words, a bias of "0.75%" on a 32-bit input was not the same as a bias of "0.75%" on a 1024-bit input. This is not to call out the Avalanche test specifically; many tests exhibited some variation of this problem.

To address these issues, SMHasher3 tests compare aspects of the distribution of hash values from the hash function against those from a hypothetical true random number generator, and summarizes the result in the form of a p-value.

P-values are probabilities: they are numbers between 0 and 1. Their values are approximately the probability of a true RNG producing a test result that was at least as bad as the observed result from the hash function. Smaller p-values would indicate worse hash results.

However, these p-values quite often end up being very small values near zero, even in cases of good results. Reporting them in their decimal form, or even in scientific notation, would probably not be very useful, and could be very difficult to compare or interpret just by looking at them.

In SMhasher3, these p-values are reported by a caret symbol (^) followed by the p-value expressed in negative powers of two. For example, if it is determined that a true RNG would be expected to produce the same or a worse result with a probability of 0.075, then SMHasher3 would compute that that p-value is about 2^-3.737. It would then round the exponent towards zero, simply discard the sign (since probabilities are never greater than 1, the exponent is always negative), and finally report the p-value as "^ 3".

Therefore, smaller p-values (which indicate worse test results) result in larger numbers when reported using caret notation. You can think of the values in caret notation as indicating how improbable, and thus worse, the test result was. For example, "^50" could be interpreted as "there is, at best, only a 1 in 2^50 chance that an RNG would have produced a result as bad as the hash did".

The p-value computations only care about the likelihood of bad results (e.g. more collisions than an RNG would produce). Test results that are better than a typical RNG result but would still be outliers from a purely statistical point-of-view, such as seeing no or very few collisions when at least some would be expected, do not produce extreme p-values. In statistics terms, the p-values are one-tailed when appropriate, instead of always being two-tailed.

The p-value computations also take into account how many tests are being summarized, which can lead to unintuitive results. As an example, here are some lines from a single batch of test keys:

Keyset 'Sparse' - 256-bit keys with up to 3 bits set - 2796417 keys

Testing all collisions (high  32-bit) - Expected      910.2, actual        989  (1.087x) (^ 7)

Testing all collisions (high 20..38 bits) - Worst is 32 bits: 989/910           (1.087x) (^ 3)

The middle line reports ^7 for seeing 989 collisions when 910 were expected, and the last line reports ^3 for what seems like the same result. This is due to the fact that the middle line is reporting that as the result of a single test, and the last line is reporting that as the worst result over 19 tests. It's much more likely to see a result at least that bad if you have 19 tries to get it than if you just had 1 try, and so the improbability is much lower. Indeed, 19 is around 2^4, and the first reported result is about 4 powers of 2 worse than the second (7 - 3), as expected.

A true RNG would generally have about twice as many ^4 results as ^5 results, and twice as many ^3 results as ^4 results, and so on. Howe

Related Skills

node-connect

345.9k

Diagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps

frontend-design

106.4k

Create distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.

openai-whisper-api

345.9k

Transcribe audio via OpenAI Audio Transcriptions API (Whisper).

qqbot-media

345.9k

QQBot 富媒体收发能力。使用 <qqmedia> 标签，系统根据文件扩展名自动识别类型（图片/语音/视频/文件）。