Benchmarking non-cryptographic hash functions in Rust

Dr. Timofey Prodanov
4 min readOct 4, 2023

--

Hashing is a well known problem in computer science, where one needs to transform variable size input into a fixed size output. A hash function should always produce the same output for the same input, but the inverse is not true: if output size is smaller than the input size, collisions are inevitable (same output for different inputs), but should appear as rarely as possible.

There are two types of hash algorithms: cryptographic and non-cryptographic. Cryptographic hash functions are used for secure purposes, such as obscuring passwords, and their main purpose is to be irreversible in practice (impossible to find a key that would produce required hash value within a human lifetime). Non-cryptographic functions, on the other hand, are used inside various data structures, most notably in hash tables; and they aim to be as fast as possible, while keeping collision rate low.

There are several benchmarks of cryptographic hash functions, some of them can be found here and here. SMhasher is a great resource for benchmarking both types of functions; however, it is specific to C++, does not stratify hashing speed by input size, and, interestingly, shows different results compared to this article.

Here, I examined the following hash functions, available in Rust:

  • sip hasher 1–3 (default in Rust) and 2–4,
  • ahash (hash function is not fixed, different computers and different versions may produce different values for the same input),
  • FNV,
  • fxhash,
  • seahash,
  • wyhash and wyhash2,
  • xxHash,
  • highway, metro (64- and 128-bit), murmur2 (64-bit) and murmur3 (128-bit), T1HA, city, spooky and farm, all from the fasthash crate.

All hash functions above implement the Hasher trait, and therefore produce 64-bit outputs (output is cropped for 128-bit hashers). You can find the benchmarking code on Github. First, I examined hashing bandwidth (key size × number of keys / hashing speed) for inputs of different sizes (between 4 and 4096 bytes).

Figure 1. Hashing bandwidth (higher is better) for 15 hash functions. City, farm and spooky hash functions are not shown as they show similar (and slightly worse) bandwidth compared to T1HA.

Note that bandwidth for most methods was worst at the smallest input size. This is due to the fact that more hash values need to be calculated in the same time to reach the same bandwidth (twice as many if comparing 4- and 8-byte keys). In general, wyhash2 showed the highest speed at almost all input sizes, with fxhash overcoming at 32 and 64 byte inputs.

However, speed is not the only important feature of hashing algorithms. Another relevant characteristic is hashing randomness: ideally, hash values for two very similar inputs should be completely different. Following this idea, I calculate a hash function randomness by generating random input, incrementing one of its bits, and measuring how many output bits have changed. In the best case, on average half of the output bits (32 out of 64) will flip.

Figure 2. Average number of bits, different between two hash values, where inputs are different in exactly one bit. Shown for FNV, fxhash and wyhash2; lines for all other hash algorithms stay very close to 32 bits.

Imperfect hash randomness may lead to an increased number of collisions. Specifically, I found the following configuration that forces fxhash and wyhash2 to generate colliding hash values: first, I randomly generated an alphanumeric string of given size, and then appended all possible 6-letter hex suffixes. In total, there are 2²⁴ hex suffixes, which is much less than 2⁶⁴ available hash values. Nevertheless, fxhash and wyhash2 produced between 9% and 94% collisions! This shows that these two hash algorithms obtain their high speed by almost ignoring last bytes of input. All other hash functions produced zero collisions.

Figure 3. Percentage of collisions for fxhash and wyhash2 for random alphanumeric prefixes
and all possible 6-letter hex suffixes.

Such collisions can easily happen in real life: imagine a list of IDs, all of which belong to the same collection, and therefore have the same prefix. Suffix length should not necessarily be 6 and not only hex characters will lead to collisions. Therefore, I strongly advise against using fxhash and wyhash2 to encode anything longer than 16 bytes. Note that fxhash has poor randomness even at 8–16 byte inputs (see Figure 2).

In general, I recommend using ahash, but note that it can only be used for in-memory applications. If you plan to store hash values between different runs and different machines, it is safer to use wyhash1. For very large inputs (≥ 1 kilobyte), wyhash1 and metrohash overtake ahash. For small inputs (up to 16 bytes, such as integers up to u128), it is probably safe to use wyhash2 as it is faster than ahash, has high randomness, and can be stored between applications. These results will likely apply in other programming languages, although additional benchmarking would not hurt.

You can find bandwidth, randomness and percentage of collisions for all 18 hashers in this table.

--

--

Dr. Timofey Prodanov
0 Followers

PhD in Bioinformatics ⋅ PostDoc at University Hospital Düsseldorf ⋅ Algorithms, Statistics, Rust