Choosing a hash function to solve a data sharding problem

Published in

Miro Engineering

8 min readJul 22, 2021

Artem Smirnov, Backend Engineer

At Miro, we work with PostgreSQL database sharding; we apply different approaches, depending on the business requirements. Recently, we had to shard new databases, and we chose a new approach to sharding, based on consistent hashing.

During the implementation phase, a crucial question we asked ourselves was: “Which implementation of the non-cryptographic hash function should we choose and use?” In this article, we’ll describe the criteria and the comparison algorithm that we developed and used to determine the best implementation.

About our architectural approach

Many database products (for example: MongoDB, Redis, and others) use consistent hashing for sharding; our implementation is going to be very similar.

Let’s suppose that as an input we have a set of entities with selected string-value sharding keys. Using the hash function, we get a hash code — a fixed-length string — for these keys. We determine the corresponding slot for the entity by running a modulo operation on the hash. The number of slots and the mapping between the entities and the slots are fixed. We also need to store the mapping between the shards and the ranges of slots — we can store this information in a configuration file.

This approach offers several benefits:

Uniform distribution of entities across shards.
Determines the correspondence of entities and shards without additional storage, while keeping resource usage to a minimum.
Enables adding new shards to the cluster.

There are also some disadvantages:

Some search operations may be inefficient, if they need to send requests to all shards.
Resharding is complex.

Requirements

A crucial element in the decision process is selecting the Java implementation for the hash function.

The function takes as an input a key that is a string object of up to 256 characters. The result is the corresponding hash code, an unsigned integer with up to 4 bytes. In practice, we’re going to compare implementations that generate 2-byte and 4-byte hash codes.

Comparison criteria

Let’s consider four common criteria for comparing hash function implementations:

Speed: the function should execute fast for each input data.
Distribution type of the results: it’s very important that the output function generates uniformly distributed hashes.
Resistance to collisions (of the first and second kind).
Compliance with the avalanche effect: all output bits should depend on each input bit for each piece of input data.

For our task, we’re going to focus only on the first two criteria: the first one, because the hash calculation operation will be used very frequently; the second one, because it’s extremely important that the data is evenly distributed across shards.

The inability to attack the characteristics of a function makes the third criterion irrelevant to our use case.

And if the fourth criterion isn’t met, we can get only single outliers from a uniform distribution, which aren’t interesting to us.

Implementations

We’re going to review the most popular Java implementations of non-cryptographic hash functions:

DJB2 (32-bit)
SDBM (32-bit)
LoseLose (32-bit)
FNV-1/FNV-1a (32-bit)
CRC16 (16-bit)
Murmur2/Murmur3 (32-bit).

Testing

Input data

We’re using the following datasets as input:

A set of real data, consisting of 216,553 English words.
A set of synthetic data, consisting of randomly generated UTF-8 characters.

In both test sets, we’re going to have groups of strings of specific lengths (number of characters): “2”, “4”, “8”, “16”, “32”, “64”, “128”, and “256”.

Metrics

To compare different criteria, we’re using the following metrics:

We’re going to measure speed with ops/ms (number of operations per millisecond of processing work).
We’re going to measure uniform distribution with Pearson’s chi-squared test. To do this, we have to introduce and test a hypothesis about the type of distribution of the results. Such a metric is binary; therefore, to visually assess how uniform the distribution of hash codes of each implementation is, we’re going to generate histograms of relative frequencies for each series of tests.

Tools

Evaluating the speed of work

To evaluate the speed of work, we’re going to use load tests and the JMH library. The general scheme of a test iteration is as follows:

We’re grouping the words from each test set by length, with the maximum length being 256 characters. Then, with each iteration, we’re passing words from each group to the input of the hash function with the same probability.

We’re using the following settings for benchmarks:

Number of warm-up iterations: 50
Number of measurement iterations: 100
Mode: throughput
Add the memory limitations: -Xms1G, -Xmx8G
To estimate the memory consumption, add GCProfiler.

You can view the complete test code here.

Evaluating the distribution of the results

To check whether the output values of the function meet our expectations, we’re going to test the hypothesis that the sample of results at the α=0.05 significance level is uniformly distributed. To verify this, we’re going to use Pearson’s chi-squared test.

The algorithm for testing the hypothesis is as follows:

1. Split the sample into partial intervals: determine the number with Sturge’s rule formula, and the length with the rule of equal-interval grouping.

2. For each interval, calculate its characteristics: average value, frequencies, and relative frequencies.

3. Calculate the sample mean

, the standard deviation

, and the theoretical frequencies

, where n is the number of elements in the sample and

is the probability of a random variable falling into partial intervals. In our case, it is equal to

, where

is the equal length of intervals, and the parameters a and b are

4. Now we can proceed with the calculation of Pearson’s chi-squared test, according to the formula

, where

are the empirical frequencies obtained from the sample, and

are the theoretical frequencies found by the formulas above.

5. We determine by the table of critical points of the distribution

by the given level of significance α and the number of degrees of freedom k.

6. If

, then we accept the hypothesis; otherwise, we reject it.

The code to calculate Pearson’s chi-squared test and the probabilistic characteristics of the samples are here.

The general scheme of the test iteration is similar to the scheme in the previous section, and it looks like this:

We’re going to group the words from each test set by length, with a maximum length of 256 characters. Then, we’re going to create input test samples of different sizes (16, 384, 8192, 4096, 2048, and 1024) where we put words into the samples from each group with the same probability.

We’re going to pass all the elements of each group as the input of the hash function; this will produce output samples consisting of integer hash codes. After that, according to the algorithm above, we’re going to run Pearson’s chi-squared test to determine whether it satisfies the uniform distribution hypothesis.

You can view the complete test code here.

Results

Evaluating the speed of work

Let’s consider the speed of work (the number of operations per millisecond) for various implementations, depending on the length of the input strings.

In the range of 2 to 8 characters:

You can see that in this range, almost all algorithms work at the same speed, with loseLose being slightly ahead of all, and only crc16 and sdbm being obvious outliers.

In the range from 16 to 256 characters:

The murmur2 function is a clear favorite, murmur3 is slightly inferior to it; when applied to this sample, crc16 and sdbm still stand out as outliers.

Evaluating the distribution of the results

Let’s have a look at the results table related to Pearson’s chi-squared test:

As you can see, the crc16, murmur2, and murmur3 implementations satisfy Pearson’s chi-squared test of uniform distribution for almost all samples.

Now, let’s examine the histograms of relative frequencies, in the context of different samples.

The first three histograms below refer to loseLose, Djb2, and Sdbm , which failed the test. Distribution on the histograms is far from uniform; it looks more like a geometric pattern:

Fnv1 and Fnv1a also failed the test; their histograms are quite similar, and their distributions are slightly closer to a normal one:

Now, let’s have a look at the top three winners:

Except for some bursts, crc16, murmur2, and murmur3 satisfy Pearson’s chi-squared test, which is consistent with the characteristics of their relative frequency histograms.

Conclusion

Based on the tests we ran, let’s wrap up, and let’s choose the most suitable Java implementation of non-cryptographic hash functions, according to these criteria: speed of work, and satisfying uniform distribution.

Speed of work. The murmur2/murmur3 functions produce the best running times for input strings longer than 8 characters.

Uniform distribution hypothesis. We can distinguish three functions where uniform distribution is satisfied in most data sets: crc16, and murmur2/murmur3. The distribution graphs of the relative frequency histograms confirm uniform distribution for the crc16 and murmur2/murmur3 functions.

Therefore, the murmur2/murmur3 implementations are the best choice, based on these criteria.

Join our team!

Would you like to be an Engineer at Miro? Check out opportunities to join the Engineering team.