Which map is faster and is there an alternative to Judy?

Published in

Bumble Tech

11 min readAug 7, 2017

For our most in-demand services at Badoo we use the C programming language and sometimes C++. These services often store hundreds of gigabytes of data in the memory and process hundreds of thousands of requests per second. And it is important for us to use not only appropriate algorithms and data structures, but also performant implementations of them.

Since the very start we have used Judy for implementing associative arrays (maps). Judy has a C-interface and lots of advantages. We even use a wrapper for PHP, since in the case of PHP versions up to 7.0 Judy is vastly superior in terms of memory use in comparison with integrated maps.

However, time passes and it is quite a number of years since the most recent release of Judy — so now is the right time to look at alternatives.

My name is Marko. I am a systems programmer at Badoo working in the ‘Platform’ team. My colleagues and I have done some research into alternatives for Judy. We have come to some conclusions and now we have decided to share these with you.

Associative array

If you know what an associative array, a tree and a hash table are, then feel free to skip this section. For everyone else, here is a brief introduction to the subject.

An associative array is an abstract data type which allows you to obtain or save a value based on a key. It is abstract, because it doesn’t tell you anything about the implementation. And indeed, with a little imagination, you can think of dozens of different implementations which might be appropriate to varying degrees.

As an example, we can take the banal linked list. In the case of the ‘put’ function we go through the whole linked list from start to finish to ensure that we don’t yet have such an element and, if we don’t, then we add it at the end of the list; in the case of the ‘get’ function we simply go through the list from start to finish searching for the required element. The algorithmic complexity for searching for and adding an element to such an associative array is O(n) — which isn’t very good.

A simple, but slightly more appropriate implementation which could be considered is a simple array data structure, the elements of which will always keep sorted by key. A search in an array of this kind may be performed using a binary search algorithm with an algorithmic complexity O(log(n)), however adding elements to it may require copying large blocks of memory due to the necessity of freeing up space for the element in a strictly fixed position.

If we look for how associative arrays are implemented in practice, then the most likely answer is some sort of tree or hash table.

Hash table

A hash table is a data structure which implements an associative array. Inside it you can find a two-stage process.

In the first stage, our key is fed into a hash function. This is a function which takes in an array of bytes (our key), and usually gives out some number. At the second stage, this number is used to search for a value.

How? There are lots of options. The first option that comes to mind is to use this number as an index in a normal array where our values are located. But this option will not work for at least two reasons:

1. The domain for the hash function is most probably greater than the size of the array we would like to see kept in the memory.

2. Hash functions have conflicts: two different keys can yield the same resulting number.

Both these problems are usually resolved by allocating a limited-size array, each element of which is a pointer to a different array (bucket). We then map the domain of our hash function to the size of the first array.

How? Once again, there are lots of options. For example, by dividing by the array size and taking the remainder or simply by taking a few lower bits for the number in question, if our array size is a power of two. Obtaining the remainder from division is a much slower operation for the processor vis-à-vis the latter, so preference is usually given to arrays of a size which is a power of two.

I have given an example of just one of the most common ways of implementing hash tables. In fact, however a huge number of possible ways are available. And, depending on which means has been chosen for dealing with conflicts, we will obtain different levels of performance. In terms of the algorithmic complexity for hash tables, on average we will have O(1), and in a worst-case scenario we might have O(n).

Trees

With trees, it is all pretty simple. For implementing associative arrays, various binary trees are used — such as, for example, the red-black tree. If the tree is balanced, we obtain O(log n) for the operation.

Some implementations of trees, such as google btree, take account of the current memory hierarchy and save keys in batches for a more effective use of processor caches.

Trees allow you to implement the function of going through elements in ascending or descending order of keys, which is often a very important condition when choosing implementation. Hash tables are not able to do this.

What are we looking at?

Things are not, of course, black-and-white when it comes to the world of associative arrays: every implementation has its advantages and drawbacks.

Various implementations may differ not only in terms of performance, but also in terms of the range of methods they provide. For example, some implementations allow you to go through the elements in ascending or descending order of keys, while others do not. And this limitation is related to the internal architecture and not to the author’s predilections.

Indeed, the performance can vary significantly even depending on how you use the associative array. Are you mainly writing to it or reading? Are you reading keys randomly or in a certain order?

The days have gone since Knuth’s big O was enough to compare two algorithms. The real world is much more complicated now. The modern hardware on which our programs work has a memory hierarchy, access to the various levels of which is strongly differentiated and taking account of this can speed up an algorithm several-fold or slow it down to the same degree.

Another criterion which cannot be neglected is the issue of memory use: what is the overhead for the implementation chosen, compared with the theoretical minimum (key size + key value, multiplied by the number of elements) and is such an overhead permissible in our program?

Pauses, maximum delay (latency) and responsiveness also have an important role to play. Some implementations of associative arrays require instant, expensive internal restructuring when adding an infelicitous element, while others ‘spread out’ these delays over time.

Potential candidates

Judy

Judy (or Judy Arrays) is a structure invented by Douglas Baskins, which implements ordered associative arrays. The implementation is done in C and is very complicated. The author tried to take account of current architecture and to make maximum use of the processor’s caches.

Judy is a tree, not a hash table. And it is not simply a tree, but what is known as a radix tree. For us users, this means that using Judy we get key compression and the possibility of going through the elements in ascending or descending key order.

Judy offers several options in terms of a key-value relationship:

Judy1 is convenient to use for large, sparse bitmaps. JudyL is the basic type and the one we most frequently use for the Badoo services. JudySL, as you can see, is convenient to use for strings, while JudyHS is simply used for an array of bytes.

std::unordered_map

unordered_map is implemented in C++ in the form of a template, so anything can act as a key and value. If non-standard types are used, you will have to define the function for hashing these types.

This is a hash table, so the function of going through the keys in ascending or descending order is not available.

google::sparse_hash_map

As you can tell from the name, sparse hash was developed at Google and has a minimal overhead — just two bits per value. It is also implemented in C++ and made in the form of a template (and this, in my view, is without doubt the advantage of implementation in C++).

Besides everything else, sparse hash is able to save its status to disk and download from disk.

google::dense_hash_map

Besides sparse hash, Google also made a version called dense hash. Its speed (and particularly the get speed) is notably faster, but this comes at the cost of high memory use. Also, since the inside of dense_hash_map has flat arrays, it periodically requires expensive restructuring. Incidentally, Konstantin Osipov has done an excellent job of talking about these problems and the idiosyncrasies of hash tables in his report on HighLoad++. I recommend that you study it.

If the quantity of data which you are storing in an associative array is not large and if its size will not vary significantly, dense hash should be the best option for you.

Likewise, sparse hash dense hash is a hash table and template. This means, I repeat, you will not be able to go through the keys in ascending order, but it does offer the option of using any types as keys and values.

Likewise, again, in the same way as its cousin, sparse_hash_map dense_hash_map is able to save its status to disk.

spp::sparse_hash_map

And, finally, the final candidate from one of the Google developers: sparse hash. The author used sparse hash as their basis and reimagined it, achieving minimal overhead: one bit per entry.

The same applies again: hash table and template. Implementation in C++.

Those are the candidates. Now to test them to see what they are capable of.

Two more factors

However, first I will point out that there are two further things which will affect productivity and memory use.

Hash function

All the candidates in our research are hash tables and use hash functions.

Hash functions have quite a lot of properties, each of which may be important — to a greater or lesser extent — depending on the usage in question. For example, in cryptography it is important for minimal change in incoming data to produce maximum possible change in the corresponding result.

However, it is speed which is of most interest to us. The productivity of hash functions may vary a lot, and speed often depends on the quantity of data fed into them. When a long string of bytes is fed into a hash function, it may make sense to use SIMD instructions for more effective use of the capabilities of contemporary processors.

In addition to the standard hash function from C++ I will see how xxhash and t1ha cope.

Memory allocation

A second factor which will definitely affect our results is the memory allocator. Speed of operation and the volume of memory used will directly depend on this. The standard allocator from libc is jack of all trades. It is good for everything, but not ideal for anything.

As an alternative, I will consider jemalloc. Despite the fact that in the test we don’t make use of one of its main advantages, namely work with multithreading — I expect faster work with this allocator.

Results

The test results in respect of a random search of ten million elements are given below. The two parameters which interest me most are the time taken to perform the test and maximum memory used (RSS of the process at its high point).

As a key, I used a 32-bit integer, and as a value I used the pointer (64 bits). These are the key and value sizes we use most often.

I haven’t shown the results of other tests, which you can find in the repository, as I was most interested in random get and memory use.

The fastest associative array proved to be dense hash from Google. Next came spp and then std::unordered_map.

However, if we look at memory, we can see that dense_hash_map uses more memory than other implementations. In fact, that is what the documentation states. Moreover, if our test were to include competing read and write functions, we would probably encounter stop-the-world readjustment. Thus, in applications where responsiveness and writing a large number of entries are important, dense hash cannot be used.

I also looked at alternative implementations of hash functions. However, it worked out that the standard std::hash is fastest.

A colleague happened to notice that in the case of 32- or 64-bit keys std::hash is basically just an identity. In other words, the numbers fed in and coming out are the same, or, to put it another way, the function doesn’t do anything. хxhash and t1ha, for their part, calculate the hash fairly.

In terms of productivity jemalloc is useful. Practically all implementations become faster.

However, in terms of memory use jemalloc is not so good.

Source code and hardware

I put the source code for the benchmarks on github. If you want, you can replicate my tests or expand them.

I performed the tests on my own home desktop computer:

● Ubuntu 11.04

● Linux 4.10.0

● Intel® Core(TM) i7–6700K CPU @ 4.00GHz

● 32 GiB DDR4 memory

● GCC 6.3.0

● Compiler flags: -std=c++11 -O3 -ffast-math

Brief conclusions and concluding remarks

Our tests have shown the strong potential of spp as a replacement for Judy. spp proved to be faster for random gets and its overhead in terms of memory is not too large.

We have made a simple C++ wrapper for C and in the near future we shall try spp out on one of our highly in-demand services.

In terms of a purely philosophical conclusion I would say that each of the implementations we have considered has its advantages and disadvantages and a choice should be made based on the task at hand. Understanding the internal structure of the implementations in question should help you to find the right solution for your particular task. It is this understanding which often makes the difference between simply being an engineer and being a more proficient one.