Composite key in HashMaps
In my last post I talked about the problems of using an incorrect hash function when you put an object with a composite key in a Java
HashMap, but I was stuck with the question: Which data structure is better to index those objects?
Don’t worry, I’m not going to talk about blockchain or geopositioning, but a more boring and basic topic: hash and…medium.com
Continuing with the same example I will talk about products and stores, and I will use their identifiers to form the map key. The proposed data structures are:
- A single map with a key containing its indexes:
HashMap<Tuple<Integer, Integer>, MyObject>, which I will call TupleMap.
- A nested map:
HashMap<Integer, HashMap<Integer, MyObject>>, which I will call DoubleMap.
To solve all doubts and draw conclusions I will measure:
- Memory consumed by indexing a collection of objects
- The time needed to randomly store that collection of objects
- The time needed to recall, also randomly, all the elements of the collection
TL;DR: this post is more boring, so I’ll save you the job of reading the whole post:
- DoubleMap is more memory efficient and consumes 30% less than TupleMap
- In large collections, DoubleMap is 30% faster indexing, while in small collections it is considerably faster.
- In large collections, DoubleMap and TupleMap have similar fetching performance, while in small collections DoubleMap is significantly faster.
All the source code needed to reproduce the tests is in this GitHub repository: https://github.com/jerolba/hashcode-map.
In this case, I will apply the hash function that generates the least number of collisions and minimizes memory consumption, so I will not have to worry about this in my benchmarks and I will be in an optimal (and optimistic) case in the TupleMap version of not having to deal with collisions (less memory and CPU consumption).
If we use an object as the primary key, how much memory will those instances of the primary key consume? If we use nested HashMaps, will it penalize the overhead of those objects?
If we randomly fill a map with 10,000 products and 500 stores, we get the following chart of memory consumption, taking into account only the classes involved in the maps:
On average, the map with a key object (Tuple) consumes 50% more memory, using 299 MB compared to the 193 MB of DoubleMap.
Looking at the histogram of the objects in memory we see that the instances of the Tuple class are using 114 MB and collisions are not taking place because instances of the TreeMap type do not appear:
while in the DoubleMap version the extra HashMap instances are consuming only half a megabyte, and the biggest difference lies in the space used in the node arrays:
Therefore, if you need to create new instances of the object that represents the primary key when using this type of maps, I would choose a data structure such as
HashMap<A, HashMap<B, MyObject>>.
How long does it take to create a large collection in each case? Does the number of elements of each type have much influence?
In order not to bore you with the details of the benchmark I summarize it in a single chart where it shows the time (milliseconds) needed to insert a random collection of products and stores according to different total numbers of products and stores:
On average, keeping all the information on a single map has a penalty of at least 40% of the time.
Even if I don’t show you graphs (you have the data at the end of the post), the increase in the number of data in either of the two variables (products or stores) increases the execution time linearly.
How long does it take to access to the value associated to a composite key? Will it penalize having to consult in two maps? In the TupleMap version, will it take longer if I have to instantiate a new Tuple object for each query?
As before, I will summarize in a single chart the benchmark of consulting in a large collection all its values randomly, according to different total numbers of products and stores:
Although DoubleMap’s execution time is always slightly below that of TupleMap, we can consider that they have a very similar performance, and therefore the access time should not condition the choice of one data structure over another.
Surprisingly, having to create a
Tuple instance for each access does not penalize performance, and even slightly improves it in large collections (JVM optimizations are inscrutable).
The problems I face usually have large collections, but in the benchmark results we can see that in small collections the implementation of DoubleMap performs much better.
To visualize it better I show two charts: when the collection has 1,000 products and when it has 2,000:
TupleMap version times are between 2 and 6 times worse. Without having analyzed the internal behavior of the JVM/CPU/Memory, I intuit that the smaller size in data influences the location of the information and will give less problem with the cache lines.
I also analyze it by looking at two charts for a different number of products:
The times of the TupleMap version are between 50% and 150% worse. I also don’t dare to say what it’s due to, but I still believe that it is related to differences in the cache usage pattern.
In spite of the obtained results, in this case I consider that it is difficult to draw clear conclusions by performing microbenchmarking. The behavior of the data structures can vary between the production code and the benchmark code.
In production code, between access and access to the map, your application can do many things that influence the availability of information in the caches, generating an access pattern completely different to the one in benchmark.
Personally I keep the idea of consuming less memory by not instantiating the Tuple class and using directly nested HashMaps. To avoid ugly code you can create an abstraction which encapsulates the behavior of the double map.
Working with two levels of HashMaps does not seem to be a performance problem, and is even faster, especially in relatively small collections.
Using DoubleMap we forgot about the problem of having to choose a hash function that minimizes collisions, because the key would be distributed between the two levels of HashMaps.
The source code to execute the benchmarks are in the GitHub repository, but in order that you can see the raw data of the charts, I copy you the results. This way you can also do your analysis and draw your conclusions.