Things every engineer should know: Hashmaps

Doogal Simpson
The Startup
Published in
11 min readMay 15, 2020

--

TL:DR

They map keys to values.

Hashmaps are probably the most commonly used implementation of the concept of a map. They allow arbitrary objects to be associated with other arbitrary objects.

This can be very useful for doing things like grouping or joining data together by some common attribute.

The underlying structure does the following:

  • Calculate the hash of the key
  • Mod the hash to give an integer between 0 and the size of the array of buckets
  • Iterate through the linked list in the selected bucket, comparing equality of keys
  • When / if keys match, return the value associated with that key.

Basic structure

Naming

I’m calling these things hashmaps, but depending on the language or context they can have different names.

In C# or Python, the rough equivalent is a Dictionary.

In Javascript, there is an in built Map class that can be used, however, the Javascript object can be used in a way that is similar to a hashmap. Under the hood, Javascript isn’t using a hashmap because it can do some optimizations.

Java has a class called HashMap, it also has HashTable, which is basically the same thing. However, HashMap is generally recommended over HashTable.

If you see something called dictionary / hashmap / hashtable, you can think of them as the data structure described in this story (mostly).

Hash function

As the name might suggest, hashmaps do a thing called ‘hashing’. This is an operation that takes an arbitrary object and gives back an integer. The way hash functions work could fill out it’s own story and probably a lot more.

For the purposes of this story, a hash function can be thought of as a black box that does something like this:

Integer hash = hashFunction(someArbitraryObject);

So why is this useful? Well, if you can transform any object into an integer, you can use that integer to navigate to the index of an array. This is what hashmaps do, the basic steps are as follows:

  • Take key and value
  • Calculate the hash of the key
  • Use hash to find index of an array
  • Store key and value in array

There is a slight complication here. A hash function can return a hash that has a value anywhere between the max negative and positive values for an integer. We do not want to have an array with billions of elements in it as that would use too much space. We also do not have the concept of a negative index in an array.

Ideally, we’d like the index to be somewhere between 0 and some reasonable size for an array, say, 100.

So how do we get our hashed integer, with a value that can be anything from negative several billion, to positive several billion, to map to the range 0 to 100?

To do this, we need 2 functions, absolute and modulus.

The absolute function will make any negative number a positive one, so -34671 becomes +34671.

The modulus function is used to calculate the remainder of the number with respect to another. For example, if we are using 100 as the number we want to modulus by, then:

  • 0 modulus 100 = 0
  • 1 modulus 100 = 1
  • 2 modulus 100 = 2
  • 3 modulus 100 = 3
  • 47 modulus 100 = 47
  • 99 modulus 100 = 99

until we hit 100. At 100, the modulus causes the returned value to loop back to 0.

So,

  • 100 modulus 100 = 0
  • 101 modulus 100 = 1
  • 102 modulus 100 = 2
  • 147 modulus 100 = 47
  • 199 modulus 100 = 99

and so on until we get to 200. At 200 the returned value loops back to 0 and starts counting back up to 99 until the value we want to modulus reaches 300, then the returned value goes back to 0 and repeat, and repeat, and repeat.

So the modulus function can be used to map any (positive) integer into a range between 0 and the value we want to modulus by.

Given an absolute function and a modulus function we can transform our hashed integer to an integer in a range that we want:

Integer sizeOfArray = array.size();// hashedInteger is somewhere between -BigNumber -> +BigNumber
Integer hashedInteger = hashFunction(key);
// absInteger is somewhere between 0 -> +BigNumber
Integer absInteger = absolute(hashedInteger);
// modInteger is somewhere between 0 -> sizeOfArray
Integer modInteger = modulus(absInteger, sizeOfArray);

Once we have our integer in a range that is the size of our array we need to store our key and value in that array.

Array of buckets

So the other part of our hashmap is the ‘Array of Buckets’. This is an array that contains other lists. These other lists are usually linked lists, you can find out more about linked lists in my story here.

We take our hashed, absolute, modulo integer and use that to select the index of the array of buckets. We then take our key and value and store them in the bucket. The bucket is a linked list, we store the key and value together at the end of this list.

The key and value pair are known as an ‘entry’. An entry is a simple wrapper object that contains both the key and the value.

Inserting a thing

We now have a number of things:

  • We can take an arbitrary object and hash / abs / mod it to an integer that fits in our array
  • We know that the array stored linked lists of entries.
  • Entries contain key — value pairs.

We can combine these to describe how a key — value pair is stored in a hashmap:

  • Given a key — value pair
  • Calculate the hash of the key
  • Transform that hash to be between 0 and the size of our array
  • Find the linked list at the index of the transformed hash
  • Add the key — value pair as an entry to the end of the linked list

and you’re done.

The key — value is stored and can then be queried. Before we move on to querying a hashmap, it is worth discussing the time complexity of storing an element in a hashmap.

The short answer is that it is O(1), or constant time. Which is to say, as the hashmap gets larger, the time taken to store a key — value pair does not change. This is because the various operations that make up the storing of a thing in a hashmap are all O(1) time complexity:

  • Calculating a hash is O(1) (or at least it should be, again, complicated subject, not going to go into detail here).
  • Abs / Mod are O(1)
  • Getting an element from an array is O(1). I have another story on array lists here. Arrays and array lists are not the same thing, but they share a lot of time complexity characteristics.
  • Adding an element to the end of a linked list is O(1). This is discussed on my story on linked lists.

As none of these operations get slower the larger the hashmap grows, the overall operation of inserting into a hashmap also does not get slower as it grows.

Getting a thing

Our hashmap is now full of key — value pairs, it would be useful if we could get our value if we provide a key. The way this works is as follows:

  • Given a key (but, no value)
  • Calculate the hash of the key
  • Transform that hash to be between 0 and the size of our array
  • Find the linked list at the index of the transformed hash
  • Start moving through each entry in the linked list
  • For each entry, ask if the input key and the key held in the entry are equal
  • If they are not equal, move on to the next entry
  • If they are equal, return the value from the entry

Through these steps we can take our key, and map it to a value. The constraint on a key is that you need to be able to ask if two keys are equal. In Java, this is not a problem as every object implements an .equals() method.

What is the time complexity of this operation? That is a more complicated question than the case of inserting into a map. The time complexity is very dependent on how many entries you have to iterate through in the linked list before you find the entry with the matching key.

There are 2 extreme cases:

  • All of the entries have been put in one bucket
  • All of the entries have been perfectly evenly distributed across all available buckets

All of the Entries have been put in one bucket

A possible situation is that all of the keys were hashed / abs / modded to the same array index. In this case, all entries are added to the same linked list and to find the entry that matches our key will take O(N) (or, linear) time. The reason that finding an element in a linked list is O(N) is discussed here.

A linear time complexity means that as the hashmap gets larger, the time taken to find an element in it also gets larger in a way that is proportional to the size of the hashmap. If the hashmap grows by 100 times, the time taken will also grow by roughly 100 times.

This situation is not ideal, so how did we get here?

This is the outcome of having a ‘bad’ hashing function. Technically a hash function could be implemented to always return 0, regardless of what key was passed to it, this would map all entries to the same bucket and is a very bad hashing function.

Ideally, we’d want a ‘good’ hashing function, one that returns evenly distributed hash integers that will all be transformed into evenly distributed indices in our array.

All of the entries have been perfectly evenly distributed

The desired situation when getting data from a hashmap is that each of the entries have been spread evenly across the buckets. This leads to the situation where each linked list contains the same number of entries.

This is the best possible situation for a hashmap because each entry can be retrieved in a roughly equal amount of time and that amount of time is the smallest it can be.

The time complexity for retrieving an item from a hashmap with evenly distributed entries is O(N / k), where, N is the number of entries held by the hashmap, and k is the number of buckets that the map has.

What hashmaps are good for

Mapping arbitrary values

As the name suggests, they are good at mapping things. You can think of an array as a way of mapping integers 0 to N to arbitrary values. This is great if you are dealing in integers, but not as useful otherwise.

Maps allow you to map from an arbitrary object to another arbitrary object. It might be a String, Integer or an object with multiple fields.

As long as the key can be used to produce a hashcode, and it can be checked for equality against other keys, then the object can be used in a map.

In Java, every object needs to implement the .hashcode() and .equals() methods, so every object in Java can be used in a map.

A common use case is for doing joins between lists of objects. For example, say we had an object like:

public BookDTO {
private String id;
private String name;
}

And let’s say we had 2 lists of BookDTOs: “Books that a user liked” and “Books that are on sale”. The logic that the business is after is they want to show books that the user liked that are on sale to the user. To do this, we need to find all the books that are common to both lists.

One way of doing this would be to go through every book the user liked, and for each one, go through all the books that are on sale. This would not be ideal because it involves going through every possible combination.

An alternative would be to make 2 maps, each mapping ID to BookDTO, out of the 2 lists. Then, go through each of the books a user liked, using that ID to pull out any books that are on sale from the other map. Using this method, we do one pass through the map and can join any relevant books without having to try out every combination.

Grouped information

Grouping information by some common attribute is a pretty common thing to do. For example, say we wanted to count the number of times a String occurs in an arbitrary list of strings.

One way of representing the output of that would be as a map of String to Integer. The code could be written as follows:

public Map<String, Integer> countItems(List<String> items) {  Map<String, Integer> stringToCount = new HashMap<>();
for (String item in items) {
if (!stringToCount.containsKey(item)) {
stringToCount.put(item, 0);
}
Integer currentCount = stringToCount.get(item);
Integer newCount = currentCount + 1;
stringToCount.put(item, newCount);
}
return stringToCount;
}

So for the input:

["Foo", "Bar", "Foo", "Baz"]

The output would be:

{
"Foo": 2,
"Bar": 1,
"Baz": 1
}

This is a relatively simple example, but if you are processing transactions and decide you want to group all of a user’s transactions by user, then representing them as a map of User ID to List[Transaction] can be a useful way of achieving that grouping.

What hashmaps are not good for

Storing ordered information

Hashmaps are not great if you want to store information in an ordered way. Typically, hashmaps do not give any guarantees on the order in which their entries can be iterated in, so you cannot rely on the the order items were added to a hashmap having any bearing on the order in which they are iterated.

There some counterexamples to this. In Java, there is the LinkedHashMap class that allows iteration of entries in insertion order. There are also flavours of sorted maps that allow keys with explicit ordering to be iterated in order.

Storing repeated information

The key in a hashmap defines the identity for a value. If you try to store multiple values for the same key, the values will be overwritten. This is different to a list where you can repeatedly add the same object multiple times.

The easy way around this is to change the value being stored to be a list of the object that is being inserted multiple times.

Summary

Hashmaps are a very useful data structure for mapping some arbitrary data some other arbitrary data.

They rely on a good hash function, the modulus function and a list of “buckets”.

Some standard use cases for maps are joining 2 collections of data, or grouping data together by some common key.

About the author

Hi, I’m Doogal, I’m a Tech Lead that has spent a bunch of years learning software engineering from several very talented people and these stories are my way of trying to pay it forward.

During my time as a Tech Lead I have mentored a lot of new software engineers and I have found there is often a situation where engineers don’t know what they don’t know. So this “Things every software engineer should know” series is a cheatsheet of the information I’d give myself in my first year of doing software.

Software is a large subject and the golden rule is that the answer to any question can start with “It depends, …”, as a result, the information in these stories is not complete. It is an attempt to give the bare essential information, so as you read these stories, please keep in mind that the rabbit hole goes deeper than the subject matter displayed here.

I can be found on Facebook, LinkedIn or pokeprice.io.

--

--

Doogal Simpson
The Startup

Technical Lead at pokeprice.io, spent some years learning software and now trying to pay it forward.