Why Should the Length of Your Hash Table Be a Prime Number?

And other hash basics

Madeline Corman
The Startup
5 min readAug 7, 2020

--

A collection of stored data…

Every thorough data structures and algorithms course will cover the hash table data structure and, by extension, hash functions. In reviewing data structures recently, I came across the notion of reducing collisions by making the length of your hash table a prime number. Due to the limited scope of the course, the author did not go into much detail as to why this works and encouraged some self-research if so inclined. It turns out I am so inclined and I wanted to get to the bottom of this seemingly magic fix. To provide a little context, we will first briefly go over hash tables, hash functions, what qualities make a good hash function, and finally how a hash table of prime number length reduces collisions.

Hash Tables

With most languages featuring a built-in version of a hash table, they are an extremely useful and common data structure. They are known as dictionaries in Python, objects in JavaScript, Maps in Java, Go, and Scala, and hashes in Ruby. Hash tables are primarily used to store data in key-value pairs. With the ability to quickly locate data using its associated key, hash tables are an excellent option for data access, insertion, and removal. This is a marked improvement over arrays that, while providing quick access using indices, can have costly time complexities when adding and removing elements.

In most cases, utilizing a language’s built-in hash function is probably the best option, however, they can be modeled from scratch using an array. In this case, we would provide the key corresponding to the data we wish to access. This key must be transformed into an index where the key-value pair is stored and then using the index the desired data is returned. This is where hash functions come in to play.

hash functions

In general, hash functions take an input of any size and return an output of a fixed size; it could be a short string or an integer. These functions are ‘one-way’ meaning we cannot construct the original input by working backward from the output. As a result, hash functions are often used in cryptography.

To illustrate, let’s say we are using a hash table to store data relating to a collection of books with keys corresponding to the books’ ISBNs and the values of the books’ titles. Our hash function would take the ISBN as an argument and return an index in which the data related to that ISBN could be found. Using this index we can look up and return the book title.

There are 3 hallmarks of a good hash function (though maybe not a cryptographically secure one):

  1. The function is fast. Access, insertion, and deletion can be performed in constant, O(1), time.
  2. The function does not tend to cluster multiple outputs at certain indices. Outputs are uniformly distributed along the hash table.
  3. The function is deterministic. For a given input, the same output index is returned every time.

A Word on Collisions

Point number two in the above section is key to reducing collisions. Collisions occur when multiple key-value pairs can be found at a given index in the hash table. Returning to the book data analogy, let's say the data relating to ‘The Fellowship of the Ring’ and ‘A Clockwork Orange’ both reside at the index of 3 as a result of a sub-optimal hash function. We call the hash function with the ISBN of ‘A Clockwork Orange’, the hash function identifies that the data is found at the third index, but returns ‘The Fellowship of the Ring’ instead!

Collisions are more common with small hash tables, however, even with large hash tables and excellent hash functions, collisions are inevitable. The techniques of separate chaining and linear probing can go a long way in reducing collisions, but we will be focusing on a third technique: using prime numbers.

Why do Prime Numbers Work?

Generally, hash functions calculate an integer value from the key. To ensure this integer value is within the length of the hash table, a handy trick is to calculate the modulo of the integer and table length. The result will range somewhere from 0 to the length-1.

function hash(key, length=15) {
let total = 0
for(let char of key) {
// map "a" to 1, "b" to 2, "c" to 3, etc.
let value = char.charCodeAt(0) - 96
total = (total + value) % length
}
return total
}
// => 13

In the case above where the default length is set to 15, integers that are multiples of 3 or multiples of 5, but not multiples of 15, will be hashed into indexes of 3 and 5, respectively. For example, keys that produce integers of 0, 15, 30, etc. will be assigned index 0, keys that produce integers of 3, 18, 33, etc. will be assigned index 3, and keys that produce integers of 5, 20, 35, etc. will be assigned index 5. Collision central.

Put another way, every integer that shares a common factor with the length will be hashed into an index that is a multiple of this factor.

In the case of non-random data, a hash table of a prime number length will produce the most wide-spread distribution of integers to indices. In the case above, patterns arose for each factor of the length, 15. As a result, it benefits us to choose a length that has the least number of factors. Enter prime numbers. They famously are only divisible by 1 and themselves. Thus, choosing to set your hash table length to a large prime number will greatly reduce the occurrence of collisions.

Closing Thoughts

Hash tables are incredibly useful data structures, providing fast access to stored data given a key. The best implementations of hash tables distribute keys uniformly and are deterministic in nature. While collisions will likely be an issue in even the best hash tables, we can reduce their occurrence using a combination of a hash table of prime number length coupled with methods like separate chaining or linear probing.

--

--

Madeline Corman
The Startup

Full-stack software engineer working with primarily Ruby and JavaScript frameworks.