Hash tables

A Brief Introduction

Mark Maher
3 min readMar 11, 2019
Hash table illustration by Jorge Stolfi is licensed under CC BY-SA 3.0

What is a hash table?

A hash table is a data structure which stores data in key/value pairs with the help of one or more hash functions. Hash tables are fast — they allow for data lookup (retrieval), insertion, and deletion in constant time (at least in a best-case scenario). This makes them an ideal data structure for things like storing an address book, storing items in a database, or representing an object in a programming language (JavaScript uses hash tables to create its objects!).

How do hash tables work?

Hash tables are comprised of an array of data (usually stored in buckets) and a hash function. Hash functions transform a key (often a string) into an array index using modular math. Hash functions are deterministic, which means that the same input will always return the same output — but should also distribute indices uniformly across the array and should attempt to minimize collisions.

Drawbacks?

With constant time insertion, deletion, and lookup, it can be tempting to assume that a hash table is the right choice for nearly any data needs, but like anything, they have their strengths and their weaknesses. For starters, even with constant time access, hash functions are taxing; in some scenarios, the linear time performance of an array or linked list can outperform a hash table. This is especially true of very small data sets. Furthermore, a bad hash function will severely diminish the efficiency of any hash table. If the function disproportionately returns some indices more than others, this can cause additional collisions, resulting in slower access times. However, some collisions are mathematically inevitable, and how those collisions are handled has a considerable impact on the performance of a hash table.

Hash table illustration by Jorge Stolfi is licensed under CC BY-SA 3.0

Handling collisions

What is a hash table to do when two or more keys are assigned to the same index? One solution is to simply find the next available index. This is called open indexing. When storing data, if the assigned index is occupied, a consistent mathematical function can be applied to search for unoccupied indices. This function can be as simple as checking the index immediately adjacent! This is a fairly efficient way of handling collisions unless the hash table load factor — The number of items currently being stored divided by the total number of spaces in the array — gets too high.

A second chance

Another option is to have a second choice hash function which consistently produces different results from the first choice function. When storing or accessing a value, the hash table can check the first hash function, before moving on to the second if the first is already in use. This reduces the chance for collisions, but increases the load of inserting and accessing values in the array.

Buckets of fun

Another method of handling collisions is to have each space in the array point to another list in memory. These lists are called buckets. Because — in theory — each bucket contains some small fraction of the overall dataset, accessing data is still much quicker, even if it must be done linearly. These buckets, however, do not need to be smaller arrays, but may be linked lists or binary trees or nearly any other data structure which best suits the needs of the dataset.

--

--