Hash Crash: The Basics of Hash Tables

“If I were stranded on a desert island and could only take one data structure with me, it would be a hash table.”
- Peter Van Der Linden, Author of Deep C Secrets

If you have programming experience but don’t know what a hash table is, chances are you use them regularly and don’t even know it. They are so useful and so practical that many programming languages have a hash table construct built directly in to the syntax.

They are sometimes referred to as associative arrays (PHP), or dictionaries (Python), or hashes (PERL).

If you are using a language that does not have built-in hash tables, then you must create one. The good news is, by creating your own hash table, you can tailor it to meet your specific needs with optimal efficiency.

In this article I try to provide a basic understanding of hash tables. At the end I provide a full implementation of a hash table in it’s simplest form, in plain C.

What is a Hash Table?

A hash table is a data structure that allows you to keep a list of key-value pairs (i.e. records), and it provides quick and efficient access to them.

A core component of any hash table implementation is the hashing function. When applied to a key, the hashing function yields an index of an array where a record can be stored or retrieved.

Hash Tables are Built on Arrays

A quick and dirty solution for keeping a list of key-value pairs is to store them in an array. To access a value for a given key, you would use a standard array loop, checking each element in the array until a matching key is found.

All that looping degrades performance as the size of the array grows. For many, this is unacceptable.

This array-based solution is not really worth exploring in great detail. But since a hash table is built on top of it, let’s define the array so that we can later use it for our hash table implementation.

In C, the definition of this array would look something like this:

#define M 100
typedef struct {
int key;
int value;
} *Record;
Record record_table[M];

The above code defines an array called record_table. It has the capacity to hold 100 elements, each containing a pointer to a record, each containing a key-value pair.

As stated earlier, the standard operations (add, search, and delete) require excess looping, an unacceptable performance issue. A hash table solves this issue by applying a hash function to the keys of the records, which then yields the array index at where the record can be stored or retrieved.

Defining the Hash Function

The hashing function we use depends on the data type of the key. In this article I use an integer data type for our keys since we can typically hash simple integer keys with just a basic math operation. Strings and other types of compound keys require greater attention.

One of the most commonly used methods for hashing integers is to choose a table size that is a prime number (referred to as M), and for each key, compute the remainder when dividing it by M.

This sort of function is called a modular hash function. In C, it looks like this:

#define M 97
int hash(int key) {
return (key % M);
}

Using this hash function, inserting a record is a simple matter of applying the hash function to the key and storing it at the resulting index.

record_table[ hash(my_record->key) ] = my_record;

Searching is equally as easy. The need to perform excess looping is virtually eliminated.

my_record = record_table[ hash(search_key) ];

Hash Table Collisions

One common concern with hash tables is what is referred to as a collision.

If the range of key values is larger than the size of our hash table, which is usually always the case, then we must account for the possibility that two different records with two different keys can hash to the same table index.

There are a few different ways to resolve this issue. In hash table vernacular, this solution implemented is referred to as collision resolution.

The type of collision resolution I’ll use in this article is called linear probing. It is the easiest to implement but requires that we know in advance the approximate number of records that will be put in the hash table. This is because it relies on a table that is at least 50% empty to maintain good performance. The extra unused space is usually an acceptable trade-off.

With linear probing, when we hash to a place in the table that is already occupied, we check the next position in the table until we find one that is available. As long as our table is at least 50% empty, available positions should be plentiful and this process will be quick.

Inserting

Using our previous hash function and table definitions, an insert function that uses linear probing should look something like this:

void insert(Record r) {
int i = hash(r->key);
while (record_table[i] != NULL) {
i = (i + 1) % M;
}
record_table[i] = r;
}

This function takes a Record as it’s argument. It then hashes its key and checks the resulting table index to see if it is occupied. If it is occupied, it probes forward until it finds one that is available. It then stores the record in that location.

Searching

Searching a hash table with linear probing is a similar process. We start at the index produced by the hash function and probe forward until we find the desired key. If we hit an unoccupied slot in the table the search stops and we return a value indicating a search miss.

Record search(int key) {
int i = hash(key);
while (record_table[i] != NULL) {
if (key == record_table[i]->key) {
return record_table[i];
} else {
i = (i + 1) % M;
}
}
return NULL;
}

Deleting

Deleting a record from a hash table involves a combination of search and insert functions. We search for the record and delete it by freeing up the table index that it is stored at. Next we probe forward and re-insert each subsequent records in the table until we reach the next unoccupied slot.

void delete(int key) {
int i = hash(key);
Record r;
   while (record_table[i] != NULL) {
if (key == record_table[i]->key) {
break;
} else {
i = (i + 1) % M;
}
}
   if (record_table[i] == NULL) {
return;
}
   record_table[i] = NULL;
   for (i = (i + 1) % M; record_table[i] != NULL; i = (i + 1) % M) {
r = record_table[i];
record_table[i] = NULL;
insert(r);
}
}

Bringing It All Together

In the below code I’ve pulled together all of the hash table functionality described above into a single module comprised of the files hashtable.h and hashtable.c. You can use these files with your C program to implement basic hash table functionality. You may wish to change the table size parameter to suit your requirements. If you modify the data type of the keys, you will need to redefine the hash function to work with that data type.

As an example, I’ve written a driver program in main.c that demonstrates basic usage of the hash table functionality.

To run this code you will need a C compiler. I use gcc. Download the files and compile it with the following command.

$ gcc main.c hashtable.c -o hashtest

Then run it with the following command:

$ ./hashtest

Thanks for reading and I hope you enjoyed the article. For more:

My Other Articles

My Current Software Projects: