Hash Tables
A hash table is a data structure used to implement an associative array, a structure that can map keys to values. This kind of data structure increases our efficiency to near O(1), or constant time. Hash tables combines the random access nature of arrays with the dynamic ability of a linked list.
Hash tables are very poor at sorting, so we only want to use it if sorting is irrelevant for that data. We will start with a simple illustration, and slowly build upon it step by step.
What is a Hash Table primarily composed of?
A hash table can be broken down into basically two things:
1. A hash function or a mapping function (returns a non-negative integer known as hash code, for simplicity we’ll call it index).
2. An array (the actual table where the data to be searched is stored)
Likewise, to get a value of a key, the illustration would be:
Let’s get started with the basic idea behind Hash Tables
The basic idea for a hash table is that we take a key, value pair; we run that key through the hash function; it processes the key, and yields a hash code (a non-negative integer). Then with that number (hash code) we store the data at that location in the array.
Hash Function
You don’t want to reinvent the wheel by writing your own hash function, there are plenty of efficient algorithms you can find on the internet. Here we will only study a basic hash function to simplify the concept and to fully understand the logic behind hash tables. A simple hash function would be a function which adds all the ASCII values of the input string, converts the generated sum to a range between 0 to size, and yields a number in that range.
A simple c language example would be:
unsigned int hash(const char *key, unsigned int size)
{
unsigned int i;
unsigned int hash = 0; /*loop through all the chars*/
for ( i=0 ; key && key[i] ; i++ )
{
/*add all the ASCIIs*/
hash += key[i];
}
/*convert hash code to fit it inside the range*/
hash = hash % size;
return hash;
}
What makes up a good hash function?
- Uses only the data being hashed
- Uses all of the data being hashed
- Is deterministic
- Uniformly distributes data
- Generates very different hash codes for similar data
Hash Tables Data Structure
Our Hash table structure and linked list structure would look something like this:
/*
* key: Points to the key string
* The key is unique in the HashTable
*
* value : The value corresponding to the key
* Value does not possess uniqueness.
*
* next : Points to the next node of that linked list
*/
struct List
{
char *key;
char *value;
struct List *next;
};
/*
* size : Total number of buckets
*
* buckets: Each bucket contains a linked list.
* Points to individual linked lists.
*/
struct HashTable
{
unsigned int size;
List **buckets;
};
Let’s select previously mentioned size of our Hash Table to be equal to 5. This is how our initialized hash table without any data would look like:
Now let’s take a few examples to store data in our hash table. First let’s assume a key, value pair of “key1” and “value1”.
- key=“key1” and size=5 is passed to the hash function, and let’s assume the function returns a value of 3.
- index=3, key=“key1” and size=5 are passed to the hash table’s save function, which will store the data at that index or in that bucket. So now our new updated hash table will look like:
Note that keys are unique, so the save function for the hash tables should first look in that bucket if the key already exists, if it does, it should replace the old value of that key with the new value; only if the key doesn’t mach any key existing in that bucket, it should create a new node in the beginning of the linked list. Now let’s populate the hash table to have a better picture of it:
If we were to use a simple array with the hash function, the resulting hash codes or indexes would have produced collisions, (different data can produce same hash code). The above method using linked lists, avoids collisions by a method called chaining. Other methods to avoid collisions include re-hashing, linear probing (results in clustering), quadratic probing, etc.