HASH TABLE AND HASHING SCHEMES

10 min readJun 27, 2023

Most computer programs or applications need fast data structures. The performance of a data structure is necessarily influenced by the complexity of its common operations; thus, any data-structure that exhibits a theoretical complexity of amortized constant time in several of its main operations is highly noteworthy and one such family of data structures that falls into this category is known as hash tables.

A hash table is a data structure which is used to store data in an associative manner (key — value pair). Hash Table lets us store things like a phone book where we store the association between a value and its unique key, and hence the item could be found quickly by searching at its approximate location.

WHY HASH TABLE OVER ARRAY

Arrays can be seen as a mapping, associating some data item with integer as the index. This works where the keys are small integers or has a finite range. There are many situations, where we want to map the data other than the small integer, like strings, structure or large integer values.

Consider the phone book example, where every user as unique 10 digit phone number. Creating an array of size of 10 digits would be a waste of memory because the data is very sparse, the array is not completely utilized. Therefore, we use “Hash Table”, where hashing is used to convert the complicated keys like strings or large numbers into smaller array indices, and the data is stored in the data structure called “Hash Table”. The idea of hashing is to distribute the data uniformly across the array, and search the value in the average of constant time i.e. O(1) time complexity, which makes the lookup (searching the key) easier.

COMPONENTS OF HASH TABLE

Array, where the data is stored.
Hash Function, a deterministic mathematical function that takes an input (data or object) and maps it to a fixed-size output, typically a hash value or hash code
Hashing Scheme, a systematic approach used to handle collisions that occur when a hash function generates the same hash value (key) for multiple different input values.

HASH FUNCTION

If we have an array that can hold N values, then hash function would transform a key into size of [0, N-1].

Properties of Hash Function :

a) It should be deterministic, equal keys should produce the same hash values. Ex — randomInteger(0,100) will not work as a hash function, because the same key would produce different values every time.

b) It should uniformly distribute the keys, to reduce the collision.

Most Common Hash Function approach is “Modular Hashing”, for any integer, compute its remainder by dividing it with the size of array(N). We choose the array size to be prime number. If we choose the size as the composite number, let’s say 10, there would be many values at the same index. Ex — 20, 30, 40…, all will compute the hash value as 0, which makes the hash function poor, as it doesn't uniformly distribute the keys.

HASHING SCHEME

Consider the above example, where multiple keys generated the same hash value resulting in collision. Different hashing schemes can be used to deal with the collision.

Static Hashing Scheme : It works when you have an upper bound, we know how many entries to store into the table. Every index can contain at most one value. Types of Static Hashing Scheme are:
- Linear Probing
- Cuckoo Hashing
Dynamic Hashing Scheme: It works on dynamic data set i.e. the size of the hash table is dynamic according to the input hash values. Types of Dynamic Hashing are:
- Chained Hashing
- Extendible Hashing
- Linear Hashing

STATIC HASHING SCHEME : LINEAR PROBING

Hash table is a circular buffer (ring structure). Collision is resolved by sequentially searching for the next free slot.

Insertion — When inserting a key to already occupied index(same hash code), it will result into collision, and look for next free slot(or Tombstone, discussed further) until found.

                        rehash(key) = (n+1)%tablesize

Consider the image below, where the data to be inserted is “C” with the hash code of “2", and collision occurs. According to linear probing, we locate the next free slot at index 4 and insert “C” there.

Lookup: Similar to Insertion, search for the value sequentially until found or empty slot.

Delete: Consider the example below, the value we want to delete is “C”, with a hash code of “2”. After deleting the element “C”, it is marked as empty or set to NULL.

After deleting “C”, the lookup of “D” with hash code of “2” will be unsuccessful, because lookup terminates when an empty slot is found. To solve the issue, we use “Tombstone”, a dummy node which marks the node as previously deleted instead of clearing it out, and indicates do not terminate the lookup early, and look further ahead and can be reused by insert algorithm.

Though Linear probing gives good cache performance, since the slots are arranged in linear manner, but still there comes a problem, where long sequences are created with the same hash code, which makes the lookup to O(n) time complexity with no added advantage. This issue is called clustering.

The worst time complexity is O(n), and best time complexity is O(1) where the value is found in one search.

To avoid clustering, there are different probing methods:

QUADRATIC PROBING: Search sequence starting with hash code(I) is as follows:

                rehash(key) = (n+k^2)%(tablesize)

This creates larger gaps in the sequence. Even though clustering is avoided by quadratic probing, still there are chances.

DOUBLE HASHING: With double hashing, another hash function is used to determine the step size.

                    h2(key) ≠ 0, and h2 ≠ h1
                 rehashing(key) = (n + k*h2(key))

Double hashing reduces clustering in a better way, but it uses more time to compute two hash functions and do the comparison.

STATIC HASHING SCHEME : CUCKOO HASHING

Collisions are handled by evicting the value and moving it out to other hash table. It got it’s name as the cuckoo pushes out an egg from the nest to make room for itself.

It is implemented using two arrays of equal size with two different hash functions. Each hash function is associated with each array.

A new element is inserted first into the first hash table. If collision occurs, then the existing element is kicked out, and the kicked out element is inserted into the second hash table using the second hash function. If this also causes the collision, the element is likewise kicked out, and inserted into the first hash table. If the number of displacements reaches a threshold or it forms a cycle, then rehashing takes place by choosing both the new hash functions and reinserting all the values again.

As in the image above, there are two hash tables h1 and h2. The first element “a” has a hash value which corresponds to slot “3” and collision occurs with element “b”. Thus it kicks out b, and makes space for itself. Now as “b” is inserted in second hash table, it collides with the element “c”. This process continues until the element finds a slot for itself which is free from collision.

Rehashing is a linear operation and worst time complexity is O(n), but average time complexity is O(1). For lookup, it will be stored either in first or second array, so at-most two lookups, hence O(1) is time complexity for lookup and removal. There is no need for the tombstones as in Linear Probing.

Though Cuckoo Hashing provides average O(1) time complexity, but it uses two hash tables, hence double the space as Linear Probing, and also it does not provide good cache performance.

DYNAMIC HASHING SCHEME : CHAINED HASHING

The collisions are handled by inserting all the keys with same hash code at the same index by creating a Linked List. Therefore, the node in the array holds the head of the LinkedList. Unlike Static Hashing, two different keys may map to the same slot, and contains extra memory overhead.

Hash Table never fills up, we can add extra slots to the table as required. However, long chains will increase the time complexity to O(n), as complete linked list traversal would be required.

Here, the procedure for inserting, searching, and deleting is same as the Linked List. The only difference is calculate the hash code for the key, and the head of the Linked List would be the pointer saved at that index.

DYNAMIC HASHING SCHEME : EXTENDIBLE HASHING

It is a type of chained hashing approach, where we split buckets instead of letting the linked list grow forever. Here, we have a specific hash function i.e. either the first or last few bits of binary code of the key. With extendible hashing, the number of bits that are used to map changes over time, and the table size changes accordingly. The slot array map hashes to the directory, which holds the pointer to the address of bucket (kind of bag with items of common properties are saved).

Global Depth(associated with directories): number of bits used by the hash function to denote the key.

Local Depth(associated with buckets): Used to decide whether to split the bucket or not in accordance to the global depth. It is always less than or equal to the global depth.

Let us understand using an example. We start out by assuming the data has 4 bits, and first 1 bit is used for directory, therefore there are two entries in the directory, 0 and 1, and each directory entry points out to two buckets. Both the global and local depth is 1, and the bucket capacity is 2.

**Initial Extendible Hashing : global depth(1), local depth(1), and bucket capacity(2)**

Suppose that the data needs to be inserted is “1100”, and because its first digit is 1, thus the hash code is “1”. As the bucket is already full related to 1, and both the global and local depth is 1, the global depth bits are not enough to distinguish the search values of the overflown bucket. Thus directory doubling occurs(considering first two bits of hash value), and the bucket needs to be split following which data is also rehashed.

**Extendible Hashing Directory Doubling( when global depth = local depth)**

Now, suppose the data to be inserted is “0001” and “0110”( as in the image below), both of which goes to the first bucket, and cause the overflow of the bucket, hence the bucket will split, and the data is also reshuffled according to the hash function. Here the directory wont be doubled, because it is less than the global depth.

**Extendible Hashing Bucket Split(Local Depth < Global Depth)**

Similar is the deletion operation. If any bucket becomes empty, it is merged with the buddy bucket. Multiple deletions can cause the directory to halve its size and thus decrease its global depth.

Its time complexity is O(1), because we only need to search one data block, but there is substantial amount of work is done, when the directory is doubled which also causes the reshuffling of data.

DYNAMIC HASHING SCHEME : LINEAR HASHING

Linear Hashing is a dynamic structure, which implements the hashing scheme, where one bucket is changed at a time.. Unlike extendible hashing, it does not use bucket directory structure, and it is also not necessary that the overflown bucket will split. Overflows are handled my creating a chain(linked list) of page under the overflown bucket. The hashing function changes dynamically according to the size of the hash table.

Consider the example, the Hash Table initially has 4(m) buckets, with a capacity of 2, and a pointer p (starts at 0) which points to the bucket to be split whenever the split policy is not obeyed.

Split Policy — Overflow will occur when the load factor(total number of elements / total capacity) will surpass a certain ratio (for ex — 75%).

**Initial Linear Hashing : m = 4, p = 0, h0 = k%4**

The data “19” which would be inserted to bucket 3, but as the bucket is full, therefore an overflow bucket is added. After adding the new element, the load factor is greater than 75%, which causes the bucket 0 to be split(rehashed) into two bucket : original bucket 0 and a new bucket m(4) with new hash function h1(k) = k % 8. The pointer p will also increment to 1 i.e. the next overflow bucket will be 1, if it follows the split policy.

As in the image above, an overflown bucket is added forming a chain structure at 3rd bucket. The values in bucket 0 previously are rehashed according to new hash function h1 ( 4%8 = 4, 8%8 = 0). A crucial property of h1 is that search values that were originally mapped by h0 to some bucket j must be remapped either to bucket j or bucket j + m.

Deletions will cause the buckets to shrink. Buckets that have been split can be recombined if the load factor falls below the threshold. The operation is reverse of splitting.

EXTENDIBLE HASHING V/S LINEAR HASHING

Linear Hashing is suitable for applications which require less memory overhead, as extendible hashing uses the global directory structure. Insertions and deletions are efficient with linear hashing, but as the load factor increases and clustering occurs, performance may decrease due to increased collisions and it may form long overflow chains. Whereas, with extendible hashing the performance remains stable even if the load factor increases.

CONCLUSION

In conclusion, hash tables are powerful and efficient data structures that enable quick access and retrieval of information based on unique keys. They offer constant-time average-case complexity for key operations, such as insertion, deletion, and search, making them ideal for applications that require fast data access. Hash tables provide a balance between memory usage and performance by utilizing a hash function to distribute data across an array of buckets or slots. However, it’s important to carefully design the hash function to minimize collisions and ensure a uniform distribution of data. Additionally, hash tables are versatile and find applications in various domains, including database systems, caching, symbol tables, and more.