Understanding Hashing

Shashi Bhushan
booleanbhushan
Published in
5 min readNov 10, 2020
Hashing Machanism

Motivation to use hashing is to reduce the search time complexity of particular element in a collection of elements to O(1) in best case. Because of the pigeonhole principle, given large enough test data, it will always be the case that two or more elements will compute to same hash, this situation is called collisions and because of collisions, the search time will not always be O(1). Let’s see the process of hashing.

Hash Function

Given an object, this function computes it’s hash. It’s also called prehash in some texts, It basically gives you a textual representation of an object as a String or a number.

How a Hash machanism works is you take an array as a backing field in your Hashing class. Now, you want to insert an element into your Hash based Data Structure, first you compute the hash of your element using the hash function, Let’s say hash function is ((i % 6) + 6) % 6+ 1, i.e. it gives a number between 1 and 7 for any positive or negative number. Based on type of hashing you are using, you’ll be saving your element in the array.

Types of Hashing

  • Closed Addressing

In this schema, you’ll use a linked list like data structure to save all your elements that compute to same hash. This means that your array is just an array of pointers or references to these linked lists. Closed address means the address of your element once found by hash function won’t change.

Link Chaining in Closed Addressing
Link Chaining in Closed Addressing

The array entry where your element is saved is called a bucket. Since there could be multiple entries sharing same bucket number(via same response as hash function’s output), collision is not a problem here. I’ll be saving all entries sharing same bucket as a list.

but another problem I could face is, what if all the entries are in same bucket or only few buckets have all the entries, well you might use a balanced Tree instead of Linked List to save your entries in the buckets. Refer to HashMap implementation after Java 8 for example on this.

  • Open Addressing

In this scheme, no pointers or references are used. Your array backing field is an array of elements and hash function gives the address where you need to save the element. Open addressing means that the final address where your element is saved could be different than what you initially got from the hash function. Intuition behind this scheme is space gain you’ll get because of not having to manage references will allow you to save more entries in your hash based Data Structure. Obviously, bigger array means less chance of collisions but still there could be few collisions in this scheme. Hence, there are primarily three strategies to resolve collisions

Handling Collisions

The way to handle collision is you define a sequence of entries that should be searched if a collision happens. So, for each position, you’ll have one unique list of positions you’ll be searching one by one and use the first entry in list that doesn’t have collision.

It’s called probing. When defining a probing sequence, you also need to make sure that you are checking only within the bounds of number of your buckets. If there are 7 buckets and I’m searching sequentially to the next position, I’ll round up to the first bucket after I reach the end(7th) position.

There are three ways to define such a list of probes

  • Linear Probing

If your hash function gives an address which is filled, simply go to the next position available in the array. You can think of it as jumping i positions forward from initial calculated position(value of hash function), where i goes from 1 to length of array. at each position, you’ll check if the position is empty or not.

In worst case, you might end up having to traverse the whole array and find no space is available. This problem is called primary clustering and denotes that there are clusters of data into your Data structure that makes finding search or insert position difficult into that area. In worst case, this cluster spans the whole array.

Also, you need to handle how you delete entries. Suppose your buckets are full and elements at position 4, 5, and 6 all has hash 4. Since it’s linear probing, you’ll search for next available entry after 4 when inserting. Now, if you delete 5 from array, and then, do a search for element at position 6. Since, it’s hash is 4 so search will begin at position 4. you still need to make sure that search for element at position 6 should succeed, even though 5 is empty now.

  • Quadratic probing

Unlike linear probing, you’ll jump i² positions this time. Even if there are clusters of data in some part of array, you’ll be making 1,2,4,9 jumps forward from initial calculated position. Hence, clustering won’t be that much of an issue but every once in a while, you’ll be unlucky to land only on filled positions.

This probing sequence is also till the length of entire array. If you’ve searched in entire array once and didn’t found any open position, the array is full and new elements can’t be inserted.

Problem with quadratic probing is the offsets you’ll search for are fixed. if a position is occupied, I’ll always search for 1,2,4,9 etc positions forward. As in the given example, if all elements had hash 4, then the probing sequence I need to search with each insertion grows. It’s called secondary clustering.

  • Double Hashing

Quadratic probing is also prone to secondary clustering problem. Solution is to use a hash to find the offset as well. Even if primary hash for each element gives 4, secondary hash that I compute will be different hence a different probing sequence for each element. This works well for most of the cases.

I need to make sure though that the second hash never gives 0. If there’s a collision at position 4and I want to check at position 4+ 0, i’m never going to get closer to resolving the collision is it.

Well, this was a primer on hashing. Hope this was helpful. Thanks for reading.

--

--