Know About Hashing!

Swapnil Kant
Javarevisited
Published in
11 min readDec 23, 2020

Hello Readers,

I am back with my new article. Today I will be discussing an interesting topic on data structures which is Hashing, hashing is a very important concept in data structures, it helps you to map meaningful data for quick access or you may say for a better arrangement of data in a table or record.

Table 1.1: Showing a simple Database

Now, let us suppose that we have to make a record of all the names of the candidate (Shown Above Table 1.1) that have taken admission in a particular college and store their data in an organized manner in a table for future purposes, now for easy identification of the candidates, we need their unique identification which must be different for each and every student, the application id (Key) is unique for every candidate and serves as the best medium to fetch the name of a particular student form the table no matter if the name of any two candidates are the same, which makes it easy for a user to search, insert and delete a particular entry of the candidate from the table. So from the above example, we see that each and every name of a particular student is mapped into a table using a unique key which may or may not be the same for everyone (Depends upon the hash function), now whenever the user wants to get information about a particular student he/she may use the key to get it without searching for it in the whole table one by one.

Now for maintaining the desired record of students as mentioned above, we may proceed by the following methods:-

  1. By creating a linked list to store names of the candidate at nodes requires a worst-case time complexity of O(n) for searching the data in the list, O(n) for insertion and deletion operations.
  2. By creating an array to store the names of the candidate at the blocks requires a worst-case searching time of O(nlogn) when the data in the array is sorted and insertion-deletion operations become costly as all elements are to be moved while inserting or deleting the elements.
  3. By creating a table which stores the names of the candidate mapped with the key value for search, Insertion, and deletion process to be done in worst-case time complexity of O(1) where application id serves for the unique key value for the names of the candidate. The only problem with this type of structure is the space required to store a large number of records. For example, if the application id is n digits, we need O(m * 10n) space for the table where ‘m’ is the size of a pointer to record. Another problem is an integer variable in a programming language may not store n digits.

What Is Hashing..?

Hashing is defined as the technique of mapping a large data value with a unique and small value (key) in a hash table for optimal and quick storing, deleting, and searching operations. For example, in the above figure, we can search the names of the candidate in O(1) time by directly knowing their key value.

Terminologies Related To Hashing

The various terminologies related to hashing are:-

  1. Hash Table: A hash table is an array of buckets or slots which stores and maps keys with the large data value for optimal and quick storing, deleting, and searching operations. The hash table basically uses a hash function to index the key values to the table.
  2. Hash Function: A hash function is defined as the mathematical function to yield distinct values for the individual keys which is difficult to obtain in practice, within a limited range so as to limit the size of the hash table. The limited range maybe is under the total size of the hash table so defined, they are equal to the index values of the hash table. The advantages of using a hash function are quick and fast access to data values and even distribution of keys across the hash table. The hash function which maps all the keys to each different slot index is known as the perfect hash function.
  3. Key: In hashing key is defined as a mapping index that stores the address of the data value, it is used to get the data value stored at a particular index of the hash table formed by the hash function.

Building Hash Functions

The hash function formation depends upon the programmer, hashing function should be designed in such a way to reduce the collision and distribute the keys evenly in the hash table along with a high load factor for a given set of keys.

Following are some common methods of obtaining a hash function:

  1. Folding: Folding is a method for the formation of a hash function such that the key value is divided evenly into parts and any of the basic arithmetic operations are performed such as addition, subtraction, multiplication, or division to obtain a number where the last two digits of the number so obtained is taken and used to hash the given data value.
  2. Truncation: Truncation is a method for the formation of a hash function where any random index for a key-value is selected according to the algorithm provided and the following index is used to hash the given data value.

Modular Arithmetic: Modular arithmetic is a method for the formation of a hash function where the index so formed by the hash function makes use of the modulo arithmetic along with the size of the hash table such that H(X) = k mod l where k is an integer value and l is the size of the hash table.

Collision In Hashing

Fig 2.2: Showing a case of collision during hashing

Now, suppose the given information (Shown Above Fig 2.2) has to be stored in a limited memory table with a less space, so for this, we hash all the application id of the candidates to a smaller value to store the data values in the less memory space. Now, suppose we use a hash function for this operation which says or is defined as H(X) = Application no. of the candidate % 2, now from the above table we can see that the application numbers 22234 and 17240 yields the same index value from the hash function this case where a hash function yields the same index values for the two different data values is known as collision. This is the most common problem every programmer faces during the formation of a hash function. For generalizing purpose let X1, X2, ..Xn be the keys of the data values in the table and hence according to the hash function we have H(X1) = H(X2) =… H(Xn) where X1, X2.. Xnis known as synonyms. A collision may also be defined as the conflicting in the position of two data values of different key values in the hash table. The above diagram illustrates the collision appropriately where two different data values are to be stored at the same index.

Handling Of Collisions In Hashing

Collisions in hashing can be handled through the following methods:-

Fig 3.3: Showing an example of Linear Open Addressing
  1. Linear Open Addressing: Now, let us take an example where we have a set of integers say S = {10, 20, 21, 15, 17, 4, 30, 40, 31} now if we want these numbers to store in a hash table with hash function defined as H(X) = X mod 10. Now form the above table we can see the places occupied by the integers in the hash table by the given hash function, taking a closer look at the table we can see that 40 is being completely divisible by 10 is kept on the slot two of bucket one this is because the previous three integers also are completely divisible by 10 so there is no slot left (which is often referred to as overflow condition), after the storage of the previous integers into the hash table, so we will keep traversing the table until and unless we find a index close to the result index which we obtained from the hash table i.e. 2. Similarly if there were more elements in a set of integers then the elements are mapped according to the hash function and stored in the next index if the slots in the particular bucket is full and all other integers are stored according to the hash function so defined if there is no collision. Hence, linear open addressing is one of the best ways to resolve collision in hashing. The following operations are performed in a linear open-addressed hash tables:
    i. Searching: If there are ‘m’ number of slots in a hash table and ’n’ is the number of keys to be inserted into the table then, Load factor = n / m < (1) , hence searching time is < 1 / (1 — load factor) and approx = 1 / (1 — load factor).
    ii. Inserting:
    The insertion operation in a hash table takes place by checking each and every closest empty slot or the bucket if the bucket at which the element has to be placed is occupied. If there are ‘m’ number of slots in a hash table and ’n’ is the number of keys to be inserted into the table then, Load factor = n / m < (1) , hence insertion time is < 1 / (1 — load factor) and approx = 1 / (1 — load factor).
    iii. Deletion:
    The deletion process in a hash table is very inefficient due to the reason that when any data value is deleted from the table then for searching for a data value when we traverse the table, if there is an empty slot in the table then the searching operation stops and the data values after the empty slot are left out and we do not reach to a proper conclusion or the result. If there are ‘m’ number of slots in a hash table and ’n’ is the number of keys to be inserted into the table then, Load factor = n / m < (1) , hence deletion time is < 1 / (1 — load factor) and approx = 1 / (1 — load factor).

Different Types Of Collision Resolution Techniques With Open Addressing

There are different methods to resolve the collision process namely:

  1. Rehashing or Double Hashing: Rehashing as the name suggests is hashing the hashed value again and again till an empty slot is found in the hash table. Now, linear addressing is inefficient in terms of storing the data value and the data gets stored in a very clumsy and gets clustered very closely in the table which reduces the efficiency of insertion, deletion and searching operations, so for this issue to solve we have the rehashing method where the previous hashed value gets rehashed and is searched for the index in the table if the index is occupied by other data value then the new hashed value again gets hashed and so on until there is an empty slot of the index in the table.
  2. Quadratic Probing: It is a method where the indexing is done on the basis of the quadratic hash function which increases proportionally to the hash value,hash function used is H(X) = (n + k2) % m provided If there are ‘m’ number of slots in a hash table and ’n’ is the number of keys to be inserted into the table. During the insertion of the data value if the slot at which the data is to be stored is occupied then the next slot using the hash function is checked and so on. Quadratic probing does not necessarily reduces the clustering in the hash table despite the repeated hashing.
  3. Random Probing: Random probing as the name suggests is the method for hashing where the random function generator is used to generate an integer value and helps in probing to find the vacant space in the hash table to store the data value. This method, however, has to generate the same sequence of integers.

Chaining: Since the storage of the data values in the hash table is non-dynamic and requires a size to be initialized which serves as a very inefficient storage space which causes overflow and many of the slots get wasted and There is a lot of memory waste with inefficient searching, insertion and deletion operations, so for maintaining efficient storage space for data values a method called Chaining or Open Hashing is used where the data values are stored in the form of singly linked lists which is dynamic in nature and reduces the lost in space memory due to unoccupied slots in the hash table. Whereas it requires extra space for links to other nodes formed, below is the diagrammatic representation of hashing using chaining. Technique for the same example explained above.

Fig 4.4: Showing a case of chaining to resolve collision during hashing

The following operations are performed in a linear open addressed hash table:

  1. Searching: The searching operation performed on the hash table does not depend upon the size of the hash table and takes a best case time complexity of O(1) because all of the data values are not stored in one single bucket rather are distributed in the table, the worst case time complexity for searching is O(n) where ’n’ is the size of the chain, it the case when all the data values in table is stored under one bucket and the data value to be searched is at the last node of the chain. On an average, the time complexity of the searching operation is better than the linear open addressing method.
  2. Insertion: The insertion operation in the hash table takes the same worst-case time complexity of O(n) where ’n’ is the size of the chain which occurs when the data value is to be stored at the end of the single bucket chain and the best-case time complexity of O(1) takes place when the data element has to be indexed just after the bucket.
  3. Deletion: The deletion operation in the hash table takes the same worst-case time complexity of O(n) where ’n’ is the size of the chain which occurs when the data value is to be deleted at the end of the single bucket chain and the best-case time complexity of O(1) takes place when the data element has to be deleted just after the bucket by maintaining the links of the chain.

So, this was all about hashing and its concepts, particularly in interviews and online tests this article would be sufficient to go with!

Keep learning and keep growing and also keep exploring more!

All the very best!

For more interesting and informative articles and tips follow me on Medium and Linkedin

Other Data Structure Articles you may like

--

--

Swapnil Kant
Javarevisited

Hi, I am Swapnil Kant, an avid programmer, and a full-time learner! One who is highly interested in Algorithm Optimization and Development