Hashing-Part 2

Sunilkathuria
In Computing World
Published in
6 min readMay 12, 2024

By the end of this article, you will have a fair understanding of choosing a hash function appropriate to your needs.
In the first part of the article, we wrote a simple hash function and understood how it fares against the characteristics of a hash function.

We will implement a hash table containing 30K common English words to showcase this. The following hash function is used to generate the hash values.

The table below shows how the hash function fares against the common characteristics of a hash function.

Testing the efficacy of the hash function

The plan is to use the hash value generated by this function to generate/identify the index of one of the bucket(s).

At first glance, the first three characteristics — “one-way,” “avalanche-effect,” and “collision”- raise suspicion. Though the result is not good, we can still use the hash function to manage the data in the hash table. The following points will help you make a better decision.

How well is the data distributed among the buckets

Since we have the data available, the first test we should do is to check whether the data distributes evenly in all the given buckets.

To calculate the index of a bucket, we will generate the hash value by passing the word to the hash function and “modulo” it with the count of buckets. Five in this case. It will give an index ranging from 0 to 4.

The following code snippet shows the distribution of 30K English words from a text file in a hash table with five buckets. It only counts the number of words each bucket will contain.

Distribution of 30K words in 5 buckets using the hash function hash_v1

The above distribution shows bucket#2 has the maximum number of elements and bucket#4 has the least.
The skewness of the distribution is subjective. We cannot decide whether this distribution is suitable just by looking at the above numbers. Another aspect to look at is how evenly distributed the access and search of data from this hash table are.

What if we do not have the actual data with us? How do we test the distribution?
In this case, if we are aware of the format of the data, then we should generate some random data and test the distribution.

What if the distribution is not even?
In this case, we can experiment with the different number of buckets and try to find a number that gives the most even distribution.

One more test

We run another test to test the effectiveness of the data distribution using five million randomly generated strings.
This serves two purposes. First, it boosts the confidence that the hash function is not bad. Second, you can use this code to play around and generate strings in different formats to test the effectiveness of its distribution.

The output of the function. “data_distribution_with_random_data()”

Keeping an eye

Imbalance — Even though we have done much testing before using a hash function, we still need to gauge its performance when new data is constantly added. This new data should not create a hot spot in a hash table. A hot spot is created due to the uneven data distribution among the buckets, thus creating an imbalance.

Frequently accessed location — We also need to monitor the performance degradation. This happens when a large amount of data is searched for or modified in a few buckets.

Corrective measures

Resize Buckets — One way to address these problems is to resize buckets by increasing or decreasing the number of buckets.
The most significant disadvantage of this approach is that it requires recalculating indexes, which can be time-consuming.

An alternative approach — We can also use consistent hashing to manage data reallocation to different indexes. However, this article does not discuss consistent hashing.

Different hash function — Another way is to look for another hash function that causes a few collision(s).

Storing data using key-value pair

Now, we have changed our scenario a little. Instead of storing the data in a hash table, we wish to store the 30K words using a “key: value” pair. This means that the hash value of each word will be used to store and locate it.

In this case, we need a hash function that is very good at “collision resistance.” Our hash function, hash_v1, will fail miserably here. The option is to use another hash function with less collision probability while storing the data.

Note: with a hash function, there is always a possibility of collision.

One option is to use commercially or freely available hash functions and use these functions to generate a hash value. Some of these are MD5 (Message-Digest algorithm), SHA2 (Secure hash algorithm-2), RIPEMD (RIPE Message Digest), MurmurHash — a non-cryptographic hash function, etc.

Python language provides a few of these hash functions via the Hashlib module. The following code snippet enumerates various hash functions Python supports.

The following code snippet shows the usage of the sha256 functions.

Swipe-through Collision

In terms of a hash function, a collision occurs when two different input values lead to an identical hash value or digest.
A digest is a sequence of fixed-size bytes. For example, SHA256 generates a 256-bit hash value. This means this hash function can generate 2²⁵⁶ unique hash values.
Due to this limited output (hash value) against unlimited input (any number of input strings), a collision is always possible. The more input values are hashed, the higher the collision probability.

The collision had little impact on our case study of storing common English words in a hash table. Yet, it can still impact other situations requiring a different implementation strategy (change bucket count, use a different hash function, identify a different input value, etc.).

When data is stored using a key: value pair, a collision can lead to the implementation of collision resolution techniques, which can degrade the performance.

In the case of cryptographic operations, it can have a huge impact on security.

Summary

The analysis above provides a good guideline for evaluating available hash function(s) and choosing a suitable one.

  • Based on the usage, select and test an appropriate hash function
  • Test its implementation before using it in a real-time environment
  • Keep an eye on hot spots and imbalance. Avoid frequent bucket resizing. If required, choose a new hash function.

Finally — find the hash table implementation in the source code shared on GitHub.

References

Source code in the article — https://github.com/incomputingworld/hahsing

Did you enjoy this article? Please clap it and share it with your friends.
If you have found this article helpful, please subscribe to read more articles.
Feel free to comment!!

--

--