What is Hash Index?

Humberto Villalta
7 min readApr 12, 2023

--

1. Overview

Hash indexes are widely used data structures. It is possible to see hash indexes in a lot of databases thanks to their outstanding time complexity of O(1). Instead of searching for a key along a table, the hash index searches the data through its hash function. Therefore, this approach ensures fast search and insertion into the index. Hashing is taking a piece of information (a string) and making it into a pointer for quick access. As an example, when Ctrl+F-ing, we look for the word in a text instead of reading dozens of pages. Thus, the main benefit of using hash indexes it because of their fast performance.

A hash index is most commonly used in data management where the corresponding column possesses a high cardinality. This is because the hash function is going to make a random deterministic value for each record. So it will have better performance if the values in the corresponding column have fewer repeated values.

2. Hash Indexing Structure

In simple terms, hash indexing is comprised of:

  1. Keys
  2. Hash function
  3. Hash table

2.1 Keys

It is the key itself before passing through the “encryption” or hashing process. The fingerprint obtained from the hash function is then going to be saved and managed by a hash table as shown in the picture above.

2.2 Hash Functions

Hash functions are mathematical functions that are used to convert the key of each piece of data into an index in the hash table. The hash functions are in charge of giving a unique pointer to the keys that are being inputted. This function hashes the key and then assigns the pointer (or address in the disk) to a bucket. Meaning that when searching that key, this hash function is going to hash the key again and immediately search for the key without any table scanning “O(1)”.

2.3 Hash Table

Hash tables are used to manage the values but also the index or “pointer” generated from the hash functions. The hash table is used to efficiently store and retrieve the data. However, hash functions can produce the same pointer for two different keys. This is called “hash collision”. Because of this issue, hash indexes can end up obtaining a time complexity of O(n) instead of the “ideal” O(1). The O(n) reason is that the key that is having the same pointer as the other key needs to be reassigned.

Also, hash indexes do not support various operators. In the case of PostgreSQL, it only supports the equality operator. This is why hash indexes can be used in conjunction with other index types, such as B-tree or GiST in PostgreSQL instead of just by themselves.

3. Hash Collisions

Hash collisions occur when multiple keys are hashed to the same slot in the hash table, and this can lead to decreased performance in hash indexes. The primary reason hash collisions occur is due to a poor hashing algorithm or when there are more items to hash than there are slots available.

However, modern hashing algorithms, including most used in practice, attempt to distribute resulting hashes as evenly as possible over the entire output space, thus minimizing collisions. This is achieved by using mathematical functions that convert the key of each piece of data into an index in the hash table. The hash function is in charge of giving a unique pointer to the keys that are being inputted. It hashes the key and then assigns the pointer (or address in the disk) to a bucket.

In addition, hash indexes are most commonly used in data management where the corresponding column possesses a high cardinality. This is because the hash function is going to make a random deterministic value for each record. So it will have better performance if the values in the corresponding column have fewer repeated values.

Overall, a good hashing algorithm is essential for efficient hash indexing, as it ensures that hash collisions are minimized and that searching and insertion into the index are fast. There are two representative methods used for handling collisions:

3.1 Open Addressing

In open addressing, all elements are stored in the hash table itself. So at any point, the size of the table must be greater than or equal to the number of keys. This approach is also known as closed hashing and this entire procedure is based on probing.

  1. Linear Probing: The algorithm simply looks for the next available slot in the hash table and places the collided e=key there. If that slot is also occupied, The algorithm continues searching for the next available slot until an empty slot is found.
  2. Quadratic Probing: The algorithm searches for slots in a more spaced-out manner. When a collision occurs, the algorithm looks for the next slot using an equation that involves the original hash value and a quadratic function. If that slot is also occupied, the algorithm increments the value of the quadratic function and tries again.
  3. Double Hashing: The algorithm uses a second hash function to determine the next slot to check when a collision occurs. The algorithm calculates a hash value using the original hash function, then uses the second hash function to calculate an offset. The algorithm then checks the slot which is the sum of the original hash value and the offset. If that slot is occupied, the algorithm increments the offset and tries again.

3.2 Separate Chaining

The idea behind separate chaining is to implement the array as a linked list called a chain. The linked list data structure is used to implement this technique. When multiple elements are hashed into the same slot index, then these elements are inserted into a single-linked list which is known as a chain. In conclusion, if two elements have the same hash value, we will store those two values in the same cell as a linked list.

4. Hash Index Algorithm Types

The main hash functions used for hashing are SHA, MD-5, RIPEMD, and Whirlpool algorithms

4.1 SHA

SHA stands for secure hashing algorithm. SHA is a modified version of MD-5 and is used for hashing data and certificates. A hashing algorithm shortens the input data into a smaller form that cannot be understood by using bitwise operations, modular additions, and compression functions. Hashing is similar to encryption, the only difference between hashing and encryption is that hashing is one-way, meaning once the data is hashed, the resulting hash digest cannot be cracked unless a brute force attack is used.

4.2 MD-5

An MD-5 hash is created by taking a string of any length and encoding it into a 128-bit fingerprint. Encoding the same string using the MD-5 algorithm will always result in the same 128-bit hash output. MD-5 hashes are commonly used with smaller strings when storing passwords, credit card numbers, or other sensitive data in databases such as MySQL. This tool provides a quick and easy way to encode an MD5 hash from a simple string of up to 256 characters in length.

MD-5 hashes are also used to ensure the data integrity of files. Because the MD-5 hash algorithm always produces the same output for the same given input, users can compare a hash of the source file with a newly created hash of the destination file to check that it is intact and unmodified. However, since hackers already know how to decode the algorithm, it is not so safe to use now for highly sensitive data.

4.3 RIPEMD

RIPEMD (RACE Integrity Primitives Evaluation Message Digest) is a group of hash functions based on MD-4 which makes it a kind of weak hash function that can only work well with 32-bit processors. There are 4 types of RIPEMD algorithms:

  1. RIPEMD-128
  2. RIPEMD-160
  3. RIPEMD-256
  4. RIPEMD-320

4.4 Whirlpool

Whirlpool is a cryptographic hash function developed in 2000. It is a block cipher hash function designed after a square block cipher. It takes less than 2²⁵⁶ bits length input and converts it into a 512-bit hash. Every block cipher in a whirlpool is an 8*8 matrix. The state of the function changes in every round by using four operations:

5. Conclusion

Hash indexes are composed of keys, a hash function, and a hash table. Because the values generated from the hash function can have a collision with a bucket in the hash table, addressing this collision with open addressing and separate chaining can help to avoid this issue. Also, it is important to note that choosing the right hashing algorithm can help to prevent this issue further. Some of the most used hashing algorithms are SHA, MD-5, RIPEMD, and Whirlpool. Below, there is an image of how a hash function should generate a fingerprint. It is possible to note that the value of “Heaven” and “heaven” differ a lot. This is a great example of how hash functions should work.

References

  1. Hash What? Understanding Hash Indexes
  2. MD5 Hash Generator
  3. What is SHA? What is SHA used for?
  4. RIPEMD Hash Function
  5. Whirlpool Hash Function in Python
  6. Understanding hash indexes
  7. Hashing Algorithm Overview: Types, Methodologies & Usage
  8. A Definitive Guide to Learn The SHA-256 (Secure Hash Algorithms)
  9. Open Addressing Collision Handling Technique in Hashing
  10. Separate Chaining Collision Handling Technique in Hashing

--

--

Humberto Villalta

I am a BigData Engineer who has a lot of interest in the research and development of data structures for the optimization of NoSQL Databases.