The Magic of Hashing for Efficient Data Retrieval

YASSINE KODAD
8 min readSep 9, 2023

--

In this post, I aim to discuss the importance and benefits of hashing in data retrieval, explaining the hashing process, exploring the most popular hashing algorithms, and delving into various types of databases and systems that harness the power of hashing.

When dealing with large datasets, the straightforward approach of searching for data elements in an array using linear search becomes impractical. The time complexity of linear search is O(n), where ’n’ represents the size of the array. For large datasets, this can lead to unacceptably slow search times. Additionally, arrays are not well-suited for other operations like insertion, deletion, or sorting without incurring significant time and space costs. This is where hashing comes into play as a powerful technique to address these challenges and enable efficient data retrieval.

Hashing allows for efficient data retrieval by transforming data into a fixed-size hash code using a hash function. This code serves as an index or pointer to the actual data. Hash functions are designed to distribute data uniformly across a fixed number of buckets or slots. This means that, on average, you can access data in O(1) time complexity, making it incredibly fast, even for large datasets.

A small phone book as a hash table

The hashing process:

Input Data: The process begins with the selection of input data, which can be of any length and content. This input data can be a password, a file, a message, or any other information you want to process. Here are some examples of input data for hashing in data retrieval systems (e.g. in DBMS, Employee IDs — “EMP12345”; Product SKUs- “SKU98765”; ISBN Numbers — “ISBN-13: 978–0–123456–78–9”; Location Coordinates)

Hash Function: A hash function is a mathematical algorithm designed to take the input data and produce a fixed-size output, which is the hash value. The characteristics of a good hash function include:

  • Deterministic: For the same input, the hash function will always produce the same hash value.
  • Fast to Compute: Hashing should be a relatively fast operation.
  • Fixed Output Length: The hash function generates a hash value of a fixed size, regardless of the input size.
  • Avalanche Effect: A small change in the input should result in a significantly different hash value.
  • Pre-image Resistance: It should be computationally infeasible to reverse the hash and find the original input data.
  • Collision Resistance: It should be computationally infeasible to find two different inputs that produce the same hash value.

Hashing Process: The input data is fed into the hash function as a parameter. The hash function processes the data using its algorithm, performing a series of mathematical operations and transformations. The result is a hash value that represents the input data. Here are some common mathematical operations and processes: Bitwise Operations, Modulus Operation, Bit Shifts, Addition and Subtraction, Multiplication and Division, Logical Operations, Rotation, Mixing Functions, Rounds, Padding, Message Scheduling, Initialization Vectors (IV), Salt, and Compression Functions.

Hash Value: The hash value is typically a sequence of characters or bytes, often in hexadecimal format. It appears as a seemingly random string of characters. This hash value is of fixed length, regardless of the size or content of the input data. Here are some examples of hash values: Hexadecimal Representation- Hash Value: “0x7a83b51f” ; Random-Looking Characters — Hash Value: “c2f10e85dab473fc”; Fixed-Length Output — Hash Value: “a0e8f356”; Consistency Across Inputs — Hash Value (for “Hello, World!”): “2ef7bde608ce5404e97d5f042f95f89f1”.

Here’s a simple example of hashing in Python using the hashlib library:

The most popular hashing algorithms:

  • MD5 (Message Digest 5): MD5 is a widely known and used cryptographic hash function. However, it is considered weak and vulnerable to collision attacks, making it unsuitable for security-critical applications. It produces a 128-bit (16-byte) hash value.
  • SHA-1 (Secure Hash Algorithm 1): SHA-1 was widely used for data integrity and security purposes, but it is now deprecated due to vulnerabilities. It produces a 160-bit (20-byte) hash value.
  • SHA-256 (Secure Hash Algorithm 256-bit): Part of the SHA-2 family, SHA-256 is widely used for secure data hashing and cryptographic applications. It produces a 256-bit (32-byte) hash value. SHA-256 is considered secure and is commonly used in blockchain technology.
  • SHA-3 (Secure Hash Algorithm 3): SHA-3 is the latest member of the Secure Hash Algorithm family and is designed to provide a higher level of security. It comes in various output lengths, including SHA3–256 (256 bits) and SHA3–512 (512 bits).
  • SHA-512 (Secure Hash Algorithm 512-bit): SHA-512, also part of the SHA-2 family, produces a 512-bit (64-byte) hash value. It is often used when a longer hash is required for added security.
  • bcrypt: bcrypt is a cryptographic key derivation function designed for securely hashing passwords. It incorporates a salt and a work factor, making it resistant to brute-force and rainbow table attacks.
  • Argon2: Argon2 is another key derivation function designed to securely hash passwords. It is the winner of the Password Hashing Competition and is known for its memory-hardness properties, making it resistant to certain attacks.
  • CRC32 (Cyclic Redundancy Check): CRC32 is a non-cryptographic hash function used for error-checking in data transmission. It is fast but not suitable for security-related applications due to its vulnerability to collisions.
  • MurmurHash: MurmurHash is a non-cryptographic hash function known for its speed and simplicity. It is often used in applications where speed is more important than cryptographic security.
  • XXHash: XXHash is another non-cryptographic hash function that is known for its high speed and efficiency. It is often used in data processing and compression applications.

Collision Handling

Collision handling refers to the strategies and techniques used to address the situation where two distinct pieces of data produce the same hash code when hashed. Hash functions map data from a potentially large domain into a fixed-size range (the hash code), and due to this compression, it’s possible for different inputs to generate the same hash code. This phenomenon is known as a collision. To manage these scenarios effectively and preserve data integrity while maintaining optimal performance, hash tables implement collision resolution techniques. Two common methods for collision handling are chaining and open addressing.

https://commons.wikimedia.org/wiki/File:Hash_table_5_0_1_1_1_1_0_SP.svg
  • Chaining: Chaining involves creating a linked list (or another data structure like a dynamic array) at each hash table bucket. When a collision occurs, new key-value pairs are simply appended to the list in the corresponding bucket.
https://en.wikipedia.org/wiki/Hash_table#

Open Addressing: Open addressing involves searching for the next available slot in the hash table when a collision occurs. Various probing techniques can be used, such as linear probing (checking the next slot), quadratic probing, or double hashing.

https://en.wikipedia.org/wiki/Open_addressing

Different types of databases and systems (DBMS) that use hashing:

Databases use different ways of hashing for various purposes, including indexing, fast data retrieval, and keeping data safe. Below, we explore a selection of databases and database management systems (DBMS) that leverage hashing techniques to achieve these objectives:

  • MySQL: MySQL employs a hashing-based indexing technique for its InnoDB storage engine. InnoDB uses a B-tree structure with a clustered index, but it also employs hashing to optimize secondary indexes.
  • Oracle Database: Oracle Database utilizes hashing techniques for its hash clusters, which are designed to improve the performance of data retrieval for specific types of queries.
  • Microsoft SQL Server: SQL Server provides a HASHBYTES function that can be used to generate hash values for data, which is useful for data integrity checks.
  • MongoDB: MongoDB uses hashed indexes for sharding data across multiple nodes in a cluster. This technique ensures even data distribution and efficient data retrieval in a distributed database environment.
  • Cassandra: Cassandra, a NoSQL database, employs consistent hashing to distribute data evenly across nodes in a cluster, ensuring load balancing and fault tolerance.
  • HBase: HBase, a distributed and scalable NoSQL database, uses hashing to partition data across region servers for efficient storage and retrieval.
  • Redis: Redis, an in-memory data store, uses hashing to manage its key-value pairs efficiently, enabling fast data access based on keys.
  • Elasticsearch: Elasticsearch uses hash-based data structures like inverted indexes to facilitate fast full-text search and indexing.
  • DynamoDB (Amazon Web Services): DynamoDB utilizes consistent hashing to distribute data across multiple nodes in its distributed database service, ensuring high availability and scalability.
  • Couchbase: Couchbase, a NoSQL database, uses hash-based sharding to distribute data across nodes in a cluster, optimizing data distribution and retrieval.
  • Riak: Riak, a distributed NoSQL database, employs consistent hashing to distribute data and ensure fault tolerance in a distributed environment.
  • Ethereum (Blockchain): Ethereum, a blockchain-based distributed ledger, uses hashing extensively for various purposes, including mining, creating blocks, and securing smart contracts.
  • Content Delivery Networks (CDNs): CDNs like Akamai use hashing techniques to determine the optimal server for content delivery based on the user’s location, reducing latency and improving content delivery speed.

Fibonacci sequence is not suitable for strong cryptographic hashing, primarily due to its lack of the “avalanche effect,” a crucial property in cryptographic security. This effect ensures that even small changes in input data produce significantly different hash values. Please refer to the image below.

SHA-1 demonstrates a strong avalanche effect

Note: Uniform distribution can be problematic for hashing, especially in cryptography. You can find the image below for reference.

The Fibonacci sequence exhibits an even distribution

Hash functions often use randomness or pseudo-randomness to generate hash values with desirable properties, such as the avalanche effect and resistance to collision attacks. Consequently, cryptographic hash functions need techniques to introduce variation and unpredictability into hash values, which are essential for robust security.

In the context of efficiently finding data elements in large datasets, hashing is a key technique that overcomes the limitations of traditional data structures like arrays. Hashing, particularly through hash tables, offers a powerful solution for efficient data retrieval, insertion, and deletion, making it an essential tool in database management and algorithm design for handling large datasets. Understanding hashing and its applications is fundamental for developing computational efficient methods to query and manipulate data.

--

--

YASSINE KODAD

Math and tech aficionado, cooking up AI, ML, and NLP magic spells. Turning data into action-packed superhero solutions!