Let’s do hashing!

Building a data structure that can be searched rather quickly…

Ashutosh Kumar
Dreams On Fire!
6 min readJun 29, 2020

--

The concept of hashing refers to a data structure that made the search possible in O(1) time. For this, we got a hash table — it is a collection of items that are stored in such a way as to make it easy to find them later. Each position of a hash table often called a slot and it holds one item each and each slot has an integer value.

A typical hash-table

For instance, we can see the above hash table has 11 slots(mostly the size of hash tables are always prime numbers, we will see later why!) and slot holds one value(None as all are empty for now).

The mapping between an item and the slot where that item belongs in the hash table is called the hash function. It takes any item from the collection and returns an integer value of the slot in which the item is to be stored.

Is it that much easy for Hash Function?

The foremost thing to understand here is that there is no such terms known as “perfect hash function”. What we all care about is performance efficiency! Let’s check it out, why?

Let start with a hash function and see what are the complications. So we got a remainder method to map items in a hash table. Like the collection of items is 44,57,60,98,55,114,26 and we have a hash table of slot size 11 as shown in the above diagram. The hash function will work like h(item)=item%11

Hence the slots for items will be 44-slot 0, 57-slot 2, 60-slot 5, 98-slot 10, 55-slot 0, 114-slot 3, 26-slot 4 [slot size=11, starting from 0 to 10].

Do you see the problem here?-Same mapping is done to multiple items like 44 and 55 are allotted to the same slot 0. This is known as a collision.

Our goal is to create a hash function that minimizes the number of collisions, is easy to compute, and evenly distributes the items in the hash table. There are a number of common ways to extend the simple remainder method. How?

  • Folding method: It is for constructing hash functions that begin by dividing the item into equal-size pieces (the last piece may not be of equal size). These pieces are then added together to give the resulting hash value. For example, if our item was phone number 436–555–4601, we would take the digits and divide them into groups of 2 (43,65,55,46,01). After the addition, 43+65+55+46+0143+65+55+46+01, we get 210. If we assume our hash table has 11 slots, then we need to perform the extra step of dividing by 11 and keeping the remainder. In this case, 210 % 11 is 1, so the phone number 436–555–4601 hashes to slot 1.
  • Mid-Square method: We first square the item, and then extract some portion of the resulting digits. For example, if the item were 44, we would first compute 44²=1,936. By extracting the middle two digits, 93, and performing the remainder step, we get 5 (93 % 11). There could be many ways literally.

The important thing to keep in mind that the hash function should be efficient so that it does become the dominant part in the storage and search process as it will defeat the purpose of hashing.

How collisions are addressed?

For collisions in the slots of hash tables, we use algorithms like linear probing, which searches for the next empty slot if a collision occurs. But the problem is we may face clustering. This means that if many collisions occur at the same hash value, a number of surrounding slots will be filled by the linear probing resolution. This will have an impact on other items that are being inserted next. Like if you have allotted a slot for an item and then the next item is also allotted to the same slot then clustering occurs in one place.

So we can use quadratic probing instead of a linear one as it will map slots with a jump of more than 1, not the next slot. Hence, randomness will increase in distribution, and therefore hashing will be efficient than before.

Chaining methods are also used in case of collisions. Each slot is connected to a chain and items allotted to that slot will go in the chain further and thus creating several different chains of each slot.

Let’s take an overview of Hash Function through cryptographic perspective!

  • Hash functions transform large input(variable size) to small fixed output(every information has a fixed size hash value).
  • Deterministic and efficient computation, given the input.
  • The output is uniformly distributed(mapping of hash functions in hash table).
  • The input is called as message and output can be called a hash, fingerprint, or digest.

Insecure Hash Functions

The distribution of items in a hash table should be improved by circular shifting message blocks by different amounts before encrypting it using XOR. If there will be a regularity in similar message blocks, then it will give a non-uniform output. The attacker here can generate a similar hash function by combining crafted code with the original input.

What does the Cryptographic Hash Function require?

  • Preimage resistance: The hash function should be a one-way function to make it difficult to find an input that maps to a given hash output. Like for any output h’, it is computationally infeasible to find a y such that h(y) = h’.
  • Collision resistance: For any x, it should be computationally infeasible to find y≠x with h(y)=h(x).
  • The output of the hash function h should be pseudo-random and exhibits Avalanche Effect.

Brute-Force Attack on Hash Function

The security of a hash function is not dependent on the encryption algorithm instead it depends on the bit-length of the hash function[h(n)]. On average it will take 2^(n-1) computations for brute force attack. That means even 128-bit hash isn’t secure enough, so presently we use a 224-bit hash.

The Structure of the Cryptographic Hash Function

Iterative Hash structure

Here, the message1, 2,……, n are input blocks and ‘f’ is the hash function applied on each block of message. If ‘f’ is collision-resistant, then so is the hash and they are to support the variable-length inputs from the message block.

  • The pseudo-randomness in the output hash does not allow the attacker to correlate the hash value and the password itself.
  • The Avalanche effect does not allow the use of heuristics(enabling a person to learn something) for the guessing of input. Thus, even after knowing part of the password, the attacker will not be able to use this knowledge to increase the effectiveness of the attack.
  • The pre-image resistance is the main requirement of crypto hash function because it does not allow an attacker to pick up a password with which he/she can authenticate, even if hashes are leaked.

So as far as the basic science of Hashing is concerned, we now have a great deal of information about that. The next topic will be the hash chains and how bitcoins use hash for its security. Stay tuned!

~Ashutosh

--

--