Hash Function: Origin story, the linchpin of the information age
A child of the information age:
Unlike most functions, hash functions were not born in the field of mathematics. They were pure innovations from the field of computing in the dawn of the information age. Today, it sees use in many core technologies. In databases, digital signatures, error checking. Its invention was marked a fundamental shift in how we saw computers.
Post cold war
The cold war is over but cold war thinking survives — Joseph Rotblat
Following World War II . Computers built during the war for weapons calculations and cryptography remained. Cold War tensions and the space race ensured continued funding for them. Fundamentally, computers only of the day only did one thing: crunch numbers.
The visionary:
As we may think — Vannevar Bush 1945
Consider a future device … in which an individual stores all his books, records, and communications, and which is mechanized so that it may be consulted with exceeding speed and flexibility. It is an enlarged intimate supplement to his memory. — Vannevar Bush
A visionary essay written by Bush, illustrated the failure of us as a species to grasp all of scientific knowledge at once. As science had grown considerably, Bush opined that it can no longer be a vast store of knowledge we hold in our heads. There needed to be a system by which we can query and consult the collective scientific knowledge.
In his essay Bush urged us to find a better way to retrieve information. He suggested we used microfilm (cutting age technology at the time) to shrink down books, so we can search for information quicker… oh how he would have loved to see the search engines we have today.
The manager of the information retrieval research division:
Shorty after becoming manager of the information retrieval at IBM. he was involved with the problem of searching for chemical compounds recorded in coded form. As with the time, his initial solutions involved some use of punch cards. But as he was given more problems regarding information retrieval and as electronic memory started to surmount the use of punch cards (Thank you magnetic tapes) Luhn had to find a more logical way to index information.
Too many numbers:
During that period, identification numbers like credit card and Social Security numbers were becoming increasingly significant in both public and private spheres. However, these numbers posed challenges as they were hard to remember and prone to transcription errors or intentional falsification. There was a need for a swift method to ascertain the validity of an ID number. Hans Peter Luhn filed for a U.S. patent on a “Computer for verifying numbers”
Luhn’s handheld computer did that, using a checksum algorithm he developed. For a 10-digit number, the computer would perform the following steps:
- Double every second digit
- If any result is 10 or greater, add up the digits of that result to get a single-digit number (for example, “16” would become 1 + 6 = 7)
- Add up all 10 digits of the new number
- Multiply by 9
- Take the last digit of that result
The result is an algorithm that takes a large number and turns it into a smaller number. Thus, the first hash function was born… but at this point in time it is not obvious how to use such an algorithm to improve information retrieval… after all, all this does is verify credit card numbers…
Buckets the answer was buckets!
Challenged with more information retrieval tasks at IBM and glancing at his stationary bucket on his desk (I don’t know if that’s true).
Luhn penned an internal memo at IBM proposing a method to expedite searches by organizing information into “buckets.” To illustrate; consider the task of searching for a specific individual’s telephone number within a database. If presented with the 10-digit number 314–159–2652, a computer could traditionally scan through the list sequentially, which could be time-consuming in a database containing millions of entries.
Luhn’s innovative concept involved categorizing each entry into a numbered bucket using the following process: The phone number’s digits were paired (e.g., 31, 41, 59, 26, 52), and the paired digits were summed (resulting in 4, 5, 14, 8, 7). A new number was then generated, comprising each single-digit result or, in the case of a double-digit result, only the last digit (resulting in 45487). Subsequently, the original phone number, along with the associated name or address, would be placed into a bucket labeled with the derived number, 45487.
Here with the phone numbers, Luhn invented another hash function without knowing it. By mapping big numbers into smaller one and grouping all the large numbers that produced the same small number he has solved an indexing problem.
This bucketing solution is similar to how we would, catergorize clothes in our wardrobe; say we do so by colour. Instead of looking through a single pile of mixed clothing to look for a particular pink shirt. We only look for the pink shirt in the pink shelf of our wardrobe, thus decreasing our search time.
Luhn’s approach
Luhn further developed this method of indexing and retrieval. Using a mathematical operation to group numbers, we can reduce search times significantly. Luhn continued to apply his “bucket method” to other data types, like text, and articles. By first encoding texts into numbers, then performing some math operations to group them. Today, we still use the same ideas. we use hash function indexing in search engines, and for other use cases like for guaranteeing integrity in blockchain and digital signatures.