The Origins of Hashing

In 1953, IBM had a problem with their computer search speed. Let’s say you wanted to look up who had the telephone number 0207 930 9000 in an electronic database. The way the computer worked then with unsorted data, it would start running through its database of numbers, checking each until it found a match. It would be a bit like going into a library that didn’t have the Dewey Decimal system — the most methodical way to find the book you want would be to start at the beginning and run through the shelves until you found it. With a large database, this could take a considerable length of time. In fact, it’s possible that a human running a manual process might be faster — not a message that would help IBM sell more machines.

A solution to this problem came in the form of an internal memo circulated by IBM research engineer Hans Peter Luhn. Luhn, though largely self-taught, was a prolific inventor. By his death in 1964, he had had a varied career that spanned textile production, computer science and linguistics, securing over 80 patents as he went. He had broad interests, including, apparently mixology as he designed and secured a patent for the Cocktail Oracle: a tool which helped you identify which cocktails you could make from the ingredients in your kitchen. However, where Luhn really excelled was information retrieval and storage.

In his memo, Luhn suggested computers put information into “buckets”. This grouping, he theorised, would make information easier for a computer to find. But how to group the numbers? The phone number I gave above is one for MI5. If I knew this, I might find it useful to group this number with other national security agency contacts. That’s no help to you or your computer if you only knew the number — you wouldn’t know whether to start searching in the national security agency group or the farmyards of London group. Instead, we need to put the telephone number in a group that is uniquely linked to the properties that both parties know. The only thing we both know is the telephone number itself.

Let’s break the number down into pairs where we can — 02 07 93 09 00 0. Now let’s add each pair together: 2 7 12 9 0 0. We have one pair that still gives us two digits, so we’ll just take the last digit, leaving: 272900. This phone number and its associated records can now be assigned to bucket 272900. There may be other phone numbers in bucket 272900: 0207 840 900 for instance, or 0207 662 7550. But, the quantity allocated to this bucket will be considerably less than the number in the database as a whole. It would also be less than if we were to allocate the phone number to a group based on the first five digits, as an example. Using this hashing and bucketing process, the computer will be able to find the records associated with the phone number much, much faster. (This example is inspired by IEEE Spectrum’s account of The Birth of Hashing)

This process, now known as a “hash table” was a big success. Hash tables have since been developed and improved, and are incredibly useful where random data storage and retrieval is required. We don’t know for sure that Luhn was the first to use this technique, but he is commonly cited as such.

Hashing, the act of chopping up information of any size and outputting something of a fixed size, has become its own field in cryptography. It’s particularly useful because, if you use the same mathematical process, the same content will produce the same output each time it’s run (e.g. you can run the bucketing process on 0207 930 9000 and each time you’ll get the same output: 272900).

In this way, hashing can act almost like a digital fingerprint for content. That can be extremely useful. If you’re curious, check out: What is hashing? Or Why Cryptographers Enjoy Passwords with a Pinch of Salt.