Hash Functions for Data Mining

Published in

Weekly Data Science

5 min readJun 20, 2018

If you work in a technical field, chances are you’ve heard of hash functions. They’re used extensively in cryptography and data structures — you might have encountered them in public-key cryptography or when learning about hash tables. Non-technical folks have likely heard of them in the context of Bitcoin mining.

Hash functions — it turns out — are incredibly useful for many things, including data mining and machine learning. This post is intended to be a quick introduction to the kinds of hash functions that are commonly used for data mining. Many algorithms for analyzing shopping trends, text documents, and even genetic information rely on the concepts discussed below.

Many algorithms for analyzing shopping trends, text documents, and even genetic information rely on the concepts discussed below.

What is a hash function?

A hash function is a deterministic function that maps inputs of arbitrary sizes to outputs of a fixed size. That means that there are an infinite number of possible inputs but only a finite number of possible outputs. We call these outputs hashes, hash values, or digests.

For example, imagine a function that takes in non-empty strings that contain the ASCII characters a-z and maps them to their first character. There’s a glaring issue with this hash function that we’ll discuss later, but for now let’s use it as our example. (See if you can identify the issue)

We can see that there are many possible inputs — an infinite number, in fact — and only 26 possible outputs.

Let’s use this example to explore some of the common properties of the hash functions we’ll encounter in data mining applications.

Common Properties of Hash Functions in Data Mining

They have a fixed and finite number of possible outputs which is smaller than the number of possible inputs. This has several implications. First, it means that multiple inputs can hash to the same output (the way “abc” and “apple” both has to the letter “a” in our example). We call that a collision. Second, it means that if you were to hash all of your data and store the counts of the hashes, you would likely be storing much less information than if you stored the counts of items in the dataset (depending on your data and which hash function you choose). For example, keeping a count of how many times each word in a book occurs would take much more memory than keeping track of how many words in the book start with each letter in the alphabet.

…multiple inputs can hash to the same output. We call that a collision.

Most hash functions cannot be reversed. Given a particular output, it’s not possible to determine what the original input was that generated that output. In our example, if we were told that some input hashed to the output “a”, it would be impossible for us to say whether that input had been “apple”, “aardvark”, “abc”, or some other string starting with “a”.
Hash functions are deterministic. A particular input will always generate the same output. This rule is true for hash functions regardless of the discipline in which they’re being used.
Outputs should all have an equal probability of occurring. This is important because it affects the way in which collisions occur. We typically want all inputs to have equal probabilities of collisions. As you may have guessed, this is where there’s a glaring gap in our example function. The probability of an output occurring is equal to the probability of a string starting with that particular letter. Since we don’t know anything about the distribution of input strings, it’s not safe to assume that our outputs will all have equal probabilities of occurring. If, for example, we’re choosing strings from an English dictionary, there will be more words that start with ‘t’ than words that start with ‘x’, for instance.

We typically want all inputs to have equal probabilities of collisions.

Common hash functions and an alternative to our example

One very common hash function is Message Digest 5, or MD5 for short. The Secure Hash Algorithms (SHA) is a family of hash functions that are quite common. But be very careful: Many hash functions, including MD5 and several of the SHA functions have been found to be unsafe for cryptographic use. For data mining, however, this doesn’t concern us.

To fix the non-uniformity problem we observed with our example, we could simply use the MD5 function and keep the first N digits of the MD5 hash. Looking at pseudocode for the MD5 algorithm shows us that characters have an equal probability of occurring in the hash, which satisfies our uniformity condition. To tune the number of expected collisions, we can simply choose a different value of N.

What’s the expected number of collisions?

Test your knowledge of hash functions by trying to solve the problem below.

Let’s assume:

We pick 20,000 unique words
We hash them according to the method described above, using the MD5 hash and keeping the first N characters of the resulting hash string. These first N characters become the final hash string that we’ll use.
What is the expected number of collisions for the hash string that consists of all zeroes when N=1? N=2? N=5?
Why might we want to avoid picking N=5?

Note: MD5 hashes are alphanumeric strings.

What can I do with Hash Functions?

There are some really cool applications that I’ll be writing about in the future, including Bloom Filters, Locality Sensitive Hashing, and the PCY Algorithm — an improvement to the Apriori Algorithm, which I wrote about in The Intuition Behind the Apriori Algorithm. Keep an eye out for these posts in the future.

Are you excited about data science / machine learning / data mining? Is there anything in particular you’d like to see me write about? Let me know in the comments below!