ML Security Pro Tips: Understanding MinHash in a Security Context

Published in

AI/ML at Symantec

13 min readFeb 25, 2019

As a recently-hired ML Research Engineer in Symantec’s Center for Advanced Machine Learning (CAML), I’ve become familiar with the MinHash technique as a method for document summarization. In this article I’ll detail what I’ve learned from a practitioner’s standpoint, and describe the pros and cons of this classic algorithm as it pertains to privacy, security and ML.

MinHash is a specific type of Locality Sensitive Hashing (LSH), a class of algorithms that are extremely useful and popular tools for measuring document similarity. MinHash is time- and space-efficient with implementations available in most commonly used programming languages. It has a long history across a wide variety of information retrieval use cases, including:

Near-duplicate detection: identifying and removing duplicates from vast collections of documents, webpages or other files.
Clustering: grouping like documents or files together
Fingerprinting: determining unique content
Large-scale search: efficiently retrieving query content, typically content related to a test document

So how can MinHash be used in machine learning? And equally importantly: how secure is MinHash? Let’s find out.

What is MinHash?

In case you’ve been under a rock for twenty years (as I have), I’ll start with a brief history and conceptual review of the algorithm. MinHash was developed in 1997 during the Dot-com boom to address a common problem faced by early search engines: the existence of multiple near-duplicate websites. How could one identify whether two webpages were nearly identical and remove them from search results? Andrei Broder had the answer and we’ll walk through his algorithm step by step.

Since the goal of MinHash is to find similar documents let’s start with a definition of “similar.” It turns out that many really great ones already exist, and one in particular works well with documents: the Jaccard Index or Jaccard Similarity Coefficient. Jaccard similarity is a statistic that measures how close two sets are in an intuitive fashion using the idea of intersection over union. This is easily explained with a toy example.

Suppose we have two documents:
A: “There are four lights.”
B: “There are five lights.”
We can visualize these like so:

The Jaccard Index for these two documents, J(A,B) = 0.6. How did we compute that? It’s the ratio of the number of shared terms (intersection), to the total number of unique terms (union):

The higher the similarity value, the more terms are shared between documents and the more semantically close we expect the documents to be. (Note: the converse does not hold; that is, two documents about the same topic using completely different words will not necessarily be detected as similar.) This works just fine until you’re faced with millions or billions of documents. Breaking each one down into its respective words, computing their Jaccard Indexes, and performing pair-wise comparisons between each document and every other document in your sample becomes computationally prohibitive.

Enter MinHash, a method for estimating the Jaccard Index, whose key idea is to find the number of overlapping segments by creating a document “fingerprint.” There are two standard implementations. In both cases, we start by running a sliding window of size k over the document in a process called “shingling.” A shingle is the collection of characters that each window “sees” as it slides through the document. For each shingle we generate hashes, a subset of which are used as the desired fingerprint.

Many hash functions

In this algorithm variant, we perform hashing by passing each shingle through h different hash functions. Specifically, we’ll take the first hash function, apply it to all document shingles, and select the minimum resulting value as the first component of the document fingerprint. Then we’ll repeat this for each of the remaining h-1 hash functions resulting in h-element MinHash signature.

Single hash function

Computing many unique hash functions can be costly in its own regard. An alternative variant uses a single hash function. Here, we hash all of our document’s shingles and keep just the n smallest values, resulting in an n-element MinHash signature. As an example, let’s process our sample documents under this implementation using the md5 hash and check that our similarity estimate makes sense.

First, we create our shingles. Here k=5 and white space is represented by underscores.
A: “There are four lights” →
[‘There’, ‘here_’, …, ‘e_are’, ‘are_f’,…, ‘_four’, ‘our_l’,…,‘light’, ‘ights’]
B: “There are five lights” →
[‘There’, ‘here_’, …, ‘e_are’, ‘are_f’,…, ‘_five’, ‘ive_l’,…,‘light’, ‘ights’]

How many shingles did we make? That’s computed as the total number of characters in a document minus the shingle size, plus one. So we have 21–5+1=17 shingles.

The number of shingles in a document is based on the number of characters it contains and the shingle size.

Next, we hash each shingle to get:
A: [60e46a, 409f25, 11a972, daf874, f77b37, bd014f, d7b02e, …]
B: [60e46a, 409f25, 11a972, daf874, f77b37, bd014f, d7b02e, …]

Note that the md5 hash function always outputs a 32-character hexadecimal string, but without loss of generality I’ve kept just the first six hex digits. Also notice that most of the hex strings are identical between the two documents since most of the shingles are identical — the same string will always map to the same hash value.

Equivalently, we can express these hexadecimal strings as decimals:
A: [396870, 264690, 72343, 896903, 1013683, 774164, 883458, …]
B: [396870, 264690, 72343, 896903, 1013683, 774164, 883458, …]

Now let’s take the n=10 lowest values as our fingerprint for each document. We can compute the MinHash similarity as M(A,B) = total number of values in common / n. For our example with n=10, we get 0.5 — Not bad! This is really close to our original estimate of the Jaccard similarity.

Why does this work?

The beauty lies in randomness and probability. As an analogy, MinHash can be thought of as taking the union of the two sets, shuffling them into a random order, and then selecting the first value in the new random order (representing the minimum hash). Let’s see how this might work using our running example with words rather than shingles for clarity.

We’ll randomly shuffle our words and highlight in bold the shared words:
[‘five’, ‘are’, ‘there’, ‘four’, ‘lights’]

Q: If we select only a single MinHash component (n=1), what is the probability that one of the shared words ends up first on that list?
A: It’s just the number of shared words (3) divided by the total number of words (5) = 3/5 — the Jaccard Similarity.

Q: If we want n MinHash components, how many of them will be shared words, on average?
A: This is equal to the number of components, n, times the probability of a shared word: (3/5)*n.

Now we can compute the expected value of the MinHash similarity:

M(A, B) = (3/5)*n / n = 3/5 = J(A, B), the Jaccard Similarity!

What if we took all n=17 shingles? We’d get a similarity estimate of 0.588, illustrating that larger n yields a better estimate of similarity since you’re keeping more of the document in your representation. However, there are some drawbacks to keeping more shingles. One issue is that the cost of storing these document summaries increases, which may be a problem depending on your application. There are also security concerns, which we’ll look at in a later section.

We now have fixed length document summary vectors that closely estimate the Jaccard Similarity. So how does this fit into machine learning?

MinHash in ML

MinHash is not typically thought of as “machine learning” but that doesn’t mean it’s worthless in this realm. The MinHash output is essentially a set of randomly selected document features, as opposed to deterministically designed features more commonly used in machine learning. MinHash is not the only document summarization method available, either: alternatives include the canonical Bag-of-Words, TF-IDF (a weighted version of Bag-of-Words), and (all the rage these days) word embeddings like Word2Vec, GloVe, and FastText. MinHash is simpler than popular alternatives, but does it hold up to these other approaches?

Spoiler Alert: yes, surprisingly well.

Vector Representation

In order to train a classic ML classifier on MinHash features, we need a representation that it will recognize. We’ll continue with our running example in which we used the same hash function several times. While the hex hashes presented above work fine in LSH, they won’t fly for traditional ML algorithms. Instead, we convert these into bit vectors. Hex has 16 characters, which is 2⁴ possible bit strings per hex digit. Md5 outputs 32 character hashes resulting in a bit representation that would be (2⁴)³² or 340282366920938463463374607431768211456 0’s and 1’s! Whoa!

Let’s do a concrete example. In our documents above, each 5-character shingle was converted to a 32-character md5 hash and we truncated those to 6 hex characters each. According to our formula above, this will result in a bit array that is (2⁴)⁶ = 16,777,216 bits long!

But this representation may seem familiar to those who know Bag-of-Words. In BoW the document is represented as a large, sparse vector of bits. The length of the vector is the size of the vocabulary. Each bit where a word occurs in the document is assigned the value +1 (in some implementations, repeats are counted to create a true histogram of words).

A typical vocabulary size is, say, 50K English words: {‘a’, ‘aardvark’, … “zap”, ‘zebra’}. In our running example, we can compute a BoW representation for our documents wherein a 1 is located in the spots corresponding to the terms “are”, “four” (or “five”), “lights”, and “there”.

BoW vectors for our documents are 50K elements long. Most of the values are 0 except for at the locations in which document words are found in the vocabulary.

In MinHash, the “vocabulary size” is instead the total number of possible bit values, which we already computed to be 16,777,216. The 1’s are located in the positions corresponding to the hash values. For example, if n=3:

A: “There are four lights” → [ 101228, 1157490, 1539676]
B: “There are five lights” → [ 844631, 1157490, 2802746]

And their corresponding bit vectors would be:

MinHash vectors for the n=3 lowest hash values. These vectors are more than 16M elements long consisting mostly of 0s except for at the locations of document shingle hash values.

We are now armed with numerical document summary vectors fit for ML algorithms, so let’s see how MinHash performs in the wild.

Time for a Real Example

To do a real world test we’ll need a sample of documents with a rich vocabulary and, ideally, a clear goal — like classification. The Enron Ham/Spam emails are a standard corpus for classifying spam vs. non-spam. This corpus consists of about 33K emails nearly evenly split between real (ham) and spam emails. We’ll treat each email as a document, taking care to remove metadata like subject headers and To/From fields. This ensures that our model focuses on the text of the email body without overfitting to these summary data.

We’ll process each email with MinHash, as well as with several popular pre-trained word embeddings including GloVe, Word2Vec, and FastText. While a detailed explanation would require its own blog post, I’ll briefly describe the concept here. Word embeddings are more sophisticated than the BoW approach. Instead of a bit, each word is itself represented by a vector capturing its semantic relationship to other words, i.e., man is to king as woman is to _____ (queen), something that MinHash is not designed for. Each of these published word embeddings consists of a vocabulary in which each word has been assigned its own unique vector. Most of these vectors are 300 elements long, though this varies (specifically, the GloVe embeddings have been published in sizes ranging from 50 to 300).

One way to apply a word embedding to an email is to first break it up into individual words and, for each word, look up its corresponding semantic vector in the word embedding vocabulary. We can then average together all of the word vectors to obtain a document summary vector. This isn’t the only way to apply word embeddings but it’s straightforward so let’s run with it.

To generate a MinHash vector for each email, I used a single md5 hash function on 5-character shingles. Each md5 output is truncated to 5 hex digits and the n=100 smallest values are retained. After converting to bit vectors they are 1,048,576 elements long and contain exactly 100 1’s.

Using a standard 80/20 train/test split, here are the classification results on both a Linear SVM and a Random Forest. No extensive hyperparameter tuning has been performed: this is straight out of the box binary classification with default parameters. (Note: correctly doing RF with MinHash features requires some special finagling, which is beyond the scope of this article.)

Comparing binary document classification for the Enron Ham/Spam corpus using either MinHash or pre-trained word embeddings to summarize the documents. These word embeddings are typically 300-dimensional arrays except where noted for GloVe. Tested on sklearn’s Random Forest and Linear SVM.

Woo-ee! Who would have thought that MinHash would come anywhere close to what is currently considered the state-of-the-art in text representation? Now there’s a lot we could discuss on the pros and cons of pre-trained word embeddings (like how their dimensionality affects efficacy) but that’s a topic for another day. And while this is by no means conclusive, it certainly demonstrates that MinHash isn’t a completely unreasonable approach to document summarization. And while MinHash can provide reasonable performance for document classification it wouldn’t fair as well in other natural language processing tasks such as question answering, machine translation, or named-entity recognition, all of which rely on cognitive language interpretation.

How Secure is Minhash?

MinHash seems great but how secure is it?

Let’s imagine an attacker whose goal is to reconstruct some or all of the original document using only the minhashes and some knowledge of the hash function. This attack is of concern in scenarios where the summarized documents are private, such as a data loss prevention or email use case. In these applications, a company may want to build models over private corpora of documents. They might reasonably use MinHash as a way to create a storable summary, one which they hope will remain private. Having an irreversible summary of a document is important for legal compliance with privacy regulations such as GDPR, as well as being the right thing for customers in the event of a data breach. If a summary is reversible, however, then the document should be considered vulnerable to attack.

Without performing a detailed crypto-analysis of MinHash, let’s consider some practical guidelines for effective use.

Shingle size

When evaluating the usefulness and security of a given MinHash implementation the shingle size is the first item to consider. If k is very large then the document summary is not robust to minor changes, such as the addition or removal of white space, and it’s difficult to compare near matches. By contrast, if k is extremely small then the saved hashes will cover very little of the document and not carry enough context to be meaningful. (For an extreme example, consider a single character hash.)

Always use as many hash functions as are computationally feasible.

From a security perspective, the shorter the shingle is the easier it is to brute force reverse the text, particularly if the number of hash functions used is small, or one. A k-character shingle for, say, English, has approximately 2.62*k bits of entropy. For example, a 5-character shingle has about 13–14 bits of entropy. Thus, an attacker could build a rainbow table — a pre-computed table for reversing cryptographic hash functions — consisting of all plausible 5-character English shingles, e.g. {“_the_”, “and_a”, “it_is”, …}. Such a table would only cost about 6.8 MB of storage space — easy! (Folks, this is also why your passwords should be much longer than 5 characters and should include special characters when possible!) However, if we increase the shingle size, the attacker’s task becomes much more challenging as the number of plausible variations grows quickly. For example, for k=8, the storage size of the corresponding rainbow table is nearly a petabyte! If we now include several hash functions (instead of one), the attacker will need a rainbow table for each one —further increasing the challenge of reconstruction.

Smaller shingles can lead to more robust comparisons but are also easier to reconstruct.

Number of hashes

We should also consider the number of hashes, n, to keep when forming the signature. When n is small we have a less statistically significant sample of the unique hashes in a document making it harder to determine the similarity between two documents. Imagine if I only kept one hashed shingle per document — comparison would be a challenge! However, when n is very large we are keeping more and more shingles, which represent more of the document. This allows for a more robust comparison (recall in our example that our comparison estimate improved when we kept all 17 shingles).

Consider the trade-off between document coverage and document reconstructability.

While keeping more shingles will help your application determine better similarity estimates, it can also work to an attacker’s advantage since they now have a larger amount of information from which to draw reference. this is particularly troublesome for short documents like emails. For example, I might decide to retain the n=20 smallest hash values for each document in my corpus. Now, if my example documents A and B are included in that corpus, n=20 is larger than the number of possible shingles (17 at k=5) and thus the entire document’s shingles are part of its fingerprint.

Specifically, an attacker now has a sequence alignment problem, which is common in the field of bioinformatics for identifying similar sequences in DNA. When the shingles are long and the attacker has a large number of them, many characters will overlap making the attacker’s task easier. In addition, the attacker has supplementary information about the vocabulary and its frequency in the target language, so even if only a fragment of a word is available it can be used to infer the missing piece.

Though this attack is labor-intensive, a motivated attacker might still find it worthwhile to steal and reconstruct particular emails with key sensitive information. We recommend always using as many hash functions as are computationally feasible for your application. Additionally, consider how important accurate comparisons are to the efficacy of your application and whether it’s possible to increase the shingle size or reduce the number of stored hashes to cover less of each document.

Conclusion

MinHash is a tried and true algorithm for performing document similarity. It is time- and space-efficient and can be used in a variety of machine learning algorithms making it highly versatile. However, these representations are not guaranteed to be private or secure since it is possible to brute force reverse engineer a document depending on the chosen implementation. If security is high on your priority list be especially careful how you implement MinHash and take precautions to select appropriate shingle sizes and number of minhashes so as to maximize protection. There’s another option that we’ve been working on here at CAML related to Differential Privacy, a mathematical framework that provides provable privacy guarantees, but that discussion will have to wait for another blog.

Acknowledgements

I’d like to acknowledge Aleatha Parker-Wood, Andrew B. Gardner, and the entire Symantec CAML team for help in preparing this.