Explaining LSH-Minhash-Simhash

Hasan Kemik

Published in

CARBON CONSULTING

5 min readNov 1, 2020

In this article, we’ll talk about Locality Sensitive Hashing.

Using brute force to check all possible combinations will give the exact nearest neighbor but it’s not scalable at all.

LSH can be also referred as a function list which runs respectively.
LSH tries to create buckets, where each bucket contains similar documents with high probability, whereas non-similar documents tend to be in different buckets.

So our goal is to find Nearly Duplicated documents. This process is made by using following steps:

Shingling
Min Hashing
Locality-sensitive hashing

Shingling

Shingling can be thought as tokenizing texts. However, this tokenization process differs from normal tokenization in such a way that it does not tokenize according to words but according to characters. For example, think of a document such as it’s :

Life is very good at Carbon Consulting!

Shingling it by using 2-shingles will result in:

{‘Li’, ‘if’, ‘fe’, ‘e ‘, ‘ i’, ‘is’, ‘s ‘, ‘ v’, …, ‘in’, ‘ng’, ‘g!’}

Similar documents are more likely to have similar shingles.

Also, reordering a paragraph, or changing words does not have much affect on shingles.

→ Usually shingle size will be selected as 8 or 10, because lower numbers may result that shingles are present in most of the documents.

So after shingling, we can calculate similarity using Jaccard Similarity function to compute the similarity of two documents. But this approach is not scalable at all because of it’s time complexity (O(n²))

Also, resulting matrix after shingling will be a sparse matrix, which can be result a memory overhead. So, in order to solve these problems, ‘hashing’ method can be used.

Hashing

Choice of hashing function is tightly linked to the similarity metric we’re using. For Jaccard similarity the appropriate hashing function is min-hashing.

Min Hashing

.----------.------.------.------.
| Shingles | Doc1 | Doc2 | Doc3 |
:----------+------+------+------:
| Shingle1 |    0 |    0 |    1 |
:----------+------+------+------:
| Shingle2 |    1 |    1 |    0 |
:----------+------+------+------:
| Shingle3 |    0 |    1 |    1 |
'----------'------'------'------'

Assume that we’ve ended up a matrix like above after shingling process. Min Hashing uses another property as permutation, but in order to explain the work properly, first I’ll explain it without permutation.

For our ‘basic’ minHashing, we need to find the first element which it’s value is 1, and record it’s index. So our output is as follows:

basic_minH(Doc1) = 2  # Index of Shingle2 (Indexes started from 1.) basic_minH(Doc2) = 2 
basic_minH(Doc3) = 1

With the usage of permutation, table can be augmented as follows:

.--------------.------.------.------.
| Permutations | Doc1 | Doc2 | Doc3 |
:--------------+------+------+------:
|       3      |    0 |    0 |    1 |
:--------------+------+------+------:
|       1      |    1 |    1 |    0 |
:--------------+------+------+------:
|       2      |    0 |    1 |    1 |
'--------------'------'------'------'

We can think permutation as change of indexes. So, Permutation 3 in the first row means that index of the first row is 3 now, not 1. Based on the permutation variables, new output is as follows:

m(Doc1) = 1 
m(Doc2) = 1 
m(Doc3) = 3

So after first permutation, our signature matrix will be like this:

After applying permutations of [1 2 3], [2 3 1] and [3 2 1], signature matrix will be as follows:

So, the signature of Doc1, Doc2 and Doc3 respectively, is:

And the similarity of Doc1 and Doc2 is 4/4 = 1 because all rows are equal.

Now, we can move into Locality Sensitive Hashing.

Locality Sensitive Hashing

Steps:

Divide the signature matrix into b bands each having r rows. For example:

Selecting b = 2 and r = 2 for our signature matrix. Result will be:

Bands   Doc1    Doc2   Doc3  
------------------------------
band1     1       1      3     
          2       2      1     
------------------------------
band2     3       3      2     
          2       2      3

For each band, hash it’s portion of each column to a hash table with k buckets.

Bucket 1 
Doc1,band1 {1,2} 
Doc2,band1 {1,2} 
---- o0o ---- o0o ---- o0o ---- o0o ---- 
Bucket 2 
Doc1,band2 {3,2} 
Doc2,band2 {3,2} 
---- o0o ---- o0o ---- o0o ---- o0o ---- 
Bucket 3 
Doc3,band1 {3,1} 
---- o0o ---- o0o ---- o0o ---- o0o ---- 
Bucket 4 
Doc3,band2 {2,3}

If two docs are in the same buckets (1 and more), they’re candidates for similarity.
Tune b and r values to catch most similars but few non-similars.
Instead of brute forcing among all documents, calculate similarity with the candidate documents, return the result.

Another Hashing Method: SimHashing

It’s patented by Google !

Instead of minHashing, after shingling the corpus / text / data, use a non-cryptographic hashing standard. The key difference between cryptographic and non-cryptographic hashing methods is that first one tries to magnify the change between two inputs with epsilon change while the second tries to minimize it. In other words, cryptographic hashing methods (like MD5, SHA and etc.) try to produce the output in a way that small changes in input would yield large changes on the output side. The similarity hashing methods on the other hand, try to minimize effect of changes in a way that two similar inputs are very close together.
After hashing completed, take the first bit of every feature hash, and average them.
If average < 0.5, write 0; if average ≥ 0.5, write 1.