Explaining LSH-Minhash-Simhash
In this article, we’ll talk about Locality Sensitive Hashing.
Using brute force to check all possible combinations will give the exact nearest neighbor but it’s not scalable at all.
- LSH can be also referred as a function list which runs respectively.
- LSH tries to create buckets, where each bucket contains similar documents with high probability, whereas non-similar documents tend to be in different buckets.
So our goal is to find Nearly Duplicated documents. This process is made by using following steps:
- Shingling
- Min Hashing
- Locality-sensitive hashing
Shingling
Shingling can be thought as tokenizing texts. However, this tokenization process differs from normal tokenization in such a way that it does not tokenize according to words but according to characters. For example, think of a document such as it’s :
Life is very good at Carbon Consulting!
Shingling it by using 2-shingles will result in:
{‘Li’, ‘if’, ‘fe’, ‘e ‘, ‘ i’, ‘is’, ‘s ‘, ‘ v’, …, ‘in’, ‘ng’, ‘g!’}
Similar documents are more likely to have similar shingles.
Also, reordering a paragraph, or changing words does not have much affect on shingles.
→ Usually shingle size will be selected as 8 or 10, because lower numbers may result that shingles are present in most of the documents.
So after shingling, we can calculate similarity using Jaccard Similarity function to compute the similarity of two documents. But this approach is not scalable at all because of it’s time complexity (O(n²))
Also, resulting matrix after shingling will be a sparse matrix, which can be result a memory overhead. So, in order to solve these problems, ‘hashing’ method can be used.
Hashing
Choice of hashing function is tightly linked to the similarity metric we’re using. For Jaccard similarity the appropriate hashing function is min-hashing.
Min Hashing
.----------.------.------.------.
| Shingles | Doc1 | Doc2 | Doc3 |
:----------+------+------+------:
| Shingle1 | 0 | 0 | 1 |
:----------+------+------+------:
| Shingle2 | 1 | 1 | 0 |
:----------+------+------+------:
| Shingle3 | 0 | 1 | 1 |
'----------'------'------'------'
Assume that we’ve ended up a matrix like above after shingling process. Min Hashing uses another property as permutation, but in order to explain the work properly, first I’ll explain it without permutation.
For our ‘basic’ minHashing, we need to find the first element which it’s value is 1, and record it’s index. So our output is as follows:
basic_minH(Doc1) = 2 # Index of Shingle2 (Indexes started from 1.) basic_minH(Doc2) = 2
basic_minH(Doc3) = 1
With the usage of permutation, table can be augmented as follows:
.--------------.------.------.------.
| Permutations | Doc1 | Doc2 | Doc3 |
:--------------+------+------+------:
| 3 | 0 | 0 | 1 |
:--------------+------+------+------:
| 1 | 1 | 1 | 0 |
:--------------+------+------+------:
| 2 | 0 | 1 | 1 |
'--------------'------'------'------'
We can think permutation as change of indexes. So, Permutation 3 in the first row means that index of the first row is 3 now, not 1. Based on the permutation variables, new output is as follows:
m(Doc1) = 1
m(Doc2) = 1
m(Doc3) = 3
So after first permutation, our signature matrix will be like this:
After applying permutations of [1 2 3], [2 3 1] and [3 2 1], signature matrix will be as follows:
So, the signature of Doc1, Doc2 and Doc3 respectively, is:
And the similarity of Doc1 and Doc2 is 4/4 = 1 because all rows are equal.
Now, we can move into Locality Sensitive Hashing.
Locality Sensitive Hashing
Steps:
- Divide the signature matrix into
b
bands each havingr
rows. For example:
Selecting b = 2
and r = 2
for our signature matrix. Result will be:
Bands Doc1 Doc2 Doc3
------------------------------
band1 1 1 3
2 2 1
------------------------------
band2 3 3 2
2 2 3
- For each band, hash it’s portion of each column to a hash table with k buckets.
Bucket 1
Doc1,band1 {1,2}
Doc2,band1 {1,2}
---- o0o ---- o0o ---- o0o ---- o0o ----
Bucket 2
Doc1,band2 {3,2}
Doc2,band2 {3,2}
---- o0o ---- o0o ---- o0o ---- o0o ----
Bucket 3
Doc3,band1 {3,1}
---- o0o ---- o0o ---- o0o ---- o0o ----
Bucket 4
Doc3,band2 {2,3}
- If two docs are in the same buckets (1 and more), they’re candidates for similarity.
- Tune
b
andr
values to catch most similars but few non-similars. - Instead of brute forcing among all documents, calculate similarity with the candidate documents, return the result.
Another Hashing Method: SimHashing
It’s patented by Google !
- Instead of minHashing, after shingling the corpus / text / data, use a non-cryptographic hashing standard. The key difference between cryptographic and non-cryptographic hashing methods is that first one tries to magnify the change between two inputs with epsilon change while the second tries to minimize it. In other words, cryptographic hashing methods (like MD5, SHA and etc.) try to produce the output in a way that small changes in input would yield large changes on the output side. The similarity hashing methods on the other hand, try to minimize effect of changes in a way that two similar inputs are very close together.
- After hashing completed, take the first bit of every feature hash, and average them.
- If average < 0.5, write 0; if average ≥ 0.5, write 1.
As can be seen, the features didn’t vanished, and the data size is shrinked.
Acknowledgments
I want to thank Mr. Meysam Asgari-Chenaghlu for reviewing and helping the creation of this article.