Locality Sensitive Hashing for Fast Search in High Dimension Data.

sid dhuri
Geek Culture
Published in
8 min readMar 25, 2021
image credits: Christopher Burns @ Unsplash

Real-world problems

In many real-world problems we have to process and analyse large amounts of text data. eg: Text Mining, Spam filtering, Product Recommendation, Online Advertising, etc..

Such data is usually characterised by high dimensionality For eg: The Google news model is trained on a dataset of about 100 billion words. The model contains 300-dimensional vectors for 3 million words and phrases.

In some cases we are analysing troves of documents to identify similar items in high dimension data for eg.:

  • Web Search and Text Mining
  • Document classification
  • Plagiarism
  • Chatbots

A common issues across all such problems is to find near neighbours in high dimensional space.In some cases, we are interested in a set of nearest neighbours rather than an exact match.

We will look at Locality Sensitive Hashing (LSH) a technique that can be used in high dimension data for finding nearest neighbours in large scale databases with near constant look up time.

LSH is a versatile algorithm that finds its application in myriad problems, including:

  • Near-duplicate detection: LSH is commonly used to de-duplicate large quantities of documents, webpages, and other files.
  • Genome-wide association study: Biologists often use LSH to identify similar gene expressions in genome databases.
  • Large-scale image search: Google used LSH along with PageRank to build their image search technology VisualRank.
  • Audio/video fingerprinting: In multimedia technologies, LSH is widely used as a fingerprinting technique for A/V data.
  • Fraud Detection: Uber uses LSH to quickly detect platform abuse, from spam to fake accounts and payment fraud

Representing documents as numerical vectors

Vector spaces are fundamental in many applications in NLP. If we want represent document so that we can apply mathematical models for analysing the content, we need to encode it as a vector where the dimensions are the words or n-grams that form the document.

Let us understand with an example. consider below statements and a query term. The statements are referred as documents hereafter.

  • Document 1: Cat chases rat
  • Document 2: Dog chases cat
  • Query: Who chases rat

After preprocessing the documents we represent them as vectors of words

  • Document 1: (cat, chases, rat)
  • Document 2: (dog, chases, cat)
  • Query: (who, chases, rat)

The relevant document to retrieve will be the one that has the greater similarity score between.

similarity score (Document1, Query) > similarity score (Document2, Query)

We can represent the above document vectors in numerical format as Term Document Matrix.

There are some additional steps such as Stemming, lemmatisation, TF-IDF when creating a term document matrix, that we will skip to keep things simple.

Now imagine if we wanted to map real world data, for example entire wikipedia corpus has about 3 billion words, and the Google news dataset has over 100 billion words. You can find both these data sets at https://code.google.com/archive/p/word2vec/

The number of dimensions will blow up considerably making it computationally infeasible or inefficient to compare your query against every possible document to find the most relevant documents.

The Curse of Dimensionality

The curse of dimensionality refers to the problems that occur when classifying, organising and analysing high dimension data that do not occur in low dimensional spaces, specifically the issues of data sparsity and proximity of data.

  • As we increase the number of dimensions in which these points are embedded, the average distance between points keeps increasing
  • As we increase the vocabulary the matrix gets increasingly sparse as most documents don’t contain most terms.

Instead of searching for exact results, one solution to address the curse of dimensionality problem is to look for approximate results. In many applications where 100% accuracy is not needed, searching for results that are close enough is much faster than searching for exact results

Locality Sensitive Hashing

Locality-sensitive hashing (LSH) is a set of techniques that dramatically speed up search-for-neighbours or near-duplication detection on data. To understand the algorithm lets first understand what we mean by hashing and being locality-sensitive.

What is hashing

A traditional use for hash functions is in hash tables. As a reminder, the hash functions used in a hash table are designed to map a piece of data to an integer that can be used to look in a particular bucket within the hash table to retrieve or delete that element.

You can think of hash function as a function that takes data of arbitrary sizes and maps it to a fixed value. The values returned are known as hash values or even hashes.

The diagram above shows an example of a very simple hash function which takes a vector and returns a hash value. Then to assign this hash value to a bucket we take the modulus value.

Now given an input, you don’t have to compare it to all the other examples, you can just compare it to all the values in the same hash_bucket that input has been hashed to.

When hashing documents you want similar documents to be hashed to the same bucket. One of the LSH techniques used in comparing documents is Cosine similarity.

What is locality-sensitive

The meaning of a word depends on the context or “locality” of the word in which it is being used.

E.g. the word “apple” refers to the fruit if words like “peel” or “pie” are also in the locality. But would refer to a consumer technology if words such as “iphone” or “computer” are nearby. But, ‘in the locality’ means one thing in a low-dimension settings, and another thing in high-dimensional settings.

Locality-sensitive hash functions are designed such that hash value collisions are more likely for two input values that are close together than for inputs that are far apart. There are different implementations of LSH functions for different data types and for different definitions of being close together.

There are different Locality Sensitive Hashing function depending on the type of data:

  1. Bit sampling LSH (Hamming distance)
  2. MinHashing LSH (Jaccard similarity)
  3. Euclidean and Manhattan LSH (Euclidean (L2) and Manhattan (L1) distances)
  4. Clustering or K-Mean LSH (learns the hash functions via K-Means)
  5. Signed Random Projections (Cosine similarity)

We will take a closer look at Signed Random Projections LSH for cosine similarity in a while.

Documents that have the similar content and context will have the same hash value thus mapping to the same or nearby hash bucket.

Motivation for LSH

Suppose we need to find near-duplicate documents among N=1 million documents.

Naively, we’d have to compute pairwise similarities for every pair of documents i.e, N(N-1)/2 ≈ 5*10¹¹ comparisons. Even with the most advanced processors this will take days to compute.

We can speed up computation by reducing the candidates we need to compute similarity for. This can be achieved by hashing document vectors into buckets.

  • Given documents D1 and D2
  • If we can find a hash function h such that:
  • if sim(D1,D2) is high, then with high probability h(D1) = h(D2)
  • if sim(D1,D2) is low, then with high probability h(D1) ≠ h(D2)
  • Then we could hash documents into buckets, and expect that “most” pairs of near duplicate documents would hash into the same bucket and then we have set of candidate pairs of docs in each bucket to see if they are really similar.
Dividing vector space into hash buckets to reduce the number of candidate pairs to compare

Cosine distance for document similarity

Cosine similarity is a metric used to determine how similar the documents are irrespective of their size.

As simplified representation of Documents and Query vectors in 2D vector space

Mathematically, it measures the cosine of the angle between two vectors projected in a multi-dimensional space.

In our context the vector space is composed of the words in the vocabulary and the vectors are the numerical representation of these documents.

Cosine between vectors can be computed using the following formula.

The cosine similarity function is expressed as:

𝐴 and 𝐵 represent the word vectors and 𝐴𝑖 or 𝐵𝑖 represent index i of that vector. & Note that if A and B are identical, you will get 𝑐𝑜𝑠(𝜃)=1.

  • Otherwise, if they are the total opposite, meaning, 𝐴=−𝐵, then you would get 𝑐𝑜𝑠(𝜃)=−1.
  • If you get 𝑐𝑜𝑠(𝜃)=0, that means that they are orthogonal (or perpendicular).
  • Numbers between 0 and 1 indicate a similarity score.
  • Numbers between -1–0 indicate a dissimilarity score.

LSH for Cosine Similarity

Signed random projections (SRP) outputs binary values, but SRP is sensitive to the angular distance between vectors.

Instead of the typical buckets we have been using, you can think of clustering the points by deciding whether they are above or below the line. Now as we go to higher dimensions (say n-dimensional vectors), you would be using planes instead of lines.

Signatures for Cosine Distance

Pick some number of random vectors, and hash your data for each vector. The result is a signature (sketch) of +1’s and –1’s for each data point

Hash signature for two document vectors

This function essentially cuts the space in half and assigns points in one half to 1 and the other half to -1. The decision boundary is based on the projection vectors, which are orthogonal to the hyperplane boundary.

The signature can then be amplified by using a AND and OR constructions or a hash function.

The picture below provides intuition for why Signed Random Projection is sensitive to angles.

The probability that h(v1)≠h(v2) is the probability that a randomly chosen line falls between V1 and V2.

If we randomly pick an angle, there is a θ/π chance that it falls between the two points and separates them.

Finally we can identify approximately similar documents that are in the same bucket, thus reducing number of candidate pairs for comparison and the dimensionality of the problem.

Following the above process, we can list the steps involved in using LSH for similarity search in a large text corpus.

  1. Map all the documents into the vector space model
  2. Construct LSH hash tables to bucket these documents into groups.
  3. For a new query, construct the feature vector and compute the hash for the query
  4. Based on the hash value identify the candidate pairs in the same hash bucket as the query.
  5. For every candidate pair calculate the distance metric between document and query
  6. Finally results with the minimum(threshold)distance are the most relevant documents to the query.

Conclusion

We have developed the intuition for Locality sensitive hashing and a high level understanding of the maths behind the algorithm.

We can see how LSH can be a highly effective technique in searching or querying large scale textual data. LSH can work in high dimension data effectively reducing dimensions and speeding up search in constant time.

LSH is used by many real world applications to query or search through large databases at scale.

References:

  1. https://web.stanford.edu/class/cs345a/slides/05-LSH.pdf
  2. https://web.stanford.edu/class/cs345a/slides/05-LSH.pdf
  3. http://www.cs.jhu.edu/~vandurme/papers/VanDurmeLallACL10-slides.pdf
  4. https://www2.cs.duke.edu/courses/fall15/compsci590.4/slides/compsci590.04_fall15_lec12.pdf
  5. https://eng.uber.com/lsh/
  6. https://randorithms.com/2019/09/19/Visual-LSH.html

--

--

sid dhuri
Geek Culture

I am data scientist by trade. I love to write about data science, marketing and economics. I founded Orox.ai a marketing ai, analytics and automation platform.