Learning to Hash — Finding the Needle in the HayStack with AI

15 min readJan 11, 2024

Overview

In this tutorial-based introductory article you will learn how to develop a data-driven nearest neighbour search model for matching data-points quickly in large datasets. The model will be learnt on data to optimise the number of matches between our queries and data in a collection while simultaneously avoiding an exhaustive search through the entire dataset. Sounds to good to be true? Please read on!

Finding the Needle in the Haystack. Image generated by Author using DALL-E.

🧐 Introduction

This article gives a practical introduction to the usefulness of machine learning approaches to nearest neighbour search. This field is known as Learning to Hash or Semantic Hashing.

This article is aimed at anyone who wants to learn more about:

Nearest neighbour search and its importance in Computer Science
Approximate methods for nearest neighbour search that can speed up search time dramatically.
Locality Sensitive Hashing, a popular and common algorithm for nearest neighbour search.
How we can boost nearest neighbour search effectiveness by using machine learning.

A full notebook with the experiments can be found on Google Colab.

👌 Background

Nearest neighbour search is the common Computer Science task of finding the most similar data-points for a query in a database. This is a fundamental matching operation that has found wide applicability across many fields from Bioinformatics, to Natural Language Processing (NLP) and Computer Vision. Nearest neighbour search is also the underlying process supporting the popular Vector Databases used for Retrieval Augmented Generation (RAG) with Large Language Models (LLMs).

A selection of interesting and successful application areas include:

A Scalable Sourcecode Search Engine: Enabling code-to-code recommendation over large-scale sourcecode repositories with MinHash.
Making the Transformer from NLP much more efficient: large Transformer models achieve state-of-the-art results but training these models can be costly. LSH to the rescue!
Detecting and Tracking Interesting Events on Social Media: a real-time system incorporating LSH to detect and track interesting social media events over time.
Earthquake Hunting with Efficient Time Series Similarity Search: detecting earthquakes by comparing segments of seismic activity time series. Some nice lessons learnt here about applying LSH in real-world applications.
Detecting Fraudulent Taxi Rides: Uber has applied LSH to detect rides that are similar based on their spatial properties.
Audio Fingerprinting Matching a query snippet of audio to a large database (think Shazam!) can be achieved efficiently by using learning-to-hash methods.
Genomics: Locality sensitive hashing (LSH) is used by Biologists to assemble large genomes and to find genes with similar expression in genomic databases.
Image Retrieval: Google applies locality sensitive hashing (LSH) alongside PageRank to index planet-scale collections of images.
Malware Detection: Developers of anti-viral software use hashcode learning models to quickly match a possibly malign code snippet to a database of known viruses.

The list is almost endless!

In this article we show how to use approximate nearest neighbour (ANN) methods to speed up the search process [2]. ANN methods using hashing allow sub-linear time retrieval of nearest neighbours. To understand the benefits of sub-linear time, compare sub-linear in the below diagram to linear. Sub-linear time grows much more slowly as the dataset size (x-axis) increases, compared to linear time.

Comparing Linear and Sub-Linear Time. Image by author.

Hashing algorithms basically work by generating similar binary hashcodes for semantically similar data-points, as illustrated by the diagram below. These similarity preserving hashcodes can then be used to index the data-points (images, documents etc) into the buckets of hashtables. Similar data-points should end up in the same hashtable bucket if we have an effective hash function.

The entire process is illustrated in the diagram below:

An overview of hashing for nearest neighbour search. Image by author.

Diagram taken from the PhD thesis of Sean Moran.

Given a query, such as the image of the tiger in the example above, we can search for similar data-points by generating a hashcode for the query and only comparing the query data-point to the data-points that collide with it in the same hashtable bucket (or buckets if we have multiple independent hashtables). Usually the number of data-points in the colliding hash table bucket(s) is much less than the total number of data-points in the entire dataset, yielding a search time that is substantially improved over a simple brute-force search. The only downside to these models is they might not return the closest nearest neighbour(s) every time (that is, we give up a degree of accuracy), which is usually an acceptable trade-off in practice.

In this article we explore a published learning to hash model and compare its performance on image retrieval to Locality Sensitive Hashing (LSH). Specifically, we dive deep on the Graph Regularised Hashing (GRH) model, a simple but empirically effective supervised hashing model for learning to hash. The model has subsequently been extended to cross-modal hashing (i.e. searching over both images and text).

🧠 Learning to Hash — Optimising Retrieval Effectiveness with AI

There are many methods of generating hashcodes, from data independent methods that generate the hashcodes using a random function with special properties, or through more recent methods that learn the hashcodes from data.

The go-to data independent model is Locality Sensitive Hashing (LSH) [1], and there are many articles written on this method. A good video introduction to LSH can be found here. In this article we will focus on learning the hash functions with AI, also known as Semantic Hashing or Learning To Hash. We explore a published learning to hash model [3] called Graph Regularised Hashing and compare performance on the benchmark task of image retrieval to Locality Sensitive Hashing (LSH) [1].

There are many opensource learning to hash models to choose from, with a comprehensive list and living literature here. In this article we will use the Graph Regularised Hashing (GRH) model of Moran and Lavrenko [3], an intuitive and empirically effective supervised hashing model. The model was extended to cross-modal hashing [4]. We choose this model as it is more straightforward to see the key principles of the learning process without being too caught up in the mathematical details. Other popular models are the Supervised Hashing with Kernels Model [4] and this Deep Hashing method [5]. For a recent survey on the field, please see this paper [6].

Preliminaries

We will develop a version of the hashing model in Python 3. The model will be trained on the publicly available CIFAR-10 dataset. Our benchmark will be retrieval effectiveness compared to Locality Sensitive Hashing LSH with Gaussian sign random projections) and using the commonly used precision at 10 metric and semantic nearest neighbour evaluation.

Our first step as always is to instantiate a virtual environment for Python3 to hold our code and configuration (requirements.txt can be found here):

python3 -m venv ./hashing_tutorial
source hashing_tutorial/bin/activate
pip3 install -r requirements.txt

We can then retrieve and pre-process the CIFAR-10 dataset as follows:

import scipy.io
import os
import requests

url='https://www.dropbox.com/s/875u1rkva9iffpj/Gist512CIFAR10.mat?dl=1'
response = requests.get(url)
with open(os.path.join("./", "Gist512CIFAR10.mat"), 'wb') as f:
    f.write(response.content)
mat = scipy.io.loadmat('./Gist512CIFAR10.mat')
data = mat['X']
data = Normalizer(norm='l2').fit_transform(data)
data = data-data.mean(axis=0)
classes = mat['X_class']

The above code will download and save the CIFAR-10 dataset pre-processed into GIST features to the current directory. To ensure that similarities are computed properly, it is important to L2 normalise and mean center the data before we start the indexing process.

If you would like to skip ahead, you can run the entire code as follows:

python3 hashing_tutorial.py

This code will download the CIFAR-10 dataset, train the GRH model and evaluate the hashing retrieval effectiveness on the CIFAR-10 image dataset.

Implementation — Locality Sensitive Hashing (LSH)

We will now generate 16 random hyperplanes (= 16 bit hashcode) and project one image onto these hyperplanes, generating the sign random projection LSH hashcode:

import numpy as np
num_classes = 10
n_vectors = 32
dim = 512
np.random.seed(0)
random_vectors = np.random.randn(dim, n_vectors)
print(random_vectors)
print('dimension:', data[0,:].shape)
bin_indices_bits = data[0,:].dot(random_vectors) >= 0
print(bin_indices_bits)
# [False  True False  True False  True False False False False  True  True True False False False]

The last line of code prints out the hashcode assigned to this image. Images with the exact same hashcode will collide in the same hashtable bucket. We would like these colliding images to be semantically similar, that is, to have the same class label.

We now convert the boolean representation above into an integer representation that will denote the bin indices:

# https://wiki.python.org/moin/BitwiseOperators
# x << y is the same as multiplying x by 2 ** y
powers_of_two = 1 << np.arange(n_vectors - 1, -1, step=-1)
print(powers_of_two)
# [32768 16384  8192  4096  2048  1024   512   256   128    64    32    16    8     4     2     1]
bin_indices = bin_indices_bits.dot(powers_of_two)
print(bin_indices)
# 21560

The example image will hash into hashtable bucket with index 21560. Now we will hash the entire dataset using matrix operations:

bin_indices_bits = data.dot(random_vectors) >= 0
print(bin_indices_bits.shape)
bin_indices = bin_indices_bits.dot(powers_of_two)
bin_indices.shape

bin_indices now contains 60,000 bin indices, one for each of the 60,000 images in the CIFAR-10 dataset. We can now insert these images into a hashtable and inspect the duplicates:

from collections import defaultdict
table = defaultdict(list)
for idx, bin_index in enumerate(bin_indices):
 table[bin_index].append(idx)
for bucket,images in table.items():
 if len(images)>1:
  print(images)

This code will print out the buckets of the hashtable that have at least two images and the associated IDs (that is, row numbers in the original .mat file) of the images in each bucket. The average number of colliding image in each bin is ~2 images, with a maximum of 68 colliding images. Next we will inspect some of the buckets to gain an understanding of the quality of the hashing with LSH:

# We take this bucket and inspect the images:
# [39378, 39502, 41761, 42070, 50364]
print(classes.shape)
print(classes[:,39378])   # 7
print(classes[:,39502])   # 8
print(classes[:,41761])   # 8
print(classes[:,42070])   # 4
print(classes[:,50364])   # 9

On this particular example we can see that LSH does fairly poorly, with two semantically related images (class 8 ship), colliding in the same bucket. We will inspect another bucket before moving on:

# We take this bucket and inspect the images:
# [42030, 42486, 43090, 47535, 50134, 50503]
print(classes.shape)
print(classes[:,42030])   # 4
print(classes[:,42486])   # 4
print(classes[:,43090])   # 4
print(classes[:,47535])   # 1
print(classes[:,50134])   # 1
print(classes[:,50503])   # 4

In this particular case we see that LSH performs well, with the majority of the colliding images coming from the same class label (4, deer). We will now quantitatively evaluate the retrieval performance of LSH in a more rigorous manner.

Evaluation — Locality Sensitive Hashing (LSH)

We now quantify the semantic retrieval effectiveness of LSH more formally using the precision at 10 metric as the number of hashcode bits are varied. Precision at 10 measures how many of the 10 retrieved nearest neighbours for a query are of the same class as the query. Firstly we split the dataset up into a set of queries, a training dataset to learn any parameters and a held-out database that we perform retrieval:

from sklearn.model_selection import train_test_split
np.random.seed(0)
data_temp, data_query, labels_temp, labels_query = train_test_split(data, classes[0,:], test_size=0.002, random_state=42)
data_database, data_train, labels_database, labels_train = train_test_split(data_temp, labels_temp[:], test_size=0.02, random_state=42)

This code will give 120 random queries that we will use alongside the LSH search index to find nearest neighbours. The database consists of 58682 images, and the training dataset contains 1198 images.

Partitioning the dataset into training, validation and test. Image by author.

To prevent overfitting and measure performance, we maintain a held-out database that we perform retrieval against using the set of 120 queries. The training dataset is used to learn any parameters required by the AI models.

We now index the database portion with LSH, thereby creating our hashtable:

bin_indices_bits = data_database.dot(random_vectors) >= 0
bin_indices = bin_indices_bits.dot(powers_of_two)
table = defaultdict(list)
for idx, bin_index in enumerate(bin_indices):
    table[bin_index].append(idx)

To search for nearest neighbours we apply a Hamming radius based search:

Different evaluation paradigms for Hashing. Image by author.

Hamming radius based search for a radius of zero is shown in Figure (b) in the above diagram (taken from the PhD thesis of Sean Moran).

In a nutshell this search methodology works by also looking in the colliding bin and nearby bins that different from the current bin by a certain number of bits, up to a specific maximum radius. We can use the itertools combinations function to enumerate all the bins that differ from the current bin with respect to a certain number of bits, up to a maximum radius of 10 bits. As well as returning neighbours in the same bin, we also return neighbours from the nearby bins.

from itertools import combinations
from sklearn.metrics.pairwise import pairwise_distances
import pandas as pd
import time

max_search_radius=10
topn=10
precision_history = {i: [] for i in range(max_search_radius+1)}
time_history = {i: [] for i in range(max_search_radius+1)}
for query_image, query_label in zip(data_query,labels_query):
    bin_index_bits = np.ravel(query_image.dot(random_vectors) >= 0)
    candidate_set = set()
    for search_radius in range(max_search_radius + 1):
        start = time.time()
        # Augment the candidate set with images from bins within the search radius
        n_vectors = bin_index_bits.shape[0]
        for different_bits in combinations(range(n_vectors), search_radius):
            index = list(different_bits)
            alternate_bits = bin_index_bits.copy()
            alternate_bits[index] = np.logical_not(alternate_bits[index])
            nearby_bin = alternate_bits.dot(powers_of_two)
            if nearby_bin in table:
                candidate_set.update(table[nearby_bin])
        # sort candidates by their true distances from the query
        candidate_list = list(candidate_set)
        if candidate_list:
            candidates = data_database[candidate_list[:]] 
            ground_truth = labels_database[candidate_list[:]]
            distance = pairwise_distances(candidates, query_image.reshape(1,-1), metric='cosine').flatten()
            distance_col = 'distance'
            nearest_neighbors = pd.DataFrame({'id': candidate_list, 'class': ground_truth, distance_col: distance}).sort_values(distance_col).reset_index(drop=True)
            candidate_set_labels = nearest_neighbors.sort_values(by=['distance'], ascending=True)['class'][:10]
            precision = list(candidate_set_labels).count(query_label) / topn
            precision_history[search_radius].append(precision)
        end = time.time()
        elapsed_time=end - start
        print(elapsed_time)
        time_history[search_radius].append(elapsed_time)
mean_time = [np.mean(time_history[i]) for i in range(len(time_history))]
print(mean_time)
mean_precision = [np.mean(precision_history[i]) for i in range(len(precision_history))]
print(mean_precision)

The above code will produce a mean precision@10 of 0.30 for a radius of 2. As we gradually increase the Hamming radius we increase the quality of the retrieval, at the expense of checking many more candidate nearest neighbours. This means that, on average, given a list of 10 returned images, 30% of those will be relevant to the query when we use a Hamming radius of 2.

We will show how this can be boosted to 0.40 mean precision@10 by learning the hashing hyperplanes, rather than generating the hyperplanes randomly (as we did with LSH).

Precision for LSH at varying Hamming Radii. Image by author.

As the Hamming radius increases from 0 to 10 we retrieve more images from the database in our candidate set, and this leads to a corresponding increase in the query time which will approach a standard brute force search time (~53 seconds) when the returned candidate set equals the entire database.

Query time for LSH for varying Hamming Radii. Image by author.

Implementation — Graph Regularised Hashing (GRH)

We now investigate how learning the hyperplanes (that is, learning to hash) can afford a much higher level or retrieval effectiveness. To recap we will be developing the supervised learning to hash model Graph Regularised Hashing.

Our first step is to use the training dataset to construct an adjacency matrix that GRH will use as its supervisory signal for learning the hashing hyperplanes. If two images share the same class label they have adjacency_matrix[i,j]=1 and adjacency_matrix[i,j]=0 otherwise. In Python we can construct this adjacency matrix from the class label vector:

adjacency_matrix=np.equal.outer(labels_train, labels_train).astype(int)
row_sums = adjacency_matrix.sum(axis=1)
adjacency_matrix = adjacency_matrix / row_sums[:, np.newaxis]

The next step is to implement the two-step Graph Regularised Hashing (GRH) model of Moran and Lavrenko, which is reminiscent of the expectation maximisation (EM) algorithm. The following slides are taken from the talk here:

Overview of the GRH AI model for Learning to Hash. Image by author.

The first step is Graph Regularisation:

Double clicking on Step 1 of the GRH AI model. Image by author.

In the first step the adjacency matrix is matrix multiplied by the hashcodes of the training dataset images. This multiplication has the effect of adjusting the hashcodes of the training database images such that semantically similar images have their hashcodes made more similar to each other.

The second step is Data Space Partitioning:

Double clicking on Step 2 of the GRH AI model. Image by author.

In the second step those refined hashcodes are used to update the hyperplanes: to do this a Support Vector Machine (SVM) is learnt per hash bit using the bits as targets. GRH takes the LSH hyperplanes in random_vector as an initialisation point and iteratively updates those hyperplanes so as to make them more effective for hashing. The entire GRH model is implemented below:

n_iter=2   # number of iterations of GRH
alpha=0.5  # how much to update the hashcodes based on the supervisory information

for i in range(0,n_iter):
    bin_indices_bits = (data_train.dot(random_vectors) >= 0).astype(int)
    bin_indices_bits[bin_indices_bits==0]=-1
    bin_indices_bits_refined=np.matmul(adjacency_matrix,bin_indices_bits.astype(float))
    bin_indices_bits_refined=(bin_indices_bits_refined >=0).astype(int)
    bin_indices_bits_refined[bin_indices_bits_refined<=0]=-1
    bin_indices_bits = (0.25*bin_indices_bits_refined.astype(float) + 0.75*bin_indices_bits.astype(float))
    bin_indices_bits=(bin_indices_bits >=0).astype(int)
    bin_indices_bits[bin_indices_bits<=0]=-1
    grh_hyperplanes = random_vectors.copy()
    for j in range(0,n_vectors):
        if abs(sum(bin_indices_bits[:,j]))==data_train.shape[0]:
            # In case all bits are the same we generate a new random vector
            random_vector = np.random.randn(dim, 1)
            grh_hyperplanes[:,j]=random_vector[:,0]
        else:
            hyperplane=svclassifier.fit(data_train, bin_indices_bits[:,j]).coef_
            hyperplane=np.array(hyperplane)
            grh_hyperplanes[:,j]=hyperplane
    
    random_vectors = grh_hyperplanes.copy()

In the above code, we parametrise GRH with 2 iterations and an alpha of 0.25. The iterations parameter is the number of times we repeat the two steps of GRH, namely hashcode refinement with the adjacency matrix followed by adjustment of the hyperplanes based on those updated hashcodes. After 2 iterations, the matrix random_vectors contains a set of hyperplanes that have been refined — that is made more effective for hashing-based nearest neighbour search — based on the supervisory information in the training dataset as encapsulated in the adjacency matrix. We can use these hyperplanes as in the code at the start of this tutorial to evaluate their effectiveness via a hashtable bucket-based evaluation at various Hamming radii.

The following diagram illustrates the nearest neighbour relationships on the toy example: denoted by the nodes and edges (edges connect semantic nearest neighbours). The LSH generated hashcodes are shown alongside each image:

Nearest neighbour relationships and initialised bits for a toy example. Image by author.

The diagram below illustrates how the adjacency matrix is used to update the hashcodes, with the first bit of image C flipping to a -1 to be more similar to the images above and to the left of it. In contrast, image E has its second bit flipped to become more similar to the hashcodes from the images below and to the left of it.

Regularising the bits with the adjacency matrix. Image by author.

A hyperplane is then learnt for the first bit by using the first bit of every image’s hashcode as the target. In this toy example (image below) a hyperplane partitions the data space horizontally, assigning images above the line a -1 in their first bit of their hashcode, and images below the line a 1 in their first bit of their hashcode.

Learning the hashing hyperplane for 1 bit. Image by author.

Evaluation — Graph Regularised Hashing (GRH)

Now we have gained a good understanding of how the GRH AI model works we will evaluate the GRH hashcodes using the same methodology as we did for LSH. We find an improved retrieval effectiveness, particularly at lower Hamming radii:

Significantly improved retrieval effectiveness for the AI approach versus LSH. Image by author.

The benefits of GRH on this dataset an for a hashcode length of 16 bits can be observed in the low Hamming radius regime (5 or lower). For example, at Hamming radius 0, GRH obtains ~0.25 mean precision@10, whereas LSH obtains ~0.1 mean precision@10. Query time for both methods are approximately similar (~0.5 seconds). The query time curve for GRH at increasing Hamming radii is shown below. As can be observed, for some Hamming radii, we pay at small price in terms of query time for the additional boost in effectiveness.

Minor impact on query time for the AI approach versus LSH. Image by author.

🤝 Conclusions

In this tutorial we implemented a data independent approach for hashing-based nearest neighbour search called Locality Sensitive Hashing (LSH). We evaluated this algorithm for quality and speed. To improve retrieval effectiveness we introduced Learning To Hash, a field of study that applies machine learning to improve the quality of retrieved nearest neighbours. We then showed an improvement in the quality of nearest neighbours on an image retrieval use-case when applying a learning to hash AI model.

To learn to hash, we applied a Support Vector Machine (SVM) to learn the hashing hyperplanes for a model called Graph Regularised Hashing (GRH). A benefit of the GRH AI, aside from its simplicity and effectiveness, is that it is agnostic to the learning algorithm. Therefore we can use a deep neural network if we wish to learn a more accurate data-space partitioning or a passive aggressive classifier if we want a light-weight learning method that can be adapted online e.g. in a streaming scenario.

In this tutorial we explored a single hashtable implementation of LSH and GRH and increased the number of relevant items retrieved using multiple buckets via a multi-probing mechanism. Other implementations of LSH would forgo the multi-probing of buckets within the same hashtable, and instead use multiple independent hashtables to find more relevant items. This is a useful study for a future article.

References

[1] Similarity Estimation Techniques from Rounding Algorithms by Moses Charikar.
[2] Mining of Massive Datasets by A Rajaraman and J Ullman.
[3] Graph Regularised Hashing by Sean Moran and Victor Lavrenko.
[4] Supervised Hashing with Kernels by Wei Liu, Jun Wang, Rongrong Ji, Yu-Gan Jiang and Shih-Fu Chang.
[5] Feature Learning based Deep Supervised Hashing with Pairwise Labels by Wu-Jun Li, Sheng Wang and Wang-Cheng Kang.
[6] A Survey on Deep Hashing Methods by Xiao Luo, Haixin Wang, Daqing Wu, Chong Chen, Minghua Deng, Jianqiang Huang, Xian-Sheng Hua