Billion-Scale Similarity Search: Addressing the Memory Constraints
We have previously written several blog posts about approximate nearest neighbor (ANN) similarity search, with a particular focus on the billion-scale similarity search problem that many large e-commerce and social media companies are faced with.
The billion-scale issue stems from the fact that large e-commerce and social media companies such as eBay, Facebook, and Pinterest need to quickly and efficiently surface the most relevant recommendations (candidates) during the candidate generation stage of their similarity search pipeline, but they need to do so in a way that scales with their vast amount of data.
For example, in a presentation given last year, Pinterest explained that during their candidate generation stage, they need to quickly funnel down from billions of images in their visual search pipeline to the most relevant 10,000, or so, images when providing recommendations to their users.
In a blog post, Facebook described a similar challenge to Pinterest’s. Figure 1 below shows how Facebook needs to funnel down from billions of candidates to a much smaller number of relevant candidates. In Facebook’s example, they are looking to provide the most relevant recommendations to Instagram users.
Another example is eBay who mentioned in a post that their “computer vision algorithm sifts through more than half a billion images and eBay’s 1.4 billion listings to find and show you the most relevant listings that are visually similar.”
In all of these examples, ANN similarity search is used in the candidate generation stage to quickly funnel down from billions of recommendations (candidates) to a much smaller number of relevant candidates.
The items to be searched against in these billion-scale databases are represented by high-dimensional feature vectors, which means that they require significant computation and memory resources. One way to address the computation and memory constraints is by compressing the high-dimensional feature vectors.
In this blog post, we will briefly introduce a model that compresses the feature vectors by learning to map a large-scale dataset in the original feature representation space to binary hash codes in the hamming space. This allows for fast and efficient approximate nearest-neighbor similarity search and helps address the memory constraint by providing a compressed representation of the original feature vectors.
Model Overview
Our model learns a binary hash that preserves local similarity by using a novel combination of a loss function and sampling method.
The model is based on a simple neural network and a training scheme that aims to preserve the locality relations between the original data points. It does this by optimizing the hamming-based similarity distribution relation to the original space cosine-based similarity distribution.
We introduce a loss function that translates the relational similarities in the original cosine space and the new hamming space to probability distributions — and optimizes the KL divergence between them. This results in distance preservation of the original cosine space in the new hamming space.
We also introduce a simple data sampling method by representing the database with randomly generated proxies, used as reference points to query points from the training set.
Figure 2 below shows the training model.
The Results
Experimenting with three publicly available standard ANN benchmarks (Sift1M, Deep1M, and BigANN1M, where Deep1M, and BigANN1M consist of the first million vectors of Deep1B, and BigANN respectively), our model provides significant improvement over other binary hashing methods — achieving an improvement of between 7% and 17%.
Unlike other methods, which report improvement in accuracy only over small code sizes (up to 128 bits), we show high performance in both low (64 bits) and high (768 bits) dimensional bit representation, offering increased accuracy when more resources are available and flexibility in choice of ANN strategy. No pre-computations or graphs are required for the training process.
To get the details about the model, please refer to the full paper, Hamming Space Locality Preserving Neural Hashing for Similarity Search, or attend the virtual BayLearn 2020 Bay Area Machine Learning Symposium where the author of the paper will give a virtual poster presentation.