How to Build a Visual Search Engine

Published in

Analytics Vidhya

9 min readJun 24, 2021

Introduction

Visual search is a type of similarity search where the entities to be searched are images or objects in images. Common use cases include searching for similar products on a shopping website and searching for rare scenarios to train a self-driving car. What makes visual search difficult and an active field of research is that users are usually not interested in the visual similarity of the results, but in the semantic similarity instead. This means that if I search the image of a dog, I want my results to be other images of dogs rather than images with the most number of pixels with the same color.

How Visual Search Works

In order to build a visual search engine, we must first build an index. An index is a data structure that search engines use to map search queries to search results. The search engine will traverse through this index to find the most similar results to a search query.

Since visual search is a type of similarity search, we have to define how similarity is measured between the subjects of interest, which in this case are images. We do this by converting images into vectors of numerical values that represent different features, called embedding vectors. This is commonly done using heuristics or feature extraction models using computer vision. Similarity can then be defined as the inner product between these embedding vectors.

At search time, the search engine converts a seed image into an embedding vector, and it traverses through the index to find other embedding vectors that are the most similar (i.e. largest inner product with the seed’s embedding vector). Exact search scales up linearly with the number of entities in the index, which is not scalable for a visual search problem with billions of entities to search from. Therefore, we will focus on approximate nearest neighbor search (ANNS) instead, which offers a good tradeoff of latency and accuracy.

Building a visual search engine involves a few components, and we’ll go into more details about each component in the following sections:

Embedding extraction
Indexing workflow
Search service

Embedding Extraction

Imagine a D-dimensional space where every dimension represents a feature that can be extracted from an image, e.g. the existence of a straight line or the number of white pixels. An image can be represented as a point on this D-dimensional space, and the distance between two points on this space can be interpreted as how similar (or how different) two images are. This D-dimensional space is called the embedding space, and the D-dimensional vector of each point’s coordinates on the embedding space is called the embedding vector.

The quality of visual search largely depends on how embeddings are extracted from images. One way to extract these embeddings is by manually defining heuristics, which may be feasible for domain-specific visual search, but is very difficult to generalize and improve over time. A more scalable approach is to use computer vision models for embedding extraction. A convolutional neural network (CNN) can be used to extract features from images, and the nodes at any given layer can be interpreted as an embedding vector of an image. This CNN model can also be easily obtained from a pre-trained model, such as an object classification model or object detection model.

Take a CNN model used for object classification for example, the last layer of this CNN model takes a vector of numerical values representing an image and assigns it a class. This means that this vector contains enough information about the underlying image it represents to be able to tell what type of object it is. If we feed an image through this CNN model, then we can take the vector of the layer as the embedding vector of this image. This simple approach is effective for search problems that generally look for a certain class of objects (e.g. searching for apples from a bunch of fruit images), but more work is needed to train a better embedding model for finer-grained problems (e.g. searching for a specific style of chairs).

Indexing Workflow

In order to build a visual search engine, we need to have an index of images to search over, so we need to have a workflow that inserts images into a searchable index. This workflow usually involves the following steps:

Find images to index
Preprocess images
Extract embeddings from images
Compress embeddings
Insert embeddings into the index

Find Images to Index

What images to include in the search index largely depends on the problem we’re trying to solve through visual search. Take the self-driving use case for example, we could use visual search to find rare scenarios in order to label them and train better perception models. In this case, we want to index images of objects on the road, such as different types of vehicles, pedestrians, and traffic signs.

We can obtain images through the cameras attached on the self-driving cars, but those images may not be good candidates for indexing because there is way too much information packed inside a single camera image. What we want to do instead is to extract image patches of individual objects from those images, and we can do so using an object detection model. We can even further filter these objects by the classes assigned by the object detection model, which can help reduce the noise in search results and increase precision.

Preprocess Images

If necessary, images should be preprocessed to conform to the input requirements of the embedding extraction model. This could include image resizing, color rectification, geometric transformations, etc. This step could be very computationally expensive, so it should be avoided if possible.

Extract Embeddings from Images

This step uses the embedding extraction methods described in the previous section. The goal is to extract embedding vectors from input images so that we can store them in the index instead of storing the images themselves.

Compress Embeddings

Searching through an index of embedding vectors is costly in terms of both search time and storage space. It is common to compress the embedding vectors into a more compact data structure that can fit more entities into memory and result in faster search. There are two common types of methods to compress embeddings: binary-based and PQ-based. These two methods are complementary: binary-based compression is faster at computing distances, and PQ-based compression achieves a higher accuracy.

Binary-based compression converts an embedding vector into a binary string. It’s guaranteed that the Hamming distance between two binary strings is a monotonous function of the distance between the original embedding vectors. In other words, binary-based conversion preserves the relative distance between entities in an index.

Product quantization (PQ) is a method that compresses embedding vectors into a short code known as PQ-code, and search is conducted on this code instead of on the original embedding vector. This brings a few advantages:

A much larger index can now fit into memory (~1 billion entities)
The distance calculation between an embedding vector and a PQ-code is more efficient than that of two embedding vectors. It is still a good estimate of the Euclidean distance between the two original embedding vectors.
It works well with other indexing data structures, which we will discuss below.

PQ works as follows:

For each of the N entities to be indexed, cut its D-dimensional embedding vector into M sub-vectors
For every m in {1, …, M}, run k-means clustering across all N of the sub-vectors corresponding to the m-th sub-vector of every entity
Each sub-vector is given an integer identifier corresponding to the nearest cluster, and the identifiers are concatenated for each embedding vector, thus encoding each embedding vector as an M-dimensional vector

The mapping of sub-vectors’ clustered means to their identifiers is called a codebook.

Insert Embeddings Into the Index

The index is a searchable data structure that contains the embeddings of every entity that can be returned as search results. Instead of embeddings, PQ-codes can also be inserted into this data structure to reduce both the latency and the memory requirement of running the search service.

A k-dimensional tree (k-d tree) is a binary tree where every node is a point in a k-dimensional space. The entire k-dimensional space is iteratively split by each non-leaf node of the tree, dividing it into smaller and smaller subspaces. The entire set of points are inserted into the tree at the same time, and the splits are selected as the median of the points in a particular axis in the k-dimensional space. At search time, the k-d tree is traversed to find the leaf node that represents the subspace that contains the query, and the tree is further traversed recursively in reverse to find the K nearest neighbors to the query by searching the nearby subspaces as well.

K-d tree is an exact nearest neighbor search algorithm, and it performs well when the embedding space has low dimensionality. However, its search complexity increases linearly with the number of entities when the embedding space has high dimensionality.

The current state of the art of ANNS uses a graph data structure known as the navigable small world. This is a graph where the vertices are the stored entities and the edges link between similar entities. Insertion into the index and running a search query are almost identical processes: both use the same method to find the nearest neighbors of an entity, and the former adds a new vertex to the graph and adds edges to connect it to its nearest neighbors. Both insertion and search can be parallelized, and the data structure can be distributed.

The insertion and search algorithm for navigable small world works as follows:

Start from any arbitrary vertex
Use a greedy search algorithm to traverse the graph, going to the neighbor closest to the query, stop when the node is closer to the query than any of its neighbors
Repeat this K times, each time choosing a random starting point, and only pick nodes that have not been visited before
Now that a set of K nearest neighbors is found, keep iterating and replacing the farthest of the K nodes with the newly found nearest neighbor
Stop when the set of K nodes cannot be improved in an iteration
At search time, simply return the K nearest neighbors. At insertion time, add the query node to the graph and add edges from that node to the K nearest neighbors

Search Service

Search Algorithm

Since product quantization with inverted indexing is the current state of the art in ANNS, we’ll be focusing on that here. At search time, the distance between the embedding vector of the query image and the PQ-codes in the index can be efficiently calculated using Asymmetric Distance Computation (ADC), which approximates distances between embedding vectors. This method is still using a linear scan, so it does not scale well with a large index.

To handle a much larger (billion scale) index, a search system with inverted indexing is commonly used. At indexing time, the N entities are put into J buckets, and a representative vector is computed for each bucket. Instead of performing PQ on each embedding vector, we can actually perform PQ on the residual vector between each embedding vector and its corresponding representative vector.

At search time, we select the nearest buckets for the query vector in a process called coarse quantization, and then we perform distance calculation using ADC. Since we only have to search through the PQ-codes in the select buckets, we ensure that search time would not scale up linearly with the number of entities in the index.

Using this method called inverted file system with the asymmetric distance computation (IVFADC), a search time of 10–100 milliseconds can be achieved for an index size of one billion.

Service Infrastructure

The visual search service can be deployed as a simple web service with a single API endpoint to perform search. The request to this endpoint needs to provide one or more seed images and bounding boxes, as well as any customized configurations for the search. Some infrastructure problems to consider include:

Memory constraint and loading time for the index
Preprocessing and postprocessing of images
Scalability in terms of the number of entities in the index
Scalability in terms of the number of results to return