Slyce was born from a simple vision: What if you could find and buy any product you see, using the technology of visual search on a photograph? Of course, the task of visual product search is born of a more ambitious challenge: Given a query image, how do you determine the most relevant images from a universe of images and presenting them in decreasing order of relevancy?
In the case of a traditional search engine, for example, the universe would be all images that are available on the internet. For a social network, it would be all images that were posted or pinned on its website (or app). In our case we elected to focus on solving visual search for retailers, narrowing the universe to be the entire catalog of products that a given retailer sells.
The challenges in visual search are of a different nature than text search. In text search, the relevancy between a query text and list of documents (i.e the search universe) is essentially whether a document has the query text (score = 1+) or not(score = 0) and if present, how many occurrences of it a document contains (score ≥ 1) . Hence, in text search, the higher the score, the higher the relevancy. In that respect, the relationship between any two text documents can be thought of one-dimensional.
In visual search, however, a query image might be related to other images in the search universe across a multitude of categories. For example, a query image might be related to other images in the “color” category, so score in this category represents the “greenness” of the images. Similarly, images could be related via the “pattern” category and the score in this category could represent the “starness” of the images. In fact, there could several categories in which the images could be related to each other. Hence, the relationships in images are multidimensional.
The most intuitive way to build a visual search system that would to extract the images related to the query image in all these dimensions and rank them appropriately would be to:
- First, define all the different ways in which a query image might be related to the product images — color, material, pattern, shape, etc. i.e. we define what filters we use to identify the images. There might be a ”color” filter, a “pattern” filter, a “gender” filter etc.
- Second, for each of the relationships, using the filters we identified earlier, we will sift through all product images. For example, we could use a “color” filter that identifies all “green” images in the product catalog and returns a “greenness” score for each of them. These two steps together accomplish the retrieval task. i.e they output an unordered list of images retrieved, that are relevant to query image in the categories we defined.
- Now that we have retrieved all product images that are related to query image in multiple dimensions, with scores for each dimension, all we need is a way to rank these retrieved images based upon how relevant they are, to the query image. If we assign appropriate weights to each of these dimensions so that more weight is given to the more appropriate categories, we will end up with a list of all relevant images with each image assigned its own score. For example, if you deem the “color” dimension to be more important than “pattern”, set the “color” dimension weight to be relatively higher. In this case, when the decision is made whether to rank “green” images without stars or “non-green” images with stars higher, the system assigns a higher rank to “green” images with no “star.” This accomplishes the “ranking.”
That’s it, believe it or not, these three steps are the key building blocks of building a visual search system. Let us dig deeper into each of these blocks and understand how to build them.
Step 1: Define the Filters
While it sounds simple and intuitive to explicitly state all possible ways a query image might be related to product images, it also demands expertise and knowledge of the search domain we are tackling. For example, if the search domain is fashion like the images shown, the categories such as color, pattern, material etc. could be relevant. But if the search problem is in a different domain, such as healthcare CT scan images, the color might not be the most important category anymore. The categories, in which the images are related to each other, are entirely dependent on the search domain and there is no one solution for all domains. Therefore, defining those filters demands domain knowledge and expertise. These demands have slowed down innovation and discouraged disruption in certain domains.
Fortunately, we live in the age of Deep Learning (DL). With the advent of this new class of algorithms, the art and the need of selecting features has largely faded in favor of letting the algorithms define and learn features by themselves. Say, goodbye to domain expertise.
Step 2: Retrieve Images from Search Universe Using the Defined Filters
Once we define our desired filters, for each filter, we need to build a system that sifts through an entire image catalog and divides the images into groups or classes. For example, for a color classifier, the classes could be green, red, blue, etc.
Such a system can be easily built using a special type of Deep Neural Network (DNN) called a classification network or simply a classifier. This type of network, when fed an image as input, is able to give out a score representing the likelihood that the image belongs to a particular class. In the following example, the network is able to determine that the input image has a higher likelihood of belonging to “brown” class than the “green” class etc by passing this image through a series of layers.
To understand how this is accomplished, let us digress a bit and do a quick primer on Deep Learning(DL), which is a subset of machine learning. Deep learning gets its name due to its use of DNNs that are constructed as a series of layers stacked on top of each other. A DNN is similar to the network of neurons in a brain.
Images stored digitally are a matrix of pixel values. In order to make determinations about the content of an image we need to look for specific features in the pixel values. Each layer looks for a particular feature, yielding a higher value for areas where the feature is present and a lower value for areas where the feature is missing. The layers individually learn a distinct set of features based on each previous layer’s output. The further you advance into the neural network, the more complex the features the layers can recognize, since they aggregate and recombine features from the previous layer. For example, a network may progress from detecting blobs and edges to noses and eyes to cheeks and faces etc.
By using these feature detecting layers, a DNN transforms the images into feature vectors or embeddings. These embeddings represent the features of the images in high dimensional space, also called embedding space. Thus, deep learning can take a million images and cluster them in embedding space according to their similarities: red images in one corner, green images in another, and in a third, blue images.
Once extracted from an image, a feature vector can be used to make inferences about the features directly or fed into another layer(s) that can transform it further to fit a particular task. For the task of color classification as an example, we could add another set of layers, commonly known as classification or softmax heads, that create boundaries between different classes. The result of this would be a network that extracts features from an image and infers the color of an object in an image by looking at the deep features in the image’s embedding. The network infers the color class depending on which side of a boundary the embedding falls in.
That was a lot of work, but we laid the foundation for what is about to come next.
Step-3: Rank the retrieved images
In the final step, a ranking algorithm weighs the extracted features to rank the relevant images in decreasing order of relevancy to the query image. If the weights assigned to these features are static it would not work well. For a given query image certain features, such as color or pattern, are more important. For other images, other features such as gender or material might be more important. Hence, the weights assigned to these features should be dynamic and factor in the query image. This is fairly straightforward to accomplish using a DNN as well where the inputs are the unordered list of relevant images from each of the filters as well as scores from them and the output is an ordered list of all these relevant images.
To extend this even further, if both retrieval and ranking steps are essentially a combination of DNNs, can we replace all of them with one giant end to end deep learning? Yes, of course, if we are smart in our approach.
We will need to employ an end-to-end deep learning network for visual search that gets an image as input and outputs a relevancy score for all products in the search catalog.
Let us start with the classification network we are familiar with and see if we can use it as an end-to-end network. Here, the number of the output classes would be the total number of products. Though doing this is theoretically possible using classification networks, they are a few practical issues that make them unsuitable for image search.
Issue 1: Data Requirements
Neural network based classifiers have proven to be successful in many applications, often performing with near perfect accuracy. A key ingredient to their success, that is often overlooked, is the large amount of data necessary to train them. In an ideal scenario multiple images for each class, often about 100, are required. This is impractical when dealing with millions of classes (or products).
Issue 2: When the Number of Products Change
Consider the scenario where we have a product catalog and have designed a classification network with a query image as input and nodes representing each product as output. We have also gathered the required data and trained the network to achieve high performance. What if we want to add another product? Classification networks have a set number of classes based on their architecture, so we would need to modify the network to add a class.
The embedding space would need to be modified to accommodate the new class as the previous training will have fully occupied the embedding spaces with the prior classes. New classification boundaries need to be determined as shown in the following figures.
Every time a new product is introduced the models needs to re-trained, a time-consuming and expensive venture, which makes difficult classification networks for visual search difficult in practice.
Issue 3: Expansion Size Constraints
In order to increase the number of classes identified by a classification network, the network itself must be expanded. This expansion increases the number of connections in the network and hence the network complexity and size. The expansion also significantly boosts the training time and the amount of data required. These changes place a practical limitation on the maximum number of distinct classes that a classification network can identify. The largest networks that have been successfully trained have about 20,000 classes and about 300 million training images (Xception), that is about 15 images per class.
Pre-trained Feature Extractor and Nearest Neighbor Searches
A simple way to look at the outputs of a neural network that would allow for new classes to be added without retraining and expanding network size is taking a classification network pre-trained by ImageNet such as ResNet or MobileNet and removing the classification layer of the network. Then, we could use the rest of the network to generate embeddings.
First, map all product images into embedding space. Then the query image is mapped into the same space and the distance between the query image and the product images in this space is used as a metric to identify products that are relevant. The closer the product is to the query image in this embedding space, the higher its relevance is.
While the fundamental structure of the network and nearest neighbor search is sound and it addresses the limitation of the classification networks we discussed earlier, the way that the network was trained (using classification) introduces an issue.
Classification networks are trained to be discriminative, which means as long as the last but one layer of the network is able project the products into the embedding space such that a boundary can be placed between distinct groups, the network is considered trained. While this grouping implies that similar items are close together, it does not explicitly force this.
For example, consider an image of product D as query image, which is mapped into embedding space as shown below. When the classification head is present aka with the classification boundary, the query image is “correctly” identified as “Product D”. However, when we remove the boundary and use the distance between the embedding as metric, the nearest neighbor of the query image would be “Product C” and is miscategorized. Therefore, it is possible to have good discrimination between classes without the members of the class being close in distance.
The other issue with pre-trained ImageNet model is with the training data itself. ImageNet training data has about 1000 classes and some of them are listed here.
- Class A: ‘jersey, T-shirt, tee shirt’,
- Class B: ‘sweatshirt’,
- Class C: ‘jean, blue jean, denim’
A deep network that is trained on this dataset will be able to group ‘t-shirts’ as one cluster and ‘jeans’ as another. But the visual search solution requires the ability not only to differentiate between ‘t-shirts’ and ‘jeans’ categories, but also within these categories between different types of ‘t-shirts’ and different types of ‘jeans’.
In summary, what is needed is a deep feature extractor that is able to cluster similar products together and differentiate between dissimilar products with finer granularity. This network should be trained such that the distance between similar products is explicitly optimized since we want to employ the distance metric to retrieve similar products directly from the search universe. This is what we will look at in the next part of this series.