How Slyce Solves Visual Search — Part 2

Hareesh Kolluru
slyce-engineering
Published in
7 min readSep 23, 2019

In the last post we discussed the construction of a visual search system. Here we will explore the core technology behind Slyce’s visual search engine, deep metric learning. Deep metric learning is a technique that suits visual search well, addressing many of the issues discussed previously and providing us with an end to end framework for visual search.

Deep Metric Learning

Deep metric learning, also known as distance metric learning, is a type of similarity learning algorithm. Metric learning algorithms optimizes the distance between the embeddings in an embedding space, so that similar images are closer together and different images are further apart. The primary difference between metric learning and other deep learning algorithms is that metric learning guarantees that similar images are clustered together in a tight space and dissimilar images are further apart by a certain margin, specified during training. These constraints create embeddings that are optimal for nearest neighbor searches, which is what we were looking for.

A properly trained metric learning network takes images, creates embeddings from the images, and places the embeddings in embedding space such that similar images (from the same product) are close to each other.

Images clustered in a metric learning embedding Space

Training and Inference

Deep metric learning networks are trained in a similar way to deep neural network based classifiers. Examples are fed into the network and the result that the network generates is compared to an ideal result. This comparison, called a loss function, is what the network tries to minimize. This minimization allows the results to get closer to ideal result and thus the network “learns”. The difference between the classifiers and metric learning derives from the difference in what the networks are minimizing. Metric learning has a distinct loss function from classification. Thus metric learning loss functions has two distinct objectives:

  • two images from same product are closer together in embedding space.
  • two images from different product are far away in embedding space.

Triplet loss is the primary loss function used in metric learning, which is what we will use to formalize the above requirements. It can be seen as a baseline, but is conceptually the most important to understand and would be a good place to start when creating a metric learning system. Besides triplet loss function, there are quite a few loss functions, which we will discuss in a different post.

In a triplet loss function, loss is calculated by looking at three examples from your dataset; an anchor, positive and negative example.

Triplet loss on two positive faces (Obama) and one negative image faces (Macron) (Source: https://omoindrot.github.io/triplet-loss)

For some distance on the embedding space, the loss of a triplet (a,p,n) is:

        Loss = max(distance(a,p)−distance(a,n)+margin,0)

We minimize this loss, which pushes distance(a,p) to 0 and distance(a,n) to be greater than distance(a,p)+margin. As soon as distance between anchor and negative is larger than distance between anchor and positive by the specified margin, the loss becomes zero.

Here is the visual demonstrating how the anchor, positive and negative embeddings are pushed and pulled to minimize the triplet loss function.

An example of how the embeddings from anchor and positive are pulled together and that of anchor and negative are pushed away in triplet loss minimization process.

Training

The first step of training a metric learning network is data preparation. A large number of images from the domain, or set of examples with similar qualities, that you would like to make inferences about must be gathered. Next, this set of images must be divided into mutually exclusive groups based on the properties of the images. For example, if you wanted to create a metric learning network for apparel, you could divide the images based on the individual article of clothing in each image or they could just be based on individual product themselves.

Once you have the images grouped, all of the images are fed into the network and embeddings are generated. The distance between the embeddings generated by the images is calculated. The network learns to minimize the distance between embeddings from the same group and maximize the distance between embeddings from disparate groups by optimizing a loss function as stated above. This process is repeated until the network is sufficiently trained. The end result is that embeddings generated by the network cluster “similar” items together. The effects of this process can be seen below.

Embedding space throughout training

Inference

When it comes time to make an inference using a metric learning network, we employ a similar approach to training. All product images are fed through the network to create an index of pre-processed embeddings. When a new image comes through the network its embedding is generated and compared to those in the index. The embeddings that are closest in distance to the new embedding are returned. This process is illustrated below where a target image is identified as “Product C” because its nearest neighbors are “Product C”.

Inference via Nearest Neighbor Search

In conclusion, the process of using a deep metric learning network can be summarized as:

  1. Feed images through the network and collect embeddings
  2. Choose images and use these to calculate the loss
  3. Update the weights in the neural network to “learn” from the loss
  4. Repeat steps 1–3 until network is trained
  5. Once trained do a nearest neighbor search to search for similar embeddings

Advantages

Now that we have discussed the basics of deep metric learning, we will explore some of the reasons why it is an attractive option for creating a visual search engine. Because metric learning creates embeddings optimal for use in a nearest neighbor search, it has some distinct advantages over other search architectures. One such advantage is that classes do not need to be explicitly defined in the network architecture. This means that classes with few training examples can be added to the network. More data can also be added to a metric learning model without necessarily needing to re-train the network.

Adding new images of existing products results in new embeddings closer to the original clusters of the product.

Another advantage of metric learning is that the size of the neural network used in a metric learning system is not dependent on the number of classes that you are dividing data into, unlike classification networks. Adding new classes (products) simply involves generating new embeddings and adding them to the embedding space. If the network has learned how to cluster features well, new objects will form clusters independent of other clusters. This means that you can train a metric learning network that works on millions of classes without having an extraordinarily large network architecture. A new product can be added to a metric learning system without retraining the neural network. This can be seen below, where new product images added to the embedding space form a new product cluster.

New products form new clusters

Conclusion

In the past two articles we have explored the makeup of Slyce’s visual search system. We started by looking at the intuitive building blocks of visual search; defining search features, creating filters to search for matches across the predefined features, and ranking the returned objects. While these blocks provided a good conceptual base, there were some issues with building a visual search platform using this basic framework.

From this framework, we examined how requirements like the need for large amounts of training data, the inability to add new products to a trained system easily, and lack of simple expandability clashed with our desires for a flexible system that could easily meet the needs of the clients that we serve.

From here we went to consider a different way of getting results from a deep neural network, extracting features and doing a nearest neighbor search. While this approach addresses the problems of the initial approach, it introduces two new problems on its own. Simply extracting features using a pre-trained ImageNet model will not work well due to the discriminative nature of the model and the classes with which the model was trained. In order to get embeddings that better suit a nearest neighbor search we need a different approach, deep metric learning.

Deep metric learning alleviates the issues introduced by the basic embedding extraction approach by directly optimizing the embeddings so that similar images are closer together and disparate images are farther apart.

To train a deep metric learning model, we start with a classification model. From here remove the classification part of the network and replace it with an embedding output. With this model, we take images from different groupings, for example T-shirts and jeans and feed them through the network to get the embeddings. Using these embeddings and one of the loss functions described above we train the network by changing its parameters to better cluster the jeans and T-shirts via loss function optimization. This process is repeated until the network is sufficiently trained.

The fully trained network now powers the visual search engine. A catalog of products is fed through the network and embeddings are generated to map the searchable products into embedding space. When results from a query image are needed, the image is fed through the network and its embedding is calculated. This query embedding is compared to the embeddings in the embedding space and its nearest neighbors are returned as search results. This metric learning approach provides a streamlined and expandable method for visual search.

Contributors — Jake Buglione, Jay Patel, YingYing, Mike Matza.

--

--