Have you ever found yourself searching for a piece of furniture but unsure how to describe it in exact words? Perhaps a sofa you found on Pinterest or home decor you stumbled upon on Instagram?
This post explains the visual search project we did for Hayneedle, a home furnishings and decor retailer part of our larger Walmart family of brands. My team was tasked with creating a search experience for customers that would allow them to search for products purely through images instead of words.
Table of Contents
Images provide a richer set of information when building a search query for certain categories like fashion, home, and furniture — where visual appeal is very important to the customer. For these categories, snapping a picture of the specific furnishing or decor piece you’re looking for leads to better and more relevant results.
A picture is worth a thousand words
Examples of Visual Search Use Cases
- You see a celebrity in a magazine and want to find and buy the same dress they’re wearing.
- You find a lamp in a store you like and want to find one in a different color.
- You find a sectional couch in a store, but the price is a little over budget. You want to find similar alternatives that are more affordable.
This defines why the problem of how to create a visual search experience is worth solving. These use cases demonstrate how images of a product can yield far richer information about the product than textual information about the product.
To further prove the point, let’s get interactive and see visual search in action.
Submit a photo of a furniture product you’d like to find through this link on your cell phone.
The search engine should return similar-looking product results — pretty cool, right?
Now, I’m going to step back and show you how our team solved this problem from the ground up. I’m going to define exactly what visual search is and how we created an engine for it (NOTE: We will provide the actual solution at the end of the how section).
Visual search broadly refers to the ability to use visual information to run queries and retrieve search results.
How do you search for an image in a database of images?
Actually, we don’t — not in the literal sense at least! We cannot search using raw images because:
- There would be too much data in the raw image.
- There would be too much redundancy as most raw images are made up of large swaths of similar or identical pixels
- Raw images aren’t scale-, transformative-, or rational-invariant. If you scale, transform, or rotate an image, you get a different set of pixels.
- Raw images aren’t robust to noise change or light exposure — a small change of the lighting changes raw image’s pixels drastically.
Essentially, searching for a set of pixels is incredibly difficult.
In the next section, we will discuss a bit the history and technical details of how visual search is done.
Before we begin, it is important to understand the following two concepts:
- object detection — detecting that there is an object in an image or video
- and object recognition — identifying what that object is in the image or video
In visual search, we try to recognize objects in images by producing embeddings — vector representations of the images — and then comparing them to vectors in a database of vectors.
Feature extraction tries to solve the limitations of using f input data raw values. It tries to remove redundant and non-informative features and attempts to preserve useful and smaller dimension data. It can be considered a combination of feature engineering (trying to come up with a number of attributes using domain-specific knowledge) and dimensionality reduction (trying to remove redundant information).
People have come up with various feature extraction methods to describe objects in images. These approaches have worked in different domains pretty nicely and have produced a lot of papers in computer vision.
We decided to use feature extraction for our Hayneedle project.
SIFT (Scale Invariant Feature Transform)
If there is one paper that profoundly changed research in computer vision before deep learning took over, it is the SIFT paper. It defined a way to extract features that describe objects in a scale-invariant manner (through what is called a key point detector). Essentially, through the method described in the paper, we can single out features that are actually useful to describe the object in an image.
The SIFT paper and a lot of derivative work following it have shown attempts to engineer different ways of extracting features to describe objects or images in scale-, rotational-, and transformative-invariant ways.
The SURF paper is another good piece in a similar line of work. However, the SURF paper and many follow-up papers always injected some knowledge about the images they operate on (edge and image structure) and did not adapt very well to different image domains.
So, we thought: There must be a better way to define feature vectors from images in a much more dynamic manner.
Here be dragons
History of Deep Learning
Deep Learning as a phrase was coined in 1986 by Rina Rechter. You can learn more about its history here.
There were successful applications of Convolutional Neural Networks (CNNs) such as the recognition of handwritten zipcodes in 1989 and MPCNNs (Max Pooling Convolutional Neural Network) in 2007. However, all of the deep learning methods at the time were incomparable to the effectiveness and accuracy of SIFT and SURF feature detection. Therefore, the computer vision community did not popularly use deep learning methods.
But then, a breakthrough happened.
LSVRC 2012 ImageNet Competition
SuperVision won the LSVRC 2012 ImageNet competition with only a 16% error rate (the second best one had a 26% error rate) when they published the AlexNet paper. This paper significantly improved the state of the art and was the beginning of a new deep learning era. It also set the stage in such a way that most computer vision researchers started developing various neural network architectures for a variety of tasks — making deep learning research flourish.
Alexnet is the ImageNet winning convolutional neural network that the SuperVision team competed with.
The implementation looks like the following in PyTorch:
AlexNet in PyTorch
The neural network architecture has five convolutional layers, and between those convolutional layers, there are ReLU(Rectified Linear Units) and max pool layers. ReLUs are useful to add non-linearity to architecture and max pool layers are good for reduceing dimensionality.
At the end of thenetwork, there is a classifier component that uses dropouts to reduce the overfitting and outputs class labels.
After the results and findings of the AlexNet paper were published, deep learning received more and more attention in the research community, who found more and more applications for it. In particular, the computer vision domain started utilizing deep learning models for various tasks such as classification and recognition.
The AlexNet paper became one of the most referenced papers (over 30,000 reads) in machine learning and computer vision. For the first time in a while, it showed significant improvement for solving large-scale image classification challenges over prior methods.
This neural network architecture not only solved the feature extraction portion of the “how to do visual search” problem, but also solved the feature selection portion. Feature selection refers to finding and identifying the relevant subset of features out of total features.
Essentially, this neural network architecture could consume labeled images during training and learn the weights. The resulting system can either classify or produce feature vectors for new images.
There is now a variety of neural network architectures — our team actually experimented with multiple models and decided to use the PNasNet5 model.
Jet Visual Search
Following AlexNet, researchers started designing different types of neural network architectures. By creating wider and deeper neural networks, they started getting better and better at solving different problems. However, this also required much manual work in architecture design, and not all archictures are very straightforward to optimize.
Because of this, some researchers looked into how to optimize certain neural network architectures to come up with the best possible architecture. This was known as hyper parameter optimization.
This line of work resulted in different types of algorithms to find optimal neural network architectures. One of the promising papers from this line of research was NASNet, which tried to find a neural network architecture through what’s known as the REINFORCE algorithm.
PNasNet5 extends the original work of NASNet through this paper and shows how to find a new neural network architecture progressively.
The following snippet shows PNasNet5 implementation in PyTorch.
PNasNet5 Model in PyTorch
For our system, we used a pre-trained Tensorflow model from Google, and we retrained it using our catalog so that its predictions were more relevant.
Now that we’ve covered the neural network architecture, we’ll now focus on our serving platform. The model is served by APIs, and the following is the API architecture.
We have three main APIs that are separated by functionality. The architecture consists of the following components:
1. TEFA (TensorFlow Serving API)
The API layer is what provides embeddings, or vector representations of the images. Its latency is in 99th percentile 856 ms. This is a light wrapper HTTP API on top of TensorFlow serving. The whole endpoint is a few lines of Python:
2. VISA (Visual Search API):
The API layer is what actually searches an embedding across a database of embeddings. We use FAISS for our image database, and it’s very fast (latency: 60 ms — 70ms)! For a given image embedding, it returns the closest image embedding indices:
Here, FAISS searches the closest embeddings to the original anchor embedding. After finding the closest embeddings, we then fetch the product indices from the FAISS index and fetch product information to pass this data to Hayneedle (very similar to vector similarity comparison).
3. CUFA (Customer Facing API)
CUFA is our orchestration layer which pulls the embedding for an image and submits that image into the FAISS index.
Now, I will explain the indexing portion of the FAISS.
For each catalog, an image first gets converted into a feature vector. Then, these feature vectors compose a feature matrix. FAISS makes it efficient to compute the similarity between a given embedding (Nx1) and what we have in the catalog embeddings (MxN mtrix).
Production readiness for a software system broadly defines if the software system is ready for use by customers. We want the system to be stable, maintainable, scalable, resilient and well-documented. Further, we want to ensure we have proper monitoring and alerting ready so that we can react to failures.
We use NewRelic to monitor the API layer. NewRelic’s APM functionality is very helpful for visibility purposes on the API layer, and it requires minimal code changes.
As for alerting, we use Splunk, which provides further visibility on our indexing pipeline.
NewRelic’s distributed tracing allows us to trace a request across three APIs. CUFA’s response time is the sum of the network and the other two API response times combined. Since we need an embedding before making a search, the calls to TEFA and VISA must be made sequentially. In this particular trace, the embedding generation is slow. VISA is generally pretty fast.
CUFA does a few more things; it logs request and response to Kafka, which aims to provide a feedback loop to our model so that it can learn from user-submitted images and continuously improve.
CUFA’s 99th latency percentile is 985 ms.
Our use case is that a user submits an image (taken in real life or from the Internet) to search and potentially buy on our eCommerce home furnishing and decor website Hayneedle.com.
Pics or it didn’t happen, am I right?
Try It For Yourself
If you didn’t already try out search engine, give it a go through this link. Please note that it needs to be accessed through mobile, or you can adjust a desktop browser to be mobile.
All you have to do is submit an image, and voila! You should get back relevant and similar looking furniture!
We take pride in the work we did to create a visual search engine. However, we also recognize that we have a lot of room for improvement. Here’s some future work that we plan to do:
- We do not need to communicate large raw image bytes over the wire in JSON. We can use binary or protobuf format to decrease payload size and speed up serialization and deserialization times between services.
- We do not need to communicate over HTTP and instead but can use GRPC since the Tensorflow model serving is already using GRPC. This would reduce the overhead of the HTTP layer in front of the Tensorflow model server. Unfortunately, we can currently only register services that are HTTP con our deployment platform, so we need to wrap the GRPC services into HTTP ones to make them available to the rest of our services.
- We currently not do any object detection on images. If we were to do object detection and crop the objects from the images, it would result in better search results. We would be able to eliminate the background and other irrelevant portions of the image that isn’t relevant to the search of the object.
- We have two three hops between client and server which is needed to generate an embedding. It would be nice if we can generate embeddings on the device and submit those embeddings for search results, improving the response time from the server side. Tensorflow Mobile already supports this in IOS and Android. Tensorflow.js can be used for browser-side embedding generation.
Of course, we’re always looking to improve, so feedback is welcome. You can ping me via Twitter @bugraa.
Lastly, we’re always searching for more people to join our team and solve interesting problems like optimizing visual search. Interested? Come #WorkPurple, and apply to open positions on our Jet Tech Careers page!