Convolutional Neural Networks Recommender
We will build a recommender with transfer learning, Spotify’s Annoy, PyTorch, and return visually similar products across 240K images in 2ms
This is part of my Modern Visual RecSys series; feel free to check out the rest of the series at the end of the article.
We will be using a subset of DeepFashion data open-sourced by Liu Z. et al., The Chinese University of Hong Kong. Our data consists of 280K fashion images across 46 categories. You can download the data from their website.
What is Convolution?
Convolution is not a new technique. In essence, we are applying a kernel to every pixel in the image to achieve a goal, usually to blur, sharpen or detect edges/objects. For each pixel, we will do an elementwise product with the kernel, followed by summing the result to a get a single number.
Let us walk through an example with the image kernels tool developed by Victor Powell.
We can see that every pixel of the image has a color value associated with it, where white =255 and black = 0.
Next, we will pick the kernel. The kernel can be of any size, but of course, it will take a longer time for a small kernel to scan through a large image. Sobel is a ubiquitous edge detection algorithm with smoothing so that it is less susceptible to noise. Notice how there are different kinds of Sobel (top, bottom, left, right) and like their namesake, these kernels are designed to pick up specific components of the image.
As you can see from the animation, we are essentially moving a 3x3 kernel across the image, generating new scores and assigning them to the output image. You will notice that after applying the bottom Sobel, only parts of the output image is highlighted in white; these white sections are the bottom edges detected by bottom Sobel.
Since each kernel specializes in detecting one aspect of the image, you can imagine us stacking up different kernels to formulate a comprehensive strategy. That is indeed the case and a collection of kernels is called a filter. In CNN, we can even stack multiple layers of filters, with each filter designated with a specific task.
If you are interested, you should try out the tool yourself with different types of kernels. CNN is a fascinating model as it combines the power of convolution and neural networks. There are many different architectures but generally consist of a combination of convolution, subsampling, activation and full connectedness, as noted by Algobeans. You can learn more about kernels and CNN under the additional resources section.
Why CNN for Visual Recommendations?
Now is an excellent time to review what we learned back in part1 of this series. Sander from Spotify designed a CNN with filters to detect the different types of music based on their frequency patterns. CNN opens a brand new way of recommending music that is intuitive, as it is based on analyzing and understanding the structure of music. Machines lacked the natural ability to comprehend and appreciate music; CNN help bridge the gap.
The power of CNN is its ability to break down a complex visual problem into layers of filters — quite often we can visualize these filters to gain an intuition of what the model is trying to learn.
Thus, our goal is to build a CNN that can recommend items based on visual similarity with the input image. CNN can be applied across a wide variety of visual problems, and I have collected a list of great articles below. Note that in the next chapter we will adapt our CNN flow to identify clusters of X-ray images with similar severity in infection.
- Searching for Visually Similar Artworks by SensiLab, Monash
- Finding Familiar Faces (in Anime Characters) with a Tensorflow Object Detector, Pytorch Feature Extractor, and Spotify’s Annoy by Michael Sugimura
- Fastai — Image Similarity Search — Pytorch Hooks & Spotify’s Annoy by Abhik Jha
- Similar Images Recommendations using FastAi and Annoy by Gautham Kumaran
Transfer Learning: Leverage Pre-Trained Deep CNN
For most real-world deployment, we do not train a CNN from scratch. Organizations like Microsoft Research has released state-of-the-art, large scale, pre-trained deep CNN (DCNN) models over the years, and we should leverage on their work by training on top of their baseline models. This is known as transfer learning.
One of the standard pre-trained DCNN is ResNet. Deeper networks have the potential of better representing the input function. The problem with deep networks is the vanishing gradient problem, as we will need to multiply small numbers repeatedly to conduct the backpropagation. ResNet solves this problem with the identity shortcut connection that skips one or more layers, allow us to construct very deep networks that generalize well over a variety of problems. See further readings section for more details on ResNet.
Approximate Nearest Neighbors with Annoy
If we only have a small corpus of images to search, simple distance metrics like cosine similarity will work. In real-world deployments, such as e-commerce, we usually have millions of images to compare with each other, and it will be impractical for the API to execute pair-wise comparisons across every single image. Annoy (Approximate Nearest Neighbors Oh Yeah) is by Erik Bernhardsson from Spotify, with an easy to use API that can be integrated into our PyTorch workflow.
More importantly, it helps us find closest neighbors without the need to calculate pair-wise distance across every single image.
If you are interested to learn more about annoy, do checkout Erik’s article under further readings.
Reviewing Results from the recommender
Let us take a look at the results of the recommender. For example, we observe that some items are easier to recommend, such as this striped sweater.
White jeans are a little harder; we seem to end with up with a mix of leggings, black pants and blue jeans.
While this…interesting outfit results in very diverse set of recommendations. It seems challenging to matching complex colors, layers, and outfits.
What have we learned
In this chapter, we explore the use of CNN in recommendations. We are using a couple of advanced techniques here, but with modern tools like ResNet, Fastai, Annoy, we can deploy a powerful recommender that generates new recommendations instantly.
Explore the rest of Modern Visual RecSys Series
Modern Visual RecSys: How does a recommender work? [Foundational]
In this series of articles, I will introduce modern approaches to visual recommender systems. We begin with a case…
Modern Visual RecSys: How to Design a Recommender? [Foundational]
For this chapter, I will introduce the RecSys Design Framework with a case study of Amazon.
Modern Visual RecSys: Intro to Visual RecSys [Core]
We will explore the “hello world” data for visual models, the FashionMNIST dataset from Zalando with PyTorch…
Modern Visual RecSys: COVID-19 Case Study with CNN [Pro]
We will cluster COVID-19 X-ray images based on severity with our CNN RecSys flow using transfer learning, Spotify’s…
Building a Personalized Real-Time Fashion Collection Recommender [Pro]
We will make use of transfer learning, approximate nearest neighbors, and embeddings centroid detection in PyTorch to…
Temporal Fashion Recommender [Pro]
Building a Recommender That Evolves with Seasons
The Future of Visual Recommender Systems: Four Practical State-Of-The-Art Techniques [Pro]
The future of visual RecSys is an exciting one. Let us explore some of the most cutting edge techniques and ideas that…
- Foundational: general knowledge and theories, minimum coding experience needed.
- Core: more challenging materials with code.
- Pro: Difficult materials and code, with production-grade tools.
- Types of Convolution Kernels : Simplified
- Image convolution examples
- Convolutional Neural Networks (CNNs) explained by deeplizard (Video)
- Convolutional Neural Networks (CNN) Introduction
- A Beginner’s Guide To Understanding Convolutional Neural Networks
- Let’s Build a Fashion-MNIST CNN, PyTorch Style
- PoshNet: You personal virtual closet
- An Overview of ResNet and its Variants
- Review: Inception-v4 — Evolved From GoogLeNet, Merged with ResNet Idea (Image Classification)
- A Simple Guide to the Versions of the Inception Network
- Erik: Annoy: Nearest neighbors and vector models — part 2 — algorithms and data structures