Target image on left, recommendations generated by our model on the right. Outfits from DeepFashion, open-source by Liu Z. et al.

Modern RecSys

Convolutional Neural Networks Recommender

We will build a recommender with transfer learning, Spotify’s Annoy, PyTorch, and return visually similar products across 240K images in 2ms

Kai Xin Thia
Analytics Vidhya
Published in
7 min readMar 20, 2020

--

This is part of my Modern Visual RecSys series; feel free to check out the rest of the series at the end of the article.

The Data

We will be using a subset of DeepFashion data open-sourced by Liu Z. et al., The Chinese University of Hong Kong. Our data consists of 280K fashion images across 46 categories. You can download the data from their website.

Furthermore, the team has released an updated version with additional data. You will need to fill up a Google form to gain access to the data.

What is Convolution?

Convolution is not a new technique. In essence, we are applying a kernel to every pixel in the image to achieve a goal, usually to blur, sharpen or detect edges/objects. For each pixel, we will do an elementwise product with the kernel, followed by summing the result to a get a single number.

Let us walk through an example with the image kernels tool developed by Victor Powell.

The base image. Source: Image Kernels by Victor Powell

We can see that every pixel of the image has a color value associated with it, where white =255 and black = 0.

Picking the kernel. Source: Image Kernels by Victor Powell

Next, we will pick the kernel. The kernel can be of any size, but of course, it will take a longer time for a small kernel to scan through a large image. Sobel is a ubiquitous edge detection algorithm with smoothing so that it is less susceptible to noise. Notice how there are different kinds of Sobel (top, bottom, left, right) and like their namesake, these kernels are designed to pick up specific components of the image.

This is convolution. Source: Image Kernels by Victor Powell

As you can see from the animation, we are essentially moving a 3x3 kernel across the image, generating new scores and assigning them to the output image. You will notice that after applying the bottom Sobel, only parts of the output image is highlighted in white; these white sections are the bottom edges detected by bottom Sobel.

Since each kernel specializes in detecting one aspect of the image, you can imagine us stacking up different kernels to formulate a comprehensive strategy. That is indeed the case and a collection of kernels is called a filter. In CNN, we can even stack multiple layers of filters, with each filter designated with a specific task.

If you are interested, you should try out the tool yourself with different types of kernels. CNN is a fascinating model as it combines the power of convolution and neural networks. There are many different architectures but generally consist of a combination of convolution, subsampling, activation and full connectedness, as noted by Algobeans. You can learn more about kernels and CNN under the additional resources section.

Why CNN for Visual Recommendations?

Closeup of filters 14, 242, 250 and 253. Source: Recommending music on Spotify slides by Sander Dieleman

Now is an excellent time to review what we learned back in part1 of this series. Sander from Spotify designed a CNN with filters to detect the different types of music based on their frequency patterns. CNN opens a brand new way of recommending music that is intuitive, as it is based on analyzing and understanding the structure of music. Machines lacked the natural ability to comprehend and appreciate music; CNN help bridge the gap.

The power of CNN is its ability to break down a complex visual problem into layers of filters — quite often we can visualize these filters to gain an intuition of what the model is trying to learn.

CNN based recommendation. Source: PoshNet by Summer Yuan

Thus, our goal is to build a CNN that can recommend items based on visual similarity with the input image. CNN can be applied across a wide variety of visual problems, and I have collected a list of great articles below. Note that in the next chapter we will adapt our CNN flow to identify clusters of X-ray images with similar severity in infection.

Seriously-Infected X-ray scan with 36 most similar scans generated by our model. Source: COVID-19 image data collection by Joseph Cohen

Transfer Learning: Leverage Pre-Trained Deep CNN

Benchmark Analysis of Representative Deep Neural Network Architectures (2018). Source: arxiv by S. Bianco et al.

For most real-world deployment, we do not train a CNN from scratch. Organizations like Microsoft Research has released state-of-the-art, large scale, pre-trained deep CNN (DCNN) models over the years, and we should leverage on their work by training on top of their baseline models. This is known as transfer learning.

ResNet Architectures. 18-layer ResNet is an excellent baseline to test model while 152-layer is a great general-purpose model. Source: Deep Residual Learning for Image Recognition by He K. et al.

One of the standard pre-trained DCNN is ResNet. Deeper networks have the potential of better representing the input function. The problem with deep networks is the vanishing gradient problem, as we will need to multiply small numbers repeatedly to conduct the backpropagation. ResNet solves this problem with the identity shortcut connection that skips one or more layers, allow us to construct very deep networks that generalize well over a variety of problems. See further readings section for more details on ResNet.

Approximate Nearest Neighbors with Annoy

If we only have a small corpus of images to search, simple distance metrics like cosine similarity will work. In real-world deployments, such as e-commerce, we usually have millions of images to compare with each other, and it will be impractical for the API to execute pair-wise comparisons across every single image. Annoy (Approximate Nearest Neighbors Oh Yeah) is by Erik Bernhardsson from Spotify, with an easy to use API that can be integrated into our PyTorch workflow.

More importantly, it helps us find closest neighbors without the need to calculate pair-wise distance across every single image.

If you are interested to learn more about annoy, do checkout Erik’s article under further readings.

The Code

Please refer to https://towardsdatascience.com/building-a-personalized-real-time-fashion-collection-recommender-22dc90c150cb

Reviewing Results from the recommender

Let us take a look at the results of the recommender. For example, we observe that some items are easier to recommend, such as this striped sweater.

Target image on left, recommendations generated by our model on the right. Outfits from DeepFashion, open-source by Liu Z. et al.

White jeans are a little harder; we seem to end with up with a mix of leggings, black pants and blue jeans.

Target image on left, recommendations generated by our model on the right. Outfits from DeepFashion, open-source by Liu Z. et al.

While this…interesting outfit results in very diverse set of recommendations. It seems challenging to matching complex colors, layers, and outfits.

Target image on left, recommendations generated by our model on the right. Outfits from DeepFashion, open-source by Liu Z. et al.

What have we learned

In this chapter, we explore the use of CNN in recommendations. We are using a couple of advanced techniques here, but with modern tools like ResNet, Fastai, Annoy, we can deploy a powerful recommender that generates new recommendations instantly.

Explore the rest of Modern Visual RecSys Series

Series labels:

  • Foundational: general knowledge and theories, minimum coding experience needed.
  • Core: more challenging materials with code.
  • Pro: Difficult materials and code, with production-grade tools.

Further Readings

--

--

Kai Xin Thia
Analytics Vidhya

Snr Data Scientist at Refinitiv Labs, M.S. CS Georgia Tech. 9+ years in data, found ❤️ in RecSys, NLP, Computer Vision, Applied R&D. linkedin.com/in/thiakx