Robo Bill Cunningham: Shazam for Fashion With Deep Neural Networks
Our story started when one of us had been gifted a rather interesting piece of clothing:
With such a unique item at hand, two thoughts initially came to mind:
- What do people even wear with this?
Stumped as we were, we figured though that there must be thousands of images online of savvy people wearing this type of shirt with the right pants, right jacket, shoes, bag, etc. Unfortunately, we couldn’t find a great search tool out there to perform a query like, Given this particular shirt, find me pictures of people wearing it with other clothing. But seeing the potential utility of something that can execute queries like this, we set out to build a “fashion visual search engine.”
The Mind of Bill Cunningham
There’s a documentary out there about famed fashion street photographer Bill Cunningham. In it they recount a number of instances that exemplify Bill’s acute ability to spot nuanced commonalities across many clothing pieces. In one such instance, they mention how the fashion designers hate it when they would create a dress one season only to watch it showcased in one of Bill’s centerfolds juxtaposed against a very similar-looking dress made a decade earlier.
Without a doubt, Bill Cunningham has an incredible ability for discerning clothing. One may wonder how he got that way. On top of being quite gifted, someone like Bill must have also taken notice of a lot of outfits throughout his 60-year career as a photographer. How many outfits? Assuming Bill works every day of the year (which isn’t a bad approximation) and shoots 10 outfits an hour for 8 hours a day, that number is well over a million.
Here’s the motivating question: If we presented the same number of clothing images to an artificial neural network, can it learn to see the world of fashion like Bill Cunningham does? Said in a less sensationalized way, what we’re proposing is training a neural network to recognize clothing from images and find us visually similar ones. Accomplishing this would be a good start towards creating our paisley shirt outfit-finder.
So let’s jump into it. The first thing we did was gather image data. Lots of image data. Since Robo Bill Cunningham has to recognize an article of clothing in as many ways as it would normally appear in the world, our training images had to contain clothing depicted as such: worn, unworn, held, folded, rotated, in front of a tree, under terrible lighting, etc.
It was a gargantuan data curation task, but after several months and thousands of man hours later, we ended up collecting and annotating millions of clothing images from various retail and social media websites, all hand-labeled and hand-cropped by skilled interns recruited from FIT and Parsons.
Find Me Similar Products, Not Images
Now with all this data painstakingly compiled, how do we build a visual search engine? Specifically what we want to do is create a function that embeds an image of a product into a metric space in which distance represents product similarity. This differs from image similarity where we care about the aforementioned ways in which the product is presented in the image. The way we create this function is by training our neural network to output vector representations of images which we can then look up quickly in a search tree.
We first train our neural net to perform a classification task — i.e. Which clothing categories are present in this image? Since we want our baby to grow into a true fashionista, we got fairly specific coming up with categories: dolman sleeves, surplice necklines, ruched detailing, epaulets, you name it.
By training our neural net to perform this task, we’ve basically created a function that takes in an image and outputs a vector representation in the form of a probability distribution over a set of categories. So far we’ve done it in such a way that similar images end up having similar distributions. This is close to what we want, and in the case of image search it’s precisely what we’d want. However, our focus is on product search, which means we’d want images depicting similar products to end up having similar probability distributions.
In other words, we need to train our neural net to understand that a Burberry scarf worn around the neck and a Burberry scarf laid out on a bed are indeed the same product. We do this by showing it many pairs of images of products presented in various ways, and establishing whether or not the items in each pair are the same.
With this piece in place, we’re ready to build a visual search engine. Here are shots of our boy Robo Billy in action.
Cartoon drawings and non-clothing input images seem to work as well…
Mapping the World of Fashion
Now for some fun stuff. Like a fashion cartographer, our neural net has the ability to map out the world of clothing products:
With the world of fashion products mapped out, we can do some interesting experiments. For instance, for any two products, we can figure out a “path” between them. That is to say, what string of similar-looking products allows us to morph between two given products?
Billy Dreams of Stilettos
If you’ve been following the tech world lately, you’ve likely heard about Google’s DeepDream project. In a nutshell, DeepDream is a way to use deep neural nets to automagically transform photos into “hallucinogenic” imagery.
This works because the filters of a neural net that comb through the images are made up of neurons that activate when they encounter certain visual features. At the bottom of the network, there are neurons that activate when they see edges; the next layer may in turn use these to form neurons that activate when they see fur; the one after that, puppies and kittens. DeepDream essentially makes incremental transformations to images in a way that tries to maximize the activation of neurons in a chosen layer.
The cool thing is that this works when those features aren’t even present in the photo, so that when the neural net sees a part of the photo that remotely resembles one of the features it was trained to look for (like fur or puppies), DeepDream will keep transforming it to look like that feature.
So… what happens when we use DeepDream on a neural net that was trained not to look for puppies or birds or cars or houses, but instead for pumps and brogues and clutches and henleys? In the words of DJ Khaled, let’s see!
This one above is actually quite remarkable. Notice how it caught on to the t-shirt that’s only partially visible under her jacket but was able to fill in the rest of the sleeves and neckline, effectively removing the jacket. Her shirt even became more of a crop top. Likewise, her pants transformed into shorts and her purse into something that resembles a bucket bag.
What Do People Wear With This?
Now onto outfit search. For a given item, what outfits do people normally create with it? From an implementation standpoint, what we’d like to do is index a whole bunch of blogger-quality outfit images by the items contained within them. For each item, we’ll then “vectorize” it in the same manner as before so we can perform fast lookups of the containing outfit.
Locating items within a photo requires training another neural net to draw bounding boxes around relevant clothing products. Fortunately, we’ve saved all the hundreds of thousands of image crops that were done by our interns, which we’ve reused to train an item locator. Once we have the boxes drawn and items cropped out, we’re left with the same product search problem as before.
Finally, we can go back to the original problem. What goes with that paisley shirt? Here’s what Robo Bill Cunningham thinks.