emoji2costume: How Warby Parker Used word2vec to Recommend Halloween Costumes
My company loves Halloween. Every year, we have a company-wide Halloween party where everyone comes to work dressed up in their costumes. The costumes are usually pretty great and end up all over our social media. Customers and friends have often asked us to come up with costume ideas for them, both in person and online.
This year, we wanted to be able to help even more people, so my team decided to build a program that could recommend a costume to a user over SMS. Someone suggested we use emojis as the inputs, and we all loved the idea. Thus, we had a vague understanding of the system we were to build: something that could map a string of emojis to a halloween costume.
emojis to Costumes?
There are a couple of possible solutions to this problem, each with their pros and cons. One easy solution would be to hand-classify emojis (e.g. “😂” could be put into a “funny” bucket), similarly hand-classify costume, and then recommend a random costume that matches the class. This would be very easy to implement, but could cause a lot of upfront manual labor for determining the classes, especially if we want to allow for more than one emoji input.
Another solution would be to make a quiz, in which the emojis would filter down the list of answers intelligently. Again, this could work well if the quiz was well designed, but would also be a lot of upfront work to craft the quiz.
A third solution would be to use machine learning to intelligently parse the emoji inputs as well as the costumes, then recommend a costume that is most similar to the semantic meaning of the emojis. This would be the most flexible solution (there would be no manual classifying, and would allow arbitrary-length inputs), but would be much more complex from an implementation perspective. It would also be more fun.
Naturally, we chose option three.
To make a simple prototype, we needed a way to extract the semantic meaning from a list of words and compare its meaning to a list of emojis. Luckily, for words, we can accomplish something like this through the word2vec model.
word2vec is a machine learning model first developed by Google that learns the semantic meanings of words. It is a trained neural network model that tries to predict a given missing word, given a window of surrounding words. The model must learn where words “fit in” and where they don’t, and learning a numeric representation of a word helps with this task.
You can download a pre-trained word2vec model that can transform a word into a 300-dimensional floating point vector. For example:
As you can see, the word “scary” is transformed into a series of numbers. These 300 numbers encode the contextual meaning of “scary”. In other words, these 300 numbers describe where the word “scary” fits into the semantic landscape of English. A (pretty neat) consequence of this is that if you think of these 300 numbers as a point in a 300-dimensional graph, synonyms of “scary” will be closer to this point than other words will be! So, if we can similarly encode an emoji, maybe we can find emojis and costumes that are “synonyms” of each other, and recommend them.
The remaining pieces
Amazingly, a word2vec-like model for emojis already exists, and it’s called emoji2vec (what a world we live in). Using this model, we can encode an emoji to a 300-dimensional vector for use with word2vec. This was the a-ha moment when we realized we might actually do this.
Using word2vec and emoji2vec, we were able to translate both words and emojis into a shared semantic space. The next step is to match them together. This can be accomplished by computing the Euclidean distance between the representation of the input emoji and the representation of all of the costumes, and picking the costume with the smallest distance. To compute this quickly, I used a k-d tree to store the vectors, which can query for the nearest point in O(log n) time.
Careful readers may have noticed one detail I’ve skimmed over: if word2vec operates on one word at a time, and emoji2vec operates on one emoji at a time, how can this method be extended to allow for multi-word costumes and multi-emoji inputs? We can use word2vec and emoji2vec to encode each of the different inputs, but we need some way to collapse these into one vector so that we can do the lookup. The approach I took was to take the vector average. Therefore the model is really matching the average input to the closest average costume (where “average” costume in this sense means the average of all of the words in the description of a given costume). There are more sophisticated approaches but this seemed like a good place to start.
Proof of concept
We knew we would want to write our own costume list eventually, but as a proof-of-concept of the model, I wrote a script to scrape costume names and descriptions from a Halloween costume website, collecting about 900 costumes. The first step I did is tokenize the words in the description. This got rid of punctuation and capitalization, and turned the description into a list of words. Each word in the description was then fed through word2vec to arrive at a matrix with 300 columns per word. This matrix was then collapsed into a 300-length vector through vector averaging.
This process was repeated for all costumes, so at the end each costume had a 300-dimensional vector associated with it. These served as the points we matched against.
The figure below shows the process for the example emoji string “👽🤖”. In the first step, each emoji (“👽” and “🤖”) are separately fed through emoji2vec, and a matrix of size 2 x 300 is produced. This matrix is then collapsed to a vector of 1 x 300 by taking the average across the rows.
The model then takes this vector and computes the closest point in the list of all costumes. It then looks up the costume name associated with this costume, and in this case “Transformers” is returned. A Transformer is quite literally a robot alien, so we might be onto something with this whole approach! Below are some more example outputs.
😨😱 -> "Saw" Movie Costume
💪 -> Inflatable Muscles Costume
💀💤 -> Giant Skull Costume
Creating a production emoji model
Next, we asked employees for help coming up with novel costume ideas, and ended up with a list of over 300 costumes! For each costume, we asked for a list of tags too. These tags served as the costume “description” for the model, and the tags contained any words that came to mind for the given costume name. For example:
We landed on a user flow that involved us collecting three emojis. Before going live, we tested the model with a wide range of inputs.
One of my colleagues noticed an issue right off the bat. The recommendations being given weren’t very diverse. In fact it seemed that the same two or three costumes were being recommended for a huge range of emoji inputs. With a few days to go before going live, we couldn’t go back to the drawing board. It would have been easy to remove the problem costumes from the list, or recommend a random costume if the closest costume wasn’t very similar, but we wanted to investigate the issue in the model and fix it without cheapening the results.
Reverting to the mean
I had a hunch that the vector averaging of all of the words in the costume descriptions was the culprit. The costume descriptions contained many words, often covering a wide variety of topics. So, the more words in the description, the more that taking the average “restricts” the resulting vector into behaving more like an average vector.
This can cause many costumes to “bunch” together, leaving large holes in the space. So, when the model searches for a closest match to an input (three emojis), there is a good chance the resulting vector will fall into this hole, and the closest costume won’t really be that close at all, it will just happen to be the costume at the edge of the hole.
The solution to this is to spread out costumes in this space. A way to do this is to highlight the words in the descriptions that are unique to that costume, and downplay the words that are not distinctive to that costume. For example, the tag “person” may not be that helpful to distinguish a “Gondolier” versus a “high school principal”, but the tags “water”, “boat”, “row” are much more helpful.
In natural language processing, one way to accomplish this is to weight each term using tf-idf, or term-frequency inverse-document frequency. The idea is that rarer words are more distinctive to a document. So, if the description for “Gondolier” uses both the words “person” and “boat”, the prevalence of “person” and “boat” among other descriptions are calculated. Since “boat” is rarer among other costume descriptions, tf-idf can give “boat” a higher weight than “person” in the description for “Gondolier”.
Weighting the costume description tags did the trick, and the recommendations are now much more varied. Below is an example of what is happening when you query with “🎸🏈😎”:
Machine learning research is advancing rapidly. This project would have been cutting-edge research five years ago, and is now achievable through a few days of work. It will be exciting to see what will be simple to assemble in a few years time.
You can text 68848 until Halloween is over and see what the Warby Parker Halloween Costume recommends to you! 🎉