Your Computer Dresses You Funny

Fashion is hard for machines. So we set up an eight-week deep-learning AI makeover to put the chic into geek.

One of the fuzziest instructions a woman must decode in life usually comes embossed on a fancy invitation inspiring equal parts joy and terror: “black-tie attire.” It’s crystal clear what that means for men, but endlessly ambiguous for women. You won’t know if you hit the mark until it’s too late. Can a machine do better?

At Kip we design AI-enhanced search tools and earlier this year we decided to give it a try: how well could we train a machine to find “black tie” outfits on the Internet suitable for women to wear to fancy dress balls, while avoiding plain old “black tie” accessories usually associated with off-the-rack men’s clothes?

We gave ourselves eight weeks. Our goal was to surface the most relevant fashion products possible — a result that could have valuable commercial applications in e-commerce. Beyond improving outcomes, we also aimed to develop a reliable process to train a machine to teach itself to correctly identify and categorize difficult words and images, and make useful connections between them.

If successful, we’d not only develop a more nuanced lexicon for our search engine; we’d have an adaptable AI better able to interpret the real intentions of users. For this reason, we called the process “learning the language of other people’s desires.”

Devil Wears Prada (2006)

If you work on hard problems, maybe you think fashion is shallow. Machine learning? Cool. Fashion? Sorry, that’s just clothes. In fact, when you work in AI, there’s a lot more to fashion than meets the eye: it’s one of the best tools available for training computers to become self-learning tools, a holy grail for building devices that may one day think like people do.

We love fashion for a couple of reasons. First, it produces a vast corpus of data that’s freely available for anyone to use. Data sets come in huge varieties from magazines, books and fashion dictionaries to bloggers, forums and social media, and there’s plenty of action in the form of runway videos, vloggers, online retail catalogues and more. Most of the data is fragmented, but openly accessible.

While breadth of data is important for building robust machine-learning tools, so is the pace at which new types of data enter the system. If data types remain too static, it’s hard to test the resilience and adaptability of software. Since new clothes and trends appear all the time, fashion provides plenty of novelties and challenges for assessing self-learning AI performance.

To set up the experiment we needed:

1. A range of search platforms as a comparative yardstick
2. A semantic search query to act as control test

To make the test as fair as possible, we only chose companies and startups that mentioned they had machine learning or artificial intelligence related to retail or fashion. We settled on three that met the requirements.

The second question was harder. “Semantic” in simple terms means “human understandable.” We had to find a question that a computer would mis-read or misinterpret unless it demonstrated a form of human-like intelligence. We considered using emotive language i.e. “cute jacket.” However, “cute” was too subjective and required continuous training by the same user to get tailored results.

After some trial and error we settled on “black tie women” as our phrase.

‘black tie’ accessory vs ‘Black Tie’ event dress-code

The most common human language use of “black tie” is as an event dress-code where people have to wear formal evening wear. Most untrained computers, though, would read it as “black-colored tie accessory.”

We added “women” to indicate that our preferred results were for the dress code and not the object. A woman’s black-colored tie accessory would be an outlier query compared to a man’s black-colored tie accessory. This would help the machine understand that we were referring to the dress-code and not the item.

Boot Camp for a Machine

Now we had defined our goal: To create a machine that would be able to tell the difference between “black tie” dress-code and “black tie” accessory and understand the semantic query to give the correct response.

We started with basic search, and the results were so bad we couldn’t use them at all, we’d get garbage back or nothing. We could not even rank the results. The scale of work seemed enormous. Where do you even start? What can you possibly accomplish in eight weeks?

Before writing a single line of code we set aside time to study the problem. We created a team chat channel for discussing neural networks and began keeping track of all the white papers, links and references we could find within a shared doc. For an entire week, we devoted ourselves to research, not building or making anything, but just reading everything we could get our hands on and discussing it.

AI is still in the very early stages and we were constantly amazed and grateful for the openness of the community. We actively started tracking the #machinelearning tag on social media, reaching out and forming a brain trust with other companies to share research. We got in contact with scientists and other specialists in the field. If we didn’t know something and there wasn’t a white paper about it, we could just contact these experts with a short, highly specific question.

Once we were ready, we used what’s known as a convolutional neural network (CNN) to read all the incoming product inventory and convert the images to a set of tags. A CNN works by extracting tiny visual clues from an image and tagging them bit by bit, identifying small details that match to templates indicating, for example, an “eye” or an “ear,” and combining them into “face;” then working up from there into the largest possible whole. Once the CNN identifies a sufficient number of the parts, it puts them all together and assigns the image a general tag that subsumes the rest, say “cat,” or “skirt.”

Using an idea first proposed by Google last year, we then fed the results into a recurrent neural network (RNN) to incorporate word-classification and parse the searcher’s question using natural language processing. The resulting “composite recurrent network” creates objects that we can use to teach the AI and improve the best matching results.

Simplified architecture with less crazy arrows than usual

To train the software we used about a million images with tags and sentences attached pulled from ImageNet — the benchmark database for machine vision — and ran the tuning cycle about 80 times. It turns out one of the hardest things currently in building AI systems is hardware optimization and compatibility. Getting your GPUs to talk to your software is a big problem: about half of the cycles were used just for debugging, a laborious process given the general lack of documentation. We frequently resorted to searching online forums, and in one instance only found the answer to our problem after scanning for solutions in Chinese.

Once we had the frameworks set up, we started the training. Fashion is comprised of several domains, e.g. apparel terminology, pop culture references, dress codes and etiquettes. We did the initial pre-processing, and conducted supervised search in several layers of the process where one data domain would build off the previous specialization using transfer learning.

We trained quite specifically on “black tie women” and tested that very heavily. We finished with an accuracy of 8 out of 10 correct results, compared with uniformly poor results from all three of our competitive yardsticks — a strong validation of the machine learning process.

To a large degree, our machine is still very much in beta. Our search results are still not as good as we want them to be, except for the few areas we’ve been training in. “Black tie women” is our test to see how far we could improve on the initial boosting and supervision stage.

Despite the issues, this experiment demonstrates exactly what eight weeks of highly supervised, domain-specific training can do. While there is still a lot more work to be done, and huge gulf between machine and human intelligent, the results are pretty encouraging.

Building towards an AI-powered personal shopper — that’s our dream. This would combine visual data plus language processing so that you can just talk to the AI to find what you’re looking for.

(Here’s a screen capture of a recent test with Slack:)

If you’re curious to learn more, check out our site at Kip or give us a shout on twitter.

Resources

Here are some of the most useful resources we found when starting out. We hope you find them equally useful!

https://www.kaggle.com/c/word2vec-nlp-tutorial/details/what-is-deep-learning

https://www.reddit.com/r/machinelearning

http://nlp.stanford.edu/projects/glove/

https://cs231n.github.io/convolutional-networks/

https://docs.google.com/presentation/d/1UeKXVgRvvxg9OUdh_UiC5G71UMscNPlvArsWER41PsU/preview?pli=1&slide=id.gc2fcdcce7_216_0

http://vision.is.tohoku.ac.jp/~kyamagu/research/paperdoll/

https://twitter.com/deeplearning4j

https://twitter.com/kipsearch

https://twitter.com/FastForwardLabs

https://twitter.com/ML_toparticles

https://twitter.com/stanfordnlp

https://twitter.com/MetaMindIO

https://twitter.com/drfeifei