Deep learning for fashion attributes

Published in

ASOS Tech Blog

7 min readSep 6, 2018

Written by Saúl Vargas and Fabio Daolio from the KDD 2018 paper Product Characterisation Towards Personalisation: Learning Attributes From Unstructured Data To Recommend Fashion Products

The Data Science team at ASOS uses Machine Learning to improve the customer experience. We contribute to the global data science community by publishing our research and by sharing the outcomes of our work at conferences. This article describes our most recent SIGKDD paper, which is one of the most prestigious conferences in data science.

Business case

ASOS is a global e-commerce company that creates and curates clothing and beauty products for fashion-loving 20-somethings. At any moment in time, our catalogue contains around 85,000 products, with 5,000 new items being introduced every week. Over the years, this amounts to more than one million unique styles. For each of these products, different divisions within the company produce and consume different product attributes. These attributes are often manually curated and there could be cases in which information is missing or wrongly labelled. However, this incomplete information still carries a lot of potential value for the business.

The ability to have a systematic and quantitative characterisation of our products is key for the company to make data-driven decisions across a set of problems, including personalisation, demand forecasting, range planning and logistics. We show how to predict a consistent and complete set of product attributes, and we illustrate how this enables us to personalise the customer experience by providing more relevant product recommendations.

Solution outline

Our model extracts attribute values from product images and textual descriptions. To do this, we rely on the success that Deep Neural Networks have in classifying text and images. Below, we briefly discuss the main model components.

Image processing

For each product, four high-quality images are produced by the studio team. As fashion is predominantly a visual business, visual features are at the core of many data science products. Training good image models from scratch can take weeks. As we use image features for many applications, to minimise computational costs we have implemented a centralised visual feature generation pipeline. This uses pre-trained Convolutional Neural Networks (CNNs) to extract product representations from images.

Images are processed as matrices of pixel RGB values. Picture from ai.stanford.edu

Text processing

CNNs were originally applied to images, which are treated as matrices of pixel colour values. It is also possible to apply convolutions to other types of matrices. In particular, paragraphs of text, which can be turned into matrices using a technique called word2vec. As the name suggests, this transforms words into vectors, which can be stacked on top of each other to form a matrix. After that, similarly to how we process images to produce product representations, we can also produce representations from text descriptions.

Convolution over text: the leftmost layer is a lookup for word vectors, each line corresponds to a different word. Picture adapted from Y. Kim’s Convolutional Neural Networks For Sentence Classification.

Multi-modal fusion

The image and the text representations are then simply concatenated together within a neural network, which is trained to predict the product attributes. This is a straightforward and common way to fuse the different inputs that works well in practice.

Multi-task learning

Our primary focus was to design a solution to deal with missing labels at scale. We could have chosen to build a separate model for each attribute, but we would have had to maintain multiple models in production. Independent models would also be oblivious to the correlations between attribute values, and they would only work well for common attributes, where there is enough training data. Alternatively, we could have built a single model to predict all attributes at once. However, few products are fully annotated and there would not be enough data to train such a model. For these reasons, we chose to cast attribute prediction as a multi-task learning problem. This means training a neural network for each attribute but sharing most of the parameters between networks.

Schematic view of the multi-task attribute prediction network, more details in our KDD paper.

In practice, we create a collection of training sets, one per attribute, so that each dataset is fully annotated. We then build a collection of models, one per attribute, where all these models share the majority of their parameters, except for the final, attribute-specific layers. These models are trained on their respective datasets; the weight-sharing scheme allows each model to learn from all available data (implicit data augmentation) and promotes an internal representation that generalises to multiple targets (implicit regularisation).

Evolution of test scores per attribute during training, at intervals of 1,000 stochastic gradient descent steps.

Product recommendations

There are many applications for augmented product attributes, and one of the most important is personalisation through recommendations.

The cold-start problem

Recommender systems are one of the tools we offer customers to help them discover products. Our ability to make accurate product recommendations relies on our knowledge of the products in our catalogue. Importantly, one of the most useful sources of information about our products is how our customers interact with them. The rationale is that, if two products are purchased by the same customers often, they are similar in some way. The set of algorithmic approaches relying on this type of customer-product interaction are known as collaborative filtering and have proven to be a very powerful way of creating personalised recommendations.

Collaborative filtering algorithms, however, suffer from an important weakness: when a new product is added, it takes some time before we obtain a large enough number of customer interactions to recommend it. This issue is known as product cold-start and, in a scenario such as ours, where 5,000 new products are introduced every week, it creates an important blind spot; we cannot assist customers who are browsing the newest products in our catalogue.

A hybrid approach

One solution to the cold-start problem uses a customised content-aware model that is able to leverage customer-product interactions together with the augmented product attributes. This hybrid approach incorporates several state-of-the-art advances in recommender systems and not only incorporates new products, but also enhances the recommendations that customers receive overall.

Our approach creates an embedding of products, i.e. a representation of all the products in our catalogue in a high-dimensional vector space. In this vector space, products with similar styles and attributes will be closer than unrelated ones. When producing personalised recommendations, our algorithm also assigns a vector to every customer. The items with the highest inner product with the customer vector are the recommended ones. The position of products and customers in this space is determined not only by the customer-product interactions (the collaborative filtering approach we discussed earlier), but also by the augmented product attributes. This ensures that newly added products are positioned correctly in the space and can be recommended to the right customers.

Example

As a qualitative illustration of how the augmented attributes help with recommendations, we consider two examples. In both, we have a seed product for which we seek alternatives (in essence the ‘You might also like’ functionality currently available when a customer looks at a product on ASOS). In our product embedding, this means looking at the closest products in the vector space. We have produced two different embeddings: one in which product augmented attributes are not used (a pure collaborative filtering approach) and our hybrid method that relies on product content to improve the embedded representations.

For the first example, we consider one pair of our high-selling black skinny jeans:

As we can see, there is not an evident difference in quality between the two list of alternative products. In both cases, the returned replacements seem to be very similar to the seed product. This result is not a surprise and, indeed, it is a good example of how relying on vast amounts of customer interactions (when available) can be very powerful.

Things look more interesting when we consider a holdall bag that had just been added to the catalogue:

As we can see, the first method (pure collaborative filtering) produces counter-intuitive recommendations. There does not seem to be an apparent connection between the bag and long sleeved shirts. This happens because, in the absence of information about a product, the collaborative model often defaults to highly popular products. However, the attribute-aware method (bottom row) is able to overcome the lack of customer interactions and suggests similar-looking bags.

As we have just seen, the attribute-aware recommendations help with the product cold-start problem. An objective, quantitative experiment further verifies that this model is better overall at predicting customer purchases based on previous behaviour, which we use as a proxy for the quality of recommendations. If you want to know more, please refer to our paper.

Saúl is a Data Scientist at ASOS. He uses Machine Learning to help customers discover clothes they will love. Fabio is a Data Scientist at ASOS. He uses Machine Learning to understand what our customers love about their clothes.