Automated outfit generation with deep learning

Elaine Bettaney
Nov 11, 2020 · 5 min read

The AI team at ASOS uses Machine Learning to improve the customer experience. We contribute to the global data science community by publishing our research and by sharing the outcomes of our work at conferences. This article describes work from our recent research paper Fashion Outfit Generation for E-commerce published in ECML PKDD 2020.

Business case

Combining items of clothing into an outfit is a major task in fashion retail. Customers want to know ‘What shoes will go with this dress?’, ‘What can I wear to a party?’ or ‘Which items should I add to my wardrobe for summer?’. The key to answering these questions requires an understanding of , which encompasses a broad range of properties including colour, shape, pattern and fabric. It may also incorporate current fashion trends, customers’ style preferences and an awareness of the context in which the outfits will be worn. In the growing world of fashion e-commerce it’s becoming increasingly important to be able to fulfil these needs in a way that is scalable, automated and ultimately personalised.


We developed a machine learning model which is capable of completing an outfit based on a given Here we give an overview of our model and some of the challenges we faced.

We consider an outfit to be a set of fashion items which match stylistically and can be worn together. In order for the outfit to work, each item must be compatible with all other items. Our aim is to create a model which embeds each item in a latent such that for any two items the dot product (a measure of similarity) of their embeddings reflects their compatibility.

We use a deep neural network to learn embeddings for each item. All products in our ASOS catalogue have associated images, text descriptions and categorical attribute data and our neural network combines information from each of these sources to create the item embeddings.

Architecture of the item embedder network

Visual embedding

Our ASOS product images are predominantly full body shots of clothing on models, meaning that whole outfits are visible in each image. Training the network directly on these images would leak information about the other items in the outfit which wouldn’t be available to the live system. It was therefore necessary to localise the target item within the image.

Image features are extracted using VGGNet, a publicly available pre-trained deep convolutional network widely used for this purpose. We then adopt an approach based on Class Activation Mapping (CAM) resulting in a heatmap for each image which we use to weight the original features localising to the most relevant areas.

Title and description embedding

Product titles typically contain important information, such as the brand and colour. Similarly, our text descriptions contain details such as the item’s fit, design and material. To extract information from these we utilise pre-trained embeddings created by an existing application described in Deep learning for fashion attributes. Fashion attribution is trained on multiple properties including many highly relevant for our context including pattern, style, and use/occasion and hence these embeddings work well for our model.

Category embedding

The product category provides fine grained detail on the product type e.g. ‘Day dresses’, ‘Woven tops’, ‘Casual trousers’. We use the popular GloVe method for word embeddings to embed the categories taking the mean vector across all words in the category.

Outfit scoring

We quantify pairwise compatibility between products by taking the dot product between their item embeddings. Outfit compatibility is then calculated as the sum over pairwise dot products for all pairs of items in the outfit. The output of this is passed through a sigmoid function to ensure it is in the range [0,1].

Model training

Our model is trained end to end using a binary classification task. Each training example consists of one outfit made up of a set of ASOS products. We utilise an internal (BTL) dataset which provides almost 600,000 outfits curated by our ASOS stylists. The dataset is taken from ASOS product description pages and so every product in our catalogue appears once as a hero product. They are therefore each made up of a hero product and a variable number of styling products (e.g. a dress may be styled with a pair of shoes and a bag).

We train the model to discriminate between real BTL outfits and negative samples. We generate negative samples by taking a BTL outfit and replacing each styling product with a randomly selected one of the same category.

Outfit generation

Once trained, our model can be used to generate new outfits. Outfits of any length can be generated by sequentially adding items and re-scoring. Each outfit starts with a seed product from our catalogue. We then define an outfit which is a set of product types with which to complete the outfit. Our aim is to find the set of items of appropriate product types which maximise the outfit score.

An exhaustive search of every possible combination of products is combinatorial over the number of product types and cannot be computed within a reasonable time. We therefore use a beam search, which is more computationally efficient, illustrated in the diagram below for a template of {Skirts, Tops, Shoes, Bags}.

Beam search for outfit generation. Starting with a seed product (highlighted in yellow) each product type in the template is filled sequentially by finding the products from the catalogue which when added to the outfit give the highest outfit score. After each step the number of outfits retained is reduced to the beam width (set to 3). The outfits retained after each step are highlighted in green with the highest scoring outfit in dark green.

Style space

We can visualise our style space using a t-SNE plot. This reduces our 256-dimensional item embeddings to two dimensions which can then be easily visualised. An example containing a sample of our womenswear shoes and dresses is shown below. While similar items have similar embeddings, we can also see that compatible items of different product types have similar embeddings. Rather than dresses and shoes being completely separate in style space, these product types overlap, with casual dresses having similar embeddings to casual shoes and occasion dresses having similar embeddings to occasion shoes.

t-SNE representation of some dresses and shoes projected into style space. The blue box highlights an area containing occasion wear while the red box contains casual day wear.

The ASOS Tech Blog

A collective effort from ASOS's Tech Team, driven and…

The ASOS Tech Blog

A collective effort from ASOS's Tech Team, driven and directed by our writers. Learn about our engineering, our culture, and anything else that's on our mind.

Elaine Bettaney

Written by

is a Data Scientist at ASOS. She uses Machine Learning to help customers discover clothes they will love.

The ASOS Tech Blog

A collective effort from ASOS's Tech Team, driven and directed by our writers. Learn about our engineering, our culture, and anything else that's on our mind.