Deep Learning Our Way Through Fashion Week

What have artificial intelligence and autoencoding got to do with 3,000 runway images?

Alejandro Giacometti
Inside EDITED
13 min readAug 8, 2017

--

AI, deep learning — or whatever we decide to call this new wave of neural network applications — has many industries racing to design the latest and most intelligent learning systems in an attempt to find solutions to industry-specific problems.

Fashion retail is no different; at EDITED we’re interested in what these technologies can teach us about our industry and the unique data it produces.

While we’ve got retail data pretty covered, there’s a huge amount of imagery in fashion to play with.

So we created a specific kind of neural network — called a convolutional variational autoencoder — to investigate a dataset of runway photos from London Fashion Week (LFW).

The network synthesises information from the photos and then represents them as a set of numbers that can be manipulated. That allowed us to analyse the collection in super interesting ways.

Fashion Week in pictures

We wanted to analyse a set of ~3,000 photos from the London Fashion Week Spring 2017 shows. These photos depict a model centred in the frame, wearing a particular design.

Most of the photos were taken in a runway show; the background behind the model in the photos is often quite simple. Some shows, however, are staged elsewhere, and so the backgrounds of the photos can be busier, they might include other people or objects.

Each designer has a completely different setting and a defined brand identity for their show, with each look being unique — this should offer a varied dataset within each show, as well as from designer to designer.

Fashion week shows are probably the best representation of strong design identity within fashion, as designers use these events to reveal their interpretation of the season.

A example set of photos from the London Fashion Week Spring 2017 shows.

The photos in the dataset include pieces from 67 designers participating in LFW. The number of photos per designer is not always the same, some designers have just one photo, whereas others have many.

Turning photos into numbers

You might be wondering what an autoencoder is…

An autoencoder is a special type of neural network that attempts to understand common characteristics of a dataset in order to represent it — or encode it — in an efficient manner.

Its purpose is to find a representation of a dataset in a reduced dimension. That’s done by training a symmetric neural network made of two parts: an encoder and a decoder.

We call it symmetric because the encoder and the decoder are very similar to each other, one is the reverse of the other.

The encoder reduces the information from each multidimensional input — in our case, the runway photos — to a limited set of dimensions.

The decoder expands from those limited dimensions to the data’s original size — it produces a version of the image which is reconstructed from the encoding.

The aim of the autoencoder is to perform this reduction and reconstruction whilst making sure that as much detail from the original image is preserved as possible.

Diagram of the fashion week autoencoder made up of an encoder and a decoder..

A regular autoencoder is a neural network which learns from the data to produce an encoding which represents a latent variable — in this case it attempts to represent inherent characteristics of the original photos.

But these encoded representations are hard to understand because they do not follow any kind of structure. So we used a kind of autoencoder called a variational autoencoder, which enforces these encodings to fit a normal distribution during training.

This is helpful because normal distributions are well understood and they follow a continuum. This representation better enables us to understand the limits of the encoded space and more easily manipulate it.

But fashion, being so visual, means we need to go one step further.

Because our data is composed of images, we use a convolutional neural network, which picks up higher order patterns in the images by applying two-dimensional operations.

The network learns not only from individual pixel intensity values, but also considers larger areas of the image, and can therefore extract insights from properties such as geometry and patterns, as well as color.

So, in short, a convolutional variational autoencoder is a kind of neural network that attempts to learn from higher order features of images and represent them in a set of normally distributed latent variables.

Entering ‘the encoded space’

Next, we trained an autoencoder to reduce the set of runway photos into a space made up of 16 dimensions. Using the encoder half of the autoencoder, each photo in the dataset can be represented in these dimensions; we called this representation the encoded space.

The ~3,000 runway photo dataset represented in the encoded space made up of 16 latent dimensions. Each column in this figure represents one single image, each row represents a single latent dimension and its value is represented in color.

Every single photo in the dataset is encoded into a set of 16 values — this representation of the encoded space describes the entire dataset in much less space.

Each of the 16 values represents a latent dimension, or an inherent characteristic of the photos. The autoencoder has learned from the variation in geometry, color, the model’s pose, and any other details from the photos in the dataset, and has condensed these features into 16 latent dimensions.

Because this is a variational autoencoder, we have forced these dimensions to be normally distributed — we can visualise those distributions via histograms.

Histograms of the latent dimensions in the encoded space. Each dimension roughly follows a normal distribution.

Our hope is that the features captured by these reduced dimensions map a visual aspect of the photos that we can recognise.

Additionally, by having each photo represented by these dimensions, we can do some interesting calculations, like finding the ‘average look’ of a designer’s collection or how a measure of how similar, or different, one design is from another.

Putting the photos back together again

The decoder half of the autoencoder, on the other hand, takes the 16 dimensions representing a single runway photo and reconstructs a version of the original photo.

For example, we can use the decoder to reconstruct a few images from the dataset:

An example set of runway images reconstructed from the 16 latent dimensions using the decoder.

These reconstructed images do not maintain every detail from the original, but preserve those features that the autoencoder has learned from.

The encoding dimensions are nothing but numbers. We understand their distribution — we know what is an acceptable value for each one of these dimensions to have — so there is nothing stopping us from manipulating them and constructing new images from new encodings.

We can even create images from picking encodings at random!

Reconstructed runway photos from encodings made up of random numbers using the decoder.

These reconstructed images look like runway photos, however, they never existed in the original dataset. These looks weren’t shown on the runway, they’re completely new ones.

Generating random runway photos is interesting enough; but we can further manipulate encodings for a particular purpose. For example, we can investigate what is the mid-point between two runway photos, by reconstructing an image from the average of their encodings.

Two reconstructed runway photos from our set (extremes), and a made up photo made from the average of the encodings of the two originals (middle).

The photo in the middle looks like it could belong in the dataset: it depicts a model in a dress, centred in the frame. It is interesting that the shape of the dress is somewhere in between the first and second original dresses, and the background is grey like the originals. Note however that this image is not simply a faded version of the two originals, but looks like a model wearing a dress.

We don’t need to use a strict average of the two images, in fact you can see the transformation from the first dress to the second by creating a linear progression from the encoding of the first image to the encoding of the second.

Two reconstructed runway photos from our set (extremes), and a set of reconstructed photos from discrete points in a linear progression in the encoded space.

Here we can see the long, narrower dress turn into the wider one. We can also see a progression in color, in the shapes in the background, and even in the pose of the model. Still, every stage of the transformation depicts a model in a dress.

These two designs initially look quite different but share some commonalities which create an interesting progression. They both use color blocking and layering, plus the walk and styling of the model is very similar. You can see how smooth the transition from the left to right is.

We can also visualise this kind of transition in the form of an animation; we can observe the transformation of one dress to another, the change in pose of the model and even the appearance or disappearance of objects in the background.

Reconstructed transitions between of pairs of runway photos by different designers.

Depending on the choice of similar or dissimilar images you can see the commonalities come together or the differences highlighted.

Boiling an entire show down to one image

So, if we can reconstruct images from the transition between the encodings of two designs, we can also visualise the entire collection of one designer by reconstructing a single look from their average encoding. It will take every look of that designer and produce one summarising image of the collection.

Reconstructed images from the average encoding per designer.

These reconstructions of averages are way more interesting! They allow us to immediately understand the main themes of each show and we can use this to deconstruct the look into it’s defining parts. It looks like some designers have a definite visual style, for example: Margaret Howell, Antonio Berardi and Emilia Wickstead.

Renowned for its relaxed, understated style, it’s not surprising that the visual identity at Margaret Howell comes through so clearly. A crumpled knee length trench in muted tones, with the model’s hands in pockets, creates a real impression of the look the brand is famous for.

Antonio Berardi is defined by a short, tight dress with the blues and purples from the show coming through.

Emilia Wickstead is perhaps the most representative visually. You get the impression of a lighter, flowing fabric in a loose simple style with even the light floral pattern coming through.

Mapping an entire fashion week

The 16 encoding dimensions describe a position of each image within the encoded space. Every single runway photo has a specific position in this space, and thus we can measure the distance from the position of one photo to another. Essentially, it shows how similar or dissimilar two images, or a whole collections are from one another.

In order to better visualise this space in two dimensions, we used a technique called t-SNE. This creates a two-dimensional map of the encoded space where the distance between each photo is reasonably preserved for visual examination.

Runway photos encodings mapped to two-dimensions using t-SNE.

We color-coded each photo, represented by a dot, according to its designer. In this map, we can see that there are some designers where all of the photos are stuck together in a group. This means that the autoencoder has determined that their photos have similar characteristics, and that photos from other designers don’t share those characteristics. For example, Julien Macdonald, Versus and Bora Aksu.

These collections have a tight visual identity.

At Versus the looks are quite alike. The palette is dark and the shapes are structured, short, with defined waists. Uniformity makes this and Julien McDonald stand out.

Interestingly, Bora Aksu uses an unusual number of different textures and techniques including lace, pleating and ruffles within the one collection. The color palette is varied and the cuts differ greatly from one another — yet despite these differences, the autoencoder is able to pinpoint Bora Aksu’s singular look.

There are some clusters in the map where multiple designers are mixed together and some designers whose pictures are spread out all over the map. This means that the autoencoder has not found many common characteristics between their looks.

For our next trick, the machine detects trends

We can hone in on which photos the autoencoder has detected as having similar characteristics.

Pairs of runway photos from a few different designers that are closest to each other in the encoded space.

Choosing a pairing per designer, the similarities here seem to relate mostly to the styling of the show and models, combined with the poses they are in. We would expect each designer’s show to have very similar looks dotted throughout in order to create a strong identity for their season.

The autoencoder detects highly nuanced aspects of the designs. At Erdem, it has identified the trim which runs V-shaped across the bodice of both dresses, despite one being white lace and the other floral print. At Burberry and Versus one model is female and the other male, and yet the autoencoder spots the masculine silhouette of both, and the layering.

We can also see the similarities across designers finding pairs of photos that are not from the same designer which are closest in the encoded space.

Pairs of runway photos from different designers that are closest to each other in the encoded space

This is an incredibly interesting look at trends in shape, material and pattern created within each season. The selected looks give you an overview of the seasonal trends and automatically show the parallels between designers. We can see that in the long coats from Anya Hindmarch and Burberry, the loose-fitting tea dresses at Paul Smith and Simone Rocha and the monochromatic prints at Joseph and Molly Goddard.

The way the autoencoder has paired similar looks reveals the key seasonal trends in shape, material and pattern.

The autoencoder detects some unlikely but interesting comparisons. We can see connections based on ruffles, volume in coats and shoulders as well as deconstructed looks.

For example, the Roberts Wood and Anya Hindmarch typically have very different styles but this uncovers strangely similar design elements.

Seeing Ashish alongside Mulberry, for the metallic fabric choice, is unexpected. The designers share little in design outlook, yet Mulberry has recently been moving in a more modern direction, which has been picked up here.

Revealing brand identity

We can also take the distances between every pair of runway photos by the same designer in the encoded space, so that we can get an understanding of the cohesion between a designer’s pieces.

Designers that have consistency in their looks will likely have shorter distances between their photos, whereas designers with a very diverse set of looks will also have more variation in the distance between their photos.

Encoded space distance between every pair of runway photos for each designer. Highlighted at the bottom are three examples of designers with cohesive brand identities, and two whose designs have higher variance.

Some designers, like Simone Rocha, Roberts Wood and Sharon Wauchob’s photos have shorter distances between their photos; we can say that their brand identities are very cohesive.

These have a definite style, both in the colors, poses, shape of the garments and a background that is similar across the photos.

Other designers have photos that are really spread out and so their photos have more variance in distance. The collections that the autoencoder deemed to be very varied include Edeline Lee and Emilio de la Morena.

We can see why the distance between these photos is wider. Sure, Edeline Lee’s backgrounds are very different, but the designs are quite diverse as well. Emilio de la Morena’s is even more interesting — the background seems to be uniform, but the garment design is strikingly different.

It’s not often that a designer’s runway show has as much variety as this. It’s possible that a human analyst wouldn’t have spotted the properties of the photos and reconstructed images highlighted by the autoencoder. However, the indiscriminate nature of the autoencoder’s approach to analysing data brings these points to the forefront.

Experimenting with deep learning methods for analysis creates a huge opportunity to explore and extract quantifiable insights from datasets which at first glance might not seem quantifiable.

It’s fun to play around with this imagery but the underlying industry impact is vast. AI could help a buyer detect the most iconic pieces from a designers’ offering, reveal the season’s leading trend stories in real-time or help a retailer create cohesive assortments pairing unique designers.

This is just the beginning.

Alejandro Giacometti is a data scientist at EDITED with a specific interest in digital humanities and image science. At the weekend, when his brain isn’t figuring out the future of the fashion industry, he enjoys baking bread. If you’re interested in his bread, or EDITED’s data science, come and work with us.

Sophie Coy is a data manager at EDITED, she came from fashion but is now also fluent in data. Too often she makes big Asos orders, not often enough she samples Alejandro’s sourdough.

--

--

Alejandro Giacometti
Inside EDITED

London. Data Scientist at Edited. Former NY Insight Fellow, Digital Humanities PhD at UCL.