AI generated fashion— how we used generative adversarial networks for e-commerce

Published in

Tooploox AI

8 min readMar 9, 2020

The multimodal heaviness of search

The internet is growing and it is getting more difficult to search for things. Even though we have sophisticated visual search engines and are able to search our photo collection for pictures of “my cat and my grandpa together”, there are still times when you want to shout at the screen for not being able to understand you.

Sometimes the goal of product search is broader than simply finding visually similar pictures. Perhaps you don’t like the results you see. Perhaps you would like to add some information about the brand, material or color shade. One way to do it is by using filters with attributes or tags. However, we do not want to restrict the search engine’s usage to a predefined set of attributes. On the contrary, our goal is to allow the user to express his thoughts in a natural language query, such as: “show me shoes similar to these but in black leather”.

The rise of GAN-related methods, especially the conditional ones, inspired us to try and apply them for multimodal search. By multimodal, we mean search query composed of image and text as in the picture above. Hence the idea is the following — a user has an image and a piece of additional text information, not visually present in the picture. Those are fed to a multimodal GAN network to generate a synthetic image that represents the object the user is looking for. Finally, the generated image is used to perform a visual search in the product database to find the best matching products.

Illusttration of how our multimodal retrieval with generative adversarial networks works

By having this additional step with a synthetic image, the user is able to verify whether his multimodal intentions are understood correctly by the search engine. It is important as it helps to distinguish between cases when there are actually no similar shirts with a shorter sleeve in the shop’s database and cases when the model does not actually understand what cold shoulder shirt means.

Give me some data

To verify the idea, we need some training data that contains relative captions, that is not regular text decriptions but descriptions of differences between two products. The training data that we used comes from Fashion-IQ challenge and it already has such labels (human annotators from English-speaking countries had to manually answer the question “what is the difference between two similar images?”), e.g. “shirt A is similar to B but has longer sleeves and a different collar”. The database contains more than 20K relative captions in three product categories (shirts, tops & tees, dresses). It is a good start and we can get our models running!

It’s all about GAN

To sum up, when it comes to technical goals, we want to build a GAN network that is able to generate images from multimodal input as well as preserve meaningful distances in the embedding space to allow for efficient visual search.

GAN (Generative Adversarial Networks) are deep learning networks responsible for creating a lot of buzz around super real looking images of people that do not actually exist. Put differently, generator and discriminator play a zero-sum game trying to fool each other, where the generator creates fake images from noise vector (latent) input, following the distribution of real images. Discriminator does its best to distinguish between fake and real images. Both of them improve interchangeably, getting better and better at their tasks of “forger” and “policeman”.

Conditional GAN is even more interesting. By feeding class labels together with real images (and generated synthetic images) and showing them to discriminator, the generator is forced to create images that correspond to the provided label. As a result, after many hours of training, the generator creates synthetic images conditioned on that label.

In our case, the input that we want to condition on is multimodal — it consists of both text and image. Before we feed them to GAN, we transform them into a one-dimensional vector by using image and text encoders. The simplest way to encode an image is to use a pre-trained ResNet and treat one of the last layers as a feature embedding vector. For encoding text, we used a text representation library fastText with pre-trained English word vectors.

Our main goal is to optimize for product retrieval, so we should consider using loss that would help us optimize the distances in the embedding space. Happily, triplet loss is all we need. By providing the network with positive and negative examples, we can optimize so that the distances between the anchor and positive object become smaller than distances between the anchor and negative. How do we choose positive and negative pairs? The image generated from the multimodal query (source image and text) is the anchor, the target image is the positive example and the source image is a negative example.

The overall architecture is summarised in the following diagram:

So hard to train you

On the generator and discriminator part, this is where most experiments and fiddling with hyper-parameters takes place. Training GAN is a tricky task and an art in itself and often includes a lot of “magic tricks” in order to avoid the scary mode collapse. After doing some research we tried out a couple of the documented strategies. Some of them drastically improved the generated images’ quality while others did not seem to work for our use case.

[++] Wasserstein GAN with gradient penalty (WGAN-GP)

The original Wasserstein GAN uses clipping of the weights to enforce the Lipschitz constraint on the critic’s model. It is a simple but suboptimal solution as it can cause vanishing gradient (if clipping is small) or make it hard to optimize the discriminator (if the clipping parameter is too large). Instead of clipping, WGAN-GP uses a gradient penalty if the norm is bigger than 1 (a differentiable function is 1-Lipschitz if and only if it has gradients with the norm 1 almost everywhere). It definitely improved our results over simple conditional GAN.

[+] Label smoothing

This method prevents discriminator from being too confident about its classification and is very easy to implement. For our experiments, we switched the label of real images from 1 to a random number between 0.9 and 1 and the training improved a bit.

[+++] Two-time scale update rule (TTUR)

Paper published at NeurIPS 2017 has shown using the theory of stochastic approximation that GANs trained by a two time-scale update rule converge to a local Nash equilibrium. In this technique, the learning rate is set separately for the discriminator and generator. Super simple and super useful.

[-] SAGAN

Self Attention GAN with spectral normalization on the convolutional kernels has gained a lot of buzz as a way to have better and more stable training. Unfortunately, when applied for our multimodal GAN, the positive effects were negligible.

Let it train

Generator learns to mimic the distribution of real images

When all is set up, it’s time for fun — watch the noise become real images. Apart from watching the quality of generated images, which steadily improved, we monitored the value of recall for retrieval — remember, that was actually our main goal. To evaluate recall, we calculated the embedding vectors for all images in the validation set, then found closest vectors to query and verified Recall@K for K in 10, 50, 100 and 1000. It is reassuring to see that when the training progresses, the generated images get better at helping retrieve the target images.

Below is a sample of results for the working model and clothing database. The first column shows input image and additional text from the user, the second visualizes the product user is looking for. The third displays a generated synthetic image by SynthTriplet GAN corresponding to both image and text and the following columns show closest matches from the database that are shown to the user. What is nice, is that the retrieved matches correspond semantically to user’s intentions.

Example retrieval results using SynthTriplet GAN

Walking on embedding

Having trained the model, we also explore what happens when we move in the embedding space and generate images for interpolated latent vectors. What we notice is that we get a smooth representation of generated images.

In the example below, we interpolate from text query “yellow and has long sleeve” to text query “is red and has short sleeves”. What we can see in the picture is that the text query affects each image input in the first column differently.

In another example, we interpolate between blue and yellow. What is nice about the model is that it preserves some general features of the input image such as shape, texture or graphic.

Some interesting results for interpolation between “has zebra pattern” and “is black with a white logo”:

_________________________________________________________________

If you want to know more, check out my videos below from PyData LA and Applied Machine Learning Days conferences. Also, you can get find all the technical details in our paper on arXiv.

________________________________________________________________

This research has been sponsored by Tooploox Research Grant.