One Size Fits All — An Approach to Calculate Universal Clothing Sizes

Josef Feigl
Otto Group data.works
8 min readMay 10, 2021

Authors: Josef Feigl & Leonard Judt (Otto Group data.works GmbH)

tl;dr: We present a way to rank all customer purchases by size and learn a new more universal size information. This new size can be used to compare clothing items sizes across the whole fashion assortment. The results can be reproduced by using the jupyter notebook https://gist.github.com/jfeigl-ottogroup/79559d3f8105b389534dfc6876fddb33

1. Introduction

People come in all shapes and sizes. This makes it difficult to find individually fitting clothes online without trying them on first. The displayed product size helps a lot although it is often just a rough estimate. Even knowing our size for similar items of clothing does not mean that other items of clothing of the same size will fit us as well. To make things even worse, clothing sizes also come in all variants. Just as the average customer, we at the Otto Group continuously face the challenge of having to compare different types of measurements for clothing: How does a 44 from a German brand compare to a XL from an US-American brand? Or t-shirts in 42 to sweaters in 42?

A lot of effort was put into systems that help to mitigate this issue. Simple size comparison tables are well known but even complex fit prediction models are also used widely today.

Fig. 1: Size comparison table for women’s dresses and suits

We want to share a different approach. One that tries to find a more fine-grained size information. A more universal size measure, that help us to compare clothes and their displayed shop sizes.

Fig. 2 & 3: Dresses with the same displayed size (left) can be cut differently and have a differing universal size (right).

2. Data

To compare clothes and their shop sizes, we started with the purchase history of all customers. Our data set consists of customers, the products they bought and the sizes in which they kept these products. Let’s create a tiny dummy data set using the following snippet:

Fig. 4: Dummy data with purchases

Here, customer A has bought and kept two products: product X in size 40 and product Y in size 42. His buying profile consists therefore of two product-size-tuples (X, 40) and (Y, 42). We will use this notation a lot.

3. Main Idea

Let’s assume, we have two pieces of clothing and their shop sizes (X, 42) and (Y, 44). Currently, we cannot really say for sure if (X, 42) < (Y, 44), because sizes are difficult to compare (as explained above). Our main goal is to learn a universal size mapping S for all product-size-tuples that allows us to make better size comparisons for the whole clothing assortment.

If product X in size 42 is smaller or tighter for most customers than product Y in size 44 our mapping S should yield: S(X, 42) < S(Y, 44).

We should emphasize that we want to find size relations that are valid for most customers. As people come in all shapes and sizes, it is most likely impossible to find relations that are always correct for all customers.

To solve this problem and create an extensive product-size mapping S, we extracted pairwise item samples from our data set, turned this problem into a pairwise ranking problem and solved it using a simple pytorch model.

4. Pairwise Ranking

To turn our data set into a training set with multiple pairwise ranking samples, let’s first have a look at the single customer profile of customer B from our dummy data. This customer has bought three clothing items:
(Z, 16), (X, 44) and (Y, 44).

  • Assumption 1: The first underlying assumption for our approach is that two pieces of clothing bought and kept by the same customer should have a similar clothing size. For our single customer B, we can therefore conclude that e.g. (X, 44) = (Y, 44), because he bought these two items and kept them.
    This general assumption is, of course, in many cases not correct as, for example, people also buy products for their partners and children. However, this noise should even out when using a vast amount of historical data and for the majority of profiles we can assume that kept products have a similar size.
  • Assumption 2: The second assumption in place is that for the same clothing item, a higher clothing size indicates an overall larger size. Thus, we can safely assume that (Y, 44) < (Y, 46) or (X, 42) < (X, 44), as these relations are given by the manufacturer and should always be true.

With these two assumptions, we can create the following cascade:
(X, 42) < (X, 44) = (Y, 44) < (Y, 46)

And, therefore, the following pairwise ranked tuples:
(X, 44) < (Y, 46)
(X, 42) < (Y, 44)

Of course, we could create much more samples, but these two are usually the most difficult to learn as they are the closest together.

These two pairwise ranked samples are extracted from just one customer profile. Applying the same procedure to all customer profiles of a typical real-world data set of purchases could easily create billions of such pairwise ranked samples.

For our use case, it was neither necessary nor memory-efficient to extract all possible samples to fit a sufficient model. Therefore, we created a sampling function with four steps:

  1. Select two random product-size tuples from each customer profile.
  2. Equate both samples (see Assumption 1).
  3. Create ranking cascade (see Assumption 2).
  4. Select pairwise samples closest together.

This way, we can create two random pairwise ranked samples per customer profile with each call of this sampling function. For each training iteration of our model, we created a new set of samples using this function. One batch of such samples using our dummy data could look like this (instead of using tuples, we concatenated product and size):

Fig. 5: Pairwise samples. Tuples in column B are always larger than tuples in column A.

Our sampling function can easily be extended to include a binary target column for each pairwise tuple. We created pairwise ranked triplets (A, B, T) where A and B are product-size tuples and T is a binary target, which signals if B is larger than A.

Fig. 6: Pairwise samples with target. The target column indicates if the tuple in column B is larger than the tuple in column A.

Both approaches are equivalent. We used the second variant to make the output of our model more easily interpretable.

5. Pytorch model

Let’s first have a look at our simple model:

Fig. 7: Our ranking model to learn universal size embeddings.

For each training triplet (A, B, T), we predict the probability that product-size tuple B is larger than product-size tuple A. For our model, we interpret the universal size mapping S we are looking for as a 1-dimensional embedding. Thus, we make use of pytorch’s Embedding layer. The output of our model is then given by: Sigmoid(S(B)-S(A))

The inner term S(B)-S(A) should ideally become very large if B is larger than A, which is also equivalent to T=1. If, on the other hand, A is larger than B, then the embedding difference S(B)-S(A) should become negative.

We use a sigmoid activation function to turn the embedding difference into a probability to answer the question: What’s the probability that B is larger than A?. This also allows us to simply use the binary cross-entropy loss as our training loss function.

We initiate all embeddings with zero: Let’s image a number line. All product-size tuples start at zero and with each training triplet the larger product-size tuple gets moved to the right and the smaller product-size tuple to the left. Ideally, all product-size tuples should wiggle into place and converge to a more or less solid state.

We use skorch for the actual training of the model, which was developed by colleagues of us and has proven to work very well for a wide range of machine learning problems. skorch allows us to concentrate on the modeling aspect of the problem and simplifies the actual training of the model.

To make use of our sampling function, we create a new large batch of training samples for each model iteration. This can be done using the IterableDataset-class in pytorch:

Fig. 8: We are sampling new ranked training samples during each iteration.

6. Let’s train

We monitor two metrics during the training:

  • Accuracy: Did we correctly predict the target?
  • Monotonicity: For a single product X we should be able to learn that e.g.
    S(X, 40) < S(X, 42) < S(X, 44) < S(X, 46).
    We monitor the ratio for how many products this monotonicity condition holds true. Ideally this metric should approach 1.

For our dummy data, the training process looks like this:

After the training, we can extract the learned embeddings:

As we can see, our monotonicity criterion holds true for X and Y but not for Z. However, the results differ a lot with each run, because we are using such a tiny dummy dataset. We achieved a monotonicity score of about 90% using our large real-world data set with millions of purchases and thousands of products.

7. Use cases

Our universal size approach helped us to find an arrangement of all sizes that is valid across all articles. Use cases are plentiful:

  • When filtering a product list for size, we can now aggregate products of different measurement types but similar actual size.
  • It helps us to identify brands that use a different cut than their counterparts.
  • We can easily identify products that turn out bigger or smaller. This helps our customers to select the right size.

8. Improvements and remarks

There is still room for improvement.

  • One single universal size mapping might be misleading when comparing shoes and upper body clothes. Depending on the use case, one might need to train multiple models for different clothes assortments. Similarly, one could train two different embedding spaces for women’s and men’s fashion.
  • The monotonicity score can also be improved by adding single product training triplets such as ((X, 42), (X, 44), 1) or ((Y, 40), (Y, 38), 0).
  • A pytorch model might be overkill. One could easily skip pytorch and derive the update rules for such a model by hand. This might even be faster in the end. However, for the modeling process, which is often trial-and-error, we enjoyed the flexibility that pytorch and skorch gave us.

9. Jupyter Notebook

To recreate these results, you can use our gist: https://gist.github.com/jfeigl-ottogroup/79559d3f8105b389534dfc6876fddb33

--

--