How we use Machine Learning to impact user experience

Adevinta

Published in

Adevinta Tech Blog

9 min readNov 20, 2020

By Cenk Bircanoglu, Data Scientist

Everything started with the need to improve the user experience.

We wanted to make it easier and faster for users in one of our marketplaces to create an ad, so we decided to create a Machine Learning model. We created, tested and implemented the model in just a few months and now have 90–95% accuracy on our main labels. Most importantly, our users are now able to create ads and automatically fill the required fields simply by uploading an image.

The model is a great example of a Convolutional Neural Network for a classification task focused on multiple labels such as Brand, Model, Color, Brand Type and Year.

It is currently in production and you can try it out for yourself by creating an ad at coches.net!

Keep reading to find out how we did it.

Our motivation

To give a bit of context: in Adevinta, Cognition is a team that works on Computer Vision and Natural Language Processing projects to help marketplaces across multiple topics, such as improving the ad creation process and moderating ads.

What the marketplace wanted was clear: to improve the user experience and make it easier and faster for users to upload an image of their car. Sounds simple! But how to achieve it was a mystery… We knew we had to reduce the number of parameters we asked the user and focus on the ones that would be helpful in predicting the rest. We started to ask ourselves: what data provided by the user when they upload an ad contains enough information to predict the rest of the parameters?

In general, the information stored in our marketplaces can be classified into four different categories:

Categorical i.e. category or offer type
Numerical i.e. price or number of kilometers
Text-based i.e. ad description or title
Image-based

From these categories, two hold the most descriptive information: text and images. Anything related to the item can be described in a more comprehensive way with text, but it may require a lot of time and effort from the user. Sometimes the user doesn’t know precisely how to describe with words the product they’re trying to sell. But with an image, every user can clearly describe the item (in our case the car) they’re talking about. A picture is worth a thousand words! So we decided to focus on the images the user would upload to extract the other pieces of information about the item.

It goes without saying that from a user’s perspective, it’s also beneficial to only use an image to create an ad as it’s easier and faster. We were clearly heading in the right direction to assist with the marketplace’s needs.

With that in mind, we started the Car Classifier project, which can extract from a single image the Brand, Model, Body Type, Color and Year fields, and automatically fill those values for the user. This feature would accelerate the ad creation process and make it pretty cool for the end-users, as they’d be less likely to make mistakes when filling in the fields.

Defining the problem

As Data Scientists, before starting any implementation we always aim to have a proper definition of the problem. At first glance, it seemed like a classification problem, which is relatively easy to solve. If you have a high quality dataset, it’s normally easy to obtain good results by using Deep Learning and, more specifically, Convolutional Neural Networks (CNN). However, when we thought about our problem in-depth it had some tricky parts, and we decided we could instead define it as a multi-label or multi-class classification problem. Furthermore, our case had some specificities as there were some relationships between the labels. For example, there was a hard constraint between Brand and Model and a soft relationship between Model and Body Type.

We had to keep in mind that our model should handle these relationships, at least the strong ones. For example, it wasn’t acceptable to predict the Brand as BMW and the Model as A3 for the same image, as there is no A3 model for the BMW brand. To be more precise, there’s a list of the pairs which have/may have some kind of relationship:

Brand — Model
Model — Body Type
Model — Year
Brand — Body Type (not a direct relation)
Brand — Year

We needed to decide which of these relationships we should manage explicitly. Then we would have to create some mechanisms to force the labels to be compatible with them. As we mentioned earlier, we wanted to strictly keep the Brand — Model relation in our results. However, we didn’t need to do it for the other labels as there were some other possible variations in the world for which our dataset didn’t contain any explicit example. For instance, there could be a car in the world which has a possible pair of specific Body Type and specific Model.

About the dataset

Before going into the details of modeling, we’d like to first introduce the dataset and give some useful details which impacted the modeling.

Our marketplaces are the biggest or some of the biggest in their countries. As a result, we had the opportunity to reach a huge image dataset with lots of variation within it. In this project, we obtained a dataset with hundreds of Brands, thousands of Models, dozens of Colors and Brand Types and a wide variation of the age of cars (some were quite new and some were 50+ years old). As expected, there were millions of car images we could train and test our model on. As a Data Scientist, it was really enjoyable to experiment with this kind of dataset as the variation of the images and categories made the problem harder but more interesting at the same time.

To better understand the problem, we had to investigate the label sets. Brand, Model, Body Type and Color labels are discrete, in other words they can be handled as a categorical variable. When it comes to Year, it becomes a bit tricky as it can be approached as categorical or numerical and both are valid and possible to implement. However, these two approaches have their pros and cons, so we had to list them and think carefully about it before doing anything else.

If we choose to consider the Year as categorical then our problem would be simpler as the other labels are categorical and our ambition is to use only one model to predict five different labels at once. However, if we choose to consider the Year as numerical, we have to somehow find a way to handle both categorical and numerical labels with one loss function. It’s a solvable problem but trickier than accepting the Year labels as categorical. This point is explained in more detail in the next sections as it can affect the modeling and performance of the model.

Possible model implementations

Before deciding if Year was categorical or numerical, we wanted to list down the possible implementations of the solution by using a CNN architecture.

CNNs are state-of-the-art in the world of Computer Vision applications. Combined with our expertise in that domain, we knew we’d use a CNN architecture and decided to go with a version called ResNet. As ResNet wasn’t originally implemented for multi-label image classification, we needed to make some changes to the architecture. We therefore replaced the last layer of the original ResNet with five different branches, as shown in the figure below.

Although the architecture was able to predict five different labels using one image, we couldn’t make sure the model was going to learn the relationship between the Brand and the Model. So we had to create a relationship between those two labels, and added it only between the Brand and the Model. The final version of the model is shown below.

The implementation of the masking code block is also given below.

The Year is categorical or numerical

We wanted to use one model with five different branches as a final layer, so we had to find a way to organise the loss function(s), optimiser(s) and so on… In every Machine Learning approach, the architecture can have a strong influence, but in the end the performance also depends on the team’s decisions. In our case, we had to make lots of decisions, such as how to connect the multiple outputs, how to train the model with multiple outputs and what would be the loss functions. We’ve listed the possible consequences in the following paragraph.

Before implementing the model or preprocessing scripts, we needed to make some more changes to the problem definition regarding the possible implementations of the model.

We tackled the problem with two different approaches, as listed below. We had to bear in mind that each approach has side effects which we’ll come back to later.

Going with the classification of the Brand, Model, Color, and Brand Type labels and the regression problem for the Year, which seemed the most natural
Considering each label as an instance of the classification problem

There were other decisions we had to make, such as defining the optimisation algorithm for the selected model. Again we had two options:

Using one optimiser for all labels
Using separate optimisers for each label

Here are the pros and cons of using one optimiser or multiple optimisers for our model:

Using one optimiser would be faster and cheaper in training time
By using one optimiser for all labels, we’d have to use one loss function to wrap the losses (as we’re training five different labels, we’d have five different losses — one for each label)
The effectiveness of using multiple optimisers is not guaranteed as it’s quite uncommon, so we’d have to analyse the impact of this approach on the training (especially the impact on the common layers)

Before experimenting with multiple optimisers or one optimiser, we also wanted to reduce the number of possibilities. Indeed, using one optimiser forced us to think about how we were going to wrap the losses.

In the code above, we’ve shown how we formalised the loss function for both cases and reviewed the Year as both a categorical label and a numerical label. In the first case, the Year label is handled as a regression problem and in the second one, it’s handled as classification.

When we dive deeper into the loss functions listed above, it’s clear that using the weighted sum of the classification losses is simpler than using the weighted sum of the classification losses with regression losses. Why? Because the degree of the results may vary and it would affect the training time and performance. That’s why we decided to use the Year label as a categorical variable and choose the second loss function.

We used some multiplier constants in the loss function formulas as weights. To keep it simple and avoid side effects, we kept all the values the same. Theoretically, playing with these constants can affect the results. For example, if a parameter chosen is bigger than the others, the loss of the Brand label will be dominant and the model will try to optimise the Brand more than the other labels.

Final model and performance

To reach the final model, we did lots of experiments with many variations — some of which we’ve listed above. Some approaches were discarded even before running an experiment on the dataset, but some needed deeper investigation and detailed numerical analyses.

To do this kind of numerical analysis, we used various metrics, approaches and ablation studies, but I’m only going to mention the accuracy metric — and I’ll give a range rather than a precise metric. With the final model, we reached more than 95% accuracy on the Brand and 90% accuracy on the Color, Model, and Body Type. The accuracy of the Year label is slightly lower as it’s the hardest one to predict — even for car experts.

Conclusion

As we achieved great results, we started to use this feature in coches.net, one of our marketplaces in Spain. As of today, it has been in production for more than six months and according to the marketplace’s surveys, it’s clear that users like the feature. It has helped speed up and automate the process as users no longer need to manually fill in the form when publishing an ad.