Learnings from building machine learning products at scale

Adevinta

Published in

Adevinta Tech Blog

9 min readNov 4, 2021

From Jeremy Chamoux, Jordi Mas, Engineering Managers, Victor Codina, Data Scientist and Sacha Verrier, Data Engineer

Introduction

Adevinta started the journey of Machine Learning (ML) many years ago, when it was still part of Schibsted. The ML initiatives were taken both in our marketplaces and within the Global Teams. Over time, we invested in training our people, built capabilities to speed up the development of ML solutions, and established data governance practices.

Today we’re sharing some of our learnings from building ML products at scale. These products are built by the Global Teams and used by many of our marketplaces as multi-tenant solutions.

Today, these capabilities are key in developing our ML products:

a Kubernetes cluster (see the article) that we use to run our prediction services
an in-house experimentation platform called Houston
Unicron (Common Runtime Environment) which allows distributed execution of training jobs
a privacy broker to manage privacy take-out and deletion requests from users

In this article, we focus on the experience of two teams: Personalisation, the team that builds Recommender Systems (RS) serving 800 million recommendations per month and Cognition, the team that specialises in understanding images, processing 220 million images per month.

Shared learnings

Data collection and quality

As a Global Team, collecting high-quality data is a critical and continuous part of our ML projects as our data requirements keep evolving. The more advanced our products become, the more data we need to offer the best product functionality.

To integrate our product with a marketplace, we firstly need to carry out data ingestion. Within Adevinta, we have an in-house tracker and share a common schema — a key component ensuring we have high-quality data. Schemas are data contracts that specify how we expect data to be shared in our domain with our teams in a standardised way, setting expectations on data formats but also allowing us to define a common language, independent from a specific marketplace.

Many marketplaces use other tracking systems with their own schemas, where the data needs to be transformed in order to be consumed. However, each ML product has different data needs and therefore different data quality requirements. For example, in a personalised product, the user field is critical, whereas in image recognition, it isn’t.

To automate expectations on data quality, we use an in-house data validation solution (similar to Great Expectations) performed on the consumer side. With this solution, we’re able to:

drastically reduce our lead time to data quality issue detection
get alerts right after a data change
detect issues before they impact our algorithms

The data quality checks are automated, easy to define and accompanied with visual dashboards that allow teams to easily identify and report issues to marketplaces. Unfortunately, these checks don’t spot all the issues. For example, categorical value inconsistency across platforms isn’t covered, e.g. España vs Espana.

Overall, data collection remains painful, time-consuming and hard to fully automate. It’s an ongoing process that we continuously aim to improve.

Compliance to privacy regulations

Since the introduction of the privacy regulations by the European Union, followed by several other regions across the world, Adevinta has been following the principle of Privacy by Design, which means that all our systems need to be privacy compliant when they are designed. While the new privacy regulations are beneficial for users, they introduce additional complexity to teams developing ML solutions.

Our Global data platform offers out-of-the-box privacy features to all the data passing through the Adevinta data lake. Our privacy broker allows teams to easily activate GDPR compliance in a short amount of time using a web user interface. Teams are still responsible for making sure all local regulations are respected, which sometimes leads to designing and implementing extra processes (e.g. managing consent to comply with local regulations).

At Adevinta, marketplaces (data controllers) are responsible for users’ data. They share this data through explicit data sharing agreements with the Global Teams (data processors) that build functionality for multiple tenants. As data processors, we cannot change the purpose of the data processing unilaterally without directions from data controllers.

GDPR regulation also prevents different sites from exchanging personal data as well as building ML models without an appropriate legal basis. When creating ads, users upload data (e.g. title, images…), which is then used by the marketplace to improve their products. The problem is, users usually don’t give their consent to other marketplaces and we, as data processors, can’t change this agreement, which means any solution built based on this data should only be delivered to the marketplace from which the data comes. As a result, when we build ML models, we need to train models with the specific data for each marketplace in each country.

There are several solutions to this problem:

Building a pipeline that automates our model building process to be able to produce models easily, as we need to produce different models per marketplace and use case.
Using synthetic datasets to train our Deep Learning models. For example, instead of using real images of cars to locate a car plate, or even read it, we believe that we can generate these images from 3D models of cars.
Using open-source datasets to build models that are available online, bearing in mind that not all datasets are allowed to be used commercially.

Complexity of the field

With hundreds of research papers published every month, computer vision and Recommender Systems are two of the fastest evolving fields. Facebook, Google and plenty of other big industry players constantly produce new approaches about Deep Learning based models that we need to understand, analyse and adapt to our marketplaces challenges.

In the context of computer vision, the variety of specialisations make the task even bigger. Classification, segmentation and object detection are all subjects that would bring value to our marketplaces, with the ability to extract relevant information from images.There is an endless amount of information that can be extracted from the images and the approaches and techniques behind them are endless too (e.g. CNN, transformers, self-supervised learning and semi-supervised learning etc.).

Similarly, the ML ecosystem is still developing and finding the right tool can be challenging. Engineers need to constantly adapt, test, fail and frequently replace infrastructure. A few years ago, Spark was running on Hadoop clusters and Luigi was the go-to orchestrator. Now Spark runs on Kubernetes and Argo is much simpler to set up for most pipelines. Meanwhile, dozens of other orchestrators with other capabilities (e.g. MLFlow, Kubeflow,…) were launched by different companies. This concrete example can be extended to all other aspects of ML products, as described here.

Gap between offline and online algorithmic evaluations

Predicting the business value that an algorithmic improvement will have can be challenging. Often, significant algorithmic improvements based on offline accuracy metrics don’t turn into higher business value according to online A/B test results.This is mostly because of the existing gap between the two evaluation settings.

Offline accuracy metrics are calculated based on how well the algorithms predict the user-item interactions in a predefined test dataset. This has the advantage of being a fast evaluation method, making it useful especially for model selection and tuning, where we need to compare with thousands of algorithm configurations. But this method comes with drawbacks, such as existing interaction data biases and the fact that it’s difficult to ensure reproducibility among experiments.

On the other hand, in A/B testing, business metrics are calculated based on how the algorithms perform in the online setting and are therefore closer to predicting the actual business impact. Here the main limitation is the experiment duration, as it takes weeks to complete each experiment and analyse the results.

In order to bridge the gap between the offline and online metrics, the Personalisation team is working on two different areas:

Improving the offline setting and extending the set of metrics to go beyond simple accuracy
Exploring precise user engagement metrics that are more suitable for measuring the true business value that the algorithms bring, which ideally should be further correlated with the offline metrics

In the future, we plan to explore new methods to get unbiased metric estimates and use Reinforcement Learning techniques to simulate A/B testing in an offline manner.

Domain-specific learnings

Building Recommender Systems learnings

Recommender Systems are one of the most successful applications of ML in the industry. They are an effective way to increase user engagement and potentially the revenues of online companies.

One problem specific to this domain is the cold-start problem, which happens when the recommender is not able to recommend an item because there are too few user interactions. In the online classifieds domain, this is a very common occurrence. Classified ads are highly volatile as the ad is only available until it’s sold and therefore can have a short shelf life with only a few interactions compared to other domains. It’s also a highly dynamic platform, with hundreds of new ads published every hour. Solving this problem requires sophisticated hybrid algorithms that use content data when interaction data is limited. This requires high expertise in the Recommender System field as most of the existing state-of-the-art recommender algorithms need to be adapted or extended so that they can perform well under cold-start conditions.

Finally, scaling a Recommender System to serve recommendations for millions of users and items requires a large and complex infrastructure. Complexity is even higher for Adevinta compared to other online companies as we provide a generic service that serves multiple marketplaces, each with subtle differences and use cases.

The infrastructure of a Recommender System can be divided into batch and online computation. Batch computation is especially suitable for time-consuming processes that process large amounts of data, such as data preprocessing and model training, which are commonly implemented in Spark to benefit from parallelism. Spark jobs can become memory intensive and it’s hard to optimise the resources efficiently when you have different site or product configurations.

In contrast, online computation (inference) is suitable for cold-start recommendations as we can exploit item content data at query time, allowing us to adapt recommendations to the user’s context. However, due to the response time restriction, real-time calculations need to be very lightweight and highly optimised.

Visual recognition models learnings

Our main focus is the image of the second hand items published by our users to be sold on our marketplaces. Seeing the product in the best possible way is essential for users to decide whether to buy an item, but it’s also a gold mine of information that eases the transaction process. From the image, we can extract information that speeds up the ad insertion process, allowing users to create ads that will attract buyers faster.

The most challenging aspect for serving visual recognition predictions is that we have to do it with low latency at scale for different tenants with different marketplaces. A lot of effort is needed to optimise the models, which sometimes means the accuracy has to be sacrificed to improve latency. We found that quantification and pruning are the most efficient techniques to help us achieve this.

Generally our inference latency can’t exceed 500ms. You could argue that with the latest AWS inference chipset, it shouldn’t be that much of a problem but as an example, segmenting an item in a full resolution image is equivalent to predicting whether the 12 million pixels belong to the foreground or the background. Combined with the millions of parameters needed to be computed to do an inference in the network, it ends up being a highly challenging task. Then you have the wonderful journey of optimisation, vectorisation, fusing operation, pruning, distributed computations or memory footprint optimisation.

Future direction

At Adevinta, we regularly discuss how we can improve the agility to build data products. We’re going to be working on this across the organisation to improve the data processes which should help to solve some of the challenges discussed.

With data governance, there are two areas that need improvement: datasets, that are treated as products and data producers, that are treated as owners to ensure that the domain experts are able to produce the highest quality data. Improving these areas will better our products whilst empowering our teams.

In the area of platformisation, we’d like to continue improving our home grown ML platform to help teams build and deploy ML models faster, reduce common pains and facilitate the sharing of best practices and models.