Machine Learning in Real Time at adidas

Published in

adidoescode

6 min readNov 24, 2023

Finally, after a year of work, we can proudly say: “We started to ride the wave of real-time”.

In this post, we are going to “take you by the hand” and show you what we did to move our batch process into a real-time endpoint in AWS SageMaker.

Use case details

In order to minimize user returns as much as possible, we came up with the idea of developing a Machine Learning system that helps our buyers select the appropriate size for apparel or footwear products.

The process may seem simple: just create a dataset with purchases and product features, and train a machine learning model. The issue arises when you have a large number of customers, a wide array of products, and various markets — these variables make the result of the equation quite substantial.

We found ourselves generating hundreds of millions of predictions weekly, many of which were never used. Despite being an inefficient approach, it also consumed a significant amount of time, resources, and money.

Given all of this facts, we made the decision to move the current batch inference into a real-time endpoint.

Now it’s time to look into the architecture of the solution.

Architecture

In order to create a fully-fledge enterprise endpoint, we need to have the following components: a custom domain, some security measures, an api gateway to aggregate the different endpoints under the same domain, the different endpoints for the different models and the online feature store for a low-latency retrieval of features.

The following diagram covers the previously mentioned requirements:

That was the easy part, just depicting the components we need in the system to enable the real-time capability. Now, let’s take a look at how the system needs to evolve to support capabilities such as a feature store, experimenting, and shadow deployments.

From an architectural standpoint, this is what the pipelines look like:

CI/CD process for Machine Learning projects in Sagemaker

Let’s dive deep into the pipelines.

Training pipeline

Our system is primarily based on SageMaker Pipelines. The main goal of these pipelines is to gather all the data from our data Lakehouse, process it and train the model. We divide the whole process into two different pipelines. The first one is responsible for all the preprocessing of the data, while the second one is responsible for the training.

The preprocessing pipeline has various parameters, but the main one defines the market of the model we will be creating. This pipeline is divided in two main steps: the first one, based in Spark, gathers all the data from the Lakehouse and the second one, filters all that data based on our internal business rules. Once those two steps are finished, we can trigger the next pipeline using a Lambda step, passing to the training pipeline all the data from this pipeline and the parameters received.

This second pipeline executes the training process. Once the model is trained we register it on the SageMaker model registry so we can keep track of all the model versions we generate. Since we are continuously improving our code and generating new releases of our project, after the model is registered, using the model version given by the model registry, we tag our code repository with the model version. This way, we know which code version has been used to generate each of the model versions we have. After registering the model version and tagging the repo, we evaluate the model just to check if the newly created version fulfills the minimum standards defined by our data scientists’ team. If the model does not meet those requirements the model is rejected and the process finishes here, leaving the endpoint as it was. This way, even if the recently added articles will not generate recommendations, we guarantee that the articles that have been performing well up until now will continue to do so. In the case that model meets the requirements the process will continue and the deployment Jenkins pipeline will be executed.

All this process is executed every week with a fire-and-forget approach. We have scheduled one event per market, so every week we execute these pipelines for each of the markets we have. Considering the constant flow of new customers and articles daily, this recurring process is considered essential to keep our models up to date. As a result, we can offer our customers the best possible recommendations.

Deployment pipeline

Once the model is created and registered in the model registry, the next main step is to deploy the model to the SageMaker endpoint. But before deploying the new model, we need to update the feature store with all the new customer and article data. All the features for each of the registered users and the features of each of the articles available online are calculated in advance and stored in the feature store. This way, anytime the endpoint receives a request, gets the data from the feature store before getting a prediction from the model. The preprocessing and the storage of the data is done in a Sagemaker pipeline, responsible for generating the features and filling the online feature store with them. Now we are ready to deploy (or update) the endpoint.

At this point, the feature store is ready and updated with all the data, and we have our newest model registered in the model registry. As we do with the rest of our infrastructure, we use Terraform and Terragrunt to deploy the endpoint. Knowing the version of the model we created in the previous Sagemaker pipeline, we are able to deploy that model into the endpoint. However, we always deploy the newest model as a shadow endpoint. For a given time, we direct part of the traffic received to the new model. When that time comes to an end, we compare the results we got from the endpoint running from last week and the shadow endpoint with the new model. If the performance of the shadow endpoint is better than the endpoint running, and we didn’t find any errors, we promote the shadow model and delete the previous one.

Conclusions

As you may guess, the journey wasn’t easy. One of the part we most struggled with was the IaC. We usually create full environment per branch, the combination between terraform and Sagemaker pipelines was tricky. We need to really think very well from which point makes sense to create the endpoint since it has lot of implications.

The journey has been amazing and it wasn’t possible without Nerea Ayestarán (who also wrote part of this article) David Fustero, Paul Zieringer and David Castellanos.

The views, thoughts, and opinions expressed in the text belong solely to the author, and do not represent the opinion, strategy or goals of the author’s employer, organization, committee or any other group or individual.