Powering Glovo’s Machine Learning with Real-Time Data, part I: introduction.
This article is part of a series co-written with Pablo Barbero, Meghna, and Enrique Fernandez about Machine Learning feature engineering in production environments. You can find the articles focusing on the Data Science and Engineering aspects here and here.
In Glovo’s Routing team, we are tasked with developing Machine Learning (ML) models that estimate the time duration of every significant stage of a Glovo order. For instance, we have models for predicting the preparation time of a restaurant, the travel time of a courier, the total delivery time, among others. These predictions are then used for different use cases: to improve the efficiency of our operations -we use our ML models to choose what courier and store deliver an order- and to improve our partner and customer user experience: when do we notify a partner about a new order? what ETAs do we show our customers and partners? -.
At Routing, we use ML to optimize our operations and to provide accurate estimations to partner and customers
Something that’s worked really well for our team has been having a mixture of software engineers (SWE), data analysts (DA), and data scientists (DS). As the models we work on are critical for our business, they need to be deployed with very low latency and high availability, scaling to processing millions of predictions every minute. Having a multidisciplinary team has been crucial to achieving that.
In this article, we provide an overview of a real-time aggregation component that was jointly implemented by our software engineers and data scientists. This is the first article of a series: in upcoming posts, we will go more in-depth about its DS and SWE aspects.
Modeling our partner’s and couriers’ behaviors.
The first models we developed relied significantly on features that capture historical properties. For instance, we calculate the mean preparation time of a restaurant in the last few weeks; the average speed of a courier, how many orders each partner processed last month, etc. These aggregations provide valuable information to our ML models but fail to capture what’s actually happening in the present. For instance, one of our partners could take 8 minutes to prepare an order. But if something changes on any given day - say as an example that FC Barcelona is playing Real Madrid and their restaurant is full of customers and Glovo orders- then we can expect higher preparation times.
Orders can quickly accumulate during El Clasico
To complement the patterns captured by these historical aggregations we started using real-time aggregations. As an example, we look at the average preparation time of a restaurant in the last minutes, the number of orders that are waiting to be picked up, etc. These features turned out to be very important for capturing changes in our operation.
Coming up with a scalable, robust, and accurate solution for calculating these real-time aggregations was a challenge. When training the ML models we can rely on our DB to compute them; however, we also need to efficiently calculate them at inference time and make sure these values match the ones we calculated for training the models.
The first solution we implemented relied on SQL queries for calculating these aggregations in near real-time and caching them with a short time to live (10 minutes). Whenever we required a feature, we would check if it was calculated in the last 10 minutes. If so we would use that value; otherwise, we would execute a SQL query to calculate it and then cache it. Note that this means that each feature would be calculated at present or up to 10 minutes ago.
A view of our previous real-time feature engine using SQL to calculate the aggregations.
This solution was the fastest way to start with but had several drawbacks: 1) to avoid contention on our DB we could only calculate these features every 10 minutes, so they were not as fresh and accurate as we wanted 2) SQL can be used for simple aggregations, but implementing and maintaining more complex transformations can be painful and 3) we were duplicating logic in different programming languages, our training and inference sometimes didn’t match.
Our first solution was great as a first iteration, but soon we switched to a more scalable design. We ditched our SQL aggregations in favor of an event-driven architecture:
- Every time a relevant business event happens we publish a data-rich event on AWS Kinesis. For instance, we can have OrderCreated and OrderPicked events with the relevant fields. For the OrderCreated event, we might be interested in publishing its id, products, creation_time, etc. For the OrderPicked event, we would be interested in knowing who the courier was and the time the order was picked up.
- These events can be consumed by a number of AWS Lambda functions that calculate the features. The OrderCreated and OrderPicked events we mentioned before could be useful for calculating the number of orders that have still not been picked up in a restaurant: every time one of them happens, we can recalculate the feature for the relevant restaurant.
- The results of these features are then stored in a Redis cache and retrieved when making requests to our ML models.
Our event-driven architecture: we use AWS Kinesis and AWS lambdas to process events with the necessary data for the transformations and their results. A Redis store allows us to store any intermediate data and the feature values.
The new system greatly improved the way we work with real-time information. Not only did it improve our ML model KPIs as a result of giving fresher and more accurate features, but it now enables our DS to very quickly come up with new transformations and modify the pre-existing ones autonomously. The loss of our ML models decreased by up to 20% when using these features. As we deprecate the old system, we also expect our DB load to significantly decrease.
This is only a very brief overview of the new system. There are many details you might be wondering about:
- If you are a Data Scientist, you might be interested in knowing: how do we avoid duplicating the code for training and inference? How do we ensure that the calculated features are the same in both environments? What types of aggregations have been the most useful? What has the impact been on our KPIs? We will answer these questions in the DS article.
- If you are a Software Engineer, you might be wondering about how we designed a system to allow DS to independently create and maintain real-time features and how we leveraged AWS lambdas, Redis, and step functions to allow our system to scale to thousands of requests
We will cover these aspects more in detail in two future articles focusing on the Data Science and Engineering aspects of our solution.