Powering Glovo’s Machine Learning with Real-Time Data. Part 2: Real-Time Data Aggregators
--
This article is part of a series co-written with Francisco Rodriguez, Meghna and Enrique Fernandez about Machine Learning feature engineering in production environments. You may find the other articles here and here.
In this post, we focus on the problem of building real-time features for our online estimators in a scalable, accurate, and robust way. We will tell you how we ensure that our training and inference pipelines are equivalent, what types of aggregations have given the best results and how this has improved our models.
Motivation
Glovo is a food and grocery delivery three-sided marketplace, that connects consumers, stores and riders.
As you may know (if you read the first article of our series), Machine Learning estimators are a key component of Glovo’s real-time operations for three reasons:
- They are key to optimize the efficiency of our marketplace, by feeding our smart logistics engine with accurate time estimates. Increasing efficiency has a major impact on courier experience, as it directly determines their earnings.
- They help us provide an outstanding customer experience, allowing us to show accurate ETAs (expected time of arrival) in the customer app and decreasing order delivery times by reducing unnecessary waiting times.
- They also help us provide an excellent partner experience, by showing accurate ETAs of our couriers in the partner applications. Note that we are using the terms partner and store interchangeably throughout this article.
We, data scientists, are responsible for the full life cycle of these estimators: researching, developing, releasing to production, and monitoring them. These are two of the most important estimators we use:
- Estimated Time to Prepare (eTP), which predicts the time needed for an order to be ready for pickup since the moment when the order is dispatched to the store. This estimation is used to select the appropriate courier so that the courier’s waiting time at the store and the probability of having the order ready before the courier arrives are both minimal.
- Wall ETA, which sets the customer’s expectations on the time range an order at a store would take to arrive (see Fig 1). Having an accurate estimation is crucial for creating an outstanding customer experience.
We currently use tree-based models for these two estimators, each of them fed with dozens of features. Among the most important ones, we find the real-time aggregated features, which extract information of the most recent orders. These features enable our Machine Learning models to better capture the dynamics of the city in real-time and the sudden changes that may occur in the operations of the stores. For example, the feature number of orders in the store queue may capture an unexpected bottleneck in the store operations, which leads to higher preparation times.
Initial Approach
In production, the engineers at our team opted for a solution that involved running queries on the client-side (Java codebase) every 10 minutes against our production database and caching the results to pass it to the model at each call. This 10 minutes window is somewhat arbitrary but it seemed to us an initial good compromise between not overloading the database and having a near real-time feature.
To mimic at training time how features are computed in production, we built a set of custom transformations in our training python repository that emulate this behavior: compute the features at intervals of 10 minutes, i.e. after computing a feature, the next incoming orders are not used to refresh the feature value until the 10 minutes window is over.
This first solution showed the impact that real-time aggregated features could provide. However, it had some significant drawbacks:
- Building new features was very time-consuming and required excessive coordination between engineers and data scientists, which raised some scalability concerns.
- Putting an excessive load in the database increased the latency for other queries running at the same time and also raised some scalability concerns.
- It required duplication of each feature (in Python for training time and in Java for inference time), which proved error-prone and very expensive to maintain.
- It wasn’t possible to compute complex features in Java (e.g. Fourier transformations and others provided by the tsfresh package), since there are no data transformation packages in Java that are equivalent to Python’s.
- It was lagging 10 minutes from the real time events, which proved to be important on certain occasions, like, for instance, when there is a sudden spike of demand in a store.
Overcoming the Challenge
The solution that the Routing engineers developed addresses these challenges: an event-driven service. This is a dedicated service to compute, in real-time, the aggregated features the Machine Learning models consume in production. This required an extra investment of the Routing engineer’s time but gives us a lot of flexibility and independence for iterating fast over new features and allows us to use our favorite language for ML: Python.
Features computed in production are currently not being stored anywhere, because of two reasons. On the one hand, if the production environment has a problem or a spike in latency, we would propagate the error or feature inaccuracy to the training pipeline. On the other hand, even if features were stored, experimentation using new features would be very slow, as it would require either to deploy the transformations to production and wait until we have enough data or to build ad hoc training transformations for each new feature. Therefore, feature transformations are currently done in two independent repositories: one for the training pipeline and another one for the inference pipeline.
For the sake of reducing boilerplate code, which eventually leads to inefficiencies and bugs, we decided to develop a package with transformers, called aggregators, which produce the same feature values at any given time, both in the training and in the inference pipelines.
Beyond reducing boilerplate, this package provides some additional advantages:
- A controlled way to improve our aggregators or change its interface to include new use cases without facing incompatibility issues, following the semver convention. For example, a new model may require a redesign of our aggregators and their interfaces, which breaks backward compatibility. The new and old models can use different versions of the package until the old models are adapted to the new aggregator's interface.
- Testing and locking the dependencies, using poetry, so that we ensure that aggregators are robust to potentially-diverting package requirements of different models.
- Running tests automatically at each code change or new version release, using github actions and pytest.
- Encouraging the adoption of some software engineering best practices, like test-driven development, which is very good for production-ready development.
- Creating a feature catalog and exposing the catalog and the aggregators to the whole Glovo organization, which we expect will foster collaboration between different Data Science teams.
Packaging the Aggregators
Aggregators do a good job at abstracting the summarization of a sequence of events belonging to the same aggregation dimension into a single value using a certain aggregation function.
An example might be illustrative. Assume we want to develop a feature that counts (aggregation function) the number of orders placed on a store (aggregation dimension) that have already been dispatched (order sent to the app) to the store (condition on event 1) but not accepted by the store (condition on event 2). Let’s call this feature number of orders in the store queue. Assume we want this feature’s value when order N is activated. By observing the image we can see that we should count from order 0 to order i-1, i.e. the feature value should be i. Note we don’t include order i because it has not yet been dispatched to the partner at sample time and -1 because it has been already accepted. All this contextual information (aggregation function, dimension, and events), which we call feature specs, is provided to the Aggregator constructor.
The Aggregators are a high-level standardized building block for turning sequential raw data into meaningful features that feed Machine Learning estimators. Aggregators provide interfaces with arguments (feature specs) on its constructor method. As you can imagine, the feature specs are enough to fully and uniquely specify a feature and how to compute it in training and inference. Therefore, they reduce a lot the complexity of experimenting and productionizing new features that match any of the aggregators’ architecture that we have built.
All Aggregators perform aggregations in two environments:
- Training: for usage in the training pipeline of an estimator (a value is generated for each sample in the input dataset based on a sample time)
- Inference: for usage in the inference pipeline by an estimator in production (a single value is generated based on the timestamp of the inference request)
These two independent transformations need to produce the same values when used on equivalent datasets. Therefore we have designed and implemented a battery of unit-tests to simulate training and inference environments and ensure the transformations give the same values. Our CI/CD pipeline executes these tests every time a PR to master is submitted and for each release, and makes sure the test coverage remains above 95%.
On top of that, the package provides a feature catalog. It names and defines the features that are used in production using the feature specs. It is enough to import the feature (an instantiated aggregator) and call its transform method with either `training` or `inference` in its argument to produce the feature values. This feature catalog creates synergies in the company since it allows the rest of the data scientists and analysts to make use of the features or define new ones and share them easily with other teams.
We have defined so far three different types of aggregators, which enabled us to productionize hundreds of features.
- StatsLastNSamples: a function aggregates an attribute of the last N samples with one completed event across a dimension. For instance, the average (function) delivery speed (variable) of a courier (dimension) in the last N delivered (event) orders.
- StatsBetweenEvents: a function aggregates an attribute of those samples that are between event e1 and event e2 across a dimension, e.g. sum (function) of order amount (variable) for all orders made to a store (dimension) dispatched (e1) but not picked (e2) by a courier.
- StatsAfterEvent: a function aggregates an attribute of those samples with a completed event (e1) in the last X minutes across a dimension, e.g. sum (function) of products (variable) in all orders to a store (dimension) activated (e1) in the last 15 minutes.
Conclusions and impact
Error reduction
Among our top features, we find some which are generated with tsfresh package and they are used to model the behavior of the couriers. They have an impact of up to 20% reduction of our error metrics. This has ultimately led to big improvements in the efficiency of our platform.
Time to market
Another important impact is the great reduction of the time to market of new features, which now takes as little as a fifth of the time. This has been possible thanks to the flexibility provided by the aggregators. Most new aggregated features can be easily extracted by just defining the right set of feature specs at instantiation time. Then, the aggregator can be used as a transformer to extract the features for experimentation or for production. Whereas before, we had to coordinate with the Routing engineers so that they could introduce the features in SQL and Java in the backend, something most Data Scientists can not do as we lack backend engineering capabilities.
Maintainability costs
Maintainability is now less time-consuming and changes are much less error-prone. This is thanks to having packaged the ML Aggregators, which contain a very complete set of tests that run for every new PR to master and package version released.
Now it is time for you to share your thoughts with us, ask questions or give us feedback so that we can learn something new. And if you found these challenges interesting, come work with us!