Bayesian modelling for cross-city learning at Glovo

Published in

The Glovo Tech Blog

9 min readApr 30, 2021

This article describes work done by multiple people from the forecasting team. Particularly Pablo Portillo Garrigues played an important role in implementing some of this work and Kolja Kleineberg contributed through multiple discussions and on the writing of the article itself.

The problem

Glovo is a three-sided marketplace composed of couriers, customers, and partners. Balancing the interests of all sides of our platform is at the core of most strategic decisions taken at Glovo. To balance those interests optimally, we need to understand quantitatively the relationship between the main KPIs that represent the interests of each side.

Let’s take the example of balancing customer experience and courier earnings. Our customers are keen on getting their food (or anything else) as quickly as possible. To them, for the same number of orders, more couriers means quicker deliveries. On the other side, couriers are interested in earning as much as possible. To them, fewer couriers means more orders per courier, hence higher earnings. We measure this balance through the following KPIs. User experience is measured via average customer delivery time and percentage of cancellations. The efficiency of our operations, which is directly connected to the earnings per hour of couriers, is represented by the utilization rate of the fleet (UR) and the number of orders delivered per hour, also called efficiency.

Having a precise understanding of how these KPIs relate to each other is necessary to have a good picture of how optimal the operations in the platform are. Following the previous example, we know that for a given number of orders, having more couriers leads to lower efficiency, but also a better user experience. However, adding couriers exhibits diminishing returns regarding delivery time improvements because eventually most couriers are idle. Hence, finding the sweet spot that provides a good user experience and high utilization rate is crucial for the sustained success of our marketplace.

In this article, we describe the approach we’ve taken at Glovo to understand these relationships, as well as the path that has taken us to that approach.

The first iteration: understanding the relationship between UR and user experience through polynomial models

Let’s start by clearly defining the problem:

What is the expected daily average delivery time, percentage of cancellations and efficiency conditional on the UR?

From this problem statement, we see we have three targets: delivery time, percentage of cancellations and efficiency, and we want to know the relationship between the UR and each of these targets. Our first approach to model the relationship was a linear regression with a second-degree polynomial:

We wanted a simple model that was easy to understand and that was not prone to overfitting. The polynomial model gave us just that. Below you can see how this model looks like for one of the main cities where Glovo operates:

Polynomial models for a large-sized city. In this case, the models concur with our domain knowledge: a convex relationship between UR and delivery time, and UR and cancellations and a concave relationship between UR and efficiency.

Note that we apply some post-processing to the delivery time and cancellations models so that the prediction doesn’t bounce back up on the lower values of the UR, as a raw second-degree polynomial would necessarily do. As you can see, the polynomial model works well for this city. The relationship between UR and delivery time is convex and the slope increases for high values of UR, just as in the relationship between UR and cancellations. The relationship between UR and efficiency is linear until it flattens out (roughly around the UR value where the delivery time starts increasing non-linearly because of the lack of couriers).

Unfortunately, for every one of these relationships, we could show a myriad of plots where the model violated our domain knowledge. Take these three, each one from a different city:

Polynomal models for three different-sized cities. The model is exhibiting a very weak relationship between UR and delivery time, a concave relationship between UR and cancellations and a linear relationship between UR and efficiency. We know all these models to be wrong.

In these examples, the model results suggest that having very high URs leads to an ever-increasing efficiency and that it has no impact on the user experience. Actually, the second plot suggests that in this particular city a higher UR means we get fewer cancelled orders!

The reason why these models do not represent the underlying relationships is that Glovo’s operations are highly volatile: data contain strong seasonalities, trends and extraordinary days that lead to unstable city level models. The problem was more severe in smaller cities, but it was also present in some larger cities. The figures shown represent cities of different sizes for Glovo.

Typically, you want to add more data to be able to learn some of these patterns, but we were concerned that older data, less representative of the current situation, would dilute the importance of more recent data, precisely due to the fast-changing nature of Glovo’s operations.

The second iteration: a hierarchical bayesian model

We don’t need to evaluate those models in out-of-sample performance: we know they’re wrong by just visually inspecting them. Why? Because we have a strong theoretical understanding of the underlying models. We know:

How these relationships should look like: convex for delivery time and cancellations and concave for efficiency. We even know that the relationship between UR and efficiency is strictly linear for as long as the delivery times are not affected.
That the models in different cities should be similar.

So, why not use that domain knowledge to inform our modelling? That’s exactly what bayesian modelling allows us to do. In a nutshell, bayesian modelling is based on two ideas:

All parameters of your model can be modelled as random variables that follow a certain probability distribution.
You do inference on the model parameters by inspecting the posterior distribution, which can be obtained by applying Bayes’ theorem.

But what is a posterior distribution? A posterior distribution is a probabilistic description of a random variable after evaluating the existing data (likelihood) and the pre-existing knowledge (prior). That is, the posterior distribution represents a compromise between the data and our prior knowledge.

In our Bayesian model, we maintain the use of the second degree polynomial as our way to model all these relationships, but we add some tweaks. First, we cluster cities geographically. Then, we define a model so that every parameter on the polynomial regression for every single city follows a normal distribution whose mean and variance are common to all the cities of the cluster. That is, the cluster-level mean is the prior of the city-level mean. In turn, every parameter at the cluster level comes from a common normal distribution, which we assume to have a known mean and variance, our hyper-priors. This is a common formulation of hierarchical linear regression, and we can represent such a model graphically as we see in the following image:

Graphical representation of our multilevel model. Variance parameters have been omitted for simplicity. α, β and β’ represent the second-degree polynomial parameters presented above. The subscript denotes the parameter’s level. Unobserved random variables are represented as white circles, observed variables as shaded circles and fixed parameters as shaded squares.

The resulting posterior of our city-level parameters (α, β and β’ in this case) will represent a compromise between the city-level parameter and the cluster-level prior, which in turn represents a compromise between the parameters from all cities and the hyper-prior. If the city-level data shows a clear relationship between the UR and our target, it will prevail in the resulting posterior, otherwise the prior will pull the model in the prior’s direction. We can see that compromise very clearly in the following images:

Polynomial and multilevel model for the previously-shown cities. The multilevel model concurs with our domain knowledge in all cases.

Every figure contains a representation of the polynomial model, the multilevel model at the city level, and a hypothetical model with cluster-level parameters — which is the prior of the multilevel model at the city level. These are the same cities we showed before. On the top row, we have the city where the polynomial model was performing well. On the bottom row, we show the cities where the polynomial model was performing poorly. We see two things:

The multilevel model corrects the polynomial model on the cities where the polynomial model wasn’t working so well. The resulting functions look like we would expect them to look due to the cluster prior effect.
The multilevel model is almost identical to the polynomial on the city where the polynomial was working well: the cluster does not pull the model towards itself. This is particularly obvious on the delivery time and efficiency, where the model remains close to the polynomial prediction, and far from the cluster prior.

An extra benefit of multilevel models is that they work quite well when certain groups have few observations. Take the example of a city where Glovo just started operating. For the first days, the cluster prior will have a strong influence on the final prediction, and as new data comes in, the likelihood part of the posterior will gain importance.

Polynomial and multilevel models for a city where Glovo just started operating. With few data, the multilevel model already does a good job.

Hierarchical models as a way of regularizing our model

In machine learning, we typically represent model performance as a trade-off between bias and variance. A model with low bias and high variance is a model that can explain the training data with high accuracy but whose parameters vary wildly with slight changes in the input data. This is a phenomenon known as overfitting. In the figure below, you can see that for such a simple model as a polynomial regression, that is what’s happening.

Polynomial parameters, 95% bootstrapped confidence intervals (left) and city-level parameters from the multilevel model with 95% probability intervals (right) for multiple cities. The dashed line (cluster average) is the average of all city-level parameters from the same cluster from the multilevel model. The multilevel model parameters have much less variance.

On the left side, you have the parameters’ means and their corresponding 95% confidence intervals (bootstrapped) for the polynomial model. On the right side, you have the same plot for the multilevel model, except with the probability intervals derived from the posterior of the parameters. The dashed line is the average parameter of the multilevel model. What we can see is that the variance of the polynomial model is multiple orders of magnitude higher than the variance of the multilevel model. This means that we can use a multilevel approach to effectively shrink the model parameters towards their prior mean and prevent overfitting, while still allowing for large differences between city-level models — as we saw in the preceding plots.

Final words and next steps

Bayesian methods helped us tackle the modelling problems we were facing due to the inherent heterogeneity and volatility of Glovo’s operations. Bayesian modelling can prove particularly useful when (1) the data scientist has a strong theoretical understanding of the model and (2) the dimensionality of the problem is limited. The second point is related to the first one, as it is difficult to have a strong understanding of a very complex model. Besides that, scalability is still one of the challenges posed by bayesian modelling. Even in a problem like the one we’ve explained, where the low-level models were rather simple — a second-degree polynomial with just one covariate — , the computational cost of this model was multiple times higher than that of the precedent model — which in this case wasn’t a problem.

As next steps, some things we’d like to try are:

Different non-linear transformations of the UR. Second-degree polynomials are too rigid for some of these relationships.
A distribution that is more robust to outliers for modelling our target (e.g. Student’s t).
A data-based method to estimate the city cluster assignments.

If you think these problems are interesting or you think you can contribute to improving our current solutions, come join us!