Partner Health: measuring marketplace health as a first step to improve it
To be able to improve something, firstly one needs to be able to measure it. In this blogpost we will go through the process of creating a new metric for Glovo, starting by understanding the business problem to solve and continuing to how we built frameworks that allow teams to use this metric and run experiments with it. The metric was named Partner Health, and it measures how well orders are being distributed across partners. Partner Health is now part of the 2022 company’s OKR (Objective Key Results), which means that it is setting out the direction of several initiatives across the company that aim to make Glovo a more balanced marketplace. This is in line with our social impact strategy, as we want to empower small and medium businesses to grow in the digital world, in order to safeguard cities’ local commerce.
A three sided marketplace
You might know Glovo only from the user perspective, an app you can order food and other products from. To make that possible, behind the app there is a complex three sided marketplace that has as pillars Customers, Partners and Couriers, and is currently adding Ads as a fourth pillar. For Glovo to be a successful business, it is key to maintain the health and efficiency of the overall marketplace, and therefore of the three pillars. The first step is to be able to measure the health of each of the pillars:
- Customers are the people who want to order something through the app or web. We are interested in understanding if the customers are able to find what they are looking for and how their overall satisfaction with Glovo is. General marketplace metrics like Conversion Rate, Clickthrough Rate, user frequency or churn bring a lot of value to understand the health of this pillar.
- Couriers are people using Glovo to earn income by delivering orders from Partners to Customers. This pillar is based on the logistics behind the app, and how shorter and more efficient rides allow to increase the number of deliveries couriers can make in an hour.
- Partners are stores using Glovo as a sales channel. This pillar is trickier: we could try to measure the engagement or churn of the stores, but these are lagging indicators that we can’t measure until it is too late in the lifecycle of the partners. We need to tackle the main reason behind partners’ churn, which happens to be the fact that many of them don’t get enough orders to keep being satisfied with Glovo. For this reason, we must be able to measure how well or bad we are distributing orders across partners.
Our team, Search & Discovery, is behind how users discover content in the app, owning rankings, recommendations, filters…. Because of the relationship discovery has with how orders are later distributed across partners, we were given the challenge to come up with a metric that would allow Glovo to understand and measure the health of the Partner’s pillar. This metric should allow to set company-wide objectives and experimentation.
Coming up with the Partner Health metric
The business problem is clear: Glovo has the challenge to provide all partners with enough orders for them to be satisfied with the app. Given the huge increase in new partners enrolling to Glovo month over month, this challenge gets more and more important and also harder to solve. Some initial brainstorming resulted in multiple metric options for monitoring the health of our partners’ pillar and measuring the success of initiatives trying to improve it:
- Percentage of partners with more than X weekly impressions (enough visibility to get them orders)
- Percentage of partners with more than Y weekly orders (enough orders to avoid churn)
- Percentage of partners with 0 orders
- Percentage of orders coming from the biggest N partners in a city
While these metrics were able to represent the business problem well, they presented many pitfalls that made clear the need to find a better way to measure this. The pitfalls were related to the fixed thresholds that didn’t scale to different cities or countries, lack of granularity, and volatility. All these pitfalls made it hard to set up OKRs with them and made experiments take too long.
To solve these pitfalls, we proposed a metric based on the Gini coefficient, a widely used statistic to measure inequalities of income or wealth within a country or social group. We called this metric “Partner Health”. It ranges from 0% (scenario of complete inequality, 1 partner gets all orders) to 100% (scenario of perfect equality, all partners get the same number of orders).
The main advantages of the proposed Partner Health metric are:
- No need for further assumptions or setting up thresholds, allowing us to scale the metric to any city or country.
- If small partners get more orders, we observe an increase in Partner Health.
- Robust and stable (much more than the other proposed metrics)
- High correlation between Partner Health and Partner Churn
An important note here is that our goal is not to arrive to a situation where all partners get the same number of orders (Partner Health = 100%, perfect fairness), as some partners will need more orders to keep being satisfied with Glovo, and some small partners would saturate if their number of orders increased too much. Also the performance of the partners is something that has a direct impact on Couriers and Customers, so bringing too many orders to low performance partners (partners that have high cancellations, bad ratings or high waiting time for the couriers) would put at risk the other two Glovo pillars.
Partner Health with an example
Time to see Partner Health in action! Imagine a city has 4 partners: a pizza place, a chicken place, a sushi place and a burger place. We are going to see how the Partner Health changes as we redistribute the orders each partner gets:
In scenario A we have Pizza place getting 100 orders, and chicken place getting 50 orders, the Partner Health is 41%, relatively close to 0 as we are far from an equal distribution of orders. In situation B and C, we can see how the computed Partner Health increases as we distribute orders more equally.
The code and statistics behind it
Once the metric was clear, we needed to set up clear documentation and develop frameworks that would allow teams to measure the current state and evolution of Partner Health in a region of interest, and use it in experimentation to understand what initiatives helped improve it.
Computing Partner Health in python is straightforward using a Gini function implemented in pysal library. This function is then applied on the distribution of the number of orders each physical store receives in a given timeframe. Further filters on region, vertical and more can be applied to obtain the Partner Health on a specific domain.
For the experimentation part we developed a framework based on the article Bootstrapping the Gini coefficient of inequality. The main idea behind this is to apply bootstrapping with replacement at customer level (as it is the unit of randomization used for the experiments). From the bootstrap, we are able to plot the distribution of Partner Health in the test variant (PH_test), Partner Health in the control (PH_control), and the differences PH_test-PH_control. It is in the distribution of PH_test-PH_control where we can check the mean and the confidence interval of the differences: if the confidence interval doesn’t include 0, this means that the test reached statistical significance.
In this example, we got a 95% Confidence Interval equal to [0.31, 0.6], therefore we reject the Null hypothesis (of no variant effect) at a 5% level and conclude there is a difference in the PH between test and control (positive, and therefore better/healthier).
If we want to use Partner Health to analyze experiments, we should also be able to set up MDEs with it (Minimum Detectable Effect, to understand how many data samples we need to raise the right conclusions from an analysis). For this we are mainly interested in checking if we have enough data so the False Positive Rate stays < 5% (alpha of the experiment) and the False Negative Rate stays < 20% (beta or 1 — power of the experiment).
False Positive Rate
We want to make sure that if there are no changes between test and control, we don’t falsely reach the conclusion that there was a change. For this, we proposed a simulation that would run several tests out of randomly splitted data so we can quantify the number of statistically significant results we would get (and make sure this number is < 5%).
False Negative Rate
We also want to make sure that if there really are changes between test and control, we are able to detect the change. For this, a similar simulation is proposed, that allows us to quantify the number of non statistically significant results we would get when there are differences between test and control (and make sure this number is <20%).
Wrapping it up
In Glovo, we want to ensure we have the right way to measure our efforts of giving every partner chances of succeeding in our platform. For this, we needed a scalable and reliable manner to evaluate order distribution that correlates also with other challenges (for example partner churn).
In this post we have walked through the process of defining and setting up frameworks for the Partner Health metric. This metric enables us to measure the health of order distribution, set up goals to improve it, and determine what is the impact of different experiments and initiatives. As an example, in the last ranking experiment (that modified the logic by which stores appearing in the restaurant wall are sorted), we were able to observe a +200bps in Partner Health that proved our work was going in the right direction.
All this work was possible thanks to David Barrero, Mehdi Bennaceur, Carolina Romero and Filipe Mencarini. Special thanks too to the Glovo Data Science community for all the feedback provided.