Maximising the Success of our Product Ranking Model through Backtesting

Published in

foodpanda.data

5 min readMay 29, 2023

Imagine it’s midnight and you’re feeling a little peckish. You decide to order some food through a delivery app. However, after selecting the first restaurant that appears, you’re presented with an overwhelming list of more than 50 menu items. With so many food options, it can be difficult to decide what to order. This is a pain point that many customers experience on a daily basis.

Figure 1: An example of how the products are displayed on a restaurant’s menu within the foodpanda app

The objective of foodpanda’s product ranking model is two-fold:

(1) Simplify the customer ordering experience by sorting the menu items in a way that is more relevant to users. This means prioritising the products that are most likely to be ordered, resulting in a more efficient ordering process.

(2) Increase the revenue of our restaurant partners by sorting the menu in a way that highlights the most popular items.

And, our key success metrics are:

(1) Menu Conversion Rate, i.e. the proportion of customers who had successfully ordered at least one item out of all customers who had viewed the menu.

(2) Average Food Value (AFV), which is the average basket value of an order, excluding delivery fee.

Why do we need backtesting?

Many tech companies often conduct A/B tests to evaluate the effectiveness of ranking algorithms. However, experiments are usually time-consuming and not best suited for quick iterations. With these drawbacks in mind, we have designed a backtesting framework that shortens the model development cycle. This has allowed us to evaluate and identify the best ranking model before rolling it out for A/B testing.

How does it work?

After a model is developed, it is evaluated against a static dataset of historical orders. The model in development is benchmarked against a baseline, i.e. the model that is currently live in production. Iteration at the model building stage is very easy as the data is entirely historical. In other words, we do not need to wait for the experiment to conclude after a few weeks to analyse the results and tweak the model.

While the quality of ranking models is often subjective, one way to assess them is by evaluating the quality of the models based on user feedback. Implicit feedback was used as a measure of product relevance; with the assumption that if a user orders a particular product, then that product is more relevant than the products that were not ordered.

Mean Average Precision (MAP) and Discounted Cumulative Gain (DCG) were chosen as our key evaluation metrics as they best met our business objectives of improving Menu Conversion Rate and AFV.

While we will not delve into the mathematical explanation behind these metrics in this article, you are encouraged to explore these resources on MAP and DCG.

Mean Average Precision (MAP)

Our product ranking models are evaluated against Mean Average Precision at K (MAP@K). MAP is used for tasks with binary relevance. It measures the relevance of the top K items in the menu. In the context of our project, the relevance of the kₜₕ item can be either not ordered (rel(k) = 0) or ordered (rel(k) = 1).

Figure 2.1: Formula for AP@K

To illustrate, let’s imagine that we have two models that have ranked the first 5 menu items as follows. The user orders 2 products (m=2), Product IDs 123 and 124.

Figure 2.2: An example of two models that rank 5 items differently in a menu

This is how we would calculate AP@5 for Models A and B:

Model A is the perfect scenario where AP is 1, implying that the model has correctly ranked all the true positives at the top of the menu. Finally, we can obtain MAP@K by simply computing the average of AP@K values for a set of orders.

Discounted Cumulative Gain (DCG)

MAP evaluates how well a model ranks on conversion, but it doesn’t take into account the secondary objective of our menu ranking project, which is to improve AFV. As such, we can use Discounted Cumulative Gain at K (DCG@K) to assess how well the model ranks on product price. DCG is particularly useful when there are multiple levels of relevance. In this case, we can use product price as relevance, rel(k).

Figure 3.1: Formula for DCG@K

Figure 3.2: An example of two models that rank 5 differently priced items on a menu

Here’s how we can calculate the DCG@5 for Models A and B:

Now, let’s imagine a user who ordered 2 items in the top 5 positions in the menu. While both Models A and B rank equally well on Average Precision (AP@5 = 0.7), Model B outperforms Model A on DCG@5. This is because Model B ranks the pricier item ($14) in the 1st position and the $10 item in the 5th position, while Model A does the opposite. Thus, DCG rewards the model for front-loading the pricier products that are ordered.

Although the total food value ordered would be the same for both models ($24), Model B is the preferred model as there is a higher likelihood for a user to see and order the more expensive item at the top of the menu, than if it were to be much further down the menu.

Conclusion

There are several ways that we could improve our evaluation methodology, one of which is to customise the discounting logic in DCG to better suit our use case. For instance, if the app typically displays five items in each scroll, we could assign the same “gain” to each set of five items instead of a discounted “gain” for every item.

Backtesting can be a valuable tool in the model development cycle, providing insights on the potential success of a new algorithm and saving time. However, it is important to keep in mind that backtesting doesn’t replace the need for experiments. In this case, since the models were evaluated retrospectively and users were shown the baseline in historical orders, there could have been a bias in product positions that affects the conversion rate. Therefore, after refining the models using metrics such as MAP@K and DCG@K, it is crucial to conduct A/B experiments to evaluate the true impact of the models in a real-world setting.