Cash is king: how our real-time Machine Learning models minimize the risk of payments in cash

Published in

The Glovo Tech Blog

10 min readOct 21, 2021

Why do we need a Machine Learning model?

Every day, the Fraud & Payments Glovo team deals with customers, couriers, and partners who try to defraud the company or have behaviour that poses a risk to a successful order delivery. Working with them, you can enjoy funny stories like how a courier tried to steal a Subway from the Fraud & Payments Director of the company he is delivering for:

*Slack caption of the moment a courier tried to defraud our Fraud & Payments Director*

An important % of all fraud and risk comes from customers. Minimizing the fraud and risk arising from them is vital because food delivery is a low margin business. Why? Consider a typical 20€ order. From this order, let’s say that 13€ go to the restaurant, 5€ go to the courier, and 2€ (10% margin) go to us. What happens if, for example, a courier cannot deliver an order because the customer has added a wrong address? We still need to pay the 18 € to the restaurant and the courier for their part in the deal. So, every order that ends up as unpaid (because the order could not be delivered or because the customer has refused to pay for it) can wipe out the margin on nine successful orders.

To reduce the risk caused by customers, we charge cancellation fees when the unpaid order is the customer’s fault. But how do we charge these fees if the payment method is cash? Let’s think about the difference between an order paid with cash and an order paid with a credit card. When the payment method is credit card, we can charge this fee automatically for an unpaid order. However, when the payment method is cash, we can only charge the customer at the moment of the order delivery (this is called Cash on Delivery — COD — ). If, for example, the customer rejects the delivery or gives a wrong address, we cannot charge the fee. We can’t send the courier back to the customer like a debt collector until the customer agrees to pay the fee. And attempts to recover the cancellation fee on the customer’s future orders have partial success.

Because of this, COD is a much riskier payment method than some others. But abandoning it would hurt our business: COD is selected as a payment method for more than 30% of all orders — and more than 70% in some countries! Currently, almost 60% of our new users place their first order using cash as a payment method. Allowing our customers to pay in cash is important for expansion in different countries and segments.

To reduce the negative side-effects of COD orders, we can block the cash payment method for those orders that have a high enough risk that it is worth preventing them. While there are many SaaS companies that deliver effective prevention tools, they mostly focus on fraud using cards (e.g., chargebacks, card skimming), so their methods might not optimize for the nuances of COD (e.g., in COD, there is no card information to process).

During our research, we found that COD challenges are under-discussed, and we hope that this article helps fill this gap. Here, we will explain how we developed an in-house real-time Machine Learning (ML) model to prevent risky cash orders. This model needed to:

be deployed at the necessary scale (millions of orders, millions of decisions),
have the lowest possible latency (so that the customer experience is not hurt), and
be effective at preventing unpaid orders, of course.

How did we develop our ML model?

Project setup

Before the development of this ML model, the Fraud & Payments team had done a lot of necessary preliminary work. They had already:

developed a Risk Engine with different business rules encoded in it,
developed a database and backend that fed the Risk Engine the necessary customer, order, and store features to deploy these business rules (here is an explanation of a related approach), and
collected a snapshot of these feature values at every order checkout during the previous 7 months.

Thanks to all this effort, prototyping the ML model was actually “easy”: we had a dataset of millions of orders without any data leakage (as the data had been collected at checkout point) and we had the necessary backend system to feed the Risk Engine with our model predictions. In addition, we also had the domain expertise of the Fraud team to guide us on the most promising features, the most appropriate data preprocessing steps, and the most important metrics.

This process followed the Data Science Hierarchy of Needs: we only started thinking about ML models when we reached the previous levels.

Prototyping

While developing the ML model, first we needed to show that the benefits could be great enough to compensate for the efforts and risks of putting it into production. Starting with a Proof Of Concept (POC), we first engaged with different experts and stakeholders to understand the best features, the best possible target labelling, the best selection of historical data and the preferred properties of the final data product. Next, we tried several algorithms and hyperparameter combinations and compared the results with an existing baseline. After that, we showed the results and the possible risks and efforts to different stakeholders.

In the end, our ML model prototype consisted of different types of features about:

the untrustworthy behaviour of potential fraudsters,
the order properties (some orders are more prone to failure or have telltale signs of organized fraud),
the store properties (fraud is sometimes organized by them), and
additional input from a third party.

During prototyping, we found several challenges to adapt as best as possible to the desired properties and performance of the data product. When tackling these challenges, we found that being creative in our solutions while at the same time empathetic towards the stakeholders and their needs can really make a difference. Here are some examples of these challenges and how we faced them:

We applied sample weighting in a smart way to adapt the training dataset and target labelling as best as possible to the business requirements and to class imbalance (unpaid orders are only a small % of all orders).
We developed a 0–100 score that quantified the different types of risk with the desired data distribution. Thanks to this type of score, the Fraud team can choose different score thresholds for different verticals/countries to adjust better to business priorities.
Some of the most important features are high-cardinality categorical features (>20000 categories). We relied on LightGBM capacity to handle these types of features effectively. But, still, most categories inside these features have no predictive power and they only add noise and latency to training and prediction. So, we gave the algorithm a nudge by removing these not predictive categories (after using certain rules) before model training.

Productionizing

After prototyping our ML model, we needed to productionize it while respecting certain constraints (real-time, <200 ms latency, frequent enough model retraining). We met these requirements thanks to the infrastructure and tools provided by our Machine Learning Platform (MLP) — you can find nice overviews about how models are productionized at Glovo here and here.

The different steps of the model pipeline were orchestrated with Jenkins. Examples of some of those steps include:

automatic weekly training of models with up-to-date data,
manual deployment of the updated model if it shows better performance than the model currently in production, and
monitoring of the model’s performance.

Before deploying the model, we also ran different types of tests and CI/CD checks, including:

unit tests to reduce the risk of introducing bugs,
integration tests to ensure that the training and production environments were the same ones, so that predictions in production behaved as expected,
load tests to verify that the model could scale in production,
style tests that ensure optimal code quality (flake8, black, isort, mypy), and
checks ensuring that the ML projects were isolated from each other.

Once this test phase passed, we would build and deploy our model using Spinnaker. During the deployment pipeline, we would use Kubernetes to scale the model API to the requested concurrency (i.e., it is not the same if a model has 1 request/second or if it has 30 requests/second) and latency requirements. Lastly, once deployed, we would monitor the technical performance and potential occurrence of errors with Datadog. Every time that a new model version needs to be deployed to prevent model drift, all these tests, checks and deployment steps are performed again.

A high-level overview of the pipeline is shown below:

*Simplified schema of the model pipeline*

All these tools and steps might be overwhelming for a Data Scientist, especially if they [1] do not come from a Computer Science background. But thanks to the tools and libraries provided by the MLP team, Data Scientists at Glovo can own the end-to-end model pipeline without deep software-engineering experience.

What happened after we deployed our ML model?

Our model showed a several times better precision-recall relationship than the previous baseline:

As a result, we have been able to reduce unpaid orders coming from the riskiest types of customers by 40%.

But still, deploying is only half the battle, and models need to be actively monitored. In our case, two interesting feedback loops appeared after the model deployment.

Why are precision-recall metrics worsening?

When we deployed the model, the precision-recall metrics started off very good but quickly worsened. Interestingly, the baseline model was showing a similar pattern:

So, our model was still several times better than the baseline, but what was happening with the overall performance? The most important fraudsters were the ones that we identified with the highest precision. As we blocked them from abusing current and new user IDs, they had lost the incentive to create new ones. So, they stopped appearing in our data, and not enough new fraudsters appeared to replace them. As a result, the unpaid orders that remained were harder to predict with high precision.

To avoid this distortion in the metrics, the general recommendation is to create a holdout group to whom our ML model is not applied. Thanks to this group, we can get isolated data where these types of fraudsters are still present when analyzing the metrics [2]. But there is a problem: a successful holdout group would need to isolate a random subset of people, not just their accounts. When assigning each account to the holdout or the normal group, we must ensure all the accounts of a fraudster go to the same group. If not, a fraudster who creates 40 accounts might have one account allowed in the holdout group and the other 39 accounts blocked in the remaining group. As a result, no fraudster would have the incentives to keep creating IDs even after the creation of a holdout group. Future network-based models should help us connect different accounts with a single fraudster in order to address this limitation.

How do we treat the orders blocked by the ML model?

When retraining the ML model, the orders blocked by the model are likely the ones with the most interesting fraud- or risk-related patterns, but because they were blocked, we do not know what would have been their actual outcome. We can’t simply remove these orders from the retraining dataset: this would cause survivorship bias. Neither can we rely exclusively on the holdout group to train models, as the sample size would be too small. And of course, we can’t ask the company to allow all fraud for a while just to have a large, unbiased dataset — just think of the headlines and of the financial loss!

In the end, the most sensible option was to add blocked orders to the training set (even if we do not know what would have been their outcome) and label them as reasonably as possible to train the model. While feedback loops inevitably came in the shape of gradual distortion of the score distribution (see in the figure below how a peak of high scores appears several weeks after the deployment), we were able to identify and develop smart ways to try to minimize these feedback loops. (In this article, you have some clever ideas on how to label partially labelled data).

In a nutshell, you don’t want to find yourself entangled with these feedback loops, so:

keep monitoring models after their deployment, and
budget some time for model maintenance (even if you might be busy building cool new stuff).

Let’s sum up!

Cash is a risky type of payment method because it can be difficult to make customers pay the appropriate penalty fees. Those challenges, however, are not much discussed on Fraud Detection forums. We hope that details on how we developed and deployed an in-house, low-latency real-time ML model that prevents risky cash orders can be of use for other fraud-fighters out there.

Yes, it’s fun to develop and apply smart ideas when prototyping and productionizing ML models. But this model would not have been possible without the previous domain expertise and infrastructure tools developed by other teams (the Fraud & Payments Backend and analysts, the Machine Learning Platform team). Training ML models is just the icing on the cake of the Data Science hierarchy of needs.

And yes, it is not too fun to have to keep monitoring and actively maintaining these ML models, but odds are high that **** will happen. So, it is better that you proactively minimize the model risks and weak points, taking special attention to the special traits of the domain.

Thanks to Jelena Aleksić, David Morera, Geo Jolly, Pablo Giner, Pablo Barbero, Andy Kreek, Adrià Salvador and Florian Jensen for their insightful feedback; and to Patrick Sánchez for the fruitful collaboration.

[1] Well, we.

[2] If this type of fraud were not only impacting us but also couriers or partners, then the analysis of KPIs would not be enough cause to allow these fraudsters to place orders.