Choosing the right KPIs to evaluate your models

Published in

Riskified Tech

7 min readSep 29, 2021

Riskified’s Chargeback Guarantee product helps online merchants prevent fraudulent transactions at checkout, while improving approval rates and reducing friction for the end customer.

Our models classify each order as fraud or not fraud, based on the order’s data and customer’s current and past behavior. We have dozens of ML models, which each need to give highly accurate decisions in real time, serving some of the biggest global retailers. Measuring the classifier’s performance and communicating it both internally and externally, is one of our most important goals as a department and company.

As Director of the Data Science Chargeback Guarantee group at Riskified, I lead the research and development of the model performance KPIs & control groups together with our DS “Chargeback Algo” team.

In this two part blog series I’d like to share the lessons we’ve learned while establishing our model KPIs and control groups.

Part one will deal with choosing the right KPIs to assess your models, and part two will review the control groups infrastructure we’ve created and the lessons we’ve learned in the process of establishing our KPI measurement processes.

As data scientists, I hope you can use these insights and methodologies to improve your models’ evaluation processes, which are an integral part of any ML pipeline.

Three steps to establishing your KPIs

Sooner or later, every data scientist dealing with a classification problem comes across the Confusion Matrix:

The 4 different conditions in the Confusion Matrix are used to evaluate classifications against the “ground truth”. There are ~15 different metrics that can be extracted from the Confusion Matrix in order to evaluate a ML classifier.

Using many metrics simultaneously to evaluate your model is a common solution that usually raises more questions than answers. It’s true that each metric shows a different flavor of the model’s performance, but in order to enable efficient decision making, hard choices need to be made!

The following is a suggestion of a 3-step “process”, to establish the relevant KPIs and be able to communicate them in a way that is tightly coupled to the product’s business model.

Step 1: Build the Confusion Matrix in business terms

First we want to give a business context to the different “conditions” in the Confusion Matrix. This should be fairly simple, and will allow discussing the data presented in the Confusion Matrix more fluently with decision makers, nailing down the right KPI.

This is what Riskified’s Confusion Matrix looks like:

Riskified’s Chargeback Guarantee business model is very intuitive — each order is being declined or approved by our model. We collect an approval fee for each order we approve, and if it turns out to be fraudulent (“Chargeback”), we refund the merchant with the order amount. Declines that are actually not fraud are considered “wrong declines”

Step 2: Write down your model’s business goals

More often than not, models can be utilized for different purposes and outcomes, and it all comes down to the business model. Many times the questions relate to which kind of mistake the business is interested in minimizing, and whether there are certain constraints that need to be taken into account while operating the model.

In Riskified’s case, we want to have a really low amount of chargebacks. Most companies desire a very low chargeback rate (~0.05–0.4%) and all need to remain under the credit card scheme’s watch list limit (around 1%). We also have to account for the importance of, wrong declines and customer satisfaction, for the merchants that work with us.

So our model goals are to (a) prevent a high share of the fraud and (b) keep false declines to a minimum.

Step 3: Choose the right KPIs for you

Once we understand the business terms and how they relate to the model performance metrics, as well as what the product needs the model to do, we can approach the KPI selection.

I won’t go into all of the different options for metrics based on the Confusion Matrix, as you can find and review them online. For us it made sense to progress with two KPIs that complement each other and reflect the business goals: Recall & Precision.

Recall (True Positive / Condition Positive) represents the amount of fraud the model blocks, and Precision (True Positive / Predicted Condition Positive) represents the model’s declines accuracy. Together they best show the overall performance of the model, from both the fraud prevention angle and the customer satisfaction perspectives.

Precision Recall Curve vs. ROC

KPIs based on the Confusion Matrix, such as Precision & Recall, will give a point estimation of the model performance for a given threshold. Since these metrics should be perceived as a tradeoff, you usually want a metric that is not threshold-dependent. Furthermore, it’s common to show the metrics on a curve of all possible thresholds, allowing flexibility in choosing the “most desirable” outcome from the model.

The most common curve for describing a model is the ROC curve. Showing the tradeoff between Sensitivity (Recall; TPR = True Positive / True Condition Positive) & False Positives Rate (FPR = False Positives / True Condition Negative).

The intuitive idea behind the ROC, is that it shows the tradeoff between coverage (Recall) and mistake rate (FPR). A random decider will demonstrate a straight diagonal line between the two metrics, which can be viewed as the baseline. Based on the ROC, we can generate the area under the curve (AUC) as a single metric that represents the model’s quality and is threshold independent.

To the layman, this sounds great — a model with high recall will block most fraud, and a model with low FPR will not make a lot of mistakes. This fits the business case perfectly. But in fact, the ROC can sometimes be misleading and not the right choice to compare classifiers.

The main issue with ROC, is that both Recall & FPR are hardly affected by the level of class imbalance. If the cost of falsely classifying the smaller (usually the “positive”) class is important for the stakeholders, for example in fraud detection and medical research, the ROC curve is not the right way to go.

Enter the PR curve and the PRAUC metric. Here the baseline will be the class distribution. The TPR (Recall) will move to the X axis, and the more interesting precision metric will replace the less intuitive False Positive Rate.

The PRAUC is more sensitive to the balance between classes and will show the precision of the model on the positive class more accurately.

The concept of partial PRAUC

Leaving you with one last insight on PRAUC — try to consider the relevant area of decision, and focus on it when evaluating models. It’s very common to see models dominate different areas of the curve. The “green” model in the example below might have a higher PRAUC compared to the “purple” model, but is it really a better model?

In an attempt to answer these types of questions, we decided to communicate a “Partial” PRAUC, averaging the precision for thresholds in the areas of highest business context. Again, this is a direct result of tying the evaluation metrics to the model’s business requirements.

Wrapping up and what’s next

I hope that I’ve managed to convey the importance of thoroughly thinking about your model’s evaluation metrics and connecting them to the business case and terminology. For Riskified, this process strengthened the connection between the growth, product, analytics and data science teams. Being able to show a graph describing the model’s performance and receiving understanding smiles instead of confused ones is a true joy.

Those of you who have gone through this post looking for answers on how to find these elusive false positives and negatives, fear not! My next post will cover the efforts we’ve taken in order to estimate our models’ mistakes in the most accurate and cost effective way.

References