Image for post
Image for post

Beyond the ROC AUC: Toward Defining Better Performance Metrics

Julien Bohne
Oct 31, 2018 · 7 min read

Data Science is a powerful tool to create new services and improve business operations. Its list of potential applications is virtually infinite, and includes such services as demand forecast, store-localization optimization, industrial-process improvement, marketing personalization and the subject of this article, performance metrics measurement.

The success of a project depends to a large extent on finding the right performance metrics at an early stage of the project. These metrics are used not only to evaluate the project at its conclusion, but to set a target and drive choices, big and small, made throughout the project.

Any bias, such as data acquired in different conditions or discrepancies in class distributions, will affect the relevance of the performance evaluation. So too will the amount of test data needed to obtain precise performance. The number of significant digits in the error rates depends on the number of actual tests that each digit represents (look for the famous n=30 rule of thumb).

This article is focused on the process of choosing performance metrics. We will not discuss the need for appropriate data for performance evaluation. All test data should reflect operational data as closely as possible.

Three Key Performance-Metric Properties

There are three main properties to consider when defining the performance metrics, presented here in decreasing order of importance:

1. Relevance: The set of performance measures chosen should reflect all the aspects of the business problem. This enables you to make the link between “evaluation of the project’s output is good” and “business value is created” as close as possible. Choosing an incorrect or incomplete set of performance metrics can lead to situations in which a project is deemed a “success” based on its performance evaluation but is, in fact, completely unusable.

2. Understandability: The metrics should lend themselves to clear interpretation. They should make it as easy as possible to answer the question: “Is a value X for the indicator Y a good result?”

3. Compactness: The fewer values in the performance indicators, the better. This property is often in contradiction with the previous two properties. Building aggregated metrics is a way to obtain compact metrics, but such metrics can reduce relevance and understandability.

In this article, we will focus on two data science projects in which input data is used to produce a performance score. The score is then compared to a threshold in order to make a decision.

· Project 1: Churn. A company wants a list of its highly likely-to-churn customers to serve as the basis of a preemptive action. A performance score is computed for each customer. If the score is higher than a given threshold, the customer is put on the action list.

· Project 2: Facial recognition. An automatic boarding-control gate is installed in an airport gate to permit or prevent boarding. A similarity score is computed between each passenger’s face and his or her passport photo. If the score is higher than the given threshold, the passenger is granted access to the airplane.

Global Performance Measures are Situation Dependent

Many performance measures are derived from the ROC (receiver operating characteristic) curve that most data scientists are familiar with. (Other graphs based on type I and type II errors include DET curves, Precision-Recall, and FAR/FRR graphs). Often, the information in the ROC curve is summarized by:

· The Equal Error Rate (EER) which corresponds to error rate at which the False Negative Rate equals the False Positive Rate, or

· The Area Under the Curve (AUC) which corresponds to the surface below the ROC curve.

These measures provide a good indication of the global performance of the methods generating the scores. They are not useful, however, for distinguishing between methods that are comparatively better at low or high False Positive Rates.

For example, the methods a and b whose ROC curves are shown in Figure 1 have the same EER (17%) and the same AUC (0.91). However, they exhibit very different behaviors, which can have a significant impact on their use. Consider Project 1: Churn.

· Imagine that the preemptive action to be taken for customers on the list has a low cost, such as a phone call to the customer. The company will be willing to accept numerous false positive errors in return for maximize the number of highly likely-to-churn customers it reaches. In this scenario, Method a is better because the company will take this low-cost action to reach nearly all (95%) of its potential future churners. In exchange, they will have to call only 26% of probable non-churners. Should they choose Method b, the company would have to call 60% of probable non-churners. By choosing Method a, the company can focus its efforts on the most appropriate target audience.

· What if, on the other hand, the action has a much higher cost, such as giving generous discount to potential churners? In this scenario, the company will want to strongly limit the number of discounts it gives out to non-target customers. By choosing Method b, the company can reach 74% of highly likely churners, while giving unnecessary discounts to only 3% of probable non-churners. Should the company chose Method a, it will reached less than a third (30%) of probable churners.

The point of these examples is that, instead of using EER and AUC, a better way to find the best performance methods is to set the True Positive Rate or False Positive Rate based on one of the business aspects, then find a method that optimizes the other. The portion of the ROC curve that should be of interest in a given situation totally depends on the actual business cost (such as money, security, or comfort) of a false positive error and a false negative error.

Another way for the company to meet its goal might be to optimize a weighted sum of the False Positive Rate and False Negative Rate with a weight derived from the relative cost of the two error types. Because this second solution if often less easy to interpret, the former is generally preferred.

Threshold and Performance Stability Across Data Subsets

Selecting the most appropriate performance measurement method at the operating point of interest is good, but not sufficient. We also need to be able to set the threshold to reach this operating point. This calibration is fairly easy when the data is homogeneous, but reality is often more complex than that.

Consider Project 2: Facial recognition. In this use case, a false positive represents a security threat: Someone will be able to board an airplane by faking his or her identity. As such, the threshold should be set so that the False Positive Rate corresponds to an acceptable value. The problem is that most face-recognition algorithms behave differently when analyzing people of different gender, ethnicity, or age groups. Given this inconsistency, it is quite likely that even if the exact same system is installed in different countries, it will return different False Positive Rates for members of these different groups. This discrepancy cannot be detected by looking at global performance on a test set. We need evaluate performance independently for each group, using the same threshold, and then observe the distribution of the False Positive Rates. The variance of this distribution should also be included in the set of performance metrics of interest.

The consequence of a facial recognition False Negative Rate may be somewhat milder than that of a False Positive Rate. When someone is denied boarding at an automatic gate, airline staff would have to personally check the traveler’s passport. This may seem like a minor inconvenience, but consider what would happen if the system had very different False Negative Rates for members of specific groups. Imagine the rejection rate averages 2% overall, but jumps to 30% for Asian women or black teenagers. Affected groups would understandably accuse the system of being racist and demand its deactivation. It may be beneficial to observe the distribution of false negative as they vary across the different groups, as well as the difference between the global and group-based False Negative Rates.

In this case, as in most business cases, it is critical to not just choose the best method for measuring performance, but set thresholds that will achieve intended positive outcomes without generating unintended consequences.

Key Take-Aways

· It is crucial to the success of any project that time be taken at the start to find performance metrics that can be used to guide choices throughout the project, and then to aid analysis during project post-mortems.

· The usefulness of performance evaluation metrics is highly dependent on the specifics of the use case. What may be optimal for one situation may be less so for another.

· The design of performance evaluation metrics is a broad topic. The two examples presented in this article present just some of the factors that must be considered when choosing a performance metric.

BCG GAMMA

GAMMAscope - The Blog

Medium is an open platform where 170 million readers come to find insightful and dynamic thinking. Here, expert and undiscovered voices alike dive into the heart of any topic and bring new ideas to the surface. Learn more

Follow the writers, publications, and topics that matter to you, and you’ll see them on your homepage and in your inbox. Explore

If you have a story to tell, knowledge to share, or a perspective to offer — welcome home. It’s easy and free to post your thinking on any topic. Write on Medium

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store