Data Science in Business

Eike Germann
Eliiza-AI
Published in
8 min readMay 12, 2020

--

Precision and Recall — a hypothetical case study

Introduction

In my blog post on imbalanced data, I introduced some ways to measure the effectiveness of predictions for rare events, specifically the metrics precision and recall. Now, these terms are common jargon in the data science community, but can feel confusing to someone just stepping into the field.

I have put together an imaginary case study that illustrates how these concepts can be applied in a business context to capture and weigh off complex business priorities.

Credit Card Fraud

According to the Australian Payment Network’s report, fraudulent transactions made up less than 0.1% of credit card transactions in 2018 but amounted to a value of more than $574 million. This is clearly a rare event and makes an ideal case to apply our concepts of precision and recall.

The situation

Let’s take an imaginary bank, we’ll call it ExampleBank.

Credit Card fraud is an excellent example of imbalanced data with a difficult field of priorities.

As part of their growth strategy, ExampleBank have entered into a new partnership with a major credit card company. Their analysis shows that the proportion of fraudulent transactions for the recent years has been almost double the national average, about 0.2%.

Lucy is the business strategist at ExampleBank in charge of asset security. She is dissatisfied with the results of the rules-based model ExampleBank is using to prevent credit card fraud. She has heard a lot about machine learning and wants to find out whether machine learning could help ExampleBank get better results.

Lucy is a business strategist at ExampleBank. (Image: Anna Shvets)

ExampleBank has a data science unit, so Lucy invites two of their data scientists, Ehsan and Janet, to discuss possible solutions.

Ehsan and Janet are data scientists working for ExampleBank (Image: Rebranded Cities)

Initial consultation

To begin, she has a few questions about the performance of the current model.

“So, the fraud defense unit tells me the model has an accuracy of 98%. If that’s true, why are we not catching all of those fraudsters?”

Ehsan and Janet explain.

“We have put together a more detailed report on the current model,” Ehsan says, advancing a slide.
“It details what happens exactly. The accuracy of the model might be 98%, but its precision is about 20% and its recall is about 52%.”

Janet points at the line with precision.
“This means that out of all the cases the current model classifies as fraud, only about 20% are actually fraud. And this,” she shifts down to the line with recall, “means that the model only catches about half of the fraud cases that occur.”

“Wait,” Lucy holds up a hand. “It catches only half? But it’s 98% accurate? How can that be?”

“Accuracy also counts how many valid transactions the model classifies correctly, and that number is really large,” Ehsan replies.

Janet explains the different impact of what is measured by precision and recall (Image: Rebranded Cities)

“Yes,” Janet agrees.”Accuracy is like a headline. It sums it up, and that can miss detail.”

Lucy nods.
“Okay. I see how that works. But how do we get more detail? I don’t want to do a whole lot of math when looking at a result or presenting my findings to the board.”

“We’ll look at what the model does from a different angle,” Janet explains while Ehsan prepares another slide.
“A precision of 20% means that for each fraud, we also falsely suspect 4 customers of fraud.”

Graphics by NounProject, angry by Adrien Coquet and gangster by wira wianda

“And a recall of 50% means that we catch half of the fraud cases out there.”

Graphics by NounProject, fraud by Bartama

“Okay,” Lucy agrees. “I’ve got to hand it to you, that’s a lot more solid information than 98% accuracy. But I still don’t really get it. Why can’t we just catch all the fraud and not bother any customers?”

Ehsan sighs.
“I wish we could. We could just suspect everyone of fraud, and that would catch all the fraudsters, but we probably would not have any customers left after that.”

“What Ehsan means,” Janet says, ”is that the two aren’t as related as it seems. That’s why we use the two separate metrics and why the two numbers are so different. It’s because of what type of mistake the different metrics count, whether we falsely suspect someone or let something slip through by mistake.”

“If you can put a dollar value on angry customers, though, we can compare those two as well,” Ehsan adds. “Then we can compare them by cost.”

Lucy’s face lights up.
“That’s the number I need. I’ll get in touch with the risk team and get you that dollar value. If you can rate the effectiveness of the model in dollars, I can understand it. And I can present it to the board.”

Candidate Models

A few weeks later, Lucy meets up with Ehsan and Janet again.

During that time, Ehsan and Janet have tested a variety of model types on the data gathered by ExampleBank’s fraud defense division. Lucy has requested details on the risk analysis of the customer experience with fraud interventions and forwarded the information to Ehsan and Janet.

While Ehsan sets up the presentation, Janet summarises their work of the past weeks.
“I think we’ve got something promising for you,” she says. “we’ve taken the report from the risk analysis team and put together a comparison of some candidate models.”

Ehsan brings up the first slide.

Estimated costs for missing a fraud case and upsetting a customer at ExampleBank

“The report outlined an expected risk between $900 and $1500 for an angry customer affected by fraud interventions, which they average to $1200. We found the average value of a fraud transaction in the last year to be $298, so we can conservatively round this up to $300. In other words, preserving the customer experience is 4 times as important as catching a fraudulent transaction.”

He forwards to the next slide.

Ehsan and Janet have prepared two candidate models with very different attributes

“There is a combined metric called the F1-score that gives us a value for how efficiently the models balance these priorities.”

Lucy frowns.
“Wait. Those are just numbers. I thought we were going to express everything in dollars?”

“We still are,” Janet explains, ”the F1-score uses a weight that’s based on the dollar value per fraud divided by the dollar value per customer, that’s just a number. It compares how well the model balances the two values, where higher is better.”

“Okay,” Lucy nods. “that makes sense. Carry on.”

Ehsan points at the slide.
“Our base model has a precision of 20%, a recall of 52%, that means its F1-score becomes 0.20751. That’s the number we’ve got to beat, or we’re losing money somewhere.”

“Yes, I remember,” Lucy says. “It gets half the fraud, but has to go through four customers to get to a fraud case.”

“Exactly,” Ehsan confirms. “The first new variant we built has a similar precision at 22% and its recall is 68%. It has an F1 score of 0.22912.”

Lucy frowns again. “That seems like it’s not doing very much. What else have you got?”

“Our third model has a precision of 13%, and we managed to get the recall up to 82% on this one.”

“That sounds promising,” Lucy says. “That’d sure catch a lot more fraud.”

“That’s true,” Janet agrees, “it’s got an F1-score of 0.13677, though. That means overall it would perform worse than even the model we are currently using.”

Lucy is trying to decide on the right model to pursue with her team. (Image: Anna Shvets)

Lucy stares at the screen, trying to make sense of the numbers.
“I don’t get it. Why is that? It’s got like 30% more recall, that should catch way more fraud cases. The difference in precision is so small.”

“It’s because of the cost,” Janet explains. “Instead of 5 out of 10, it will now catch 8 out of 10 fraud cases, but for each 10 suspected cases, 9 will be honest customers. That’s more than twice as many, and those cases are more expensive.”

Graphics by NounProject, angry by Adrien Coquet and gangster by wira wianda

Lucy frowns at the screen for a bit longer, then nods.
“Yes. Yes, I see now. Okay, that sounds to me like we’ve got something to work with.”

The three agree for the team to continue working and to report regularly on their findings.

Implementation

Lucy, Ehsan and Janet meet again after the development period for the proof of concept has concluded. Ehsan and Janet have prepared a few slides to help Lucy report the results of the project to her supervisor.

Ehsan can be proud of the latest model. (Image: Rebranded Cities)

“So, what’s the latest result of the model development?”

Ehsan and Janet both smile.

“We managed to get both the precision and the recall up a little, to 23% and 76% respectively.” Janet says, and Ehsan adds:
“That means the F1 score is now up from 0.22912 to 0.23984.”

Lucy lets her eyes wander over the figures on the screen.
“Good work! And this means the customer experience has not gotten worse, right?”

Graphics by NounProject, angry by Adrien Coquet and gangster by wira wianda

“Yes,” Ehsan confirms.

“And we catch 50% more fraud cases than before?”

Graphics by NounProject, fraud by Bartama

“Correct.”

Lucy smiles.
“Great. You’ve given me some great tools to show the performance of the model. I’m really chuffed about what we’ve achieved. I’ll add your data to my presentation and let you know how we go.”

The board is impressed to have tangible, real world effects as part of the description of the model performance and applaud Lucy for a well-prepared presentation.

The board is impressed with Lucy’s effort and wants her project to proceed. (Image: Anna Shvets)

A week later, Lucy receives the confirmation that the project has the green light to proceed into productionisation.

Conclusion

The project finishes as a great success. Lucy is commended for helping ExampleBank realise their growth strategy and she continues to have a great working relationship with the data science team. She keeps the presentations she has made as a memento of her success and a quick note on how to interpret Precision and Recall in her personal files.

Lucy’s note with a graphic from the NounProject: money value by Eucalyp

“Accuracy isn’t everything. Always bring it down to the dollar value,” she writes down, with a little set of scales drawn next to it. “If I can understand it, I can explain it.”

She puts the notes aside and turns to her laptop. Time to get on to the next project.

--

--