String Similarity for Fraud Prevention using Levenshtein Distance

Published in

Analytics Vidhya

3 min readJan 2, 2021

Background: Fraud — A $24 billion dollar industry

According to DataProt, in 2018, Credit Card fraud cost the global economy $24.3 billion — yes, billion. Almost everybody has either been a victim of credit card fraud, or knows someone who has.

But who foots the bill?

Many people think there is a kind of ‘fraud insurance’ that banks tap into to reimburse cardholders, however in 99% of cases that cost falls on the merchant who fulfilled the order. That’s where merchant Fraud Prevention comes in, and where most of the work is done.

I’ve been an e-comm Fraud Prevention leader for 5 years now and as time has gone on, I have found myself more and more in need of custom solutions for problems that tradtional tools don’t provide, and coming from a non-technical background was a huge challenge in this regard;

So, I started learning Python, and the results have been incredible!

Here’s one example.

The problem

One of those problems I mentioned earlier relates to how “similar” an Email address on an order is to the Name that was provided. I’ve always suspected intuitively that a mismatched name and email (eg. John Smith, abbey.miller@<domain>) was a small red flag (especially for a new customer), but I’ve never had a way to prove it, much less detect it.

Enter the Levenshtein Distance

And more specifically, the python-Levenshtein package.

Simply put, the Levenshtein Distance is a method for counting the number of single-unit character changes (additions, subtractions and substitutions) required to transform one string into another. At the most basic level it returns an integer of how many transformations were required to transform string A into string B.

Installation

pip install python-Levenshtein

Code

I created a small function that takes an email address and a name as inputs, removes the email domain, strips out numbers and punctuation using some regex, then calculates the Levenshtein Distance as a proportion of the length of the email address (to control for variations in string length). Essentially, the value returned will be used as a feature in a Machine Learning model and analytics.

import re
import Levenshteindef compare_email_name(email, name):
    
    lower_name = str.lower(name)
    lower_email = str.lower(str.split(email,'@')[0])
    nopunc_email = re.sub('[!@#$%^&*()-=+.,]', ' ', lower_email)
    nonum_email = re.sub(r'[0-9]+', '', nopunc_email).strip()    distance = round(Levenshtein.distance(lower_name,nonum_email) /        len(email),1)
    return distance

Results!

A quick dislaimer: While built to simulate a real transaction dataset, the dataset was synthetic.

What I found was actually much more impactful than I expected! Below is the Fraud rate (in SEK in the case of this dataset), grouped by the values returned by the function above.

What I took away from the plot above is that any Name to Email transformation that required more changes than 40% of the length of the original email address are much more likely (like, a lot more likely) to be fraudulent.

In summary:

I’ve got a lot more work to put into this, in terms of perfecting the correct calculation of the output metric, testing it on real data and getting this variable into a productionised model, but I think this is a great starting point. Give it a try and let me know your thoughts!

P.s. this has been my first ever post — if you’ve gotten this far, thanks so much for reading!