Machine Learning to the Insurers’ Rescue in Fraud Detection

Find out how machine learning can empower fraud detection in insurance. Learn about ML use cases, how ML-based fraud detection works, and how to build it.

Alexander Barinov
Intelliarts AI
13 min readJul 5, 2022

--

Image from Unsplash

Insurance fraud has been a challenge in the sector since… Always. Still, the problem has become more urgent today due to the growth in global cybercrime and more sophisticated fraudulent schemes. The Covid-19 outbreak has also made a perfect storm for corporate fraud, placing an extra financial strain on businesses and forcing companies to go online.

Meanwhile, old-school fraud detection methods such as rule-based systems are not an option in insurance anymore. Companies need something more intelligent, and machine learning can become a great solution in this case. In this article, read how insurers can implement machine learning to transform it into a powerful technological weapon and fight back against fraud more efficiently.

Is your company prepared for increasing fraud risks?

The Insurance Information Institute informs about $38 to $83 billion losses yearly because of insurance fraud. And this is excluding health insurance fraud, which costs the US nation an additional $68 billion according to the National Health Care Anti-Fraud Association and is the most costly type of fraud in the insurance industry.

Fraud costs in insurance

That’s a lot of money at stake. A heavy financial burden is yet not the only toll on the insurance business — here we also have bad customer experience, reduced loyalty, affected company reputation, and operational failures.

Traditional fraud detection methods

In the past, fraud detection was left to insurance fraud investigators, who had to go through new claims manually. Their only weapon was a few facts and lots of intuition. For sure, this approach could not provide quality checks, aside from the fact that manual fraud detection was expensive and time-consuming.

The situation improved when rule-based systems appeared. This approach operates on a set of “rules”, so-called conditions, that warn about potential fraud once it’s detected. The rules could relate to unusual transaction types, suspicious timestamps, or account numbers. In other words, the system is looking for red flags to recognize fraud and automatically block it.

Fraud detection

A great anti-fraud toolkit as ruled-based systems are, their “black and white” logic doesn’t always work well. Its most critical limitation includes the impossibility of detecting new fraud schemes and patterns. But there are other drawbacks too:

  • Blind spots: As fraudsters become smarter and new schemes evolve, more blind spots in the insurer’s fraud detection system appear, i.e. the areas that rules haven’t covered yet. This makes a fraud detection system inefficient at some point as well as places an extra burden on the insurer’s fraud analyst team that should keep expanding the rules.
  • False positives: The more rules the company adds, the more it risks increasing false positive rates, which results in blocking genuine customers and valid claims. For instance, the insurer limits claims from a risky region. This means losing at least some amount of genuine customers from this location.
  • Only simple cases: Rule-based systems rarely notice more complex fraud cases because they are limited by human comprehension.

Machine learning to the rescue in fraud detection

Machine learning (ML) has been the next big step forward in fraud detection. Its idea lies in using complex algorithms that analyze large, complex datasets, seek patterns, learn, and improve from this experience. Here are a few most popular reasons why insurers are opting for ML-based fraud detection:

  • Speed: Imagine fraud detection using machine learning like having several teams of fraud investigators at your disposal. And these are working with thousands of claims registered in real-time and with high precision. ML can reduce the time spent on fraud detection by 70%. This isn’t surprising if we consider that an ML-based fraud detection solution can work 24/7 and analyze large amounts of info in the blink of an eye.
  • Accuracy: Unlike rule-based systems that are broad and notice high probability fraud claims only, ML solutions spot non-intuitive behavior easily. According to Capgemini, an ML-based fraud detection system can increase accuracy by 90% thanks to noticing the subtlest evidence of abnormal behavior.
  • Efficiency: An ML model usually detects fraud at early stages. For example, a neural network can complete more complex analyses, like investigating how much time the customer spends to fill in the claims forms, how many pages they browser, and whether they are copy-pasting the info.
  • Scalability: While for a rule-based approach, more data could become a problem, machine learning fraud detection thrives on large datasets. Additional data is one more opportunity for an ML model to learn and discern patterns of valid and fraudulent claims. Besides, submitting more info allows ML models to keep up with the latest scams and fraud methods.
Advantages of ML over traditional fraud detection approaches

Here is one more important thing to mention. Although machine learning is a huge update to insurance fraud detection, it doesn’t mean that a company should replace its rules entirely. As much as a standalone solution, ML can work great as a complementary tool for your legacy system.

Machine learning fraud detection use cases

How exactly can insurance companies apply machine learning to fraud detection? We mention a couple of ideas to get you started.

Fake claims

This is probably the most common use case for fraud detection using machine learning. Here ML takes advantage of semantic analysis, which makes it possible to analyze almost any type of data:

  • Structured
  • Unstructured
  • Table-type

Simply put, ML algorithms analyze claims-related files submitted by insurance agents, clients, police, and other stakeholders. They’re looking for inconsistencies in the provided evidence. And a great chance exists that ML will find these discrepancies since there are many hidden clues in textual data, and ML systems are great at detecting them.

In their case study “Leveraging deep learning with LDA-based text analytics to detect automobile insurance fraud”, Wang and Xu tested ML to detect frauds in automobile insurance claims. The scholars used three different ML models, and all of them gained not lower than 75% accuracy in fraud detection.

Duplicate claims

Somehow, most insurers think of duplicate claims as “exact matches”. These are easy to spot, and you don’t even need a smart solution for fraud detection in this scenario. However, a duplicate request for payment is sometimes not so evident. Instead of conscious fraud, it can be a sort of mistake, like when a client resubmits the claim that wasn’t paid within the agreed time or when they want to add an extra modifier. In this scenario, smart ML-backed algorithms will be useful as they can notice subtle inconsistencies and inform fraud analysts.

Upcoding in medical billing

This is a type of health insurance fraud where a healthcare provider adds extra costs to the medical bill, planning to charge more to the patient and their insurance company. With the help of digital analysis based on Benford’s Law, a simple rule-based fraud detection system will reveal this type of fraud. However, ML can upgrade your rule-based system and, for example, add image recognition to digitize documents and classify them easily.

Overstated repair costs

In auto insurance, ML-based fraud detection can help to search for inconsistencies in car repair costs. This is a type of classification task in machine learning, which can help to classify data in repair claims to see hidden correlations in claim records or even decisions of insurance agents, clients, and repair service providers. For instance, an auto repair service can charge an extra fee to the clients of a particular agent.

Others

There are a few other machine learning use cases in insurance fraud. How about adding image recognition to detect fraud at the personal identification stage? Or an insurance company can use an ML model to check medical receipts and bills to find links between a healthcare practitioner and a specific patient.

How an ML fraud detection system works

Before we go over to building an ML-based fraud detection model, let’s explore how it works. Imagine we want to check an insurance claim whether it’s fraudulent or not:

How an ML fraud detection system works

Thanks to ML, an insurance company receives a risk score for each of its claims on a scale from 1 to 100. The higher the score is, the higher the probability of fraud is. The system then has to decide whether to block the claim, send it further for review, or allow it. This decision depends on the threshold chosen for each of the actions earlier.

If we want to improve these final results, we can take as much data as possible. For example, in healthcare insurance, we can use:

  • Personal info: Age, gender, and location
  • Claims data: Claims history, claims amount, minor vs. major claims
  • Hospital-related info: Length of stay, admission reasons, hospital status
  • Policy data: Plan type, direct vs. agency registration

The benefit of ML algorithms over rule-based systems is that ML can work with different types of data simultaneously. And more data will only contribute to its accuracy of outcomes.

Building an insurance fraud detection model

Generally speaking, the ML process in fraud detection includes five big steps:

Building an insurance fraud detection model

1. Input data

Data plays the most critical role in building a fraud detection solution using machine learning. Most insurers would use historical datasets of their insurance claim info as a backbone of their data.

The quantity and quality of data dictate how accurate the outcomes of the model will be. Although the general rule claims that the more data, the better, the insurer should still make sure that the quality of data is good.

Moreover, if it’s supervised machine learning, a critical part of the data preparation process will include dividing data into valid and fraudulent claims and labeling them accordingly.

Input data

2. Create features

Features are sort of characteristics of claims to separate fraudulent insurance activity from valid claims. To some extent, these are based on the same principles that fraud investigators will make their decisions upon.

For example, good indicators of insurance fraud could be the next features:

  • The date of claim, e.g. when a claim is made in short notice after the inception of the policy
  • The claimant is or has become unemployed
  • The documents provided by the claimant have inaccuracies, e.g. there are signs of alterations in dates, amounts, or descriptions
  • The applicant left some questions unanswered in the claim, such as about income or other insurance carried
  • The claimant has made insurance claims multiple times in their life

3. Choose the model

Different machine learning algorithms are used to build models in insurance fraud. In simple words, an ML algorithm is a set of rules to follow to solve complex problems, much like a mathematical equation or even a recipe. Its idea is to use the insurer’s data described by labels and features and learn to make conclusions, e.g. fraud vs. not fraud.

We briefly mentioned the algorithms that are the most popular in fraud detection:

  • Logistic regression that relies on a cause-effect relationship to work with structured data. In fraud detection, it tends to become more sophisticated with multiple variables and large datasets.
  • Decision trees that are used to automate the creation of rules for classification and regression tasks. This algorithm has a tree structure and, at its essence, is a set of rules trained using examples of insurance fraud.
  • Random forest that combines several decision trees to contribute to the performance of classification or regression. This technique works great to smooth the error that could occur in a single decision tree and, thus, achieves better accuracy.
  • Support Vector Machines (SVMs) that create a hyperline to divide data into two categories with a clear gap. The algorithm is especially useful for working with complex multidimensional systems.
  • K-Nearest Neighbors (KNN) that include an algorithm that classifies records according to how similar data points stay close to each other.
  • Neural Networks and deep neural networks that are suitable for determining non-linear relations between the records. They can learn and uncover patterns — to some extent, similar to the human brain. To understand the difference, deep neural networks use more layers than neural networks, which guarantees more accurate results.

4. Train, evaluate, and fine-tune

Train the algorithm

When the algorithm is chosen, the learning part begins. First, an insurer can train the algorithm using historical data, a so-called training set. It’s important to have enough data to feed the model so it can learn the difference between fraudulent and valid claims and customer behavior better.

Patience and experimentation are required from ML engineers at this stage. At some moment, the model needs to be tested in real-life circumstances. The engineers will show the model new insurance claims, and it has to compare them to the valid/fraudulent claims it has seen before. Based on the results, the engineers tune parameters and improve the model.

This process should include as many iterations as needed so the fraud detection model provides the most accurate fraud score.

5. Detect fraud

The final stage of building an ML-based fraud detection system is the actual prediction. This is when the insurer’s ML model is ready for practical application and can differentiate valid claims from fraudulent ones.

How can you tell that the model is working? Again, insurance companies have to feed the model with the new fraud data (but the one that they know the outcomes for) and compare the results. If the model works correctly, it’s ready for deployment in the insurer’s live environment.

Download white paper here

FAQ about fraud detection using machine learning

1. Do I even need machine learning for fraud detection?

The answer here depends on your business model. In some cases, a rule-based system is enough for fraud detection, and an ML-powered model is only a complementary solution. In others, ML is a necessary upgrade to the insurer’s anti-fraud toolset to save costs and stay profitable.

A good idea to know whether you need ML is to start with the next questions:

  • What are fraud losses vs. costs of advancing data analytics in your organization?
  • Does fraud burden your current and future operations a lot?
  • Is fraud affecting your company’s reputation and/or customer experience?

2. Where will I get the data? Will it be enough?

Most insurance businesses, as well as insurtechs, have vast repositories with existing data. This could be historic claims, policy data, and others, which will suit ideally to build an ML model for fraud detection. Additionally, insurers usually have a steady stream of new claims and application info, which you can use.

As for the amount of data, there is no exact quantity that is required to train an ML model. As said, the more, the better. You can still follow the rule of thumb: you need x10 data instances as there are features.

3. I don’t have historical data; what do I have to do?

Even if you don’t have enough data, no worries. Our experienced data scientists can help you with data collection. We understand that this is the most important step that will impact the final ML results, so we’re ready to assist wherever possible.

4. Do I need real-time in insurance fraud?

Real-time functionality is the preferred option in fraud detection. Still, this can differ depending on the industry and a particular company. In the insurance business, it’s important to check the likelihood of a fraudulent claim right after it was filed (real-time). Clients don’t want to wait too long for the claim to be processed.

At the same time, once potential fraud is detected, it’s okay not to process it immediately. This way, an insurer can reduce computing resources and overall costs of fraud monitoring analytics.

5. How do I know that I need to retrain the model?

Intelliarts AI usually works with full-cycle projects. This means we don’t only build and deploy the model but configure the process of monitoring it. Thus, you will be able to notice when the data changes or any serious data drift happens and can ask us to retrain the model.

6. Is my legacy system a barrier to ML-based fraud detection?

Although legacy systems usually hinder ML implementation, especially in insurance digitalization, they shouldn’t stop you from implementing machine learning. Basically, this just adds extra hours to the deadline and makes the task more complicated. A skillful ML team can still help you create the most powerful ML fraud detection model and build it on top of your existing solution.

By the way, feel free to ask more questions in the comments. We’ll be glad to answer them.

The bottom line

To some extent, fraud detection resembles an arms race where insurers find out new and more exquisite ways to combat fraud. Meanwhile, their competitors — fraudsters — build new scams and schemes as fast as they can to pass by the insurer’s fraud detection system.

Machine learning is a game-changing technology that can bring fraud detection in your insurance company to a new level. Aside from automatic fraud detection, ML also delivers great speed, high accuracy, and insightfulness to insurers. Besides, an ML-based fraud detection system handles overload in the most efficient way.

Thinking about implementing fraud detection using machine learning? We at Intelliarts AI are ready to give your company a hand and build a fully-fledged ML solution for you.

--

--

Alexander Barinov
Intelliarts AI

R&D enthusiast in a field of Data Science and Machine Learning with vast experience in software engineering. Helps companies to gain more value from their data.