Healthcare Fraud Detection With Python

The importance of exploratory data analysis

Nov 7 · 9 min read
Photo by Ani Kolleshi on Unsplash

This April, a $1.2 billion Medicare scheme took advantage of hundreds of thousands of seniors in the US. In reality, this is just a small sliver of the billions of dollars healthcare fraud costs both consumers and insurance providers annually.

Healthcare fraud can come from many different directions. Some people might think of the patient who pretends to be injured, but actually, much of fraud is caused by providers(as in the New York Times article above).

Providers often have financial incentives for increasing performing unnecessary surgeries or claiming work they never even did. This leads to many different flavors of fraud that can all be difficult to detect on a claim-by-claim basis.

For example, fraud from healthcare providers could include:

  • Upcoding
  • Medically unnecessary procedures
  • Kickbacks
  • Providing services with nurses and staff that should be provided by doctors

These four methods of fraud are often effective for several reasons. First, there are so many claims it can be hard for claims processors to discover them before paying them. So often these fraudulent claims will be paid before getting caught. Another good reason is that sometimes the cost of adjudicating the claims might be greater than the claims’ value themselves.

That makes it difficult for insurance providers to rationalize spending money on creating methods to capture these bad behaviors.

This is why, before investing hundreds of thousands of dollars into your first fraud detection system, you should first analyze your claims from multiple directions to get an idea where fraud could be coming from.

This is where the exploratory data analysis step comes into play.

What Is Exploratory Data Analysis?

When you first start to analyze data, your goal will be to get a good sense of the data set.

This is known as exploratory data analysis. In particular, if your company follows the OSEMN (Obtain, Scrub, Explore, Model, and iNterpret) data science process, then this is the step.

Using this process can help provide clarity to the management of your progress.

The purpose of this step is to become familiar with the data as well as to drive future analysis. Generally, this step has a combination of analyzing data sets for skew, trends, making charts, etc.

It’s not about structure or process but instead meant to bring out possible insights through a flow state.

How you approach this step depends on how you work best. For instance, our preference is to think of questions we want to answer about the data set and then go about answering said questions.

For example, in our analysis today we will be looking at the healthcare fraud data set from The data set is focused on fraud and providing insights into which providers are likely to have fraudulent claims.

So our questions will be based on looking into what could support the case of fraud for these providers and why it is worth it for our business providers to invest in our project.

Here are some example questions:

  1. Does age play a role in which claims are fraudulent?
  2. Month over month, are there any patterns of when fraud occurs?
  3. Do fraudulent providers make more per claim than non-fraudulent providers?
  4. Do fraudulent providers make more per patient than non-fraudulent providers (e.g., per member per month, or PMPM)

These questions can help frame and guide our analysis so we don’t spend too much time wandering without a purpose.

What Is the Purpose of EDA?

The purpose of this EDA step is to provide support for later more in-depth analysis. In a recent post, we discussed the concept of agile data science, not so much as a strict process but as a framework. This is one of those steps where, when you are doing the analysis, you can bring up points that are interesting using charts and metrics that might help move your business case along.

Note: In this example, we have already joined all the data sets together for easy use. You can find the code for that here.

For example, in our questions above we are looking to support the idea that it is worth looking into fraudulent providers.

So let’s look at how they play out. We are running this all in because it is easy to spin up a VM and run this analysis as well as to share it.

In this first part, we look at age. It’s usually a great place to start because it is a natural place you might see some patterns in the data.

So we can use the histogram function in Pandas to analyze this.

However, once we look at it, it seems to break down at a pretty even distribution. This means there is a pretty similar sample across both sets of data.

The next question we wanted to answer was focused on spending. First, we wanted to look at spending in general. Initially, when analyzing the gross amount, nothing sticks out, as seen in the charts below.

Total Monthly Spending for Fraud Claims vs. Total Monthly Spending for Non-Fraud Claims

Overall, the months seem to line up, except that the total amounts month over month seem to be much higher on the fraud side. So we wanted to look into this.

A better way we can look at this is this. First, let’s look at the average claim cost per month.

Per Claim Costs for Fraud Claims vs. Per Claim Costs for Non-Fraud Claims

We can see here that there is a drastic difference in the average cost per claim. In the case of providers that are likely to commit fraud, they often charge two times what the non-fraud providers charge. (It would require more analysis into what the claims were to support this).

Another great metric used in healthcare is PMPM. This stands for . This is a great metric to see how much a patient is costing per month.

So instead of looking at the average claim costs, we will look at the average patient cost per month. Technically, we should be looking at this by calculating whether or not a patient has valid coverage for the month.

However, due to the data set, we don’t really have that specific data. So, for now, we are using the proxy of the patient’s ID. It’s not perfect, but it is what we will use for now as seen in the code below.

Per Patient Per Month Costs for Fraud Claims vs. Per Patient Per Month Costs for Non-Fraud Claims

Looking at this, you will notice that an insurance provider that is likely to have fraudulent claims also charges two times per patient more than the non-fraudulent providers.

Now, why is it important that we have done this exploratory analysis before diving into model development?

The reason is that this provides a solid business case to sell to your stakeholders for why you would like to invest further in this project.

You already have a business reason that would intrigue any business partner. Based on the monthly spend charts, your provider could be saving upwards of $750,000 USD a month, or several million dollars a year, if you were able to crack down on this insurance fraud.

In addition, you are already seeing some tendencies of fraudulent providers.

But you can’t stop analyzing the data just yet.

Always Get More Support

Now, as a data scientist or analyst, you will want further supporting evidence to continue down this avenue.

This means bringing in other angles from this data that can further support the point of the providers costing your insurance company far more than is required.

Here are a few ways we can do so.

Let’s first start by looking at the overall count per physician of claims they had in a year. Let’s take a look at what the breakdown looks like, comparing fraud to non-fraudulent claims.

Top 20 Attending Physicians Involved in Healthcare Fraud vs. Top 20 Attending Physicians Not Involved in Healthcare Fraud

What you will notice is that there is a drastic difference in the number of claims done by the physicians to providers where there is a likelihood of fraud vs. our non-fraud physicians.

In addition, physician PHY330576 seems to be doing a much larger number of claims compared to even his peers at the fraudulent providers. This would be worth digging into.

There could be a business reason for why this physician provides so many more claims. Perhaps they handle procedures that are very small and easy to do, and that could just be a confounding factor.

Again, hard to say. But, this still further supports the idea that fraudulent providers are providing or claiming to provide extra services that are not needed.

Let’s take one last look at this from another angle.

Instead of monthly breakdowns, let’s try analyzing the average number of claims a physician provides per day. If we analyze the number of claims done by physicians on a daily basis depending on whether the provider is fraudulent or not fraudulent, what do we find?

Looking at the two charts, we can see there is a much larger number of claims that exceed three or more claims per day in the fraudulent providers vs. the non-fraudulent providers.

In addition, when you further look into it, you will find that fraudulent providers have 15% of claims with three or more claim ids in a day compared to 3% for non-fraudulent providers.

As you can see, the fraudulent providers are claiming much more in the way of claims per day.

This is highly suspect and would be a great place to start analyzing data.

From here your goal as an analyst would be to analyze what types of claims have three or more claims per day. This might give you a pattern of behavior.

However, we will stop our analysis for now.

Before we go on, we want to point out a nifty feature that helped us during our analysis. is automatically integrated with Git. This means that, as you are working on answering these various questions, if you accidentally change something in your code and don’t remember what it was, then you can easily roll back. This came in handy because you’re not even seeing all the charts we developed.

Thus, having the ability to roll back and see if there were snippets of code that made more sense was very helpful!

Exploratory Analysis Can Save Time and Increase Buy-In

In the original analysis on Kaggle, they tried to develop a model right away without really finding a target population.

Here we have a possible population (physicians that provide three or more claims per day) that we might want to target. This would again be brought up in a meeting with stakeholders. But it goes to show why EDA is important.

It’s not always about going headfirst into the model. Sometimes it is about first developing solid support into what populations might be worth looking at.

In this case, there is value in analyzing the three or more claims per day as that seems to be a factor.

From here you would want to see what procedures or diagnoses are included in these cases as that might further provide information into what is going on.

We do hope this gave you valuable insight into why EDA is important. It helps you get a better understanding of the data while at the same time providing support that you can offer your business partners.

Better Programming

Advice for programmers.

Thanks to Zack Shapiro


Written by

#Data #Engineer, Strategy Development Consultant and All Around Data Guy #deeplearning #machinelearning #datascience #tech #management

Better Programming

Advice for programmers.

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade