Healthcare Provider Fraud Detection Analysis using Machine Learning

Build a binary classification model based on the claims filed by the provider along with Inpatient data, Outpatient data, Beneficiary details to predict Healthcare Provider Fraud.

Anik Manik
Analytics Vidhya
17 min readFeb 27, 2021

--

Table of Contents:
1. Introduction
2. Types of Healthcare Provider Fraud
3. Business Problem
4. ML Formulation
5. Business Constraints
6. Dataset Column Analysis
7. Performance metric
8. Exploratory Data Analysis
9. Existing Approaches and Improvements in my model
10. Data Preprocessing
11. Machine Learning Models
12. Final Data pipeline
13. Future work
14. LinkedIn and GitHub Repository
15. Reference

1. Introduction:

What is Healthcare Fraud?
Fraud is defined as any deliberate and dishonest act committed with the knowledge that it could result in an unauthorized benefit to the person committing the act or someone else who is similarly not entitled to the benefit. Healthcare fraud is one of the types of fraud. Here we will analyze and detect “Healthcare Provider Fraud” where the provider fills in all the details and makes a claim on behalf of the beneficiary. Provider Fraud is one of the biggest problems that Medicare is facing currently. Healthcare fraud is an organized crime that involves peers of providers, physicians, beneficiaries acting together to make fraud claims. As per U.S. legislation, an insurance company should pay a legitimate healthcare claim within 30 days. So, there is very little time to properly investigate this. Insurance companies are the most vulnerable institutions impacted due to these bad practices. As per the Government, the total Medicare spending increased exponentially due to frauds in Medicare claims.

2. Types of Healthcare Provider Fraud:

Healthcare fraud and abuse take many forms. Some of the most common types of frauds by providers are:
a) Billing for services that were not provided.
b) Duplicate submission of a claim for the same service.
c) Misrepresenting the service provided.
d) Charging for a more complex or expensive service than was actually provided.
e) Billing for a covered service when the service actually provided was not covered.

3. Business Problem:

Statistics show that 15% of the total Medicare expense is caused due to fraud claims. Insurance companies are the most vulnerable institutions impacted due to these bad practices. The insurance premium is also increasing day by day due to this bad practice.
Our objective is to predict whether a provider is potentially fraudulent or the probability score of that provider’s fraudulent activity and also find the reasons behind it as well to prevent financial loss.
Depending on the probability score and fraudulent reasons insurance company can accept or deny the claim or set up an investigation on that provider.
Find out the important features which are the reasons behind the potentially fraudulent providers. Such as if the claim amount is high for a patient whose risk score is low, then it is suspicious.
Not only the financial loss is a great concern but also protecting the healthcare system so that they can provide quality and safe care to legitimate patients.

4. ML Formulation:

Build a binary classification model based on the claims filed by the provider along with Inpatient data, Outpatient data, Beneficiary details to predict whether the provider is potentially fraudulent or not.

5. Business Constraints:

a) The cost of misclassification is very high. False Negative and False Positive should be as low as possible. If fraudulent providers are predicted as non-fraudulent (False Negative) it is a huge financial loss to the insurer and if legitimate providers are predicted as fraudulent (False Positive) it will cost for investigation and also it’s a matter of reputation of the agency.

b)Model interpretability is very important because the agency or insurer should justify that fraudulent activity and may need to set up a manual investigation. It should not be a black box type prediction.

c) The insurer should pay the claim amount to the provider for legitimate claims within 30 days. So, there are no such strict latency constraints but it should not take more than a day because depending on the output of the model the agency may need to set up an investigation.

6. Dataset Column Analysis:

Source of Data: The dataset is given on Kaggle's website. Please find the link below.

Train-1542865627584.csv:
It consists of provider numbers and corresponding whether this provider is potentially fraudulent. Provider ID is the primary key in that table.

Test-1542969243754.csv
It consists of only the provider number. We need to predict whether these providers are potential fraud or not.

Outpatient Data (Train and Test):
It consists of the claim details for the patients who were not admitted into the hospital, who only visited there. Important columns are explained below.

BeneID: It contains the unique id of each beneficiary i.e patients.
ClaimID: It contains the unique id of the claim submitted by the provider.
ClaimStartDt: It contains the date when the claim started in yyyy-mm-dd format.
ClaimEndDt: It contains the date when the claim ended in yyyy-mm-dd format.
Provider: It contains the unique id of the provider.
InscClaimAmtReimbursed: It contains the amount reimbursed for that particular claim.
AttendingPhysician: It contains the id of the Physician who attended the patient.
OperatingPhysician: It contains the id of the Physician who operated on the patient.
OtherPhysician: It contains the id of the Physician other than AttendingPhysician and OperatingPhysician who treated the patient.
ClmDiagnosisCode: It contains codes of the diagnosis performed by the provider on the patient for that claim.
ClmProcedureCode: It contains the codes of the procedures of the patient for treatment for that particular claim.
DeductibleAmtPaid: It consists of the amount by the patient. That is equal to Total_claim_amount — Reimbursed_amount.

Inpatient Data (Train and Test):

It consists of the claim details for the patients who were admitted into the hospital. So, it consists of 3 extra columns Admission date, Discharge date, and Diagnosis Group code.

AdmissionDt: It contains the date on which the patient was admitted into the hospital in yyyy-mm-dd format.
DischargeDt: It contains the date on which the patient was discharged from the hospital in yyyy-mm-dd format.
DiagnosisGroupCode: It contains a group code for the diagnosis done on the patient.

Beneficiary Data (Train and Test): This data contains beneficiary KYC details like DOB, DOD, Gender, Race, health conditions (Chronic disease if any), State, Country they belong to, etc. Columns of this dataset are explained below.

BeneID: It contains the unique id of the beneficiary.
DOB: It contains the Date of Birth of the beneficiary.
DOD: It contains the Date of Death of the beneficiary if the beneficiary id deal else null.
Gender, Race, State, Country: It contains the Gender, Race, State, Country of the beneficiary.
RenalDiseaseIndicator: It contains if the patient has existing kidney disease.
ChronicCond_*: The columns started with “ChronicCond_” indicates if the patient has existing that particular disease. Which also indicates the risk score of that patient.
IPAnnualReimbursementAmt: It consists of the maximum reimbursement amount for hospitalization annually.
IPAnnualDeductibleAmt: It consists of a premium paid by the patient for hospitalization annually.
OPAnnualReimbursementAmt: It consists of the maximum reimbursement amount for outpatient visits annually.
OPAnnualDeductibleAmt: It consists of a premium paid by the patient for outpatient visits annually.

7. Performance metric:

As the dataset in healthcare fraud is highly imbalanced(very few fraud cases), ‘accuracy’ won’t be the proper metric. An important initial step will be to plot the confusion matrix. Then we need to check the misclassification i.e. FP and FN. FN means the cases predicted by the model are legitimate but actually it is fraudulent. FP means the cases detected by the model are fraudulent, but actually, it is legitimate.

So, the performance metrics are:
a) Confusion Matrix: It is the table where TP, FP, TN, FN counts will be plotted. From this table, we can visualize the performance of the model.

b) F1 Score: It is the harmonic mean of precision and recall.
F1 Score = 2(Precision * Recall)/(Precision + Recall)
where Precision = TP/(TP+FP) and Recall = TP/(TP+FN). As F1 score consists of both Precision and Recall it will be correct metric for this problem.

c) AUC Score: AUC stands for Area Under ROC(Receiver Operating Characteristics) Curve. ROC plots TPR concerning FPR for different thresholds. The area under the curve depends on the ranking of the predicted probability score, not on absolute values.

d) FPR and FNR: As the cost of misclassification is very high, we need to check the FPR and FNR separately, It should be as low as possible.

8. Exploratory Data Analysis:

Distribution of Class Labels (Provider Data):

Observation:

This is a highly imbalanced dataset. There are 10% fraudulent providers and 90% non-fraudulent providers.

Distribution of Gender (Beneficiary Data):

Observation:

The ratio of genders in beneficiary data is Gender_0 : Gender_1 = 57% : 43%.

Distribution of State (Beneficiary Data):

Observation:

  1. The top 20 states in terms of the beneficiary count are shown in the above picture.
  2. States with codes 5, 10, 45, 33, and 39 are the top 5 states.
  3. 8.7% of the beneficiaries belong to state 5.

Distribution of Country (Beneficiary Data):

Observation:

  1. The top 20 countries in terms of the beneficiary count are shown in the above picture.
  2. Countries with codes 200, 10, 20, 60, and 0 are the top 5 states.
  3. 2.85% of the beneficiaries belong to country code 200.

Distribution of Race (Beneficiary Data):

Observation:

  1. Race 1 is the most in terms of beneficiary count.
  2. 85% of the beneficiaries belong to race 1.
  3. There is no race 4 in the dataset.

Distribution of Patient Risk Score (Beneficiary Data):

Observation:

  1. The distribution of patient risk scores is right-tailed.
  2. Most of the patients with risk scores 2, 3, 4, 5.
  3. Very few patients are there with risk score 9, 10, 11, 12

Annual Reimbursement Amount (Inpatient and Outpatient):

Annual Reimbursement Amount — Inpatient(Left) and Outpatient(Right)

Observation:

  1. The total annual reimbursement amount for inpatient is 507162970 and outpatient is 179876080. The inpatient reimbursement amount is 2.81 times the outpatient amount.
  2. There are some outliers in both inpatient and outpatient as the tail of both the distributions is flat with high value.

Annual Deductible Amount (Inpatient and Outpatient):

Annual Deductible Amount — Inpatient(Left) and Outpatient(Right)

Observation:

  1. The total annual deductible amount for inpatient is 55401242 and outpatient is 52335131.
  2. In both of the datasets there exists some outliers with high value.

Attending Physician (Inpatient and Outpatient):

Attending Physician — Inpatient(Left) and Outpatient(Right)

Observation:

  1. PHY422134, PHY341560, PHY315112, PHY411541, PHY431177 are the top 5 attending physicians for inpatient and PHY422134, PHY341560, PHY315112, PHY411541, PHY431177 are the top 5 attending physicians for outpatient in terms of the number of patients visit.
  2. Physician PHY422134 treated 1% of the total inpatients and physician PHY330576 treated 0.5% of the total outpatients.

Operating Physician (Inpatient and Outpatient):

Operating Physician — Inpatient(Left) and Outpatient(Right)

Observation:

  1. PHY429430, PHY341560, PHY411541, PHY352941, PHY314410 are the top 5 operating physicians for inpatient and PHY330576, PHY424897, PHY314027, PHY423534, PHY357120 are the top 5 operating physicians for outpatient in terms of the number of patients operated.
  2. Physician PHY429430 operated 0.56% of the total inpatients and physician PHY330576 operated 0.08% of the total outpatients.

Other Physician (Inpatient and Outpatient):

Other Physician — Inpatient(Left) and Outpatient(Right)

Observation:

  1. PHY416093, PHY333406, PHY429929, PHY423728, PHY361563 are the top 5 attending physicians for inpatient and PHY412132, PHY341578, PHY338032, PHY337425, PHY347064 are the top 5 other physicians for outpatient in terms of the number of patients visit.

Procedure Code (Inpatient and Outpatient):

Procedure Code — Inpatient(Left) and Outpatient(Right)

Observation:

  1. 4019, 9904, 2714, 8154, 66 are the top 5 procedures for inpatient, and 9904, 3722, 4516, 2744, 66 are the top 5 procedures for outpatient in terms of the number of diagnoses done.
  2. Procedure 4019 is done 6.5% of the total procedures for inpatient and procedure 9904 is done 7.35% of total procedures for outpatient.

Diagnosis Code (Inpatient and Outpatient):

Diagnosis Code — Inpatient(Left) and Outpatient(Right)

Observation:

  1. 4019, 2724, 25000, 41401, 4280 are the top 5 diagnosis codes for inpatient, and 44019, 25000, 2724, V5869, 4011 are the top 5 diagnosis codes for outpatient in terms of the number of diagnoses done.
  2. Procedure 4019 test is done 4.3% of the total diagnosis for inpatient and 4.65% for outpatient.

Distribution of Inpatient Outpatient in Final Dataset

Observation:

  1. The number of claims is less for inpatient data compared to outpatient data.
  2. Even though the claims are less in inpatient data, the percentage of fraudulent activity is more in inpatient data(57.8%) whereas it is 36.5% in outpatient data. This is because the per claim reimbursement amount for inpatient is much higher(35 times calculated earlier) than the per claim reimbursement amount for outpatient.

Insurance Claim Amount reimbursed in Final Data:

Histogram (Left) and Box-Plot(Right)

Observation:

  1. 25th, 50th percentiles are very less for claim amount reimbursed.
  2. 75th percentile amount of Insurance Claim Amount reimbursed for fraudulent claims is higher than the legitimate claims.

Bivariate Analysis:

Scatter Plot of Patient Age vs Claim_Period:

Observation:

  1. From the scatter plot we can see that when a patient’s age <60 years and claim period more than 20 years, the probability of the transaction is fraudulent is high.

Scatter Plot of Patient Age vs InscClaimAmtReimbursed:

Observation:

  1. From the Scatter Plot of Patient Age vs InscClaimAmtReimbursed, I can observe that if patient’s age60000 it tends to be a fraudulent transaction.
  2. If the patient’s age>88 yrs and claim amount>60000 the probability to be fraudulent is high.

Scatter Plot of IP_OP_TotalReimbursementAmt vs InscClaimAmtReimbursed:

Observation:

  1. If InscClaimAmtReimbursed>10000 and IP_OP_TotalReimbursementAmt>120000 then the chance to be a fraudulent transaction is high.

Scatter Plot of IP_OP_AnnualDeductibleAmt vs InscClaimAmtReimbursed:

Observation:

  1. If IP_OP_AnnualDeductibleAmt<5000 and InscClaimAmtReimbursed>600000 then the chance to be a fraudulent transaction is high.

9. Existing Approaches and Improvements in my model:

In the existing approaches mainly inbuilt models from scikitlearn were used and no proper strategy was taken for the data imbalance. In my solution, I have introduced random oversampling using SMOTE to handle the data imbalance. Along with the first cut models I have used custom ensemble models to get better performance.

10. Data Preprocessing:

First, create some features in the individual datasets.

Calculate patient’s age based on DOD, if DOD is not available calculate age based on the maximum date available in the data.

Create a separate column whether the patient is dead.

Calculate Claim Duration and Hospitalization Duration:

If the number of days claimed for Inpatient treatment is more than no of days hospitalized is suspicious. So, I am adding this feature column.

Merge All the Datasets: We have 4 different datasets, which are interconnected by foreign keys. I need to merge them using the foreign keys to get a overall dataset. Below is a brief overview of the dataset.

Overall Representation of The Dataset
  1. Merge Inpatient and Outpatient data based on common columns:

2. Merge beneficiary details with inpatient and outpatient data on BeneID.
3. Merge provider details with previously merged data on ProviderID.

Once merging is done, create new features based on the merged data.

Create a new feature “total reimbursement amount” for inpatient and outpatient.

Create new features using ‘groupby’ and taking mean or aggregate.

As Providers fill and submit the claim they are mainly associated with the fraudulent activity. So, I will group by the provider and take the mean of reimbursed, deducted, etc. If the average claim amount or claim period is high for a provider, this is suspicious.

Beneficiaries also associated with fraudulent activity. So, group by the data-frame by Beneficiary Id and take mean. If the average claim amount is high for a beneficiary then this is suspicious.

Physicians are also associated with fraudulent activity. So, group by AttendingPhysician, OperatingPhysician, and OtherPhysician and take mean. High amounts for a physician are suspicious.

I need to group by all diagnosis codes to combine the patients who performed the same tests and take the average of costs etc.

I need to group by all procedure codes to combine the patients who gone through same procedure take the average of costs etc.

Sometimes Providers along with physicians, beneficiaries, and sometimes diagnosis and procedures are also associated. So take another feature with provider id and group by. Take count after that.

Remove the columns which are no longer required.

Convert the type of Gender and Race to categorical and do one-hot encoding.

Our objective is to predict healthcare provider fraud. So, group by the provider and take sum to create a feature corresponding to each provider.

Now standardize the data using StandardScaler.

11. Machine Learning Models

Now our dataset is ready. We need to try different models, first cut models along with ensemble models and validate the performance of every model. Based on the performance of the validation data we need to pick the best one for deployment.

First, define some functions to validate every model, which will draw ROC curve and confusion matrix.

Below are the different approaches that we will follow in this case study.

Approach 1:
a. Split the data into Train and Validation (80:20)
b. Oversample the data using SMOTE (majority: minority) to make 80:20, 75:25, 65:35, and 50:50.
c. Use Logistic Regression, Decision Tree, Support Vector Classifier, and Naive Bayes for all these 4 oversampled datasets. Pick the best model based on the performance score.

After evaluating the above models across all the oversampled data, it is found that Logistic Regression performed the best with Oversampling Ratio 80:20 with AUC score of 0.9508

AUC and Confusion Matrix for LR with all features

Approach 2:
In this approach, we will calculate feature importance using the Random Forest model. Based on the decrease of Gini impurity when a variable is chosen to split a node the feature importance is calculated. Below is the code snippet to find feature importance.

20 top important features
20 least important features

Take the features whose feature importance is greater than 0.001. I found 161 such features. Now I will train the ML model using these features only. Trained Logistic Regression and Random Forest using the important features. Below are the observations.

Comparison of model with all features and important features.

AUC and Confusion Matrix of LR model for Important Features:

AUC and Confusion Matrix for LR with important features

Observation:
1.
Logistic Regression vs Random Forest with all the features
Using RF, F1 score increased from LR model with little decrease in AUC. Apparently, it can be said the RF model performing better than LR model. But if I look at the confusion matrix, the False Negative(Predicted Not-Fraud but actually it is Fraud) count is more in RF, which is very dangerous in our case. After looking at all the scores it can be said that LR is performing better than RF.
2. After filtering the important features there is no such improvement in model performance for both LR and RF. F1 score is increased even though False-negative also increased. In our case decreasing False Negative is more important than decreasing False Positive. So, I can say the model is performing better with all features than only using top important features.
3. After considering AUC, F1 Score, FNR it can be said the Logistic Regression model is the best model in healthcare provider fraud detection problem.

Approach 3: Build a Custom Ensemble Model with base models along with a meta-model and see if we can achieve better accuracy.

Steps to be followed:
a.
Split whole data into train and validation(80–20).
b. Now, in the 80% train set, split the train set into D1 and D2.(50–50).
c. From D1, do sampling with replacement to create d1,d2,d3….dk(k samples).
d. Now create ‘k’ models (combination of DT, LR, SVC, NB, RF, XGBClassifier) i.e. base learners with low bias and high variance and train each of these models with each of these k samples.
e. Now, pass the D2 set to each of these k models. Now, I will get k predictions for D2 from each of these models.
f. Now, using these k predictions, create a new dataset. The corresponding target values of this new dataset are already known. Train a meta-model using this new dataset.
g. Now evaluate the model using the 20% validation data. Pass the validation set to each of the base models and get ‘k’ predictions. Create a new dataset using these predictions and pass it to the meta-model to get the final prediction. Using the predicted label and original label calculate the model’s performance score.
h. Once the final model is ready it can be deployed.

Custom Ensemble ML Model

Code sample for custom ensemble model:

Define a function to randomly sample data from D1 dataset:

Function to create ‘k’ sampled dataset:

Function to create ‘k’ base learners:

Train ML ensemble model for different values of ‘K’ and find best ‘K’

Observation:

Model performance comparison for Meta Model

Random Forest worked the best as meta-model using 50 different combinations of base learners. Test AUC = 0.9631

Performance of ensemble model with RF as a meta-model

Approach 4: StackingCVClassifier from mlextend

As hyperparameter tuning was not done for the base learners in the previous model, using StackingCVClassifier from “mlxtend”, the base models, as well as meta-model can be tuned. Then train the StackingCVClassifier with the best parameters to find the final prediction. Below is the code sample.

As the performance of the StackingCvClassifier was not better compared to the custom ensemble model, I will not be using this as the final model.

12. Final Data Pipeline:

As we saw that custom ensemble model with RF as meta-model worked the best for this healthcare provider fraud detection problem, we will use this in our final model pipeline. Pleaase find the code snippet for complete pipeline.

Final Data Pipeline

Final prediction received from the pipeline.

Final prediction along with the original labels

13. Future Work:

This problem can be solved using deep learning techniques. Using deep multilayer we can use different kinds of activation functions(Relu, Leaky relu) and dropouts to prevent overfitting. As this a binary classification problem, we can use either sigmoid or softmax in the final layer.

14. LinkedIn and GitHub Repository:

LinkedIn: https://www.linkedin.com/in/anik-manik-aa1594a4/
GitHub: https://github.com/anikmanik04/healthcare-provider-fraud-detection

15. References:

  1. A survey on statistical methods for health care fraud detection
    https://cpb-us-w2.wpmucdn.com/sites.gatech.edu/dist/4/216/files/2015/09/p70-Statistical-Methods-for-Health-Care-Fraud-Detection.pdf
  2. The Detection of Medicare Fraud Using Machine Learning Methods with Excluded Provider Labels
    https://www.aaai.org/ocs/index.php/FLAIRS/FLAIRS18/paper/download/17617/16814
  3. Machine Intelligence & Data Mining in Healthcare Fraud Detection
    https://www.roselladb.com/healthcare-fraud-detection.htm#:~:text=Healthcare%20fraud%20detection%20involves%20account,feasible%20by%20any%20practical%20means.
  4. Predicting Healthcare Fraud in Medicaid: A Multidimensional Data Model and Analysis Techniques for Fraud Detection
    https://www.sciencedirect.com/science/article/pii/S2212017313002946
  5. www.appliedaicourse.com

--

--

Anik Manik
Analytics Vidhya

Data Scientist at Cigniti Technologies || Ex-IBMer