Building AI for vetting medical insurance claims V1

Shunling Guo (Shirley)
Curacel
Published in
11 min readJan 31, 2020

--

Automated flagging of false medical insurance claims

Background

False claiming is a universal problem for all insurance companies, with health insurance dominants the market (Healthcare Fraud Abuse review).

False claims could be resulted by human mistakes or intentional act which is called fraud. No matter what, it will cause a loss of insurance companies for paying what they should not pay.

In Curacel’s database, using a sample dataset of 51,883 total claims, false claims composite of ~4.4% of the total claims, but add up to 10% of the total claim amount which is up to 0.8 million dollars.

In the US, according to the Department of Justice, more than $2.8 billion were spent in the recovery of fraudulent claims in 2018 (Department of Justice 2018 annual report).

To combat false/fraud claims, insurance companies hire claim-adjusters to process insurance claims. According to Bureau of Labor Statistics in the US, over ⅓ million claim adjusters paid more than 20 billion dollars annually, and it took an average of 40 days to process a claim both according to Curacel’s data and US data.

Claim processing is labour intensive and time-consuming, an experienced claim adjuster could process 50 claims per day, and to limit human errors, insurance companies need to hire more than 2 people to process the same claim.

In our project, to process ~52K total claims for the last two years, the company may need to pay above 0.17 million dollars (consider 2 people process one claim for 51k claims, each claim processor could process 12k claims per year, and is paid $20k per year. formula = 51k/12k*2*$20k) according to local wage level (Check salary here). While this 52K claim is just a small portion of all the insurance claims, therefore, the profit for automating claim processing is much bigger for insurance companies.

Data

To automate claim processing, the first step we need to do is to auto flag the false claims.

We consider false claims to be the positive class (1), and legal claims to be the negative class (0).

For the legal claim, auto-approve it, and for the false claim, further review is needed to either adjust it or reject it, and also further approve those false positive claims. So this is a binary classification problem. To solve this problem, we need to see whether false claims have different patterns with legal claims, to gain confidence of whether it’s feasible to use any machine learning algorithm to mine the classification rules.

We first queried Curacel mySQL database, and get 9 tables with over 1 million rows and over 50 attributes.

Total data in mySQL database

‘care’ table included all annotations of care, explains the content related to care_id.

‘care_type’ is a short annotation of care type_id.

‘claim_comment’ contains all review comments for some of the claims.

‘comment’ included further comments in addition to the comments in ‘claim_comment’.

‘claim_diagnose’ is a relational table between claim_id and diagnosis_id.

‘claim’ contains most information of claims, indicated as below.

Attributes in Claim table

‘claim_item’ contains the one-to-one map of care information for each claim, including the claim_id corresponded to care_id, and care cost (‘amount’), the quantity of care item, and whether the care is approved or not.

‘diagnose’ contains detailed annotation about diagnosis_id, including the name and type of diagnosis and its ICD code (International Classification of Diagnosis).

‘provider_tariff’ contains the tariff amount for each care in each claim.

When we think about which features might be important to check, we could first think about what makes a claim. We need to file a claim because a patient (enrollee_id) had a disease and went to the doctor (provider_id), and the doctor gave a diagnosis (diagnosis_id) and made treatment (care_id) to him. The care_id and diagnosis_id would be sufficient to contain the information that whether it’s legitimate to meet the needs of the patient, therefore, the annotation table of ‘care’ and ‘diagnose’ might not be useful to build the learning model. The table ‘care_type’ is also not needed.

Therefore, we joined the following tables for exploratory data analysis: ‘claim_item’, ‘claim’, ‘claim_diagnose’, ‘provider_tariff’. We labelled the data entries with ‘hmo_approved’ value at 1 (means manually approved, -1 indicates rejected and 0 means not processed yet) and also the claim amount is equal to approved_amount as ‘legal claims’, because sometimes the first claim adjuster approved it (hmo_approved = 1) but later someone spotted a mistake in this claim and further adjusted the amount but without changing the hmo_approved value. Therefore, only entries fulfil the above two conditions to be considered as legal claims, which would be labelled as ‘0’, and the rest of the claims would be flagged as ‘1’, which means the further review is required or reject it. After dropping missing values and abnormal values (e.g. ‘qty’ < 0), we get a clean table of 700K+ entries with 13 attributes.

DataFrame snapshot

Exploratory Data Analysis

To select the features that has distinct pattern between the two classes, we first explored all correlations and distributions of features. ‘care_id’ is a center feature, because it justifies whether the care is legally associated with ‘diagnosis’ or not and whether care quantity (‘qty’), cost (‘amount’) and ‘tariffs’ are legitimate according to the insurance policy (‘hmo_id’) and provider.

After checking the correlation plot of care and other features (Fig1.1), we could find some correlations are distinctive between false (class 1) and legal (class 0) claims, as shown by pure heavy orange spot is distinctive in correlation plots. The darkness of dot indicates the frequency of the correlation pair (the more cases the darker), and colour indicates the claim class (1 is orange and 0 is blue). The pure orange dot indicates the correlation pair only appears in false claim class, and pure blue dot indicates the pair only appears in legal claim class, otherwise, the dot would be a merge of orange and blue.

Fig1.1 Example of correlation plot of care and other features

In another way, we could visualize the above association in heatmap correlation (Fig1.2). We encoded the probability difference between false claim and legal claim ((Number of false — Number of legal)/total) for each value pair between two categorical features. And the red dots indicating there are specific values pairs that tend to have higher probability of false claim.

Fig1.2 Heatmap Correlation of feature ‘Care’ and ‘Provider’. red circles are example of value-pairs that only appear in false claims.

We then checked the distribution of features to show distribution difference clearly. We used normalized cumulative probability curve to show the feature distribution (Fig2), to accumulate the difference, if there is any, between the two classes for a specific feature. and we could see the features are distinctive between the two classes as shown by two non-overlapping cumulative curves, and the more separated of the curves the bigger their distribution difference would be. Those are all interesting and potential important features.

Fig2. Example of distribution plot of features.

Feature Engineering

For ‘enrollee_id’, we noticed in some comments, the reason to reject the claim was because the patients payment was maximized, therefore, instead of using ‘enrollee_id’ as a feature, we engineered ‘cumulative claim amount’ and ‘cumulative claim count’ for the enrollee, this would also further facilitate handling new patient data for future test. The ‘vetted_at’ feature which means when the claim is vetted, which would not be available in future test cases, because if we use the machine learning to process claims, the claim won’t be vetted by humans before feeding the model, therefore we should drop this feature. The feature ‘created_at’ will not be useful to handle new data because each date is unique; therefore, we used only the month to retain some information of this feature, although this might not be a useful feature since we didn’t see a repeated pattern of the month feature in two classes according to time series plot (Fig3) of both average claim amount and total claim counts.

Fig3. Time series plot of average amount claim for two classes.

After feature engineering, we get the following clean data ready to feed the model.

Model selection

According to the nature of the data input: 1) it is highly imbalanced. 2) it has huge dimension of categorical data (each categorical data has tens of thousands unique values), because there would be tens of thousands of unique values for each feature, and if we use one-hot-encoder to expand the feature dimension, it would be huge. 3) the one-to-one correlation/mapping of feature pairs provide the most meaning rules to learn for the model. For example, there would be legal rules of mapping the care_id and diagnosis_id, and any pairs that don’t comply with the rule would be problematic. Taken all these into consideration, the simple models would not be sufficient to do a good job. Therefore, we chose Xgboost (further reading to understand it check here) and LightGBM (further reading to understand it check here) algorithm for modeling, because both of them are using decision trees as base model to handle categorical data, and if the binning boundary is well set for the tree, it would be able to learn unique feature pairs without confusing, and the gradient boosting algorithm would enhance the positive feedback and selected the best tree model with best splitting strategy. We also chose Naive Bayes as a base model which could handle both imbalance and categorical data, but would not be able to correlate features.

Modeling

To handle the imbalanced class, there are five most common strategies: 1. Choose the right metric (recall and specificity, roc_auc, f1_score); 2. Upsampling of minor class; 3. Downsampling of major class; 4. SMOTE (Synthetic Minority Over-sampling Technique). 5. Adding weight to minor class. We choose roc_auc as a major metric for model tuning since it best describes the model learning ability, and we could further discuss how to balance recall and specificity. F1_socre is an overall reflection of recall and precision, those are all valuable metrics to check. We also chose upsampling instead of downsampling or SMOTE, because we need data as much as possible to provider rules and we are not sure synthetic data is legitimate or not.

We first used imbalanced data as input, but for xgboost and lightgbm we set class weight of 25 (because major class is about 22 times more than minor class) to enhance the positive minor class signal to avoid the model try to ignore the minor class in learning.

We first saw Naive Bayes (NB, pink line) model learned almost nothing with roc_auc score (the closer to 1 the better the model performs in learning rules) at 0.54 (compared with baseline 0.5 for a dummy learner), indicating the feature correlation is essential for model to learn the rules. Xgboost and lightGBM could reach above 0.76 roc_auc (Fig4. Left), further tuning tree depth from 5 to 25 improved roc_auc to 0.8 at tree depth of 11(Fig4. Middle, red line).

Fig4. ROC_AUC curves for model tuning. Left: model selection. Middle: model fine tuning. Right: feature contribution.

We then used upsampling of training data to balance the minor class and set class weight to be 1:1, and re-tuned the lightgbm model. We realized although increasing tree depth resulted in some over-fitting (training score is much better than testing score), but it also resulted in higher testing scores, therefore, we chose the model with highest testing scores as the best model with the maximum depth at 15.

Further tuning (further understand it check here):

We then fine-tuned the following parameters with grid search:

  1. ‘bagging_frequency’, which means how many trees would be used to learn a subsample of the data. Our data showed 1 is best, meaning for each subsample we need only 1 tree to learn.
  2. ‘bagging_fraction’, the fraction of data for each iteration. Optimized to be 0.6.
  3. ‘class_weight’, adding more power to enhance minor signal. Optimized to be 25.
  4. ‘colsample_bytree’, the fraction of features to be selected for each iteration. Optimized to be 0.8.

And finally, we get the best model with roc_auc score at 0.929 (Fig4. Middle, black line).

By checking the feature importance of the model (Fig5), we found the patient (enrollee_cum_claim_count and enrollee_cum_claim_amount), care (care_id) and diagnosis (diagnosis_id) are top 3 important features, which make sense.

Fig5. Feature Importance

We then would like to know how important those features are by dropping each feature and see how much the model would be affected. Without ‘enrollee’ feature, we found the roc_auc dropped 10%, with recall dropped 17% and specificity dropped 2.7%. While dropping diagnosis resulted in roc_auc dropped 2%, with recall dropped 6.7% but specificity increased 2.4%. What does that mean?

Insights

Recall/sensitivity versus specificity: In the above cases, dropping of recall would result in loss due to failed to catch some false claim and paid what should not be paid. And dropping specificity would result in increase more false positive claims, which would lead to cost to process the claim. In this project of Curacel, generally, dropping 1% of recall would result in about $8000 loss (1% of $0.8 million total false claims), while dropping 1% of specificity would result in about $850 loss to pay for the people to further review positive claims (1% of 51K claims and consider $1.67 per claim according to local wage). Therefore, sensitivity cost 9.4 x higher than specificity.

Since we could change class weight to favour one class than another, then should we do this tilt? According to the above calculation, we should do it if increase 1% of sensitivity cost no more than 9.4% drop of specificity, otherwise, we should go the other direction. According to above formula, we fine tuned the model to have weight 25 towards positive is the best model.

Deliverable: At last, our model delivered a solution to save more than $80K (leaving 10.7% false negative that cost $80K and 3.6% false positive claims that cost $3K according to the above formula) for the insurance company to automatic process the 52K insurance claims and flag the false claims.

There’re rooms to further improve the model:

  1. In our model, many claims contain multiple diagnosis and care, and they are not one-to-one matched, we could find other solutions to mapping the relationship of care and diagnosis to enhance the rules.
  2. More data. If we have enough data to cover all the rules in claim processing, then we could increase the accuracy to almost 100%.

Jupyter Notebooks

See all related codes for data processing and modeling, please check my Link to Github.

--

--