Auto Insurance Fraud Prediction

Neel Roy
Analytics Vidhya
Published in
5 min readSep 15, 2020

Dear Readers, This is my very first article on Medium. This is about an auto insurance fraud prediction. Fraud predictions are usually an Imbalanced dataset with more legit claims than fraudulent Claims.

Problem Statement:

These days lot of insurance companies , deal with fraudulent claims. The frauds can be at different stages , either at the stage of filling the proposal or at the time of claims like staging an accident or claiming pre-existing Damages. Frauds are committed to achieving personal gains. The data set I worked on has is called an imbalanced dataset with legit claims being far come as compared to fraudulent Claims. According to the FBI, non-health insurance fraud costs an estimated $40 billion per year, which increases the premiums for the average U.S. family between $400 and $700 annually

About the Dataset:

The dataset has 1000 observations with 39 features. The dataset contains information about fraudulent claims from 01-Jan-2015 to 01-March-2015 in the state of Ohio,Indiana,Illinois. The data given does not mention the insurance company. So we are not aware that whether it is from an single insurance or multiple insurance companies. The obvious drawback about this dataset is that it has only 1000 observations.

EDA(Exploratory Data Analysis)

The given Dataset has 1000 observations and 39 features,with the column fraud reported being the dependent variable(the variable that we wish to predict). The dependent Variable has 753 non-fraudulent cases and 247 fraudulent cases.

No of Fradulent vs Non Fraudulent Cases.

Correlations among variables

There was no significant correlation among variables except between months as customer and age(0.92) .The other being Total claims,injury claims,property claims and vehicle claim.Since Total claim is equal to the sum of Injury claim,property claim and vehicle claim.So I had to drop injury claim,property claim and vehicle claim.

Correlation Matrix

Cross Tabulation and Visualization against dependent variables.

To see how each variable affected the independent variable,I did an cross Tabulation between dependent variable and the independent variables. Post that I ran a chi square test to check whether they were dependent or not. Chi square is a non parametric test to check whether two variables are independent or independent.Some of the most common observations that were made post EDA were

  1. The no of fraud cases was highest in state of South Carolina followed by new York.
  2. There was a significant relationship between hobbies and fraud reported. People who hobbies are chess and cross fit are more likely to commit fraud.
Fraudulent vs Non Fraudulent Cases for each type of Hobbies.

3.There was significant relationship between the authorities contacted and fraudulent cases.Except in six cases ,all the fraudulent cases had contacted the authority (each of them were almost equal to 25%).

Authorities contacted for each of the fraudulent cases.

4. The average claim amount of fraudulent claims were 10000$ more than non fraudulent claims.

5. 90% of cases were fraudulent if it was multi vehicle or single vehicle collision.

Incident_type vs fraudlent Cases

6. If the collision was rear (42%) of the cases were fraudulent, followed by 27% for front and side collision.

Collision Type vs Fraudulent Cases.

7.In the cases where fraud was reported, 71% cases had property damage associated with them.

Missing Value Treatment.

There were missing values in the dataset. There were missing Values in the Columns Collision type(178),property damage(360),police report available(343). Dropping them will not be a feasible option because there are few observations in the data. Had there been lots of observations,we could have gone ahead and dropped them. In this case since the missing values are categorical in Nature, we can replace with most frequent or mode value.But there is a certain risk attached to it ,that it will have biasness attached to it. So now I used a different concept called MICE Imputation(You can read about it here-https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3074241/).

MICE(Multivariate imputation by chained equations). As the name suggests in place of replacing the missing variables using Univariate techniques, you use multivariate techniques.There are Actually 3 kinds of missing values. Missing at random,missing completely at random and Missing not at random(You can read About it Here-https://www.displayr.com/different-types-of-missing-data/). In our case the missing values was missing at random since it was related to the dependent Variable. In python, MICE equivalent is iterative imputer. Here I used Random Forest and linear regression to impute missing Values. But remember that before you use iterative imputer, Convert categorical Data into numerical Date and MICE works much better with R. So once I imputed the missing values and cross checked with earlier mode based imputation, I found out that that results were same.SO I was convinced that my result was accurate.

Model Building

During The model Building Stage,I used 5 algorithms to build my classification model. Those 5 models were

  1. XGBoost
  2. Logistic Regression
  3. Knn
  4. Random Forest
  5. Ada Boost.

Metrics Used:F1 score, Recall, AUC SCORE, Precision, Accuracy.

Since my data set was imbalanced .My most accurate metric will be(F1-score,Recall and AUC score).F1 score is the harmonic mean of recall and precision .Recall is out of all the fraudlent cases ,how many did my model predict correctly.AUC(Area Under Curve)-It tells me how much my model is capable of differentiating between the two classes.I used Pycaret Library,to generate my output. The advantage of Pycaret is we don't need to write a function to get the desired output.You just need to pre process your data and supply your dataframe and your target variable along with training and test percantage. SO this what the output I got.

The best model which I got was XG boost with recall of 61.99 meaning out of every 10 fraudulent cases 6 were correctly identified an F1 score of 63.55.the AUC score was 85.51% indicating that 86% of the times my model was capable in separating the two classes.

The test set also got a accuracy of 82% and an AUC of 87.71,Recall of 64.52,Precision of 63.49,F1=64.Since ,my metrics of test set and training set were similar which indicated there was no overfitting and both have got high accuracy and were also similar on other metrics,indicating no/less biasness

PS:The results might change depending on size of the dataset and the imbalance in case of dependent variable(Github Link:https://github.com/neelcoder/-symmetrical-robot).

If you have any Doubts.Please do reach out.

LinkedIn Profile(https://www.linkedin.com/in/neel-roy-55743a12a/)

--

--