Predicting Insurance Claim Amount for Damage Vehicle

10 min readOct 11, 2021

Contents

Overview
Data set
Business Problem
ML formulation of business problem
Business constraints
Performance metric
EDA
Feature Engineering
DL models
ML models
End 2 End Pipeline
Sample Output
Deployment
Future Work

OVERVIEW

Vehicle insurance is an insurance that is used for all the automobile that uses road as a medium. Its main purpose is to provide financial protection against

Physical damage or bodily injury caused to us or third party or both by traffic collisions
Liability that could arise from incidents in a vehicle or accidents

As per law it is mandatory to have basic insurance for every automobile driver or vehicle, various polices come in different shape and sizes keeping in mind of individual basic to advance needs, covers individual policies requirement due to advancement in technology doing an insurance has become hassle free not only to company but also to customers.

Q. Why need vehicle insurance?

Vehicle insurance may have added on terms to offer financial protection against theft of vehicle, natural disaster or damage to vehicles sustained because of events other than traffic collision such as keying, damage sustained by colliding with stationary objects.

Maximum or minimum coverage depends on various factors such as age of car, brand and model number, risk or theft associated to that car.

It acts as shield and protect from financial losses by paying for damage, it reduces liability, it is much affordable when purchased online, it compensates family after accident demise etc. It is further of two types Own damage and Third party liability.

2. DATASET

The dataset contains train, and test set which have 1399 training samples, and 600 test samples respectively. This data contains multiple columns as described below

3. BUSINESS PROBLEM

Trying to automate the process of damage car claims which usually a process of person visiting at site following visual inspection and validation of damaged car using this automation we can claim insurance faster and hassle free manner which also helps the company in terms of their services ratings and attract large pool of customers with the help of technology.

Condition: predict of vehicle provided in image is damaged or not.
Amount: Based on condition of vehicle predict insurance amount of cars.

4. ML FORMULATION FOR BUISNESS PROBLEM

Predicting the condition of car when deals with ml problem poses classification task which can be solved by classical, deep learning, ensemble or combination of these.
Predicting the amount of insurance based on damage condition when deals with ml problem poses regression task which can be solved by classical, deep learning, ensemble or combination of these.

5. BUSINESS CONSTRAINT

Low-medium latency requirement.
Misclassification is problem as if car predicted not damage and insurance is provided is huge blunder.

6. PERFORMANCE METRIC

F1-micro

It is harmonic mean of micro-precision and micro-recall.

For car condition prediction evaluation metric asked to use is micro-F1, although task is binary in nature false negative and false positive plays an important role if false positive (actually not damaged but predicted damaged) it will be huge cost for company paying for claim for incidence which actually not happens, if false negative (actually damaged but predicted not damaged), it is still manageable, as it can further send for evaluation if required.

2. R-Square

For claim amount prediction evaluation metric asked is r2_score also known as coefficient of determination. It basically gives the information about the goodness of fit of model i.e. the how well the predictor (independent variables) coefficient in regression equation approximates the test/real data points.

SSRES: Sum Squared of residual

SSTOT = Sum Squared of Total

Yi = ith observation of actual data point

Yi_hat = ith observation of predicted data point

Y_bar = mean of actual data points

7. EDA

ASSUMPTIONS:

from the official given description/instruction it is not clear Amount column represents what, whether it is insurance premium amount or insurance claim amount or insurance sum assured amount, since this dataset have majority of damaged vehicle is seen from its distribution so we go with Amount as insurance claim amount.
since it is insurance claim amount, so claim cannot be greater than cost of vehicle and maximum insurance coverage amount (conditional), so we have to perform imputation on Amount column which are greater than Amount(insurance claim amount).
since dataset mention Max coverage : Represents maximum coverage provided by insurance company, it is not clear whether this add-ons added or not on this Max coverage, so we go with it is not add-on (for the sake of purpose), (if add-on is covered Max coverage can go up).
unit of amount column is not provided in original data itself, so assuming any unit is on reader’s discretion.

From the original data source three files is being provided namely train, test set, submission file.

train set shape = (1399,8)
test set shape = (600,8)

output

2. Plotting

2.1 checking null value

OBSERVATION

There are some data point which have null value across the columns, leaves with two choices either to drop or to impute.

2.2 VEHICLE DISTRIBUTION

OBSERVATION

92.9% data points in train data is damaged (1), 7.1% is not damage (0), which suggest dataset is predominantly have majority of damage vehicle images.

2.3 INSURANCE COMPANY COUNT

OBSERVATION

Each insurance company have equal percentage of share, with B being the highest.

2.4 CONDITION OF VEHICLE IN EACH INSURANCE COMPANY

OBSERVATION

from above data it is clear that every insurance company have very few data which is being not damage, but insurance company ‘B’ have highest (140), and ‘BC’, ‘AA’, ‘RE’ have lowest (111) each.
this imbalance is being in line with statement, as we want to predict the price of claim to be paid based on damaged condition.

2.5 CORRELATION AMONG FEATURES

OBSERVATION

Cost of vehicle and Min coverage are equal (almost in numerical sense) in terms of correlation.

2.6 IMPUTATION

Since we have very less data points we are imputing values for each features that have null value based on its median values of each insurance company in that particular feature (column).

for details work on each column imputation click here 1.5

2.7 PDF AND CDF OF FEATURES

OBSERVATION

The Distribution is right skewed and has a long tail on the right side, indicated by the high values for Skewness and Kurtosis.
The PDF has a long tail on the right which means there are few samples that have large value of Amount. These samples could affect the model training.
It can also be seen that there is a some gap between 99 percentile and 75 percentile value, which may also confirms the presence of few outliers in the data.
mean and 50th percentile almost lies on one another, effect of outlier if present is negligible.
Also peaks can be seen in the distribution at various values of Amount which indicate multimodal distribution.
Long tail on right is mainly due to single observation.

for details work on each column cdf pdf click here 1.6

2.8 CHECKING DISTRIBUTION OF CONTINUOUS VRIABLE

QQ-Plot : graphical way of checking given data follows a particular distribution or not (type has to be chosen by us).

OBSERVATION

we have used normal distribution as base, distribution of Amount follows roughly normal distribution, with some outliers.

2.9 BOX PLOT

OBSERVATION

we see that median of Amount almost coincides with median of each insurance company.
we see that 25 percentile of Amount almost coincides with 25 percentile of each insurance company.
we see that 75 percentile of Amount almost coincides with 75 percentile of each insurance company.
visually we see that presence of outliers, that may affect the further pipeline.

2.10 OUTLIER TREATMENT

OBSERVATION

from the above two query we see, 1st not possible as Amount is much higher than Cost of vehicle, which is not at all possible.
second query is not possible as Claim amount cannot be negative.

so replacing these with median value and these are based on respective insurance company median (of Amount).

2.11 PAIR PLOT

It tells us about how each variables are behaving in presence of other, below biplot (subset of pair plot) tells something interesting.

OBSERVATION

condition on which Max coverage is decided is based on 2 separable categories, i.e. greater than 20000 (which have almost all insurance company), do not have any damages, this helps to design a feature in feature engineering section (if Max coverage greater than >20000 or not) i.e. describe category of insurance, for this dataset.

for details work on eda click here section 1.

8. FEATURE ENGINEERING

Coming up with new features based on existing features to have more insight on data.

1. Company count = count of number of times insurance company appears in dataset.

2. Range of coverage = difference between minimum and maximum coverage.

3. Insurance period = if age of insurance is greater than the median of age column than 1 else 0.

4. Low expire = insurance expire within 2 years from now.

5. Medium expire = insurance expire greater than two years but less than 5 years from now.

6. High expire = insurance expire with more than 5 years from now.

for other feature click here 2.1

9. DL MODELS

In this section, for the prediction on image provided in data for it condition (damage or not), experimented with 4 deep learning algorithms namely (vgg19, resnet50, mobile-net, custom-architecture), extracted features for transfer learning with the stated algorithms above.

Transfer learning is a machine learning technique, that captures the knowledge (patterns) learned while solving one problem, but applying it different but related problem, knowledge gained while learning to recognize cars could apply when trying to recognize trucks.

From all the said 4 architecture, mobile-net performs the best (stable learning) and will be used for final deployment purpose.

OBSERVATION

from the loss curve, initially in epoch loss curve training loss is higher than validation loss that means underfitting, which is quite evident as it is starting stage, but at epoch 9, validation loss (here test loss) is more or less equal to training loss indicates model as it is the balance we are looking for, after epoch 9 it may start overfitting.
in binary accuracy plot training accuracy is increasing, test loss have haphazard movement, may be because whatever model learning not able to apply in that epoch, more epoch require to have definite say, moreover at epoch 9, we see balance train and test accuracy.

for detailed deep learning and transfer learning work click here deep and transfer learning.

10. ML MODELS

In this section, for the prediction on amount provided in data experimented with 2 algorithms, linear regression (as base line) and GBDT (gradient boosted decision trees) (slight advance), are applied on

image only data (that is original feature + feature engineered features + condition predicted by DL Models)
transfer learning data (that is original feature + feature engineered features + condition predicted by DL Models + feature extracted from DL Models)
grid search also applied on both of the task above to arrive at robust results.

for detailed regression, GBDT/ ml work click here regression learnings.

10.1 OVRALL PERFORMANCE OF MODELS

summary of all dl ml model experimented along with its type and metrics.

CONCLUSION

Base line model linear regression is slightly inferior than GBDT (slight advance).
for selection of final model for production we will be choosing r2 score that is lowest which indicates (minimum residual along regression line, i.e. data (almost all features nearly fit the regression line)), here it is GBDT of only image data.

for details works on deep learning and regression models internal working click here

11. END 2 END PIPELINE

considering best model along with its cross validate results, for productization.

function for predicting condition for image given

final pipeline function to integrate all steps

for details works on final end to end pipeline working click here

12. SAMPLE OUTPUT

13. DEPLOYMENT AND WEB-APP

web app link : to be updated soon

local system deployment via streamlit.

14. FUTURE WORK

try to collect more data manually
to have cars dataset (publicly available) for more variation in image data
to employ models like XGB Regressor, catboost regression, lightgbm models.
to employ models like densenet, higher version of mobile-net, resnet for improvement in results.
try to experiment with droping all the rows that have null value.
try to incorporate descriptive features as separate columns for images in regression task.

References

some of them are listed below, but are not limited to

Thank you for being patient reader, please do support by liking it if you feel this blog has added value.

You can connect with me on LinkedIn .

For entire project code on GitHub Repo.

Predicting Insurance Claim Amount for Damage Vehicle

Written by nikhil sharma