All state insurance severity prediction

Rupesh Malla
11 min readMar 8, 2022

--

Ref: https://www.kaggle.com/c/allstate-claims-severity/

Table of Content :

  1. Business Problem
  2. ML formulation
  3. Understanding the Data
  4. Existing approaches / Literature Survey
  5. First cut solution
  6. EDA
  7. Feature Engineering
  8. Understanding Autoencoders
  9. Modeling different Machine Learning architectures/algorithms
  10. Results
  11. Future Work
  12. Profile
  13. References

Business Problem:

1.1 Overview :

This problem statement is from the Kaggle recruitment challenge, by Allstate Insurance. Allstate is an insurance services company in the USA, which provides insurance to over 16 million households in the USA. The company wants to reduce the complexity of the insurance claiming process and make it a worry-free experience for the customers by automating the predictions of claims severity.

1.2 Main Objective :

The Allstate Insurance company wants to reduce the time taking process and make it easier for the people who need insurance cover to claim it much easier.

So in order to reduce the complexity, It has given a dataset to use machine learning algorithms to predict the costs and hence the severity of the claims accurately. This would help the people claiming the insurance to better cope with the severity rather than dealing with the complex process of submitting the papers and dealing with the insurance agents.

2. ML formulation

The insurance claims can depend upon several factors, and with help of those factors ( features ), we can build a machine learning model to predict the loss amount / the severity of the claims. This would be a unidimensional regression problem as we have to only predict the costs for each given set of features.

3. Understanding the Data

First after obtaining the data from kaggle, importing the data and checking its dimensions

We see that the Train data dimensions are : (188318, 132) and Test data dimensions are : (125546, 131)

This indicates, there are 131 features. Also, there is an identification number of train and test data points.

All the features are anonymized i.e categorical features are given cat1, cat2, etc naming, and continuous features are given cont1, cont2, etc naming.

There are 14 continuous features and 116 categorical features.

4. Existing approaches / Literature Survey

Some of the interesting approaches which I have studied and replicated in my work are :

i. Effect of using MAE as loss

https://www.kaggle.com/c/allstate-claims-severity/discussion/24520#140255

About :

This is a discussion forum where the problem of optimizing the loss of MAE is discussed.

The Problem :

In XG Boost and other algorithms, we can use ‘objective’: ‘reg: ‘linear’, but then the algorithm optimizes for MSE. That is MSE optimizes for mean whereas MAE optimizes for the median. Which can incur problems in our training of the ML model. Also, this is the problem because MAE (mean absolute error ) is not differentiable at ‘0’. (As abs(0) is not continuous at 0 ).

Suggested solution :

(fair_obj) To get over the differentiability problem, we use a custom objective function (obj) to generate custom (grad, hess ) {ref:xgb.train: eXtreme Gradient Boosting Training in xgboost: Extreme Gradient Boosting}. By simply adding a fair_constant to the absolute values and manipulating the grad and hess values. The value of fair_constant is found to be 0.7 /2 as optimal.

ii. 2nd place solution :

About :

This solution is of Alexey Noskov, 2nd place winner in this kaggle competition. He discusses how and which architecture of models has given him good results.

Data Preparations :

The author has used different categorical encodings like a lexical, dummy, Bayes. The best encodings for tree-based models ( like Random forests ) were found to be lexical encodings, whereas in non-tree models transforming the data by clustering and replacing the data with cluster-distance and applying RBF function on top performs well.

Ml models :

First level models: The models are the single best performing models without stacking (ensemble models ). They are found to be XG Boost and Keras Neural Net with 4–6 bags. In the first level models, the author trained more than 70 models which consisted of lightgbm, random forests, etc.

Second level models: These models consisted of mostly XG boost and Neural Net models with different parameters along with some linear regression with different target transformations, random forests, and gradient boosting from sklearn because it can optimize MAE directly.

Third Level models: The author used quantile regression (the regression where conditional medians are taken to draw the regression line ) along with 8 bags for each fold and used fold average predictions. And grouped all the second-level models based on their similarities and averaged their predictions and produced an additional 10 features for this third-level ensemble model and has also applied power correction to some of the features.

iii.Solution by Mariusbo

Introduction :

This code solution tackles the data in a simple way, just by performing normal data preprocessing steps and then implementing the fair_obj for optimizing the XGboost model and scores a very high score of 1106.33084, which is higher than the first place solution in the private leaderboard.

Pre-Processing the data :

All of the highly skewed variables are transformed using box-cox transformation ( where any non-linear distribution is converted into normal distributions ) All the categorical variables are custom encoded and standard scaled, while all the continuous variables are min-max scaled. And all the train and test data points are checked to be unique and, if there are any points that have the same data points in both the train and test datasets, they are removed.

Modeling and input of the data :

The Data is sent into the XGboost’s train along with all the parameters and is trained for 10 folds, and all the results from the fold’s average are taken.

(Several other works have been referenced, please check my Github repo for a more detailed literature survey )

5. First cut solution

With the help of the above research done from kaggle and other sources, I have come to know about different aspects of the data and also the problems which are being faced in the data either from the MAE. And also the high amount of categorical and continuous variables.

EDA:

Plot the correlation plot between each of the variables that are highly correlated with each other. The distribution of the loss is also to be observed and the log loss of the distribution is also plotted. As every reference considers log-loss transformation to train the models (because log loss generates a QQ plot which is almost parallel to the normal distribution ) the skewness of all the variables is to be calculated and transformed with box cox transformation and plotting their distribution in QQ -plots. The use of DABL plots, to plot all the variables, helps in checking the distribution of every variable with respect to the target variables and we can also obtain the top features directly.

Feature Engineering :

The most important features are obtained by using Random Forest regressor and constructing an autoencoder to create additional features on top and the skewed features are transformed for box-cox transformation to make the non-linear distribution normal. And for certain models ( for stacking ) highly correlated features, and least important features are removed and checked for the score, as some of the solutions suggested.

Models:

As given in all the almost references, Initially I had to build a random forest regressor and observe its score. Then build multiple models with XG boost, Light GBM, and also gradient boosting model from sklearn as it supports MAE. And also deploy a neural net model like this neural net stack all of them, and experiment with making additional XGboost models on top of the well-performing initial models, to get even better predictions. But one of the most crucial parts is using fair_obj for Xgboost and other related models to get around the MAE problem as discussed in the first place solution. And use k fold predictions for each of the best performing XGboost models. Although I would like to split the models according to different feature engineerings as mentioned above and experiment with them. Also, I thought of experimenting with CAT Boosting and giving only categorical features to it and seeing the performance.

6. EDA (Exploratory data analysis)

With the help of various plots, the analysis has been done on the data, initially, I have used a correlation plot on the continuous variables to analyze the correlation amongst each feature in the dataset.

In the above correlation heat map, we observe that there are certain continuous variables that have very high correlations like ( cont11,cont12 ) variables with 0.99. We know that in linear algorithms like Linear regression, etc highly correlated variables cause i.e even for small changes in data, there can be large changes in the model. And there is also a chance of overfitting the regression models too.

While analyzing the categorical variables in the data some variables had a very high number of labels

The above plots represent the analysis of the different label’s distribution of all the categorical variables present in the data frame and some of the analyses are :

  1. From the categorical variables of cat1 to cat72 there are only two labels of A, B. Whereas from CAT73 to cat 116 there are multiple (>3) labels.
  2. In two labeled categorical variables in almost (except for 5) every categorical variable the majority class is A.
  3. In Cat109, there is a single category that has an overwhelmingly high majority over all the other classes.
  4. There are a lot of labels for the variables of cat 116 and cat113

To check the variations of the number of labels in each categorical variable I have built a count plot

The above graph represents the number of labels is present in each corresponding categorical variable and we find that the variables of cat116 have the highest number of labels of 326, cat110 has the next highest number of labels of 131. These variables pose a problem, as they create huge dimensionality for classifications, and assigning them into continuous variables would and implementing a regression for them will be more appropriate.

Now coming to the target variable which is the LOSS, plotting the distribution plot has resulted :

The target variable has a very right-skewed nature because of some outliers present, causing the distribution to be highly right-skewed. As many of the models assume that the target variable has a normal (gaussian ) distribution. This target variable (raw) here would impact our model negatively.

So to convert the skewed variable into normal, I have used log transform and it resulted in a less skewed plot:

Transforming the target variable with log() makes the distribution almost gaussian, but the distribution has become a little left-skewed. Although modeling with this data is possible more corrections can be implemented.

And to the end of EDA analysis of train and test datasets using adversarial validation to check if the train and test data have the same kind of distributions. If there are different distributions the model will not predict the values properly after training on a different distribution.

With the help of PCA, I have plotted both train and test components and :

we see that the distribution of both the train and test data is almost similar as they overlap on each other and hence while training the models we will have relatively no issues because of distributions of train and test datasets.

Conclusions from EDA are :

-> The distribution of loss (target variable ) is highly skewed, hence transforming it with log(x) helps in converting the target variable into a normal distribution.

-> There are certain variables in the data both in categorical and continuous features which are very highly intercorrelated amongst each corresponding class of variables (cat and cont)

->There is a small number of outliers in the data, but due to the vast amount of data, they are mostly negligible. But there would be some issues while building distance-based models like K-NN, as due to long distance from clusters, the model would be affected.

->There aren't any distribution differences between the train and test data. This indicates that while training our model in train data, there wouldn't be problems of overfitting.

->There are a lot of features and there are some features in cat116, cat110, etc which have very high labels which might cause dimensionality problems.

7. Feature Engineering

Because of the problems which we might face with the high dimensions and also the high number of labels in some of the categorical variables, I have calculated the top 30 features using RandomForestRegressor,

The top 30 important features which are predicted by the Random Forest Regressor are :

Here we see that there is very high feature importance to the features of CAT80, CAT79, and other features which are observed in the above graph. An important note to be observed is that almost all the categorical features present in the top 30 features have labels less than 25. That represents that the very high labeled categories like cat 116, cat 110, etc, because of their high dimensionality don't have high importance

Along with the top 30 features, I have built an autoencoder to add them as new features.

And built a linear regression model as a baseline and compared the results between the normal data and data checked with an autoencoder effect on the baseline model.

The normal baseline model produced a score of :

1323.67 (Mean Absolute Error)

The baseline model working on data with autoencoder features produced a score of and produced a score of :

1311.46 (Mean Absolute Error)

8. Understanding Autoencoders

Autoencoders are used here as they use deep-learning techniques to compress(encode) the data and perform Analysis such that when they decode again, we get the most important features like the function of an autoencoder is to do dimensionality reduction, or in other words, they try to conserve as much information as possible. Here as we have many features, this method helps us in obtaining the most information out of the data.

9. Modeling different Machine Learning architectures/algorithms

After some preprocessing steps like :

  1. Converting skewed continuous variables using Box-Cox
  2. Min-Maxing transform on continuous variables
  3. Combining the categorical features together
  4. Obtaining the autoencoder features

I have built various models like :

  1. Linear Regression
  2. Ridge Regression
  3. Random Forest Regressor
  4. Light GBM
  5. ADA boost model
  6. CAT Boosting Model
  7. Custom Model

Custom Ensemble Model

The total training data is divided into two data sets D1 and D2. D1 contains 80% of the training data in the D2 contains 20% of the training data which is a holdout set and will later be used for testing the performance of the final custom ensemble model. From the D1 set, we are sampling(with Replacement) N different dataset which is used for training N base regressors ( decision trees). Using the prediction of the N base models a meta-regression model is trained which will predict our final loss for a data point. The performance of this metamodel is finally tested on the hold-out set D2.

Source: https://rb.gy/sfaodf

And found the scores for each respective model as :

(All scores in MAE)

10. Results

Although, because of over ensembling the models did overfit on Kaggle's private dataset, With the help of a custom model, the kaggle score improved drastically to :

11. Future Work

  1. Adding more Features can improve the score
  2. Doing some feature engineering on continuous features can improve the score.
  3. Using different loss functions in Xgboost.
  4. Trying out different stacking architecture and reducing the MAE.
  5. Including different feature engineering ways.
  6. Using Hyperparameter tuning using libraries like optuna.

12. Profile:

Github :

LinkedIn :

13. References :

  1. https://www.kaggle.com/c/allstate-claims-severity/overview
  2. https://www.kaggle.com/c/allstate-claims-severity/discussion/24520#140255
  3. https://www.kaggle.com/sharmasanthosh/exploratory-study-on-ml-algorithms
  4. https://www.kaggle.com/c/allstate-claims-severity/discussion/26427
  5. https://www.kaggle.com/cuijamm/allstate-claims-severity-score-1113-12994
  6. https://www.kaggle.com/chandrimad31/claims-severity-analysis-of-models-in-depth/notebook#Adversarial-Validation-:
  7. https://www.kaggle.com/mariusbo/xgb-lb-1106-33084
  8. https://www.kaggle.com/c/allstate-claims-severity/discussion/26416
  9. https://www.appliedaicourse.com/course/11/Applied-Machine-learning-course

Thank you.

--

--