Sumit Gulati
15 min readNov 25, 2021

Introduction

The importance of correct risk prediction cannot be overemphasized in Life insurance context. Life insurance company have to be careful about whom to insure and whom not to in order to stay financially solvent. While there is no perfect formula to determine the insurability. Actuarial tables have been traditionally used but they are quite time consuming. Predictive Analytics offers a promising alternative to the risk prediction task.

Prudential is a US Insurer and is into Life insurance for last 140 years. There’s pattern observed by the company that on average only 40% of US population is having insurance. This is because of the time taking process that involves classification of each individual risks according to their medical history, background. Prudential wants to make it quicker and less labor intensive for new and existing customers to get a quote while maintaining privacy boundaries.

Business problem

The goal of this problem is to develop a simplified predictive model for accurately determining the risk level of life insurance applicants to make decision on their insurance approvals. In our task, we have 8 Risk levels with 1 being the lowest and 8 highest.

ML formulation of the business problem

To solve the business problem using data science, it is needed to pose that problem as classical machine learning problem. First of all, since the data has target variable, it is supervised ML problem. Further we need to predict the risk level of insured. Hence it is a multi-class classification problem. Since we have 8 class labels.

In this Supervised Machine Learning problem we will try different feature engineering hacks and use different algorithms in order to predict the individual risk level.

Performance Metric

The Evaluation Metric used as a part of this Task is “Quadratic weighted kappa” (Since this is Kaggle problem, the same metric has been used there).

Let’s look what Cohen Kappa metric is….

Quadratic Weighted Kappa Metric : A weighted Kappa is a metric which is used to calculate the amount of similarity between predictions and actuals. A perfect score of 1.0 is granted when both the predictions and actuals are the same. Whereas, the least possible score is -1 which is given when the predictions are furthest away from actuals. (More Explanation)

Objective/Business Constraints

  • No strict low-latency requirement. But the model shouldn’t take more than a minute to predict the risk level.
  • Model interpretability is required. Probability of predicting the Risk level is useful. Since model predicting the correct Risk Level has direct impact on insurer portfolio.
  • Predicting the values as close to actual so as to get KAPPA score close to 1.

Data Description

https://www.kaggle.com/c/prudential-life-insurance-assessment/data/

In this dataset, we are provided over a hundred variables describing attributes of life insurance applicants. The dataset is self anonymous where the attributes were grouped under six heads namely product info, family info, employment info, general health measurements , medical history info, and medical keyword(yes/No). The meaning of individual attributes under these group were unknown. There are over 127 independent variables. These variables are either Discrete, continuous or categorical in nature.

  • train.csv — the training set, contains the Response values
  • test.csv — the test set, you must predict the Response variable for all rows in this file
  • Id : A unique identifier associated with an application.
  • Product_Info_1–7 : A set of normalized variables relating to the product applied for
  • Ins_Age : Normalized age of applicant
  • Ht : Normalized height of applicant
  • Wt : Normalized weight of applicant
  • BMI : Normalized BMI of applicant
  • Employment_Info_1–6 : A set of normalized variables relating to the employment history of the applicant.
  • InsuredInfo_1–6 : A set of normalized variables providing information about the applicant.
  • Insurance_History_1–9 : A set of normalized variables relating to the insurance history of the applicant.
  • Family_Hist_1–5 : A set of normalized variables relating to the family history of the applicant.
  • Medical_History_1–41 : A set of normalized variables relating to the medical history of the applicant.
  • Medical_Keyword_1–48 : A set of dummy variables relating to the presence of absence of a medical keyword being associated with the application.
  • Response : This is the target variable, an ordinal variable relating to the final decision associated with an application.

Data Preprocessing

Performing the missing values check on the data, we noticed that…

In the above, we can see that there are around 13 features with missing values in the train dataset. The Medical_History_10 feature has the highest missing percentage i.e. 99.09%. So, we have to handle these missing values before proceeding further.

There are two strategies we could follow in the case we encounter a missing value are:

  • Removing the rows containing.
  • Replacing the missing values.

We can’t go with the first method because in our data we have some feature with over 90% of data missing. Second method requires to fill missing value with either mean, median, mode or simply a predefined value.

But before imputing we will further analyze the pattern of missing values in our data. There are 3 categories of missing data:

  1. MCAR : The missing value is missing completely at random. The propensity for a data point to be missing does not have anything to do with its hypothetical value and with the values of other variables.
  2. MAR : The missing value is missing at random. The propensity for a data point to be missing is not related to missing data, but it is related to some of the observed data.
  3. MNAR : The missing value is missing not at random. There are reasons for this. Often, the reasons are that the missing value depends on the hypothetical value or it is dependent on another variable’s value.

Since, we can analyze the pattern of missing values or check the dependency of one feature with other but we can’t just check for dependency with some hypothetical value hence possibility of MNAR rules out. We’ll be checking for MCAR/MAR in our data.

** Here, we have used MISSINGNO package available for missing values best visualization.

From the above plot showing the correlation between the missing values features. The values close to positive 1 indicate that the presence of null values in one column is correlated with the presence of null values in another column. Values close to negative 1 indicate that the presence of null values in one column is anti-correlated with the presence of null values in another column. In other words, when null values are present in one column, there are data values present in the other column, and vice versa.

Values close to 0, indicate there is little to no relationship between the presence of null values in one column compared to another.

In the above plot, FamilyHist_2 and Family_hist_3 , Family_hist_4 and Family_hist_5 have high negative correlation between them, which means if one variables appears the other would most likely be missing.

Since the features, Family_Hist_2 and Family_hist_3, Family_hist_4 and Family_hist_5 have high negative correlation which confirms the presence of MACR or MAR hence we will remove one feature out of these pairs which has higher missing values i.e. Family_Hist_5 and Family_Hist_3.

Apart from these we will be removing features with more than 30% data missing.

Data Splitting

Before working on the data and visualizing it, we will split the data into train and cross validate in 80/20 ratio using train_test_split function under sklearn module.

Training data shape: (47504, 128)
Cross Validation data shape: (11877, 128)

Exploratory Data Analysis

  • Response Variable

We can see that distribution of response variable is imbalanced. We have 32% of insurance data classified as high risk i.e. level 8 and the lowest risk insured are around 10%.

Whereas the insured with neutral risk i.e. 4 or 5 have around 2% and 9% respectively. With the lowest share of Risk level 3 being present in our data.

Since the Distribution of Response variable is imbalanced we can perform SMOTE technique in order to generate synthetic samples of minority class which will balance the classes in our data.

  • Multivariate Analysis

After Removing the missing values we are still left with 117 features. Now for exploring this huge dataset we will perform Multivariate Analysis using seaborn heatmap.

In the above plot we are visualizing the correlation between different featues excluding Medical History and Medical Keyword with respect to other features and Response variable in order to understand the relationship between them. The green highlighted box indicates high positive correlation, red indicates high negative correlation whereas yellow indicates some positive correlation.

We can see that there are few features pair, which are boxes in red have High negative correlation. We can perform certain steps to remove these correlated features since these causes the problem of multicollinearity and in general don’t improve the model. We can perform different techniques based on the model we are fitting, Like:

  • Combine highly correlated variables: We can use PCA like technique to get the features that explains most of the variance.
  • feature reduction this can be done greedily and as per custom threshold of correlation we can remove features.

In the above plot we can see that the Response variable has very high negative correlation with Medical_Keyword_3 and Medical_keyword_15.

In the above plot showing correlation with Medical keyword and Response variables. There’s high negative correlation between Medical_History_26 and 25, Medical_History_26 and 36 too.

Univariate Analysis

  • Product_Info

In the above plot between Product info’s and Response Variable, we can see that for almost all product data distribution are skewed. Like in case of Product_1, we have more than 80% of data for type=1. Similarly for Product_Info7 more than 80% data is from type 1 only.

Looking at the Product info 3, we have around 85% data of category 26. While for the Product_info_2 plot isn’t much clear.

For Product_info_4 the data is peaked around 0.0 while the distribution with respect to Response Variable is almost overlapping for all risk levels.

  • Insured Age

In the above plot, it can be seen that the Average Insured Age is the lowest for Response level 3(moderate risk) and 7(high risk). Whereas the highest average Age falls under Risk Level 1(i.e. lowest).

When we look at the distribution of Insured Age only, it is normal peaked but the width is large which makes it more like block shape, hence distribution is Platykurtic. Also, more than 75% of normalized age is ranging between 0.2 and 0.6.

  • BMI

In the above plot, it can be seen that the Average BMI is the lowest for Response level 3(moderate risk) and 7(high risk). Whereas the highest average BMI is for Response Category 4.

When we look at the distribution of BMI only, it almost normally distributed. And 99% of normalized BMI values lies within range 0.2 and 0.8.

  • Insured_Height

In the above plot, it can be seen that the Average height in each Response category is more and less same and their distributions are also similar.

When we look at the distribution of Height only, it isn’t normal since the distribution has high peak and high breadth hence it is platykurtic( Kurtosis :-0.36269).

  • Weight

In the above plot, it can be seen that the Average Weight is highest in Risk Level 4 and lowest for Risk level 7 and 3.

When we look at the distribution of Weight only, It is more peaked and is positively skewed(0.7034009).

  • Along with the above univariate analysis on above features, we had performed EDA on same features after binning as well. Like, for Age Bins were created as follows: young (values <1st quantile), average ( 1stquantile ≤values< 3rd quantile) and old(values>3rd quantile).
  • Observations after performing feature binning on Age, Height, BMI and Weights features are:
  1. In case of Age, we noticed that high risk level i.e. 8 mostly have Average Age. Also, the majority of the insured have Average height.
  2. In case of BMI, bins were created as normal, high and low BMI. We have noticed that people with HIGH BMI have the lowest number of insured with Risk level 4 and 8. This supports to our results which we inferred in Correlation matrix that Response Variable and BMI have negative correlation.
  3. In case of Height, bins were created as short, average and tall. we have noticed that the mostly high Risk level insured usually have average height.
  4. In case of Weight, bins were created as Underweight, Average and Overweight. we have noticed that there are majority of insured who have average weight and that too Average weighted people are more risk prone. which shouldn’t be the case.
  • Employment Information

From the above plot, we can infer that except Employment_Info_6, 3 and 5 all other employment information's are positive skewed. Since Employment_info 3 and 5 are categorical. We can infer that the distribution of categories are imbalanced in both features.

  • Insured Information

In the above plot we can see that Insured Info 1–6 which are all categorical , all seems to have imbalanced distribution of categories. Not much can be inferred except the distribution of categories since the categories and information are self Anonymous.

Checking for Feature Mapping

Using the above function on different feature combinations, below observations have been made:

  1. Medical_History_1 has a category 223 which occurs when medical_history_2 category 389 is opted or vise versa. The occurrence of category 223/ 389 is only once so this analysis isn’t helpful.

There are some more EDA performed on correlated features, their plots you can check here under Visualizing/Analyzing the correlated features header.

Data Wrangling

Handling Outlier

The separate Dataset has been created with outliers removed, to check later whether removing Outliers impact our model results or not.

Imputing Missing Values

Now we still have 4 columns left with missing data, we will be imputing value corresponding to missing values with some strategy in these columns in train data and same strategy will be used in test data.

#checking the distribution plots of null_columns left, so as to determine which imputation methods fit best

For imputing missing values, we will impute the median value since the distributions are skewed for all missing values features in training data. The same imputation strategy we will follow for test data also.

We will be using Simple imputer from sklearn library with imputing strategy as median. This function helps in easy imputation.

Although in later part while training model, we found that just imputing -1 inplace of missing value worked well. Since XGBoost takes care of missing values itself.

Feature Engineering

Several Features have been engineered based on the Visualizations and research done.

  1. Binning Height, Weight, BMI and Age feature separately as per the categories defined using quantiles. Like in Age we defined young, average and old.
  2. Deleting all Medical keywords from data and replacing with their sum only. This reduces the sparsity contained in medical keyword features. Since we have in Total 47 Medical keywords with 0/1 values.
  3. Adding new features which are result of multiplication of other features. Like AGE*BMI, Height*Age, and Weight*Age.
  4. Defining a new feature with categories as High Risk or Low risk. High risk here means that insured being overweight or underweight or with high or low BMI or having high age. 1 being the high risk and 0 as low risk.
  5. Since after analyzing we noticed that Employment_Info_3 category 3 occurs only when Employment_Info_2 category 1 is selected. Hence making a separate feature of their co-occurrence and dropping the original.
  6. Dropping Insured_info_5 and height feature. Inplace of that adding Ht/Insured_info_6 as new feature.
  7. Splitting Product_Info_2 into char and number. like D3 into D and 3 separately, A1 into A and 1.

After Engineering the new features, we created another dataset with outliers removed. Now, we had 3 datasets one is raw on which we will train baseline, another is with feature engineered and last one with feature engineered and outliers removed.

Encoding Data

Now, we have to encode the data to make it suitable for modelling. We are already provided with normalized data, hence only thing required is to encode the Categorical variables.

For Encoding Categorical variables we tried 2 different techniques on 2 datasets out of 3 defined above, leaving behind raw dataset. The techniques used were LabelEncoding and OneHotEncoding.

SMOTE for imbalanced dataset

SMOTE is Synthetic Minority Oversampling Technique.

… SMOTE first selects a minority class instance a at random and finds its k nearest minority class neighbors. The synthetic instance is then created by choosing one of the k nearest neighbors b at random and connecting a and b to form a line segment in the feature space. The synthetic instances are generated as a convex combination of the two chosen instances a and b.

— Page 47, Imbalanced Learning: Foundations, Algorithms, and Applications, 2013.

Modelling

Baseline Model

The random model fitted on raw data gave Kappa score of 0.00563.

We have fitted different classification models as well like KNN, Naive Bayes, Logistic Regression, SGD Classifier with log/ hinge loss.

Decision Tree

Kappa Score for prediction with Training set: 0.3987740179774045
Kappa Score for prediction with Cross Validation set: 0.3902734002785311

The above model trained with One Hot Encoding, median imputation done and all engineered features included.

Testing for misclassifications.

We can see that model is more confused between class 1, 2 and 6. Whereas it is predicting other classes somewhat better.

Random Forest

Kappa Score for prediction with Training set: 0.49990379064845103
Kappa Score for prediction with Cross Validation set: 0.4230374551051821

Testing for misclassification

There’s not much improvement in the risk level classifications as compared to decision tree.

LightGBM

Kappa Score for prediction with Training set: 0.5258986411574443
Kappa Score for prediction with Cross Validation set: 0.4768745877006524

The results from LGBM with median imputation and one hot encoding are the best so far.

Testing for misclassification

The model is still confused between class 1, 2 and 6. But LGBM showed slight improvement on these classes as compared to previously trained.

Custom Stacking Classifier

A custom stacking classifier was also tried on the final data. Here are the steps that were followed to train the custom model:

  1. Split your whole data into train and test(80:20).
  2. Now, in the 80% train set, split the train set into D1 and D2.(50:50)
  3. From this D1, do sampling with replacement to create d1,d2,d3….dk(k samples). Now create ‘k’ models and train each of these models with each of these k samples.
  4. Now pass the D2 set to each of these k models; now, you will get k predictions for D2 from each of these models.
  5. Now, using these k predictions, create a new dataset, for D2, you already know it’s corresponding target values, so now you train a meta model with these k predictions.
  6. Now, for model evaluation, you can use the 20% data that you have kept as the test set. Pass that test set to each of the base models, and you will get ‘k’ predictions. Now you create a new dataset with these k predictions and pass it to your meta model, and you will get the final prediction. Using this final prediction and the targets for the test set, you can calculate the model’s performance score.

Here is a sample code with base model: LGBM and meta model: LGBM.

Results for lightgbm model
k: [20, 50, 100, 200, 500]
Kappa score: [0.5482780649814829, 0.5505648648899505, 0.5571184425587261, 0.5591000431228068, 0.556939874179319]

XGB Classifier with Offset

The above XGB offset model gave the best score. This model applies the Offset technique which uses fmin_powell function given under Scipy module. What fmin_powell does is, it Minimize a function using the downhill simplex algorithm. This algorithm only uses function values, not derivatives or second derivatives.

For Optimizing the test predictions with respect to train predictions, several offset values have been tried.

Summarizing Results

Results using the Tree Based Algorithms and Ensemble techniques

Results using the Custom Stacking Classifier

Results obtained using XGB Classifier with Offset

Conclusion

  • The Best Results were obtained using XG Boost model with Offsetting done on test predictions using train predictions.
  • The final model gives us the Kappa Score of 0.66937 on Test data. Since, this is a Kaggle Problem, the final leaderboard score is 0.679 and we tried achieving very close to the leaderboard.

Deployment

The final model weights were saved after training. I have made a Streamlit application which asks user input and predicts the Risk level for the same.

Web App

https://share.streamlit.io/sumit-github08/insurance_risk_classification/app.py/

References

Future Work

Since, XG boost model proved to be the best model. I’ve included the feature importance using the “xgboost.plot_importance” with importance type as weight. We know that there are 3 types of importances (weight, gain and cover) but they all contradict each other. So, we can include SHAP analysis as a part of future work for better feature analysis.

Contact