Life Insurance Application Assessment Prediction

Weichen Lu
Data Science is life
6 min readApr 7, 2018

Insurance has become an indispensable part of our lives in recent years and people are paying more attention on it. It contains multiple types of where auto insurance and life insurance are the most common types. In this article, I will mainly focus on the life insurance since it is related to my project.

What is Insurance ?

According to wikipedia, insurance is a means of protection from financial loss. It is a form of risk management primarily used to hedge against the risk of a contingent, uncertain loss. To put it more bluntly, insurance is a tool that can cover your loss caused either by accident or other unpredicted factors.

How to apply ?

The general process of life insurance application involves four steps. Firstly, you will need to submit the application form which contains your personal information such as medical history, beneficiary information etc. Then, a medical exam will be scheduled. Next, the insurance company will review your personal information as well as medical history in order to make the insurance policy for you base on the assessment. Once it done, you are approved for the coverage and the insurance policy will be send to you within days.

Project background

For this project, the data comes from prudential life insurance on kaggle. The challenge part for them is that the application process time is antiquated and the goal for this project is to help them to enhance the efficiency of processing time as well as reduce labor intensive for new and existing customers.

Data Exploration

The dataset contains training data and testing data and it has 128 features in total and the response variable is an ordinal measure of risk that has 8 levels.

Training data: 59381

Testing data: 19765 (no labels in response variable)

Features

Data Preprocessing

The first thing is to check if there is any missing value in the dataset. I used python to extract the columns which contains the missing value as well as the percentage of missing value.

Creating bar plots for training data
Training
Testing

As we can see from the graphs, most of the values in Medical_History_10, Medical_History_24 and Medical_History_32 are missing. There are several ways to compute missing values such as case analysis, imputation and missing indicator. In this case, I basically just drop all these three columns due to its large percentage of missing values. For the rest of the columns, I used missing indicator approach to set missing value as a fixed value and then create an extra dummy variable for those columns to indicate whether the value for that variable is missing.

After that, I checked the size of each class to see whether it is balance or not.

Bar plots for the size of data
Data size

The above graphs shows that the class 8 has most data where class 3 has least data. There are several approaches to deal with imbalance data. For me, I used smote which is an oversampling method. It creates synthetic samples from the minor class instead of creating copies. The algorithm selects two or more similar distances (using a distance measure) and perturbing an instance one attribute at a time by a random amount within the difference to the neighbour instances.

After smote

Feature Engineering

There is a categorical variable called Product_Info_2 which contains character and number. You can use label encoder along with one hot encoding or get dummy to extract the features. However, I use another method which is factorize in pandas to factorize the column and split the character and number, then create additional two columns with the extract character and number after factorization

After factorization

Drop the original column and rename it.

Moreover, I had create a new features by multiply the value between BMI and Ins_Age since I think it is a useful feature for model to learn.

After feature creation

For the Medical_Keyword columns, it has 48 in totals and it is a set of dummy variables relating to the presence of/absence of a medical keyword being associated with the application. I added a column which sum all the counts of those dummy variables.

Medical_Keywords_Count

Once complete the feature engineering, the data are ready for modeling.

Modeling

The algorithms I used are Random Forest, Logistic Regression and XGBoost. I used cross validation on training data to check the model performance then refit the data on the entire training data and submit the prediction to kaggle for assessment.

Evaluation

The evaluation metric is Quadratic weighted kappa. It measures the agreement between two ratings which typically varies from 0 (random agreement) to 1 (complete agreement). Oij corresponds to the number of applications that received a rating i by A and a rating j by B. Eij corresponds to the expected number of applications that received a rating i by A and a rating j by B. N represents the number of class.

Weight
Formula

Evaluation comparison

Evaluation bar plots
Before smote

As we can see from the above bar plots, the purple bar represents the kappa score based on the cross validation and the yellow bar represents the score after submit to kaggle. I implemented three algorithms where XGboost advance is the model after hyperparameter tuning which gives me the best score.

Now, let’s check the score after doing smote

After smote

The cross validation score of random forest bump up dramatically while the actual score not increase significantly which means that my model is overfit. For other models, the score doesn’t have much difference. In a nutshell, implement smote to this dataset doesn’t really improve the score and it makes the model overfit in cross validation.

Improvement

By checking on the kaggle discussion, applying offset may improve the performance since the predictions generated from XGBoost are offset to a value which increases the score. One of the possible reason is that the response variable is ordinary, implement reg:linear in XGBoost gives you regression prediction and there is a solution which optimizes a function to transform regression predictions to desired predicted discrete outcomes by using offsets before applying rounding.

Applying offset to my data gives me a score of 0.66659 which is close to the leaderboard (0.67938)

Conclusion

In the end, I have learned a lot from this project and it gives me a more comprehensive understanding of machine learning techniques.

--

--