Insurance Prediction with Machine Learning

A simple introduction to machine learning.

Published in

IKU Deep Learning

10 min readOct 30, 2020

In this guide, we will learn how to do training a machine learning model with DecisionTreeClassifier, RandomForestClassifier, KNeighborsClassifier and BaggingClassifier. First, we will do exploratory data analysis to understand the data. We will apply label encoding, K-Means algorithm and examine correlation map, what our dataset contains, whether there is a missing value or not.

Dataset we will be working on is health insurance cross sales forecast.

Context

Our client is an Insurance company that has provided Health Insurance to its customers now they need your help in building a model to predict whether the policyholders (customers) from past year will also be interested in Vehicle Insurance provided by the company.

An insurance policy is an arrangement by which a company undertakes to provide a guarantee of compensation for specified loss, damage, illness, or death in return for the payment of a specified premium. A premium is a sum of money that the customer needs to pay regularly to an insurance company for this guarantee.

Building a model to predict whether a customer would be interested in Vehicle Insurance is extremely helpful for the company because it can then accordingly plan its communication strategy to reach out to those customers and optimise its business model and revenue.

Data Description

id = Unique ID for the customer
Gender = Gender of the customer
Age = Age of the customer
Driving_License = 0 : Customer does not have DL, 1 : Customer already has DL
Region_Code = Unique code for the region of the customer
Previously_Insured = 1 : Customer already has Vehicle Insurance, 0 : Customer doesn’t have Vehicle Insurance
Vehicle_Age = Age of the Vehicle
Vehicle_Damage = 1 : Customer got his/her vehicle damaged in the past. 0 : Customer didn’t get his/her vehicle damaged in the past.
Annual_Premium = The amount customer needs to pay as premium in the year
PolicySalesChannel = Anonymized Code for the channel of outreaching to the customer ie. Different Agents, Over Mail, Over Phone, In Person, etc.
Vintage = Number of Days, Customer has been associated with the company
Response = 1 : Customer is interested, 0 : Customer is not interested

Data Preprocessing

The basic libraries that we will be using are NumPy, pandas, matplotlib, os and seaborn. After importing the libraries read the data and take a look at the data we have.

Import Datasets

Our dataset looks like this. We have already looked at data description before. Let’s look at null values to learn more about the dataset :

We have 381109 rows in train data and 127037 rows in test data and no null values. Most of the real-world data that we get is messy, so we need to clean this data before feeding it into our machine learning model but since we are just starting, we will be working with a clean data set.

Label Encoding

In machine learning, we usually deal with datasets which contains multiple labels in one or more than one columns. These labels can be in the form of words or numbers. To make the data understandable or in human readable form, the training data is often labeled in words. Label Encoding refers to converting the labels into numeric form so as to convert it into the machine-readable form.

We replaced some values in the data sets with numerical values, as follows ;

Vehicle Age ->

“<1 Year” = 0
“1–2 Year” = 1
“>2 Year” = 2

Gender ->

“Female” = 0
“Male” = 1

Vehicle Damage ->

“No” = 0
“Yes” = 1

Now all values are numeric.

Correlation

We can see that the most influencing factors for Response are Vehicle_Damage and Previously_Insured, followed by Vehicle_Age and Policy_Sales_Channel.

Overview Of The Data Set

Once you understand the data you have, the next step is to start looking for relationships between data items. This is called exploratory data analysis and it usually focuses on the correlation between variables.

The 1 values in the response column are quite low. We found our total values with “value_counts ().sum ()”. Comparing the total values of 1 to this, we saw that the values of 1 in the dataset are 12%.

Most of the vehicles of customers with response 1 are between the ages of 1–2 and their vehicles are damaged.

Customers who were previously insured tend not to be interested. We can think that the reason for this is that their previous insurance agreement has not expired yet.

The most used sales channels are 152, 26 and 124. The best channel that results in customer interest is 152.

Model Building

We will use sklearn library to building models.

First, I delete the “id” column as it will not contribute to model training.

I define the “Response” column to y and the other columns to X.

We will divide our data into 4 variables; The x_train and y_train variables for training, x_test and y_test variables to test the model at the end of the training.

The test_size parameter specifies what percentage of the data set should be reserved for testing, random_state is used to set a seed for a random generator so that you always get the same result and the most used value for this is 42.

K-Means algorithm

Actually, this is a very detailed subject , but in this guide I will only give a brief explanation of the models.

The k-Means clustering algorithm is one of the most used algorithms. Clustering algorithms are algorithms used to group data with similar characteristics in a data set. The letter “k” in the name of the algorithm actually indicates the number of clusters. Given “n” number of data sets are placed in “k” sets.

‘k-means++': selects initial cluster centers for k-mean clustering in a smart way to speed up convergence.

We divided our data into 5 clusters with the K-Means algorithm.

Apparently most of the customers are in cluster 2.

I choose customers in cluster 2 and used ExtraTreesClassifier () to find the most important features. So we can have an idea why customers are gathering more in cluster 2.

DecisionTreeClassifier

Decision Tree Classifier is a Supervised Machine Learning where the data is continuously split according to a certain parameter. It poses a series of carefully crafted questions about the attributes of the test record. Each time time it receive an answer, a follow-up question is asked until a conclusion about the calss label of the record is reached.

By sending X_test data to the predict function, we get a predict result.

On this graph we can see how many of our predictions are correct.

The bottom right corner is the number of values we guessed to be 1 and are actually 1, so it is True Positive.
The bottom left corner is the number of values we guessed to be 0 but are actually 1, so it is False Negative.
The top right corner is the number of values we guessed to be 1 but are actually 0, so it is False Positive.
The top left corner is the number of values we guessed to be 0 and are actually 0, so it is True Negative.

You can examine to the table below for more explanation :

After building model, let’s check the model’s accuracy to see if it works correctly.

accuracy_score = Accuracy is one metric for evaluating classification models. Informally, accuracy is the fraction of predictions our model got right.

precision_score = It shows how many of the values we guess as Positive are actually Positive.

recall_score = It is a metric that shows how many of the operations we need to predict positive.

f1_score = The F1 Score value shows us the harmonic mean of the Precision and Recall values.

RandomForestClassifier

RandomForestClassifier generate multiple decision trees. When it will produce a result, the average value in these decision trees is taken and the result is produced.

We created our RandomForestClassifier model and get a predict result.

We checked the score values I explained one by one above in the RandomForestClassifier model.

I’ll do the same for other models :

KNeighborsClassifier

The purpose of the K Nearest Neighbors algorithm, which is a classification algorithm, is to classify our data sets and then place the data whose class is unknown to the closest class.The number of elements to be looked at in the algorithm’s work is determined by a K value. When a value comes, the distance between the value is calculated by taking the nearest K number of elements. The Euclidean function is generally used in the distance calculation. After the distance is calculated, it is sorted and the corresponding value is assigned to the appropriate class.

BaggingClassifier

A Bagging classifier is an ensemble meta-estimator that fits base classifiers each on random subsets of the original dataset and then aggregate their individual predictions to form a final prediction.

Let’s Compare the accuracy scores in all the models :

It seems that KNeighborsClassifier has the best accuracy score. So I am going to use this model on submission.csv

File Submission

Since we wanted to predict the responses of customers in the test data, we sent the test data to the prediction function instead of X_test. But we do not want to use the “id” column of test data in data prediction, so we used test.columns[1:]

Our submission data is as follows:

Conclusion

We did a preprocessing on our data set, checked whether there were null values, converted our data to numerical values with Label Encoding, and we took a look at the Correlation map. Then we tried to visualize our data to better understand the dataset.

We clustered with the K-means algorithm and used ExtraTreesClassifier to find important features.

Finally, we created four different models using DecisionTreeClassifer, RandomForestClassifier, KNeighborsClassifier and BaggingClassifier. When we compare the Accuracy scores, KNeighborsClassifier gave the best rate with 87.14% and we used this model in our submission file. I hope it has been useful work.

References

You can find the complete code on Kaggle kernel here:

https://www.kaggle.com/ilaydaioglu/insurance-prediction-acc-0-87

And, Health Insurance Cross Sell Prediction Data link :

https://www.kaggle.com/anmolkumar/health-insurance-cross-sell-prediction