Creating a Tool for Predicting Health Insurance Charge

Amjad El Baba
5 min readApr 6, 2022

--

www.insuranceneighbor.com

Every insurance company needs a smooth way that helps it identify the charges(premium) of their customers based on different aspects and situations, for this sake, this project was built in order to solve this issue and to draw insights based on the company’s data to extract some conclusions that would boost it’s work and to understad more how it should interact with customers.

So without any fluff, give me a second in order to illustrate what will be show to you in this post:

  • Importing Libraries.
  • Importing Dataset & Data Cleaning.
  • Exploratory Data Analysis (EDA).
  • Model Selection.
  • Model Tuning.
  • Model Building After Choosing the Best Algorithm Acting on Our Model With The Right Hyper Parameter Value.
  • Model Testing

You can reach out the dataset used here.

Let’s start!

Importing Libraries

Importing Dataset & Data Cleaning

Data form is:

Fig. 1

Data Cleaning

Luckily, the dataset was 95% clean and organized, but I applied some edits:

  • Changing the smoker column’s values such as: “yes” => 1 & “no” => 0:
  • Rounding the charges values in order to have more straightforward values:
  • Exporting the new cleaned dataset:

Exploratory Data Analysis (EDA)

A very important part of any data science project is to know how to deliver a story out of the data you worked on, and to draw conclusions so you could help your company with taking the right & most effective business decisions.

With no doubt, visuals are the best way to deliver your conclusions. You can check out one of my posts I wrote before regarding visuals.

Features Correlation

Fig. 2
Fig. 3

Obviously, there is a correlation between (age, charges) & (smoker, charges).

Exploring the Distribution of Sex & Smoker Among the Data

Fig. 4

As you can see, in the first figure we have a kind of equilibrium between female and male. In contrast, in the second figure the number of non-smokers is approximately about 4x the number of smokers.

Idea To Be Implemented:

We can add some privileges or customized offers to non-smokers so we can make an advantage of their noticeable number.

Example:

Adding a free gym membership offer in “X” gym, so in this way you can grab more non-smokers customers to your health insurance company.

Visualize Some Features

Fig. 5

Age:

Customers of age between 19 & 22 are the dominant part among the rest of customers, the rest of age ranges are approximately of the same number with small differences. We can consider ages between 19 & 22 as university students, so in this way we can increase marketing among universities to grab as much young people as we can.

Bmi:

Follows a normal distribution.

Children:

Apparently, people with no children are the highest.

Charges:

We can see that the small portion of customers have the highest charges among others.

Grouping Different Features with Respect to Charges

Output:

sex
female 12569.58
male 13956.73
Name: charges, dtype: float64
smoker
0 8434.26
1 32050.23
Name: charges, dtype: float64
region
northeast 13406.35
northwest 12417.58
southeast 14735.41
southwest 12346.93
Name: charges, dtype: float64

Smokers have a charge way higher than non-smokers, which totally makes sense since smokers will pay more for life insurance than non-smokers due to the increased health risks of smoking, while for sex and region, all their different values are close to each other.

Discovering How Cost Varies with Respect to Age

Fig. 6

We can conclude that older customers charged with a higher cost.

Reporting & Organizing the Summary of Column’s Values Relation Betweeen Each Other

Fig. 7

We can see here a clear data distribution considering the state of a person with respect to the smoker or non-smoker & how does the age & bmi variation changes the cost, but nothing surprising, an old aged smoker will definitely be charged with a higher cost.

Searching for Outliers

Fig. 8
Fig. 9
Fig. 10
Fig. 11

Apparently, charges has the highest number of outliers.

Machine Learning

Data Preprocessing

We have 2 main steps:

  • Shuffling the data to reduce the bias.
  • One hot encoding (get_dummies) in order to deal with the categorical values.

Train, Test Splitting

Splitting the data over a train (80%) & test (20%) set for training and testing our model.

Model Selection

I tried three different models and evaluated them using cross_val_score. I chose cross validation because it can show how each algorithm will act on my model and which one will give the highest score by applying different train/test splits.

If you're not familiar with what K Fold Cross Validation is, please check my post here.

I tried three different models:

  • Linear Regression
  • Lasso Regression
  • Random Forest

The Random Forest model far outperformed the other approaches regarding it’s score.

  • Linear Regression : [0.75229225, 0.78832069, 0.7388152 , 0.69943941, 0.7462162]
  • Lasso Regression : [0.75224359, 0.78835653, 0.73883871, 0.69949229, 0.74622792]
  • Random Forest : [0.84178716, 0.86191325, 0.8137814 , 0.8308621 , 0.82653594]

Model Tuning

I tried different values on our hyper parameter n_estimators.

Output:

(100, 0.831846125141681)
(110, 0.833364289735879)
(120, 0.8339154009224192)
(130, 0.8333984367876605)
(140, 0.834614042038114)
(150, 0.8335295861801286)
(160, 0.8340482963171366)
(170, 0.8332182585103117)
(180, 0.8341442666308645)
(190, 0.8333433079839091)

To sum it all up, always we need to find what help us in using less computational complexity and since there is no significant score difference when utilizing different number of estimators as shown below, I will use the n_estimators = 100.

Model Building After Choosing the Best Algorithm Acting on Our Model With The Right Hyper Parameter Value

Model Testing

Output:

array([26852.88])

The above inputs were:

age 47.0

bmi 29.8

children 3.0

smoker 1.0

sex_female 0.0

sex_male 1.0

region_northeast 0.0

region_northwest 0.0

region_southeast 0.0

region_southwest 1.0

The predicted charge is: $26852.88

Hope you enjoyed what has been mentioned throughout this article, and don't hesitate to drop your comments in the comments section.

Reach out the code repo and try it by yourself.

Thanks for your time and let’s boost our knowledge!

--

--

Amjad El Baba

An AI engineer with a passion for writing, always curious and eager to share what I learn. I enjoy taking ideas and turning them into something relatable.