Creating a Tool for Predicting Health Insurance Charge

5 min readApr 6, 2022

Every insurance company needs a smooth way that helps it identify the charges(premium) of their customers based on different aspects and situations, for this sake, this project was built in order to solve this issue and to draw insights based on the company’s data to extract some conclusions that would boost it’s work and to understad more how it should interact with customers.

So without any fluff, give me a second in order to illustrate what will be show to you in this post:

Importing Libraries.
Importing Dataset & Data Cleaning.
Exploratory Data Analysis (EDA).
Model Selection.
Model Tuning.
Model Building After Choosing the Best Algorithm Acting on Our Model With The Right Hyper Parameter Value.
Model Testing

You can reach out the dataset used here.

Let’s start!

Importing Libraries

Importing Dataset & Data Cleaning

Data form is:

Data Cleaning

Luckily, the dataset was 95% clean and organized, but I applied some edits:

Changing the smoker column’s values such as: “yes” => 1 & “no” => 0:

Rounding the charges values in order to have more straightforward values:

Exporting the new cleaned dataset:

Exploratory Data Analysis (EDA)

A very important part of any data science project is to know how to deliver a story out of the data you worked on, and to draw conclusions so you could help your company with taking the right & most effective business decisions.

With no doubt, visuals are the best way to deliver your conclusions. You can check out one of my posts I wrote before regarding visuals.

Features Correlation

Obviously, there is a correlation between (age, charges) & (smoker, charges).

Exploring the Distribution of Sex & Smoker Among the Data

As you can see, in the first figure we have a kind of equilibrium between female and male. In contrast, in the second figure the number of non-smokers is approximately about 4x the number of smokers.

Idea To Be Implemented:

We can add some privileges or customized offers to non-smokers so we can make an advantage of their noticeable number.

Example:

Adding a free gym membership offer in “X” gym, so in this way you can grab more non-smokers customers to your health insurance company.

Visualize Some Features

Age:

Customers of age between 19 & 22 are the dominant part among the rest of customers, the rest of age ranges are approximately of the same number with small differences. We can consider ages between 19 & 22 as university students, so in this way we can increase marketing among universities to grab as much young people as we can.

Bmi:

Follows a normal distribution.

Children:

Apparently, people with no children are the highest.

Charges:

We can see that the small portion of customers have the highest charges among others.

Grouping Different Features with Respect to Charges

Output:

sex
female    12569.58
male      13956.73
Name: charges, dtype: float64
smoker
0     8434.26
1    32050.23
Name: charges, dtype: float64
region
northeast    13406.35
northwest    12417.58
southeast    14735.41
southwest    12346.93
Name: charges, dtype: float64

Smokers have a charge way higher than non-smokers, which totally makes sense since smokers will pay more for life insurance than non-smokers due to the increased health risks of smoking, while for sex and region, all their different values are close to each other.

Discovering How Cost Varies with Respect to Age

We can conclude that older customers charged with a higher cost.

Reporting & Organizing the Summary of Column’s Values Relation Betweeen Each Other

We can see here a clear data distribution considering the state of a person with respect to the smoker or non-smoker & how does the age & bmi variation changes the cost, but nothing surprising, an old aged smoker will definitely be charged with a higher cost.

Searching for Outliers

Apparently, charges has the highest number of outliers.

Machine Learning

Data Preprocessing

We have 2 main steps:

Shuffling the data to reduce the bias.
One hot encoding (get_dummies) in order to deal with the categorical values.

Train, Test Splitting

Splitting the data over a train (80%) & test (20%) set for training and testing our model.

Model Selection

I tried three different models and evaluated them using cross_val_score. I chose cross validation because it can show how each algorithm will act on my model and which one will give the highest score by applying different train/test splits.

If you're not familiar with what K Fold Cross Validation is, please check my post here.

I tried three different models:

Linear Regression
Lasso Regression
Random Forest

The Random Forest model far outperformed the other approaches regarding it’s score.

Linear Regression : [0.75229225, 0.78832069, 0.7388152 , 0.69943941, 0.7462162]
Lasso Regression : [0.75224359, 0.78835653, 0.73883871, 0.69949229, 0.74622792]
Random Forest : [0.84178716, 0.86191325, 0.8137814 , 0.8308621 , 0.82653594]

Model Tuning

I tried different values on our hyper parameter n_estimators.

Output:

(100, 0.831846125141681)
(110, 0.833364289735879)
(120, 0.8339154009224192)
(130, 0.8333984367876605)
(140, 0.834614042038114)
(150, 0.8335295861801286)
(160, 0.8340482963171366)
(170, 0.8332182585103117)
(180, 0.8341442666308645)
(190, 0.8333433079839091)

To sum it all up, always we need to find what help us in using less computational complexity and since there is no significant score difference when utilizing different number of estimators as shown below, I will use the n_estimators = 100.

Model Building After Choosing the Best Algorithm Acting on Our Model With The Right Hyper Parameter Value

Model Testing

Output:

array([26852.88])

The above inputs were:

age 47.0

bmi 29.8

children 3.0

smoker 1.0

sex_female 0.0

sex_male 1.0

region_northeast 0.0

region_northwest 0.0

region_southeast 0.0

region_southwest 1.0

The predicted charge is: $26852.88

Hope you enjoyed what has been mentioned throughout this article, and don't hesitate to drop your comments in the comments section.

Reach out the code repo and try it by yourself.

Thanks for your time and let’s boost our knowledge!

Creating a Tool for Predicting Health Insurance Charge

Importing Libraries

Importing Dataset & Data Cleaning

Data Cleaning

Exploratory Data Analysis (EDA)

Features Correlation

Exploring the Distribution of Sex & Smoker Among the Data

Visualize Some Features

Grouping Different Features with Respect to Charges

Discovering How Cost Varies with Respect to Age

Reporting & Organizing the Summary of Column’s Values Relation Betweeen Each Other

Searching for Outliers

Machine Learning

Data Preprocessing

Train, Test Splitting

Model Selection

Model Tuning

Model Building After Choosing the Best Algorithm Acting on Our Model With The Right Hyper Parameter Value

Model Testing

Written by Amjad El Baba