Health Insurance Cost Prediction

6 min readOct 20, 2023

In this article, I will describe my project in which I use different regression algorithms to predict health insurance costs from people’s personal health data.

I will explain the code step by step:

# python libraries
import numpy as np
import pandas as pd

# ML libraries
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.linear_model import LinearRegression, Ridge, Lasso, ElasticNet

from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error

Import Libraries

The first two lines import the NumPy and Pandas libraries. NumPy is used for numerical calculations, while Pandas is used for data analysis and manipulation.

The “ML libraries” section imports the necessary machine learning related libraries:

train_test_split: It is used to divide the dataset into training and testing data.
StandardScaler: It is used for standardization of data. It helps scale the data to have a mean of 0 and a standard deviation of 1.
DecisionTreeRegressor: Decision tree is used to implement the regression model.
RandomForestRegressor: Random Forest is used to implement the regression model.
LinearRegression: It is used to implement the linear regression model.
Ridge: It is used to implement the Ridge regression model.
Lasso: Lasso is used to implement the regression model.
ElasticNet: Elastic Net is used to implement the regression model.
mean_squared_error, r2_score, mean_absolute_error: It is used to calculate metrics used to evaluate the performance of regression models.

# DATA DISCOVERY

data = pd.read_csv("C:/Users/HAZAL/OneDrive/Masaüstü/Projeler/health_insurance_cost_estimate/insurance.csv")
data.head()
data.describe()
data = data.dropna()

Reading data, examining data and removing missing values (NaN)

Reads a CSV file via the specified file path ("C:/Users/HAZAL/OneDrive/Masaüstü/Projeler/health_insurance_cost_estimate/insurance.csv") and loads that data into a Pandas dataframe. The data frame is assigned to a variable named “data”.
The second row displays the first five rows of the data frame using the head() method. This is used to quickly see the structure of the data.
The third line displays the basic statistical summary of the dataframe using the describe() method. This shows the data’s central tendency, dispersion, and other important statistical information.
The fourth line removes rows containing missing values (NaN) from the data frame using the dropna() function. This process is a common method for dealing with missing values and is important for data cleansing.

column_names = ["age", "sex", "bmi", "children", "smoker", "region", "charges"]
statistics = {}

for column_name in column_names:
    column_stats = data[column_name].value_counts()
    statistics[column_name] = column_stats

for column_name, column_stat in statistics.items():
    print(f"{column_name} kolonundaki değerlerin kaç defa tekrarladıkları:\n{column_stat}\n")

Whether the data is distributed evenly (balance/unbalance)

A list named column_names is created and in this list we define the names of the columns we want to process.
An empty dictionary named statistics is created. This dictionary will be used to store the frequency statistics calculated for each column.
We calculate the frequency statistics of each column using a loop. This loop takes the values of each column, calculates how many times each value occurs using the value_counts() method, and adds the results to the dictionary called statistics.
With another loop, it prints the name of each column and the relevant frequency statistics to the screen. The results of each column are displayed in a more meaningful way.

data = pd.get_dummies(data, columns=['sex', 'smoker', 'region'], prefix=['sex', 'smoker', 'region'])
data.drop(['sex_female', 'smoker_no'], axis=1, inplace=True)

data.head()

One hot encoding(dummy variable)

One hot encoding (dummy variable) converts categorical data into nominal data. This transformation is done to make it suitable for machine learning models

The first line creates new columns by converting specific columns of the data frame into dummy variables using the pd.get_dummies() function.
The second line drops unnecessary columns using the data.drop() function.

y = data['charges']  # Dependent variable
x = data.drop(['charges'], axis=1)  # Independent variables

Separating data as Dependent/Independent

We divide the data set into 2:x(independent=age,bmi, …) and y(dependent=charges).
We will try to estimate the expense by looking at variables such as age, bmi, sex other than the charges value, that is, the expense value depends on these independent variables.
Therefore, the charges column is dependent and the other columns are independent variables.

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.25, random_state=46)

scaler = StandardScaler()
x_train = scaler.fit_transform(x_train)
x_test = scaler.fit_transform(x_test)

Preparation for the machine learning phase

In the first line, we divide the data into 2: train and test.
We perform normalization operations on the dataset with StandardScaler.
So actually we are trying to write all the values in the same scala. We write them all in 0 and 1. Because some values are just numbers, while others can be thousands or millions.
For this reason, we put them all in the same scale so that our model will be more successful.

# Decision Tree Regressor
tree_regression = DecisionTreeRegressor(random_state=42, max_depth=2) 
tree_regression.fit(x_train, y_train) 
predict_tree_regression = tree_regression.predict(x_test) 

# Random Foret Regressor
random_regression = RandomForestRegressor(n_estimators=100, max_depth=4, random_state=42) #ağaç sayısı=100
random_regression.fit(x_train, y_train)
predict_random_regression = random_regression.predict(x_test)

# Lasso Regressor
lassoReg = Lasso(alpha=2)
lassoReg.fit(x_train,y_train)
predict_lasso = lassoReg.predict(x_test)

# Elastik Regressor
elastic_reg = ElasticNet(random_state=0)
elastic_reg.fit(x_train,y_train)
predict_elastic = elastic_reg.predict(x_test)

# Ridge Regressor
ridge_reg = Ridge()
ridge_reg.fit(x_train,y_train)
predict_ridge = ridge_reg.predict(x_test)

Machine learning and Model Performance

Each algorithm is sequentially defined and trained on training and test data, and then predictions are made using the test data.

DecisionTreeRegressor:

A Decision Tree Regression model named tree_regression is created. random_state sets the reproducibility of random numbers, and max_depth limits the maximum depth of the tree.
We train the model using the fit() method. The training data is trained on x_train and the target variable y_train. So the machine will learn y_train by looking at the data in x_train.
predict_tree_regression It is the phase of testing whether the machine has learned or not. It is predicting y_test from x_test.

The steps in the Decision Tree are repeated in other regression models.

predicts = [predict_tree_regression, predict_random_regression, predict_lasso, predict_elastic, predict_ridge]
algorithm_names = ["Decision Tree Alg.", "Random Forest Alg.", "Lasso Alg.", "Elastik Alg.", "Ridge Alg."]

def performance_calculate(predict): 
    mae = mean_absolute_error(y_test, predict)
    mse = mean_squared_error(y_test, predict)
    rmse = np.sqrt(mse)
    r2 = r2_score(y_test, predict)
    
    data = [mae, mse, rmse, r2]
    return data

Calculating the performance of models

We evaluate how successful or incorrect the model is with metrics such as R Square, Mean Squared Error, Mean Absolute Error and Root Mean Squared Error.

predicts contains a list of predictions made by predefined regression models. Each model’s predictions are added to this list.
algorithm_names is defined as a list containing the name of each regression algorithm used.
A function called performance_calculate is defined. This function calculates the performance metrics against which predictions are compared to actual values and returns a list of those metrics.
The following four performance metrics are calculated within the function:

mae (Mean Absolute Error): The absolute mean error between actual values and predictions. A lower MAE indicates better model performance.
mse (Mean Squared Error): Mean squared error between actual values and predictions. A lower MSE indicates better model performance.
rmse (Root Mean Squared Error): Square root of MSE. RMSE shows how far the predictions are from the actual values.
r2 (R-squared or Determination Coefficient): A value that measures how well the independent variables explain the dependent variable. R2 takes a value between 0 and 1, and the closer it is to 1, the better the model fits.

5. The calculated performance metrics are organized as a list, and this list is stored in a variable called data.

6. Finally, the data variable is returned by the function and these performance metrics are calculated for each regression model and used to compare the results.

series  = []
metrics = ["Mean Absolute Error(MAE)", "Mean Squared Error(MSE)", "Root Mean Squared Error(RMSE)", "R2"]

for i in predicts:
    data = performance_calculate(i)
    series.append(data)
    
df = pd.DataFrame(data=series, index=algorithm_names, columns=metrics)
pd.set_option('display.colheader_justify', 'center')
print(df.to_string())

Printing Results to Screen

series is defined as an empty list to store the results. We will add the performance results of each algorithm to this list.
metrics are defined as a list containing the names of calculated performance measures.
Using a for loop, performance is calculated for each prediction and the results are added to a dictionary called data. This dictionary has a structure where each metric name is used as a key and the calculated value is stored as a value.
The results are added to the series list. The performance results of each algorithm are stored as a dictionary in this list.
A Pandas data frame (df) is created using pd.DataFrame. This data frame uses data from the series list and contains the performance results of each algorithm and related metrics.
Using pd.set_option('display.colheader_justify', 'center'), the header alignment of the data frame is centered.

Output

For an algorithm to be the best, we expect the R Square value to be large and the other values to be small. Therefore, when we look at the output, we see that Random Forest makes the best prediction among them.