PyCaret and Health Insurance Cost prediction

Carlos Henrique Fantecelle
9 min readNov 17, 2022

--

When acquiring Health Insurance, it is common that the we pay a fixed and low amount of money, in return to being covered by the insurance over a high amount of charges during a moment of healthcare need or emergency. Given this fact, is important for insurance companies to predict the cost of customers in case such an event arises, so that their business is still feasible.

This is a difficult issue, because it is hard to predict when and how someone will become ill. However, certain aspects of people’s behaviour, habits and medical history might be able to tell us how much these patients will cost for the insurance company.

Image credits: pch.vector (www.freepik.com)

In today’s story, we will be looking at a Health Insurance Cost dataset, using regression machine learning models in PyCaret. PyCaret is a popular, low-code library, that provides a nearly automated way to create data analysis workflows using Machine Learning. It aims to reduce time used for coding the models, while leaving more time for the analyses themselves.

Let us dive into it, then!

For a detailed version of the analysis, see the Jupyter Notebook.
Where to find me: GitHub | LinkedIn

The Data

The data for this project was obtained on Kaggle. There is not much information about it on the page, but it is a simple dataset (with 7 columns, only) which features characteristics of the individuals and their insurance charges over the period analysed (unknown), containing 1338 observations.

This is how the data looks at first sight:

First 5 entries of our dataset.

Our variables

As mentioned above, the dataset comes with 1338 observations and 7 columns only, which are:

  • age = The age of the individual insurance client.
  • sex = The biological sex.
  • bmi = Body Mass Index, a health measure based on weight divided by the squared height.
  • children = The number of children the individual has.
  • smoker = If they smoke or not.
  • region = The region where they live (related to the dataset origin, other information unknown).
  • charges = The incurred charges originanting from the specific individual. This is our target variable.

A quick look into a summary of our dataset would give the following:

df.describe() of our dataset

From this alone we can already see that our chargesvariable contains some outliers. Upon looking at an exploratory analysis using SweetViz, we can get a better insight of our variables (adapted for dark colours):

Plots of ‘age’, ‘sex’ and ‘bmi’, respectively.

We can see that our age variable only ranges between 18 and 64. If it were there were older people in the dataset, it would be a good idea to create and additional column categorizing the ages, since we know that older people will often need more health care than younger people. Also, the age of the individuals has a somewhat uniform distribution, with the exception of an increase around the lower end of the age range.

The sex variable, which might as well influence our predictions is well balanced.

Another possible risk factor for health conditions is adiposity. Weight and height alone (the measurements used to calculate BMI — Body Mass Index) are not good enough factors to evaluate someone’s adiposity levels and we should be careful not to fat shame other people in the name of their health condition (viewers can read more on it here). However, historically the BMI and the categories defined by it have been associate to negative health conditions. Thus, here we will classify the data according to the BMI standards and we will compare how this measurement influences our predictions when dealing with our bmi variable.

Plots of ‘children’, ‘smoker’ and ‘region’, respectively.

The children variable is more skewed towards the lower end of the distribution, but this is a potential factor that might be associated to our outcome, due to children being able to get ill easier through, for example, school contact.

The region variable is also well balanced.

The smoker variable is also unbalanced, with ~20% of smokers in the dataset. This also represents a major health risk factor, and the variable will be left as is since it might be an important predictor.

Plot of ‘charges’.

Our target variable, charges, presents with some obvious outliers, but these are of extreme interest in the case of predicting costs in a health insurance scenario. That said, most of our data is concentrated in values below 20,000.

After this initial inspection, the variables are left as they are, except for bmi variable, which we discretised according to common standards. Thus, two datasets were created (one with and one without the bmi categories) which were then split into train and test datasets, to avoid data leakage.

Regression with PyCaret

Since we are trying to predict cost values, we are going to use regression models in our analyses. The first step in our PyCaret pipeline, is defining our setup.

These settings were used in both datasets, changing only the train data with or without the BMI categories. And just like that, PyCaret handled our data preparation, and with another simple line of code…

…and voilà! PyCaret gives us a comprehensive comparison of how the regression models available performed with our dataset. Let’s see how good they were.

Without BMI categories

The ‘compare_models()’ result for the dataset without BMI categories.

From this we can see that our best model was the Linear Regression. However, these values are not ideal, especially with a 0.6837 R². Let’s see how the model with the BMI information performs.

With BMI categories

The ‘compare_models()’ result for the dataset with the BMI categories.

From the comparisons, the Gradient Boosting Regressor, with an R² on the dataset with the BMI information achieved the best metrics in nearly all categories, indicating that this variable still provides us with some prediction power.

With our best model identified, we now must effectively build the model to train our data (this is step is not done during compare_models()). For comparison purposes, we will also use the second and third best algorithms identified. These are the CatBoost and Random Forest Regressors.

The Gradient Boosting Regressor

The Gradient Boosting Regressor, or simply GBR, is a powerful machine learning model used for predictions. It is based on the construction of weak learners, which are improved by adding them together in an ensemble of predictors that minimizes the loss function (1-2).

It basically has three components1:

  1. A loss function;
  2. Weak learners;
  3. The additive model in which the weak learners are added to minimize the loss function.

In this model, the weak learners are added one by one, while existing ones remain unchanged. This is done through a gradient descent procedure (iterative method to minimize some function).

Tuning our model

After our model is created, it is easy to tune the hyperparameters using PyCaret. All we need is:

And then, PyCaret will find the hyperparameters that best maximize our models performance (the default metric for regression models is the R². For our model, we’ve got:

Results after tuning our GBR model.

We can see that this barely improved our model, going from 0.8378 to 0.8382. Let us see how the CatBoost and Random Forest regressors fared in this matter.

Trying CatBoost and Random Forest Regressors

CatBoost regressor

Results after tuning the CatBoost regressor.

Random Forest regressor

Results after tuning the Random Forest regressor.

From these results we can see that the CatBoost regressor actually outperforms the GBR after tuning the hyperparameters. The main difference between CatBoost and other gradient boosted regressors is that CatBoost produces symmetrical trees, which helps the model by reducing the time to perform predictions (3). From this point forward, since this was the best model, we will use CatBoost on our previously split test data.

Let’s see how well CatBoost performed. As easy as it is to create a machine learning model with PyCaret, so it is to check the models performances.

Prediction Error plot for our tunned CatBoost Regressor model.

From this, we can see that our best fit is closely related to the identity line, as demonstrated by the R² metric, and is probably skewed by our outliers. PyCaret has many other diagnostic plots. You can check them here. One of the ways we can evaluate our model is by checking Feature Importance.

Model interpretation of feature importance using SHAP values in our Regression model.

SHAP (SHapley Additive exPlanations) values are a measurement, based on game theory, that is used to explain the output of machine learning models (4). They can indicate which “players” in our model contributed the most to the final outcome.

The plot of SHAP values here bring to light what was expected: that both smoking status, BMI values and age were the most influential variables for our predictions. The discretisation of BMI values did not improve our model prediction substantially. This indicates that this strategy was not very useful in analysing this dataset. Let’s now see our models predictions and finish the model for evaluation with the test data.

Finalising the model

In PyCaret, finalising a model has as specific meaning. It combines all our data (including train and test from the cross validation) and generate the model again using the defined parameters. This allows us to make use of all of the available data, except for the test data that we extract in the beginning of the project. After finalising the model, we will test how it predicts on itself (same set of data), beforing testing it in our test data.

Model evaluation its own data.

The output here is expected, yielding a higher R², because this is the same data that we used to build the model. After this step, we will use it to predict the prices on our test data.

Model evaluation using the test data

Our R2 value remained within expected range, indicating that our model did not overfit, which is good! After this step, we could save our model for future use. For simplicity, we will skip this step here.

Now, let’s see how different our predictions were from our actual values:

Line plot of actual versus predicted charge values.

We can see that our model did well, but it overall underpredicted the real charges. This indicates that our dataset could have more variables to improve it’s predictive power, since only three of them were considered of high relevance to our model.

Conclusion

In this post, we have seen how the PyCaret framework for machine learning can be used to make the Data Scientist’s life easier. The usage of this framework demonstrated a substantial decrease in time taken writing code, serving as a useful tool for preliminary analyses on data.

Our dataset did not have many important features for our model. The smoker status, age and BMI values were the most powerful variables for our predictions, and we were still able to reach ~0.82 R2 score in our test data. This demonstrates the power of machine learning algorithms such as the CatBoost regressor. However, with more relevant features one might be able to achieve even better predictions in the future.

Observations

Some of the code has been omitted for simplification purposes. To learn about the complete PyCaret workflow, check their official page, or check this post’s related notebook in my GitHub portfolio.

Thank you for reading! :)

Detailed analysis: Jupyter Notebook
Where to find me: GitHub | LinkedIn

--

--

Carlos Henrique Fantecelle

Transcriptomics researcher working with infectious diseases and aging of the immune system, but also working on developing other Data Science skills.