Predicting Patient treatment costs using Machine Learning.

Regression and EDA on personal health data to determine factors contributing to treatment

Thomas George Thomas
Analytics Vidhya
6 min readMar 21, 2021


Photo by Kendal on Unsplash


Linear regression is one of the most important algorithms under the supervised learning category in Machine Learning. It is also the simplest and commonly used model for predictive analysis. Using this we explore the personal health dataset and predict treatment and insurance costs.

What is a Linear Regression?

In the simplest terms, when a relationship between the target and one or more predictors is linear, it is a linear regression.

Where the target (y) is considered the dependent variable while one predictor (x) is considered to be the independent variable. b0 and b1 are the intercept and the slope respectively.

Why Linear Regression?

  1. Linear regression can be used to determine the strength of the effect that the predictor variable(s) have on the target variable.
  2. It helps us understand how much the target variable changes with changes in the predictor variable(s).
  3. Most importantly, linear regression can be used to get future estimates and help predict trends accurately.

Types of Linear Regression

Linear Regression can be broadly classified into two categories:

  1. Simple Linear Regression: Equation where there is one dependent (target) variable and exactly one independent (predictor) variable.
  2. Multiple Linear Regression: Equation where there are one dependent (target) variable and two or more independent (predictor) variables.

In this use case, We explore Simple linear regression in detail.

Data Description

Photo by Alexander Sinn on Unsplash

For predicting health insurance costs, We utilize Miri Choi’s Medical Cost Personal Datasets hosted on Kaggle. The column descriptions look like this:

  • age: age of the primary beneficiary
  • sex: insurance contractor gender, female, male
  • bmi: Body mass index, providing an understanding of the body, weights that are relatively high or low relative to height.
  • children: Number of children covered by health insurance / Number of dependents
  • smoker: Yes/No
  • region: the beneficiary’s residential area in the US.
  • charges: Individual medical costs billed by health insurance

Now, getting our hands dirty with data

Acquiring the Data

Once we download the CSV data, we can import it using read_csv. We then use head() to sample the data.

Viewing the sample data | Image by Author

Preparing the Data

We try to identify numerical and categorical data. We proceed to collect basic descriptive stats using describe(). We try to understand what the data looks like and what it is trying to tell us.

Viewing descriptive statistics of the data | Image by Author is also good to give us a concise summary of the data that it is holding.

Exploratory Data Analysis (EDA)

EDA is the analytical process in data science where the main characteristics are drawn and summarized by investigating and analyzing the data sets. We take a closer look at all the involved steps as follows:

Feature Engineering

Photo by Alex Knight on Unsplash

Feature engineering is the process of extracting raw data using domain knowledge to improve model performance.

Features in Machine learning essentially mean columns.

From our prior step of descriptive analysis, we can see that the data is made up of two forms: Numerical and Categorical. Partitioning the features accordingly into numerical and categorical:

Differentiating numerical and categorical features | Image by Author

Building a model with categorical data is hard but not impossible. For simplicity, we proceed to convert categorical data into numerical data. For this purpose, we use the One hot encoding technique.

One hot encoding is a technique where we replace the categorical data with binary digits. The categorical column is split into the same number of columns as the values. The respective column is then given a ‘1’ or a ‘0’ corresponding to the values.

we use one-hot encoding by using get_dummies()

One hot encoding on Categorical data | Image by Author

After preparing our data, we are ready for the next step. We need to pick the important ‘features’ that will have an impact on the target variable. The best way to do that is to try and find the correlation(s) between the different features. This can be achieved by using data.corr()

Exploring the correlation between the features | Image by Author

Visualizing using a heat map to better explore the trends.

Heatmap showing the correlation between the features | Image by Author

From this we can see the following observations:

  1. Strong correlation between charges and smoker_yes.
  2. Weak correlation between charges and age.
  3. Weak correlation between charges and BMI.
  4. Weak correlation between BMI and region_southeast.

Since the values for the weak correlations are less than 0.5, we can term them as insignificant and drop them.

Remember: Correlation doesn’t imply Causation

Here we can see that there is exactly one predictor for our target (charges) variable. This makes for a good use case for simple linear regression.

Exploring the correlation and the trend between charges and smoker_yes:

Graph showing the varying trend for treatment charges of patients | Image by Author

From the graph, the treatment charges of patients range from a minimum of 1122 for a significant number of patients and a maximum of 63770 for a few patients.

Building the Model

Training the linear regression model | Image by Author

We begin to predict the values of the patient charges using the other features. We build a simple linear regression model after importing the package sklearn.linear_model. We split the data set into training and test set. A good idea is to split 30% of the dataset for testing using test_size=0.3 We take the predictor variable without the charges column and the target variable as charges. We proceed to fit the linear regression model for the test and training set using fit(). This part is called Model fitting. We check the prediction score of both the training and test set using score(). It comes out to be 79%, which is pretty decent I would say.

Evaluating the Model

To evaluate our linear regression model, we use (also known as the coefficient of determination) and Mean Squared Error as our metrics.

R² is a statistical measure of how close predicted data is with respect to the regression line. It is expressed in terms of percentages and lies between 0% and 100%. Generally speaking, the higher the percentage of R² is, the better the model.

Mean Squared Error (MSE) is the average squared difference between the predicted values and the actual values. The lower the MSE the better the model fits.

Model Evaluation results | Image by Author

From the figure, MSE is closer to 0 and R² on the test data is a whopping 79%!


Our evaluation metrics of R² and mean squared error of both training and test data are closely matching and are consistent with being a good fit with the regression line. This is enough to conclude our model is appropriate to predict patient charges based on their personal health data.

I hope that I was able to explain linear regression and the related concepts of EDA, feature engineering and selection, R² and Mean Squared error through this use case effectively. Personal health data is a great example of where linear regression simply works! Thank you for reading!


  1. My code, Regression on Personal Health Data (2020), GitHub
  2. Miri Choi, Medical Cost Personal Datasets (2013), Kaggle
  3. Statistics Solutions, What is Linear Regression (2013)



Thomas George Thomas
Analytics Vidhya

Data Analytics Engineering Graduate Student at Northeastern. Ex Senior Data Engineer & IBM Certified Data Scientist.