Regression Model

Shruti Garg
Analytics Vidhya
Published in
6 min readNov 1, 2019

In this article we will see how a regression model is build in Python. First, let us talk about what Regression analysis actually means..

Regression Analysis focuses on relationship between a dependent variable and one or more independent variables, whereas Linear Regression focuses on relationship between two variables only.

Let’s see the equations of a Simple Linear Regression and Multiple Linear Regression.

Here, y is the dependent variable and xi’s are the independent variables . b0 is the intercept or the constant which is the defined as the mean value of dependent variable when all the independent variables are set to zero.

We’ll be applying the concept of multiple linear regression to predict Heath Insurance Cost. Let’s see the steps to build a model in python .

1) Import the python libraries

2) Import the data

3) Missing Value Treatment

4) Dummy Creation of Categorical Variables

5) Correlation to select variables

6) Splitting of the data( Training and Testing )

7) Building the model

8) Decline Analysis and Error Testing

The data which you will see here will be of HEALTH INSURANCE COSTING. The main aim of building the model will be to predict the cost of Heath Insurance keeping some factors in mind. So, without any further delay let’s get started.

The data has 7 columns:

  1. age: age of primary beneficiary.
  2. sex: gender
  3. BMI: Body Mass Index
  4. Children: Number of children covered by health insurance / Number of dependents
  5. Smoker
  6. Region: the beneficiary’s residential area in the US, northeast, southeast, southwest, northwest
  7. Charges: Individual medical costs billed by health insurance

The important Python Libraries that need to be imported are:

Now we import the Health Insurance Cost data and see the data. Below image shows the first five rows of the data.

Our next step will be to explore the data. We need to see if missing value treatment or outlier treatment needs to be done or not. Well Luckily, we don’t have any missing values here and our data is clean.

Now, let’s look at the visualizations that can be done from this data.

The above image shows the total insurance charges for male and female.

No. of Smokers in each Region

The above image represents that male smokers are more than the female smokers.

These are the charges for smokers and non smokers.

Now we will see the count of 18 years old Male and female smokers.

Surprisingly, we see that 18 years old also have smoking habit.

We have a column for the count of children that people have. So, lets look at the distribution.

We can see only few have 5 children and many don’t have children.

Now lets look at the another important column of the data ,which is the BMI (Body Mass Index) . Well, a person having BMI less than 18.5 is said to be underweight . A BMI of between 18.5 to 24.9 is ideal. A BMI of between 25 and 29.9 is overweight and a BMI over 30 indicates obesity.

The image below shows the charges charged for the people with different BMI’s and having smoking habit.

Now let’s look at the count of the people having the smoking habit and falling under the different categories of BMI.

We observed that in overweight category the number of smokers are more as compared to any other category.

We have now visualized the data to the fullest and its time to build the model . We first make the dummies for the categorical variables. When dummies are made, the categorical variables looks like this:

and after concatenating the numerical variables we get our final data . which looks like this:

Now coming to the definition of Multiple Linear Regression .Here our “Y” is charges as we need to predict the Insurance cost and other columns representing the independent variables(“xi’s”). We took the log of Y here so as to get the normal distribution curve. While building the model we split the data into Testing and Training data. It means we take 70 percent of the data as Training data and 30 percent as Testing data .

After splitting the data we need to add the constant (b0) as the definition of the Multi Linear Regression says. After this has been done we fit our model and get the OLS(Ordinary Least Square) regression results. Here the variables are less as the feature importance for these variables were more as compared to those of others .

Now we check for the errors in the data.

MAE: Mean Absolute Error

MSE: Mean Squared Error

RMSE: Root Mean Squared Error

For Training data we get:

and for Testing data we get:

We say the model is overfitting if MSE of Training data is less than that of MSE of Testing data.

Hence, our model is neither overfitting nor underfitting and can be used for predicting the Insurance cost.

we have got the coefficients for the xi’s from the OLS Regression results. Therefore we need to put that in the equation in order to predict the Insurance cost.

For example :

For the person aged 35 years having smoking habit and having a BMI of 24 will have the Health Insurance Cost as:

y=b0 + (coef_age)*age + (coef_bmi)* bmi +(coef_smoker_yes) *smoker

After solving the equation we get the value as 23784. Therefore this will be the Health Insurance Cost for the above mentioned person. Similarly we can get the amount for others.

Hence, this is how a Multiple Regression Model can be use for predicting.

Thanks for reading:)

--

--