Linear Regression

Predict Insurance Charges using different Linear Regression Models and compare results.

Priyanka Dave
Oct 17 · 6 min read
Image for post
Image for post
Predict Insurance Charges

Here I will discuss how Linear Regression works and how can we implement it in different ways to achieve best accuracy.

Data set overview:

I have taken health insurance data set for analysis. It contains 1338 samples and 7 features.

Here we want to predict insurance charges using given features like age, sex, bmi, children, smoker and region.

You will be able to download data from here.

Image for post
Image for post
Data set overview

Table of content:

  1. Data visualization, interpretation of visuals and feature selection
  2. Simple Linear Regression
  3. Multiple Linear Regression
  4. Polynomial Regression
  5. Discussion
  6. Conclusion

1. Data visualization, interpretation of visuals and feature selection:

Check relationships between features and target variable and select most relevant features.

Image for post
Image for post

Here we can see that age, bmi and smoker all three features are correlated with target variable charges. Charges increases with age and bmi. Smoker variable clearly divides data set into two parts.

Image for post
Image for post

Here we can see that sex variable is not able to differentiate data set in anyway.

Image for post
Image for post

Same way region is also not able to differentiate data set.

Conclusion: age, bmi and smoker are important features to predict charges. Where sex and region do not show any prominent pattern in data set.

Data set after converting categorical variables to numeric:

Image for post
Image for post
Data set after converting categorical variables to numeric

2. Simple Linear Regression:

In simple linear regression there is only one input variable and one output variable.

Equation: y = ax+b

  • a = slope
  • b = intercept
  • x = input
  • y = output

Let’s use age as an input variable and charges as an output.

x=df[[‘age’]]
y=df[[‘charges’]]
lr = linear_model.LinearRegression()
lr_model = lr.fit(x, y)

print (‘Slope: ‘, lr_model.coef_)
print (‘Intercept: ‘,lr_model.intercept_)

Slope:  [[257.72261867]]
Intercept: [3165.88500606]

Note:

  • Based on above result, equation to predict output variable using age as an input would be like,

y = (257.72261867 * x) + 3165.88500606

  • If sample data with actual output value 8240.5896 having,
  • age = 46, bmi = 33.44, smoker_yes = 0
  • Let’s put value of age in place of x in above equation,

(257.72261867 * 46) + 3165.88500606 = 15021.12546488

  • So, considering age as only input, 46 years old person will have to pay 15021.12546488 insurance charge if we will use Simple Linear Regression model.
  • Here we can see that predicted value is almost double than actual. So we can say that Simple Linear Regression model is not performing well.

Let’s visualize output of Simple Linear Regression Model:

Image for post
Image for post
Output of Simple Linear Regression Model

3. Multiple Linear Regression

In multiple linear regression there can be multiple inputs and single output.

Equation: y = a1x1 + a2x2 +….. + b

  • a1,a2 = slope
  • b = intercept
  • x1,x2 = input
  • y = output

Let’s use age, bmi and smoker_yes as input variables and charges as output.

x=df[[‘age’,’bmi’,’smoker_yes’]]
y=df[[‘charges’]]
lr = linear_model.LinearRegression()
lr_model = lr.fit(x, y)

print (‘Slope: ‘, lr_model.coef_)
print (‘Intercept: ‘,lr_model.intercept_)

Slope:  [[  259.54749155   322.61513282 23823.68449531]]
Intercept: [-11676.83042519]

Note:

  • Based on above result, equation to predict output variable using age, bmi and smoker_yes as input would be like,

y = (259.54749155 * x1) + (322.61513282 * x2) + (23823.68449531 * x3) — 11676.83042519

  • If sample data with actual output value 8240.5896 having,
  • age = 46, bmi = 33.44, smoker_yes = 0
  • If we will put value of age in place of x1, bmi inplace of x2 and smoker_yes inplace of x3 in above equation then,

(259.54749155 * 46) + (322.61513282 * 33.44) + (23823.68449531 * 0) — 11676.83042519 = 11050.6042276108

  • So, considering age, bmi and smoker_yes as input variables, 46 years old person will have to pay 11050.6042276108 insurance charge if we will use Multiple Linear Regression model.
  • Here we can see that predicted value is somewhat near to actual value. So we can say that Multiple Linear Regression model is performing better than Simple Linear Regression.

Let’s visualize output of Multiple Linear Regression Model:

Image for post
Image for post
Output of Multiple Linear Regression Model

4. Polynomial Regression:

  • Sometimes, the trend of data is not really linear, and looks curvy. In this case we can use Polynomial Regression methods.
  • The relationship between the input variable x and the output variable y is modeled as an nth degree polynomial in x.

4.1 Find Optimum Value for degree of polynomial:

x=df[[‘age’,’bmi’,’smoker_yes’]]
y=df[[‘charges’]]
lr = linear_model.LinearRegression()

scores = []
degree = list(range(2,15))
for n in degree:
pr = PolynomialFeatures(degree=n)
x_pr = pr.fit_transform(x)
lr.fit(x_pr, y)
scores.append(lr.score(x_pr, y))

degree_score_df = pd.DataFrame(list(zip(degree, scores)),columns = [‘Degree’, ‘R2-score’])
degree_score_df.set_index(‘Degree’,inplace=True)

degree_score_df.plot()
plt.xlabel(‘Degree’)
plt.ylabel(‘R2-score’)
plt.title(‘Degree Vs. R2-score’)

Image for post
Image for post
Degree Vs. R2-score

Here we can see that R2-score is highest i.e 0.86 when degree=13.

4.2 Use value of degree as 13 and train model:

x=df[[‘age’,’bmi’,’smoker_yes’]]
y=df[[‘charges’]]

poly = PolynomialFeatures(degree=13)
x_poly = poly.fit_transform(x)

lr = linear_model.LinearRegression()
lr_model = lr.fit(x_poly, y)

Let’s visualize output of Polynomial Regression Model:

Image for post
Image for post
Output of Polynomial Regression Model

4.3 Normalize input data and then train model:

x=df[[‘age’,’bmi’,’smoker_yes’]]
y=df[[‘charges’]]

normalized_x= preprocessing.StandardScaler().fit(x).transform(x)

poly = PolynomialFeatures(degree=13)
x_poly = poly.fit_transform(normalized_x)

lr = linear_model.LinearRegression()
lr_model = lr.fit(x_poly, y)

Let’s visualize output of Polynomial Regression Model using Normalized-X:

Image for post
Image for post
Output of Polynomial Regression Model using Normalized-X

5. Discussion:

Compare results of all 4 models:

Here,

MAE = Mean Absolute Error

  • Used to check how wrong our model is
  • Mean value of absolute difference between actual and predicted values.
  • Higher the MAE, worst the model
  • Goal is to minimize this value.

MSE = Mean Squared Error

  • Used to check how wrong our model is
  • Mean value of squared difference between actual and predicted values.
  • Higher the MSE, worst the model
  • Goal is to minimize this value

R2-score = Accuracy Metric

  • Used to check how accurate our model is
  • Higher the R2-score, best the model.
  • Goal is to maximize this value. Best value can be 1. Worst can be 0.
Image for post
Image for post
Comparison Of Results

Visualize Comparison:

Image for post
Image for post
Scatter plots of Actual Vs. Predicted values using all 4 models
Image for post
Image for post
Dist plots of Actual Vs. Predicted values using all 4 models
Image for post
Image for post
Scatter plots of error generated by all 4 models

6. Conclusion:

Comparing above models, we conclude that Polynomial Regression With Normalized-X is the best model which is giving 87% accuracy.

If we look at error plots then,

  • Simple Linear Regression error ranges from -10,000 to 50,000.
  • Where Multiple Linear Regression error ranges from -10,000 to 30,000.
  • There is not much difference between error range of Polynomial Regression and Polynomial Regression With Normalized-X. Both are ranges between -10,000 to 25,000 but in Polynomial Regression With Normalized-X error spread is less as compared to Polynomial Regression.

I hope this will help to understand basics of Linear Regression.

You can download full source code from my GitHub Repository.

Image for post
Image for post
Thank You

The Startup

Medium's largest active publication, followed by +731K people. Follow to join our community.

Medium is an open platform where 170 million readers come to find insightful and dynamic thinking. Here, expert and undiscovered voices alike dive into the heart of any topic and bring new ideas to the surface. Learn more

Follow the writers, publications, and topics that matter to you, and you’ll see them on your homepage and in your inbox. Explore

If you have a story to tell, knowledge to share, or a perspective to offer — welcome home. It’s easy and free to post your thinking on any topic. Write on Medium

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store