# Linear Regression

## Predict Insurance Charges using different Linear Regression Models and compare results.

Here I will discuss how Linear Regression works and how can we implement it in different ways to achieve best accuracy.

# Data set overview:

I have taken health insurance data set for analysis. It contains 1338 samples and 7 features.

Here we want to predict insurance charges using given features like age, sex, bmi, children, smoker and region.

# Table of content:

1. Data visualization, interpretation of visuals and feature selection
2. Simple Linear Regression
3. Multiple Linear Regression
4. Polynomial Regression
5. Discussion
6. Conclusion

# 1. Data visualization, interpretation of visuals and feature selection:

Check relationships between features and target variable and select most relevant features.

Here we can see that age, bmi and smoker all three features are correlated with target variable charges. Charges increases with age and bmi. Smoker variable clearly divides data set into two parts.

Here we can see that sex variable is not able to differentiate data set in anyway.

Same way region is also not able to differentiate data set.

Conclusion: age, bmi and smoker are important features to predict charges. Where sex and region do not show any prominent pattern in data set.

Data set after converting categorical variables to numeric:

# 2. Simple Linear Regression:

In simple linear regression there is only one input variable and one output variable.

Equation: y = ax+b

• a = slope
• b = intercept
• x = input
• y = output

Let’s use age as an input variable and charges as an output.

x=df[[‘age’]]
y=df[[‘charges’]]
lr = linear_model.LinearRegression()
lr_model = lr.fit(x, y)

print (‘Slope: ‘, lr_model.coef_)
print (‘Intercept: ‘,lr_model.intercept_)

`Slope:  [[257.72261867]]Intercept:  [3165.88500606]`

Note:

• Based on above result, equation to predict output variable using age as an input would be like,

## y = (257.72261867 * x) + 3165.88500606

• If sample data with actual output value 8240.5896 having,
• age = 46, bmi = 33.44, smoker_yes = 0
• Let’s put value of age in place of x in above equation,

## (257.72261867 * 46) + 3165.88500606 = 15021.12546488

• So, considering age as only input, 46 years old person will have to pay 15021.12546488 insurance charge if we will use Simple Linear Regression model.
• Here we can see that predicted value is almost double than actual. So we can say that Simple Linear Regression model is not performing well.

Let’s visualize output of Simple Linear Regression Model:

# 3. Multiple Linear Regression

In multiple linear regression there can be multiple inputs and single output.

Equation: y = a1x1 + a2x2 +….. + b

• a1,a2 = slope
• b = intercept
• x1,x2 = input
• y = output

Let’s use age, bmi and smoker_yes as input variables and charges as output.

x=df[[‘age’,’bmi’,’smoker_yes’]]
y=df[[‘charges’]]
lr = linear_model.LinearRegression()
lr_model = lr.fit(x, y)

print (‘Slope: ‘, lr_model.coef_)
print (‘Intercept: ‘,lr_model.intercept_)

`Slope:  [[  259.54749155   322.61513282 23823.68449531]]Intercept:  [-11676.83042519]`

Note:

• Based on above result, equation to predict output variable using age, bmi and smoker_yes as input would be like,

## y = (259.54749155 * x1) + (322.61513282 * x2) + (23823.68449531 * x3) — 11676.83042519

• If sample data with actual output value 8240.5896 having,
• age = 46, bmi = 33.44, smoker_yes = 0
• If we will put value of age in place of x1, bmi inplace of x2 and smoker_yes inplace of x3 in above equation then,

## (259.54749155 * 46) + (322.61513282 * 33.44) + (23823.68449531 * 0) — 11676.83042519 = 11050.6042276108

• So, considering age, bmi and smoker_yes as input variables, 46 years old person will have to pay 11050.6042276108 insurance charge if we will use Multiple Linear Regression model.
• Here we can see that predicted value is somewhat near to actual value. So we can say that Multiple Linear Regression model is performing better than Simple Linear Regression.

Let’s visualize output of Multiple Linear Regression Model:

# 4. Polynomial Regression:

• Sometimes, the trend of data is not really linear, and looks curvy. In this case we can use Polynomial Regression methods.
• The relationship between the input variable x and the output variable y is modeled as an nth degree polynomial in x.

# 4.1 Find Optimum Value for degree of polynomial:

x=df[[‘age’,’bmi’,’smoker_yes’]]
y=df[[‘charges’]]
lr = linear_model.LinearRegression()

scores = []
degree = list(range(2,15))
for n in degree:
pr = PolynomialFeatures(degree=n)
x_pr = pr.fit_transform(x)
lr.fit(x_pr, y)
scores.append(lr.score(x_pr, y))

degree_score_df = pd.DataFrame(list(zip(degree, scores)),columns = [‘Degree’, ‘R2-score’])
degree_score_df.set_index(‘Degree’,inplace=True)

degree_score_df.plot()
plt.xlabel(‘Degree’)
plt.ylabel(‘R2-score’)
plt.title(‘Degree Vs. R2-score’)

Here we can see that R2-score is highest i.e 0.86 when degree=13.

# 4.2 Use value of degree as 13 and train model:

x=df[[‘age’,’bmi’,’smoker_yes’]]
y=df[[‘charges’]]

poly = PolynomialFeatures(degree=13)
x_poly = poly.fit_transform(x)

lr = linear_model.LinearRegression()
lr_model = lr.fit(x_poly, y)

Let’s visualize output of Polynomial Regression Model:

# 4.3 Normalize input data and then train model:

x=df[[‘age’,’bmi’,’smoker_yes’]]
y=df[[‘charges’]]

normalized_x= preprocessing.StandardScaler().fit(x).transform(x)

poly = PolynomialFeatures(degree=13)
x_poly = poly.fit_transform(normalized_x)

lr = linear_model.LinearRegression()
lr_model = lr.fit(x_poly, y)

Let’s visualize output of Polynomial Regression Model using Normalized-X:

# 5. Discussion:

Compare results of all 4 models:

Here,

MAE = Mean Absolute Error

• Used to check how wrong our model is
• Mean value of absolute difference between actual and predicted values.
• Higher the MAE, worst the model
• Goal is to minimize this value.

MSE = Mean Squared Error

• Used to check how wrong our model is
• Mean value of squared difference between actual and predicted values.
• Higher the MSE, worst the model
• Goal is to minimize this value

R2-score = Accuracy Metric

• Used to check how accurate our model is
• Higher the R2-score, best the model.
• Goal is to maximize this value. Best value can be 1. Worst can be 0.

Visualize Comparison:

# 6. Conclusion:

Comparing above models, we conclude that Polynomial Regression With Normalized-X is the best model which is giving 87% accuracy.

If we look at error plots then,

• Simple Linear Regression error ranges from -10,000 to 50,000.
• Where Multiple Linear Regression error ranges from -10,000 to 30,000.
• There is not much difference between error range of Polynomial Regression and Polynomial Regression With Normalized-X. Both are ranges between -10,000 to 25,000 but in Polynomial Regression With Normalized-X error spread is less as compared to Polynomial Regression.

I hope this will help to understand basics of Linear Regression.

Written by

Written by

## Priyanka Dave

#### Data Science Enthusiast 