# Linear Regression

## Predict Insurance Charges using different Linear Regression Models and compare results.

Here I will discuss how **Linear Regression** works and how can we implement it in different ways to achieve best accuracy.

**Data set overview:**

I have taken health insurance data set for analysis. It contains **1338 samples** and **7 features.**

Here we want to predict **insurance charges** using given features like **age, sex, bmi, children, smoker and region.**

You will be able to download data from here.

# Table of content:

- Data visualization, interpretation of visuals and feature selection
- Simple Linear Regression
- Multiple Linear Regression
- Polynomial Regression
- Discussion
- Conclusion

**1. Data visualization, interpretation of visuals and feature selection:**

Check relationships between features and target variable and select most relevant features.

Here we can see that **age, bmi and smoker** all three features are correlated with target variable **charges**. Charges increases with age and bmi. Smoker variable clearly divides data set into two parts.

Here we can see that sex variable is not able to differentiate data set in anyway.

Same way region is also not able to differentiate data set.

**Conclusion:** age, bmi and smoker are important features to predict charges. Where sex and region do not show any prominent pattern in data set.

**Data set after converting categorical variables to numeric:**

**2. Simple Linear Regression:**

In simple linear regression there is only one input variable and one output variable.

**Equation: y = ax+b**

- a = slope
- b = intercept
- x = input
- y = output

Let’s use **age** as an input variable and **charges** as an output.

x=df[[‘age’]]

y=df[[‘charges’]]

lr = linear_model.LinearRegression()

lr_model = lr.fit(x, y)

print (‘Slope: ‘, lr_model.coef_)

print (‘Intercept: ‘,lr_model.intercept_)

`Slope: [[257.72261867]]`

Intercept: [3165.88500606]

**Note:**

- Based on above result, equation to predict output variable using
**age**as an input would be like,

**y = (257.72261867 * x) + 3165.88500606**

- If sample data with actual output value
**8240.5896**having, - age = 46, bmi = 33.44, smoker_yes = 0
- Let’s put value of
**age**in place of x in above equation,

**(257.72261867 * 46) + 3165.88500606 = 15021.12546488**

- So, considering
**age**as only input,**46 years**old person will have to pay**15021.12546488**insurance charge if we will use**Simple Linear Regression**model. - Here we can see that predicted value is almost double than actual. So we can say that
**Simple Linear Regression**model is not performing well.

**Let’s visualize output of Simple Linear Regression Model:**

# 3. Multiple Linear Regression

In multiple linear regression there can be multiple inputs and single output.

**Equation: y = a1x1 + a2x2 +….. + b**

- a1,a2 = slope
- b = intercept
- x1,x2 = input
- y = output

Let’s use **age, bmi** and **smoker_yes** as input variables and **charges** as output.

x=df[[‘age’,’bmi’,’smoker_yes’]]

y=df[[‘charges’]]

lr = linear_model.LinearRegression()

lr_model = lr.fit(x, y)

print (‘Slope: ‘, lr_model.coef_)

print (‘Intercept: ‘,lr_model.intercept_)

`Slope: [[ 259.54749155 322.61513282 23823.68449531]]`

Intercept: [-11676.83042519]

**Note:**

- Based on above result, equation to predict output variable using
**age, bmi and smoker_yes**as input would be like,

**y = (259.54749155 * x1) + (322.61513282 * x2) + (23823.68449531 * x3) — 11676.83042519**

- If sample data with actual output value
**8240.5896**having, - age = 46, bmi = 33.44, smoker_yes = 0
- If we will put value of
**age**in place of**x1**,**bmi**inplace of**x2**and**smoker_yes**inplace of**x3**in above equation then,

**(259.54749155 * 46) + (322.61513282 * 33.44) + (23823.68449531 * 0) — 11676.83042519 = 11050.6042276108**

- So, considering
**age, bmi**and**smoker_yes**as input variables,**46 years**old person will have to pay**11050.6042276108**insurance charge if we will use**Multiple Linear Regression**model. - Here we can see that predicted value is somewhat near to actual value. So we can say that
**Multiple Linear Regression**model is performing better than**Simple Linear Regression.**

**Let’s visualize output of Multiple Linear Regression Model:**

# 4. Polynomial Regression:

- Sometimes, the trend of data is not really linear, and looks curvy. In this case we can use
**Polynomial Regression**methods. - The relationship between the input variable x and the output variable y is modeled as an nth degree polynomial in x.

# 4.1 Find Optimum Value for degree of polynomial:

x=df[[‘age’,’bmi’,’smoker_yes’]]

y=df[[‘charges’]]

lr = linear_model.LinearRegression()

scores = []

degree = list(range(2,15))

for n in degree:

pr = PolynomialFeatures(degree=n)

x_pr = pr.fit_transform(x)

lr.fit(x_pr, y)

scores.append(lr.score(x_pr, y))

degree_score_df = pd.DataFrame(list(zip(degree, scores)),columns = [‘Degree’, ‘R2-score’])

degree_score_df.set_index(‘Degree’,inplace=True)

degree_score_df.plot()

plt.xlabel(‘Degree’)

plt.ylabel(‘R2-score’)

plt.title(‘Degree Vs. R2-score’)

Here we can see that R2-score is highest i.e **0.86** when **degree=13.**

**4.2 Use value of degree as 13 and train model:**

x=df[[‘age’,’bmi’,’smoker_yes’]]

y=df[[‘charges’]]

poly = PolynomialFeatures(degree=13)

x_poly = poly.fit_transform(x)

lr = linear_model.LinearRegression()

lr_model = lr.fit(x_poly, y)

**Let’s visualize output of Polynomial Regression Model:**

# 4.3 Normalize input data and then train model:

x=df[[‘age’,’bmi’,’smoker_yes’]]

y=df[[‘charges’]]

normalized_x= preprocessing.StandardScaler().fit(x).transform(x)

poly = PolynomialFeatures(degree=13)

x_poly = poly.fit_transform(normalized_x)

lr = linear_model.LinearRegression()

lr_model = lr.fit(x_poly, y)

**Let’s visualize output of Polynomial Regression Model using Normalized-X:**

**5. Discussion:**

Compare results of all 4 models:

Here,

**MAE = Mean Absolute Error**

- Used to check how wrong our model is
- Mean value of absolute difference between actual and predicted values.
- Higher the MAE, worst the model
- Goal is to minimize this value.

**MSE = Mean Squared Error**

- Used to check how wrong our model is
- Mean value of squared difference between actual and predicted values.
- Higher the MSE, worst the model
- Goal is to minimize this value

**R2-score = Accuracy Metric**

- Used to check how accurate our model is
- Higher the R2-score, best the model.
- Goal is to maximize this value. Best value can be 1. Worst can be 0.

**Visualize Comparison:**

# 6. Conclusion:

Comparing above models, we conclude that **Polynomial Regression With Normalized-X** is the best model which is giving **87%** accuracy.

If we look at error plots then,

**Simple Linear Regression**error ranges from**-10,000 to 50,000**.- Where
**Multiple Linear Regression**error ranges from -10,000 to 30,000. - There is not much difference between error range of
**Polynomial Regression**and**Polynomial Regression With Normalized-X**. Both are ranges between -10,000 to 25,000 but in**Polynomial Regression With Normalized-X**error spread is less as compared to**Polynomial Regression**.

I hope this will help to understand basics of Linear Regression.

You can download full source code from my GitHub Repository.