Pranjal Pandey
Analytics Vidhya
Published in
22 min readJul 22, 2020

--

How to proceed from Simple to Multiple and Polynomial Regression in R

Introduction -

This article is in the continuation of my first article in which I have shown a complete procedure to perform Simple Linear Regression in detail. Now in this article, I am taking a little more complex data set (Advertising Data set) and going to show you How Multiple Linear Regression is prepared and using the information obtained from its diagnostic plot, how we proceed towards Orthogonal Polynomial Regression and obtain a better model for this data set.

This article consists of the following sections -

  1. Loading Required Libraries
  2. Loading Data set
  3. Exploring Data set
  4. Splitting Data set
  5. Fitting Simple Linear Regression Model
  6. Fitting Multiple Linear Regression Model with Diagnostic Plots and Statistical Tests
  7. Fitting Orthogonal Polynomial Linear Regression Model with Diagnostic Plots and Statistical Tests
  8. Making Predictions
  9. Repeated 10-fold Cross Validation
  10. Conclusion
  11. Task for you (If interested)
  12. Information about Next Article

I am going to use kaggle online R-Notebook for analysis work. You may use any software like R-studio or R-cran version to work offline.

1. Loading Required Libraries

It is not mandatory to load libraries in the beginning but I am doing it for simplicity.

# Loading required libraries
library(tidyverse) # Pipe operator (%>%) and other commands
library(caret) # Random split of data/cross validation
library(olsrr) # Heteroscedasticity Testing (ols_test_score)
library(car) # Muticolinearity detection (vif)
library(broom) # Diagnostic Metric Table (augment)

2. Loading Data set

First of all, Load the data set in your R-Session. Link to download data set.

# Loading Data set
data = read.csv("../input/advertising-dataset/advertising.csv" , header = T)

Advertising data set has been successfully loaded in the R-object “data”.

3. Exploring Data set

Make some understanding about the given data set as follows -

# Inspection of top 5-rows of data
head(data)
Output — 1

The above output shows top 5-rows of given data set. At this stage, just see the data and make some understanding as — There are four variables (TV,Radio,Newspaper,Sales) in the data set and all are numeric variables.

# Inspection of 5-bottom rows of data
tail(data)
Output — 2

From the above output, it is clear that there are 200 rows in the data set and one more important point is that there are no rows that contains the information on something like “Totals”. It may be possible that in your data set there is a last row that contains the information of Totals of each column. Such rows are not useful in further analysis or during the model preparation. So, if there exists such row, just remove it from the data.

# Getting Structure of whole data set
str(data)
Output — 3

The above output gives information as-

  1. There are 200 rows and 4 variables.
  2. Variables are : TV,Radio,Newspaper,Sales
  3. All are numeric variables.
# Checking Outliers
boxplot(data)
Output — 4

The above plot shows that two outliers are present in the variable “Newspaper”. Just remove these outliers by the following command -

# Removing Outliers
data <- data[-which(data$Newspaper %in% boxplot.stats(data$Newspaper)$out), ]

Again, See the boxplot -

# Again Checking Outliers
boxplot(data)
Output — 5

Now, Outliers have been removed.

# Checking Missing Values
table(is.na(data))
Output — 6

The above output shows that there is no missing value in the given data set.

Deciding the Target and Predictors — It is always known to us which variable must be taken as Target and which as Predictors. This depends on the problem what you want to predict. I am taking here Sales as Target and rest variables as Predictors.

We have four numeric variables. Just take a look on scatter plot of these Variables as follows -

# Creating scatter plot matrix 
pairs(data , upper.panel = NULL)
Output — 7

This output shows that -

  1. No or very low linear relationship between TV and Radio variable.
  2. Low linear relationship between TV and Newspaper variable.
  3. Moderate linear relationship between Radio and Newspaper variable.
  4. High linear relationship between TV and Sales , Radio and Sales , Newspaper and Sales.
  5. A small curvilinear relationship is also present between TV and Sales as well as Radio and Sales.

Let’s get a more closer view to be more confident about existing relationship by plotting separate scatter plots -

# Scatter Plot between TV and Sales
plot(data$TV , data$Sales)
Output — 8

Notice, there is a small curvilinear relationship between TV and Sales.

# Scatter Plot between Radio and Sales
plot(data$Radio , data$Sales)
Output — 9

Notice, there is a curvilinear relationship between Radio and Sales.

# Scatter Plot between Newspaper and Sales
plot(data$Newspaper , data$Sales)
Output — 10

Low linear relationship between Newspaper and Sales variable.

# Scatter Plot between TV and Radio
plot(data$TV , data$Radio)
Output — 11

No linear relationship between TV and Radio variable.

# Scatter Plot between Newspaper and TV
plot(data$TV , data$Newspaper)
Output — 12

No linear relationship between TV and Newspaper variable.

# Scatter Plot between Newspaper and Radio
plot(data$Radio , data$Newspaper)
Output — 13

Moderate linear relationship between Radio and Newspaper variable.

Remember these points in your mind that will help you to prepare a better model.

4. Splitting Data Set

Now, I am going to split the whole data set into two parts. One part is known as train data set and other is test data set. We do this because first we train/fit the model using train data set and then use the test data set to check the performance of the obtained model on new data set that has not been used during training period. Splitting is done by the following code -

# Randomly Split the data into training and test set
set.seed(123)
training.samples <- data$Sales %>%
createDataPartition(p = 0.75, list = FALSE)
train.data <- data[training.samples, ]
test.data <- data[-training.samples, ]

Train data set and Test data set has been stored in R-object train.data and test.data respectively. Note that :

  1. We will use the data available in train.data for fitting/training the model.
  2. We will use the data available in test.data to check the performance of model.

5. Fitting Simple Linear Regression

Since, we have only three predictors, we may fit three separate simple linear regression model one for each predictor, i.e.,

  1. Sales ~ TV
  2. Sales ~ Radio
  3. Sales ~ Newspaper

Fit these three models and try to find the percentage variance explained by these models.This is achieved by Adjusted R² and in R using summary() function.

# Fitting Sales ~ TV
sm1 <- lm(Sales ~ TV , data = train.data)

# Take a look on summary of the model
summary(sm1)
Output — 14

From the above output, you must notice that -

  1. Created model is statistically significant since p-value <<< 0.05 (see in the last line of output)
  2. From the coefficients section, it is clear that both coefficients (slope and intercept) are statistically significant since p-value <<< 0.05
  3. This model with TV as predictor explains approximately 81% variability of target (Sales).
  4. Residual standard error for the model is 2.29
# Fitting Sales ~ Radio
sm2 <- lm(Sales ~ Radio , data = train.data)

# Take a look on summary of the model
summary(sm2)
Output — 15

From the above output, you must notice that -

  1. Created model is statistically significant since p-value << 0.05 (see in the last line of output)
  2. From the coefficients section, it is clear that both coefficients (slope and intercept) are statistically significant since p-value << 0.05
  3. This model with TV as predictor explains approximately 13% variability of target (Sales).
  4. Residual standard error for the model is 4.917
# Fitting Sales ~ Newspaper
sm3 <- lm(Sales ~ Newspaper , data = train.data)

# Take a look on summary of the model
summary(sm3)
Output — 16

From the above output, you must notice that -

  1. Created model is statistically significant since p-value < 0.05 (see in the last line of output)
  2. From the coefficients section, it is clear that both coefficients (slope and intercept) are statistically significant since p-value < 0.05
  3. This model with TV as predictor explains approximately 2% variability of target (Sales).
  4. Residual standard error for the model is 5.21

Till now, we have obtained that Simple Linear Regression Model with TV as predictor is explaining more variability of target (Sales).

Just draw Scatter plot between TV and Sales and also draw the Simple Linear Regression Line in the plot as follows -

# Scatter plot with Simple Linear Regression Line
plot(train.data$TV , train.data$Sales)

# Adding Regression Line
abline(lm(train.data$Sales ~ train.data$TV) , col = "blue")
Output — 17

One problem occurs here : The above plot shows that it is not feasible to predict Sales only on the basis of a single predictor due to more variability in the Sales. Also, if we use single predictor then we completely neglect the effect of rest two other predictors on Sales, that may not be the case in real. So, why not extend this model ?

6. Fitting Multiple Linear Regression with Diagnostic Plot

There are many methods to extend the above simple linear regression model such as Forward Selection method, Backward Selection Method, Mixed Selection Method and many more. In this article, I want to go with Forward Selection method to explore some more concepts.

Extending Simple Linear Regression Model using Forward Selection Method -

In this context, What will i do ?

I will simply include the predictor Radio in Simple Linear Regression model Sales ~ TV (Why in this model ? : Due to explaining a huge part of variability of Sales).

Why we include Radio at this stage ?

Because it explains more variability (13%) of Sales in comparison to Newspaper (2%) after TV (81%). — {Results from Simple Linear Regression has been used here.}

So, Fit a Multiple Linear Regression model with two predictors TV and Radio and obtain summary of the model as follows -

# Fitting MLR model with predictors TV and Radio 
mm1 <- lm(Sales ~ TV + Radio , data = train.data)

# Take a look on summary of the model
summary(mm1)
Output — 18

Well, From the above output, notice that -

  1. Created model is statistically significant since p-value <<< 0.05 (see in the last line of output)
  2. From the coefficients section, it is clear that both coefficients (slopes and intercept) are statistically significant since p-value <<< 0.05
  3. This model with TV and Radio as predictors explains approximately 89% variability of target (Sales) that is a better indication with respect to the model with TV alone as predictor.
  4. Residual standard error for the model is 1.715

But we must test, whether the improvement in Adjusted R-squared is statistically significant ?

i.e., we want to test the null hypothesis — H0 : The improvement in Adjusted R-squared is not statistically significant. Vs the alternative hypothesis — H1 : The improvement in Adjusted R-squared is statistically significant.

For this testing, we use ANOVA (Analysis of Variance) technique and code for the same is as follows -

# Performing ANOVA to test the above stated null hypothesis
anova(sm1 , mm1)
Output — 19

In the above output, Notice the value in the last column of second row. This value (2.051808e-20) indicates the p-value for testing null hypothesis. Since this value is extremely less than 0.05, hence we have sufficient evidence from the data to reject the null hypothesis and accept the alternative.

That’s why the improvement in Adjusted R-squared is statistically significant.

Hence, Adopt the model Sales ~ 0.05462 TV + 0.10239 Radio at this stage.

Why not extend this model further ?

i.e., Include the third predictor Newspaper also in your multiple linear regression model and see what happens.

# Extending further the MLR including the predictor Newspaper
mm2 <- lm(Sales ~ TV + Radio + Newspaper , data = train.data)

# Take a look on summary of the model
summary(mm2)
Output — 20

From the above output and using the information from previously fitted model, Notice that -

  1. From the coefficients section of the above output, It is clear that Newspaper predictor is not statistically significant for the model due to p-value (0.69) > 0.05
  2. Adjusted R-squared has been reduced 89.41 to 89.35
  3. Residual standard error has been increased from 1.715 to 1.72
  4. Although, the created model is statistically significant since p-value <<< 0.05 (see in the last line of output)

So, we have sufficient evidence from the data for not to include the Newspaper as predictor in the model.

Hence, Remove it from the model and we get the model as in previously fitted multiple linear regression model already stored in R-object mm1 -

i.e., Sales ~ 0.05462 TV + 0.10239 Radio

We will consider this model for further discussion.

Diagnostic Plots -

To check whether all the assumptions of Multiple Linear Regression is fulfilled, we use different diagnostic plots.

Checking Linearity Assumption -

Residual plot is used to check the first assumption, i.e., Linearity assumption between target and predictors (Joint of TV , Radio) as follows -

# Residual Plot
plot(mm1 , 1)
Output — 21

Notice from the above plot -

  1. Red line is approximately horizontal and linear which indicates that Linearity assumption holds well.
  2. Residual fluctuates in a random manner inside a band drawn between Residuals = -4 to +4 which indicates that the fitted model is good for prediction to some extent. Why to some extent ?
  3. Because the points 131 and 151 may be potential outliers since they are very far from other points. But we know that Large residuals (as in our case with 131 and 151 data points) could also indicate that either the variance is not constant (heteroscedasticity) or the true relationship between target and predictors is nonlinear. So, These possibilities should be investigated before the points are considered outliers.

Checking Homoscedasticity Assumption -

I am going to use Score Test, but you may apply other tests also Breusch Pagan Test, Bartlett Test etc.

# Score Test for Heteroscedasticity
ols_test_score(mm1)
Output — 22

From the last line of the above output, It is clear that p-value is greater than the significance level 0.05. Hence, we may accept the null hypothesis and conclude that the variance is homogeneous. i.e., Homoscedasticity

Checking Auto-correlation Assumption -

Durbin Watson Test is used to detect the effect of Auto-correlation as follows -

# Checking effect of Auto-correlation
durbinWatsonTest(mm1)
Output — 23

From the above output, It is clear that p-value (0.166) > 0.05 , Hence, we may accept the null hypothesis and conclude that there is no auto-correlation between errors. i.e., Errors are uncorrelated.

Checking Multicolinearity -

Generally, Variance Inflation Factor is used to detect Multicolinearity. As a rule of thumb, VIF greater than 5 or 10 represents Multicolinearity.

# Detecting Multicolinearity
vif(mm1)
Output — 24

Note that Variance inflation factor for both predictors are less than 5 (as a rule of thumb) , Hence there is no multicollinearity between predictors.

Checking Normality Assumption -

Shapiro Wilk Test is generally used to check normality assumption.

# Checking Normality of Errors
shapiro.test(mm1$residuals)
Output — 25

Normality does not hold since p-value < 0.05

Just plot histogram for residuals to get an idea about the pattern of distribution -

# Plotting Histogram for Residuals
hist(mm1$residuals)
Output — 26

We see that there is some problem with left tail. It may be due to the data points 131 and 151 as pointed out earlier.

Now, Finally what we have obtained is that Variance is constant and one last possibility is to check whether there is any nonlinear relationship between target and predictors before considering the data points 131 and 151 as outliers.

Also from the previously plotted scatter plots between target and different predictors we have noticed that there exists some type of curvilinear relationship. There are so many algorithms that deals with curvilinear relationship but I am going to take a very basic algorithm to deal with the existing curvilinear relationship and that is nothing but Polynomial Regression. (Generally, we use orthogonal polynomial to avoid multicollinearity problem)

So, All these facts directly indicate us why not to use Orthogonal Polynomial Regression ?

And now we move towards fitting of Orthogonal Polynomial Regression between Sales and predictors TV and Radio.

Why these two predictors only ?

Because we have seen that Newspaper variable is not statistically significant when we had fitted Multiple Linear Regression.

7. Fitting Orthogonal Polynomial Regression with Diagnostic Plot

Now, I am going to fit a second order orthogonal polynomial in two variables. R-code for fitting second order orthogonal polynomial model in two variables TV and Radio is as follows -

# Fitting second order orthogonal polynomial model in two variables to avoid multicollinearity
pm1 <- lm(Sales ~ poly(TV , 2) + poly(Radio , 2) + TV:Radio , data = train.data)

# Take a look on summary of the model
summary(pm1)
Output — 27

From the above output, Notice that -

  1. Created model is statistically significant since p-value <<< 0.05 (see in the last line of output)
  2. From the coefficients section, it is clear that all coefficients are statistically significant since p-value <<< 0.05
  3. This second order orthogonal polynomial model explains 92.58% variability of target (Sales) that is a better indication with respect to the multiple linear regression model with TV and Radio as predictor.
  4. Residual standard error for the model is 1.436

Checking Whether this improvement in Adjusted R-squared is statistically significant -

i.e., we want to test the null hypothesis — H0 : The improvement in Adjusted R-squared is not statistically significant.

Vs the alternative hypothesis — H1 : The improvement in Adjusted R-squared is statistically significant.

# Performing ANOVA to test the above stated null hypothesis
anova(mm1 , pm1)
Output — 28

In the above output, Notice the value in the last column of second row. This value (9.441734e-12) indicates the p-value for testing null hypothesis. Since this value is extremely less than 0.05, hence we have sufficient evidence from the data to reject the null hypothesis and accept the alternative.

That’s why the improvement in Adjusted R-squared is statistically significant.

Hence, Adopt the second order orthogonal polynomial model at this stage.

Why not to use third order (orthogonal) Polynomial Regression in two variable ?

Since, We have noticed that Adjusted R-squared has been increased to a great extent from 89% to 92.58%. So, Move towards fitting of third order orthogonal Polynomial Regression in two variable and see what happens.

# Fitting third order (orthogonal) polynomial model in two variables to avoid multicollinearity
pm2 <- lm(Sales ~ poly(TV , 3) + poly(Radio , 3) + TV:Radio , data = train.data)

# Take a look on summary of the model
summary(pm2)
Output — 29

It is clear from the coefficients section of the above output that third order of TV predictor is not statistically significant (p-value > 0.05). Hence, Don’t include this term in the model.

Hence, Fit the model as follows -

# Fitting third order (orthogonal) polynomial model in two variables to avoid multicolinearity but after removing third order of TV predictor
pm3 <- lm(Sales ~ poly(TV , 2) + poly(Radio , 3) + TV:Radio , data = train.data)

# Take a look on summary of the model
summary(pm3)
Output — 30

From the above output and using the information from second order orthogonal polynomial model stored in R-object pm2, Notice that -

  1. Created model is statistically significant since p-value <<< 0.05 (see in the last line of output)
  2. From the coefficients section, it is clear that all coefficients are statistically significant since p-value <<< 0.05
  3. This third order orthogonal polynomial model in two variables after removing third order of TV predictor explains 92.93% variability of target (Sales) that is a better indication with respect to the second order orthogonal polynomial regression model.
  4. Residual standard error for the model is 1.401

Again, Checking Whether this improvement in Adjusted R-squared is statistically significant -

i.e., we want to test the null hypothesis — H0 : The improvement in Adjusted R-squared is not statistically significant.

Vs the alternative hypothesis — H1 : The improvement in Adjusted R-squared is statistically significant.

# Performing ANOVA to test the above stated null hypothesis
anova(pm1 , pm3)
Output — 31

In the above output, Notice the value in the last column of second row. This value (0.004968654) indicates the p-value for testing null hypothesis. Since this value is extremely less than 0.05, hence we have sufficient evidence from the data to reject the null hypothesis and accept the alternative.

That’s why the improvement in Adjusted R-squared is statistically significant.

Hence, Adopt the third order orthogonal polynomial model without third order of TV predictor at this stage.

We will consider this model for further discussion.

Note : If you want to increase the order of predictor Radio from 3 to 4, you may increase but you will see that this new coefficient with fourth order of predictor Radio will not be statistically significant (Try it yourself). Hence, you have to remove it and go with second order of TV and third order of Radio only. After that you can not increase the order further.

Diagnostic Plots -

Now, again check all the assumptions of Linear Regression are satisfied or not.

Checking Linearity Assumption -

# Residual Plot
plot(pm3 , 1)
Output — 32

Notice from the above plot -

  1. Red line is approximately horizontal and linear at Residuals = 0 which indicates that Linearity assumption holds well.
  2. Residual fluctuates in a random manner inside a band drawn between Residuals = -4 to +4 which indicates that the fitted model is good for prediction to some extent. Why to some extent ?
  3. Because again we see that the point 131 (Note : 151 is now with the other points) may be potential outlier since this is very far from other points.

Checking Homoscedasticity Assumption -

# Score Test for Heteroscedasticity
ols_test_score(pm3)
Output — 33

Errors have constant variance, p-value > 0.05

Checking Auto-correlation Assumption -

# Checking effect of Auto-correlation
durbinWatsonTest(pm3)
Output — 34

Errors are uncorrelated.

Checking Normality Assumption -

# Checking Normality of Errors
shapiro.test(pm3$residuals)
Output — 35

Errors are normally distributed.

Checking Multicollinearity -

# Detecting Multicolinearity
vif(pm3)
Output — 36

Note that all values in the last column of the above output are less than 5 (as a rule of thumb) , Hence there is no multicolinearity.

Removing Observation number 131 from train data set -

Now we have only a choice that Delete the observation number 131 from the train data set as it has large residual (See : Residual Plot for pm3 object) and check whether Adjusted R-squared improves significantly. If yes, then remove it otherwise include observation number 131 too.

# Creating Diagnostic metrics Table for model pm3
dm = augment(pm3)

# See the Table
head(dm)
Output — 37

Notice from the above output that -

  1. This table consists of information on different diagnostic metrics such as Residuals (column — 9), cook’s distance (column — 12) and Studentized Residuals (column — 13) and many more.
  2. I will use last column of the above table to delete observation number 131.
# Checking minimum value of last column (Studentized Residual)
min(dm$.std.resid)
Output — 38

The above value of Studentized residual is less than -3 (Rule of thumb), Hence it indicates an outlier. So just remove that observation as follows -

# Checking the index of that observation in train data
which(dm$.std.resid %in% "-3.4452042988145")
Output — 39

The above output indicates that the outlier is at index 98 in train data set. Just check the complete information about that row as follows -

# Info. about 98th row of train data set
train.data[98,]
Output — 40

This is our target. We have to remove it from our train data set.

# Removing 98th row of outlier
train.data1 = train.data %>% filter(train.data$Sales != 1.6)

# Checking number of rows in old train data set
nrow(train.data)

# Checking number of rows in new train data set (train.data1)
nrow(train.data1)
Output — 41

One observation has been successfully removed.

Now, again fit the same polynomial model as is stored in pm3 but using the data stored in R-object train.data1 -

# Fitting third order (orthogonal) polynomial model in two variables to avoid multicolinearity but after removing third order of TV predictor using train.data1
pm4 <- lm(Sales ~ poly(TV , 2) + poly(Radio , 3) + TV:Radio , data = train.data1)

# Take a look on summary of the model
summary(pm4)
Output — 42

From the above output and using the information from second order orthogonal polynomial model stored in R-object pm3, Notice that -

  1. Created model pm4 is statistically significant since p-value <<< 0.05 (see in the last line of output)
  2. From the coefficients section, it is clear that all coefficients are statistically significant since p-value <<< 0.05
  3. This polynomial model after removing the outlier explains 93.21% variability of target (Sales) that is a better indication with respect to the polynomial regression model stored in R-object pm3.
  4. Residual standard error for the model is 1.347

Note : At that time we can not perform ANOVA to test whether this improvement in Adjusted R-squared is significant, because model pm3 is based on 150 observations and pm4 is on 149 only. In such situation. since we have noticed that Adjusted R-squared has increased and Residual standard error has decreased, hence we may adopt this model stored in pm4.

Now, Check all other assumptions in a quick -

# Linearity Assumption
plot(pm4 ,1)

# Homoscedasticity Assumption
ols_test_score(pm4)

# Autocorrelation Assumption
durbinWatsonTest(pm4)

# Normality Assumption
shapiro.test(pm4$residuals)

# Multicolinearity Assumption
vif(pm4)
Output — 43

All assumptions have been satisfied now.

Checking outliers again by creating Diagnostic metric table for model pm4 -

# Creating Diagnostic metric table for model pm4
dm1 = augment(pm4)

# Checking minimum and maximum value of studentized residual
min(dm1$.std.resid)
max(dm1$.std.resid)
Output — 44

The above output shows that studentized residuals are not greater than 3 (rule of thumb) in absolute value. Hence, there are no potential outliers.

Finally, Adopt this model (Stored in R-object pm4) for making predictions.

8. Making Predictions -

Now, It’s time to make prediction on test data set (unseen data) and check the performance of the model as follows -

# Making Predictions
prediction = pm4 %>% predict(test.data)

# Checking performance by calculating R2 , RMSE and MAE
data.frame( R2 = R2(prediction, test.data$Sales),
RMSE = RMSE(prediction, test.data$Sales),
MAE = MAE(prediction, test.data$Sales))
Output — 45

We obtain : R² = 0.9526385 , which indicates a best fit.

Since, this result is based on only one test data set. Hence, we can not sure that the model will perform better on all unseen data. To be more confident in this respect, we will use the method of repeated K-fold cross validation to test the performance of model on different test data set.

This will be done as follows -

9. Repeated 10-fold Cross Validation

Before performing Cross validation, just remove that observation that is identified as outlier, i.e., the row that contains Sales = 1.6

# Removing outlier, i.e., the row that contains Sales = 1.6
data <- data %>% filter(Sales != 1.6)

Now, Perform Cross Validation as follows -

# Define training control
set.seed(123)
train.control <- trainControl(method = "repeatedcv",
number = 10, repeats = 3)
# Train the model
model_cv <- train(Sales ~ poly(TV , 2) + poly(Radio , 3) + TV:Radio , data = data, method="lm",
trControl = train.control)

# Summarize the results
print(model_cv)

Great Result !

On an average, This Orthogonal Polynomial Regression Model (stored in R-object pm4) captures 93.69% variability available in the target (Sales). That is, 93.69% variability in Sales is due to the predictor “TV” and “Radio”. Rest variability is due to random causes or may be due to some other causes.

10. Conclusion -

Finally, I want to conclude here. I started from fitting of Simple Linear Regression, then shown you what problem occurs with this model and then to get rid of that problem how you must proceed towards Multiple Linear Regression model. After that I have shown you how you will get an idea of how to proceed towards Orthogonal Polynomial Regression. I have also included different Statistical tests, Diagnostic plots, Diagnostic metrics to do the task of preparing a better basic model for predicting Sales on the basis of given Advertising budget for TV, Radio and Newspaper.

Further, I want to mention here that this is not the end. It is just a starting and that’s why I have said earlier that I have prepared a better basic model for the given data set. There are several advanced algorithms such as fitting of splines (Parametric algorithm) and many non-parametric algorithms like Decision tree, Random forest, Support Vector Machine etc.. which may deal with curvilinear relationship and may give more accurate results. So you must learn and try these advanced algorithms further to improve and gain more accuracy as well as knowledge in this field.

Just do it once yourself for better understanding !

11. Task for you (If interested) -

If you want to do more practice, I recommend you to work on this Advertising Data Set. This data set requires some more analysis work related to Residual Plots. Before working on this data set just read the concepts of Residual Analysis from here (Page no :18 to 20) and then apply Simple, Multiple and polynomial Regression and analyze the Diagnostic plots.

12. Information about Next Article -

Just save the data that is completely outlier free obtained while performing cross validation and also save splitted data stored in train.data1 and test.data to read my next article named Multicolinearity / Ridge / Lasso / Elastic-Net Regression using R. I will use the same outlier free data set in the next article to understand more concepts.

# Saving outlier free data set stored in R-object data
write.csv(data , "Outlier Free Data Set.csv" , row.names = F)

# Saving outlier free train data set stored in R-object train.data1
write.csv(data , "Outlier Free Data Set.csv" , row.names = F)

# Saving test data set stored in R-object test.data
write.csv(data , "Outlier Free Data Set.csv" , row.names = F)

Thanks for Reading my article.

My kaggle Notebook — click

If you find any mistake, just let me know.

--

--