Multiple Regression — The ‘R’ Way!

Vipin Verma
Human Systems Data
Published in
4 min readMar 26, 2017

The purpose of multiple regression is to learn about the relationships between several predictors and an outcome variable. In this context, the computational problem that needs a solution is to fit a model(regression equation) to the training data that we have. Let us go through the process with an example data set of Combined Cycle Power Plant uploaded on the UCI machine learning repository.

The data contains 9568 data points collected from a Combined Cycle Power Plant over 6 years (2006–2011), when the power plant was set to work with full load. Features consist of hourly average ambient variables Temperature (T), Ambient Pressure (AP), Relative Humidity (RH) and Exhaust Vacuum (V) to predict the net hourly electrical energy output (PE) of the plant. A combined cycle power plant (CCPP) is composed of gas turbines (GT), steam turbines (ST) and heat recovery steam generators. In a CCPP, the electricity is generated by gas and steam turbines, which are combined in one cycle, and is transferred from one turbine to another. While the Vacuum is effects the Steam Turbine, the other three of the ambient variables effect the GT performance.

Let us consider the following model where output PE is linearly related to the other four predictors:

fit1 <- lm(PE ~ AT + V + AP + RH, data=powerPlantdata)

Running this in R outputs the following data:

Summary of Regression in R

R-squared value of 0.9287 indicates that the regression equation explains 92.87% of the data. So, the predicted model is as follows:

PE = 454 -1.97AT-0.23V+0.06Ap-0.16RH

Let us have a look at the diagnostic plots for this model using the partial-residual plots.

generated using crPlots from ‘car’ package

Above plots models the residuals of the four predictors against the dependent variable. Green line indicates where the line of best fit lies. A significant difference between this residual line and the component line indicates that the predictor does not have a linear relationship with the dependent variable. Looking at the plots we see that all of the predictors are normal. If it turns out to be otherwise, then we will have to alter the predictor using a different function e.g. sqrt, log, square, cube etc. For example, let us assume that green line and the dotted red line for AP do not overlap, and are quite different from each other. Then it would mean that ambient pressure is not linearly related to the the energy output and we will have to consider other relation like AP square, AP cube, log(AP), etc. and see if we could fit the model with the new relation.

Shown above are the scatter plots of the four predictors and the energy output created using ggplot2. From these plots, temperature seem to be clearly related to the energy output, but the other three relationships are not clearly evident. So, let us consider the following three models as well and then do an Analysis of Variance(ANOVA) of the three models with the first one to see if they are different or not.

fit2 <- lm(PE ~ AT + V + AP, data=powerPlantdata)
fit3 <- lm(PE ~ AT + V + RH, data=powerPlantdata)
fit4 <- lm(PE ~ AT + RH + AP, data=powerPlantdata)
ANOVA of the three models with the first one

A large F-value(1437.8, 43.087, 1031.8) in all the three cases with extremely small p-values indicate that the models are significantly different. This suggests that we need all the four variables in order to correctly predict the energy output.

Here is a GitHub of all the code used in this blog post -https://github.com/vipin8169/HSE598/blob/master/multiple_regression.R

References:-

How To Find Relationship Between Variables, Multiple Regression. Retrieved from http://www.statsoft.com/Textbook/Multiple-Regression

Multiple (Linear) Regression. Retrieved from http://www.statmethods.net/stats/regression.html

R Regression Diagnostics Part 1. Retrieved from https://www.r-bloggers.com/r-regression-diagnostics-part-1/

Combined Cycle Power Plant Data Set. Retrieved from https://archive.ics.uci.edu/ml/datasets/Combined+Cycle+Power+Plant

Understanding Analysis of Variance (ANOVA) and the F-test. Retrieved from http://blog.minitab.com/blog/adventures-in-statistics-2/understanding-analysis-of-variance-anova-and-the-f-test

--

--