Multiple Linear Regression — 8

Vivekanandan Srinivasan
Analytics Vidhya
Published in
9 min readDec 12, 2019

If you have not read part 7 of the R data analysis series kindly go through the following article where we discussed Simple Linear Regression — 7.

The contents in the article are gist from a couple of books that I got introduced during my IIM-B days.

R for Everyone — Jared P. Lander

Practical Data Science with R — Nina Zumel & John Mount

All the code blocks discussed in the article are present in the form of R markdown in the Github link.

To see all the articles written by me kindly use the link, Vivek Srinivasan.

The logical extension of simple linear regression is multiple regression, which allows for multiple predictors. The idea is still the same; we are still making predictions or inferences on the response, but we now have more information in the form of multiple predictors. The math requires some matrix algebra but fortunately, the lm function is used with very little extra effort.

In this case, the relationship between the response and the p predictors (p − 1 predictor and the intercept) is modeled as

where Y is the nx1 response vector.

X is the nxp matrix (n rows and p − 1 predictor plus the intercept)

β is the px1 vector of coefficients (one for each predictor and intercept)

and ε is the nx1 vector of normally distributed errors.

with

which seems more complicated than simple regression but algebra actually gets easier. The solution for the coefficients is simply written as in

To see this in action we use New York City condo evaluations for the fiscal year 2011–2012, obtained through NYC Open Data. NYC Open Data is an initiative by New York City to make the government more transparent and work better. The original data were separated by borough with one file each for Manhattan, Brooklyn, Queens, the Bronx, and Staten Island, and contained extra information we will not be using.

So we combined the five files into one, cleaned up the column names and posted it at housing.csv. To access the data, either download it from that URL and use read.table on the now local file, or read it directly from the URL.

housing <- read.table("../data/housing.csv",
sep = ",", header = TRUE,
stringsAsFactors = FALSE)

A few reminders about what that code does: sep specifies that commas were used to separate columns; header means the first row contains the column names; and stringsAsFactors leaves character columns as they are and does not convert them to factors, which speeds up loading time and also makes them easier to work with. Looking at the data, we see that we have a lot of columns and some bad names, so we should rename those

names(housing) <- c("Neighborhood", "Class", "Units", "YearBuilt",
"SqFt", "Income", "IncomePerSqFt", "Expense",
"ExpensePerSqFt", "NetIncome", "Value",
"ValuePerSqFt", "Boro")
head(housing)

For these data, the response is the value per square foot and the predictors are everything else. However, we ignore the income and expense variables, as they are actually just estimates based on an arcane requirement that condos be compared to rentals for valuation purposes. The first step is to visualize the data in some exploratory data analysis. The natural place to start is with a histogram of ValuePerSqFt.

require(ggplot2)
ggplot(housing, aes(x=ValuePerSqFt)) + geom_histogram(binwidth=10) + labs(x="Value per Square Foot")

The bimodal nature of the histogram means there is something left to be explored. Mapping color to Boro and faceting on Boro reveal that Brooklyn and Queens make up one mode and Manhattan makes up the other, while there is not much data on the Bronx and Staten Island.

require(gridExtra)
p1 <- ggplot(housing, aes(x=ValuePerSqFt, fill=Boro)) +
geom_histogram(binwidth=10) +
labs(x="Value per Square Foot") +
theme(axis.text.x = element_text(angle = 90))
p2 <- ggplot(housing, aes(x=ValuePerSqFt, fill=Boro)) +
geom_histogram(binwidth=10) +
labs (x="Value per Square Foot") +
facet_wrap(~Boro) +
theme(axis.text.x = element_text(angle = 90))
grid.arrange(p1, p2, nrow = 1,widths=c(3, 5))

Outlier Analysis

Now let us try to visualize two other important predictors ValueperSquareFoot and units to see if there are any outlier points that may affect our regression analysis.

p3 <- ggplot(housing, aes(x=SqFt)) + geom_histogram()
p4 <- ggplot(housing, aes(x=Units)) + geom_histogram()
grid.arrange(p3, p4, nrow = 1)

It can be seen from the above figure that there are quite a few buildings with an incredible number of units and square foot area. They may be potential outliers and let us confirm this using our boxplots.

p5 <- ggplot(housing, aes(y=SqFt,x=1)) + geom_boxplot()
p6 <- ggplot(housing, aes(y=Units,x=1)) + geom_boxplot()
grid.arrange(p5, p6, nrow = 1)

So it is quite visible from the boxplots that there are quite a few points with extreme values. Having these points in our regression analysis may skew our regression estimate and will result in poor prediction capability. After all, regression is just a mean estimate of the data which gets affected if we have extreme value in our data. So we will see how many such points we may need to remove to make regression analysis better. From the table, we can see that there are 6 rows of data with extreme Units and SqFt value. So we will those data points and continue our analysis.

housing[housing$Units > 1000, ]
housing <- housing[housing$Units <   1000, ]

Now that we have viewed our data a few different ways, it is time to start modeling. We already saw from the histograms that accounting for the different boroughs will be important and subsequent histograms and boxplots indicated that Units and SqFt will be important as well.

Fitting the model uses the formula interface in lm. Now that there are multiple predictors, we separate them on the right side of the formula using plus signs (+).

house1 <- lm(ValuePerSqFt ~ Units + SqFt + Boro, data=housing)
summary(house1)

The first thing to notice is that in some versions of R there is a message warning us that Boro was converted to a factor. This is because Boro was stored as a character, and for modeling purposes character data must be represented using indicator variables, which is how factors are treated inside modeling functions.

Model Summary and Interpretation

The summary function prints out information about the model, including how the function was called, quantiles for the residuals, coefficient estimates, standard errors and p-values for each variable, and the degrees of freedom, p-value and F-statistic for the model. There is no coefficient for the Bronx because that is the baseline level of Boro, and all the other Boro coefficients are relative to that baseline.

The coefficients represent the effect of the predictors on the response and the standard errors are the uncertainty in the estimation of the coefficients. The t value (t-statistic) and p-value for the coefficients are numerical measures of statistical significance, though these should be viewed with caution, as most modern data scientists do not like to look at the statistical significance of individual coefficients but rather judge the model as a whole.

The model p-value and F-statistic are measures of its goodness of fit. The degrees of freedom for regression is calculated as the number of observations minus the number of coefficients. In this example, there are nrow(housing) − length(coef(house1)) = 2613 degrees of freedom.

We explained the importance of t-value and F-value elaborately in our last article.

Simple Linear Regression — 7

In short t-value measures, the significance of individual variables and on the other hand F-statistics measure the overall significance of the model. Since p-value for t-statistics and F-Score is less than 5 percent (general rule of thumb) we can safely conclude that the model we built is good and predictors used in the model are significant enough. In addition, we have fairly decent R-Square value of 0.6034 in our regression analysis which indicates 60 percent of our total variance is explained by the model.

Coefficient Analysis

A quick way to grab the coefficients from a model is to either use the coef function or get them from the model using the $ operator on the model object.

coef(house1)

We prefer visualizations over tables of information, and a great way of visualizing regression results is a coefficient plot. Rather than build it from scratch, we use the convenient coefplot package , where each coefficient is plotted as a point with a thick line representing the one standard error confidence interval and a thin line representing the two standard error confidence interval. There is a vertical line indicating 0.

In general, a good rule of thumb is that if the two standard error confidence interval does not contain 0, it is statistically significant.

library(coefplot)
coefplot(house1)

As expected, being located in Manhattan has the largest effect on value per square foot. Surprisingly, the number of units or square feet in a building has little effect on value. This is a model with purely additive terms. Interactions between variables can be equally powerful.

Models With Interaction Terms

To enter them in a formula, separate the desired variables with a * instead of +. Doing so results in the individual variables plus the interaction term being included in the model. To include just the interaction term, and not the individual variables, use : instead. The results of interacting Units and SqFt are

house2 <- lm(ValuePerSqFt ~ Units * SqFt + Boro, data=housing)
house3 <- lm(ValuePerSqFt ~ Units : SqFt + Boro, data=housing)

We have fit numerous models from which we need to pick the “best” one. Model selection and model diagnostics are a completely different topic altogether which will be discussed in our next article.

Model Diagnostics And Model Selection — 8

In the meantime, we will visualize the coefficients from multiple models using handy multiplot function from coefplot package. The following plot shows a coefficient plot for models house1, house2 and house3.

multiplot(house1, house2, house3)

Predictions

Regression is often used for prediction, which in R is enabled by the predict function. We have multiple models and our first step is to find out which model fits the data best before using it for prediction. As discussed earlier we will see more on model diagnostics and model selection steps in our next article. As of now we will take a step back and predict ValueperSquareFoot for our test data available at housingNew.csv using our house1 regression model.

housingNew <- read.table("../data/housingNew.csv", sep=",",
header=TRUE, stringsAsFactors=FALSE)

Making the prediction can be as simple as calling predict, although caution must be used when dealing with factor predictors to ensure that they have the same levels as those used in building the model.

housePredict <- predict(house1, newdata=housingNew,
se.fit=TRUE,interval="prediction", level=.95)
#Predictions with upper and lower bounds based on standard errors
head(housePredict$fit)

In this article, we discussed how to fit the regression model with multiple predictor variables. We have also visualized our response and predictor variable to understand it better. We added a few interaction terms in the model building process and discussed how it affects our model. Finally, we used our base model and predicted for the new dataset.

Although our F-score and R-Square suggests the overall significance of the model is good. It is not sufficient enough to conclude that the model we built is robust and complete. There are several other ways to test our model performance and robustness which will be discussed in our next article.

Model Diagnostics and Model Selection— 9

Do share your thoughts and support by commenting and sharing the article among your peer groups.

--

--

Vivekanandan Srinivasan
Analytics Vidhya

An analytics professional with over six years of experience spanning across predictive modelling, statistical analysis and big data technologies.