# Multiple Linear Regression — 8

If you have not read part 7 of the R data analysis series kindly go through the following article where we discussed

Simple Linear Regression —7.The contents in the article are gist from a couple of books that I got introduced during my IIM-B days.

R for Everyone — Jared P. Lander

Practical Data Science with R — Nina Zumel & John Mount

All the code blocks discussed in the article are present in the form of R markdown in the Github link.

To see all the articles written by me kindly use the link, Vivek Srinivasan.

The logical extension of `simple linear regression`

is `multiple regression`

, which allows for multiple predictors. The idea is still the same; we are still making predictions or inferences on the response, but we now have more information in the form of multiple predictors. The math requires some matrix algebra but fortunately, the `lm `

function is used with very little extra effort.

In this case, the relationship between the response and the `p`

predictors (p − 1 predictor and the intercept) is modeled as

where `Y`

is the `nx1`

response vector.

X is the `nxp`

matrix (n rows and p − 1 predictor plus the intercept)

β is the px1 vector of coefficients (one for each predictor and intercept)

and ε is the nx1 vector of normally distributed errors.

with

which seems more complicated than simple regression but algebra actually gets easier. The solution for the coefficients is simply written as in

To see this in action we use New York City condo evaluations for the fiscal year 2011–2012, obtained through NYC Open Data. NYC Open Data is an initiative by New York City to make the government more transparent and work better. The original data were separated by borough with one file each for Manhattan, Brooklyn, Queens, the Bronx, and Staten Island, and contained extra information we will not be using.

So we combined the five files into one, cleaned up the column names and posted it at *housing.csv*. To access the data, either download it from that URL and use `read.table`

on the now local file, or read it directly from the `URL`

.

`housing <- read.table("../data/housing.csv",`

sep = ",", header = TRUE,

stringsAsFactors = FALSE)

A few reminders about what that code does: sep specifies that commas were used to separate columns; header means the first row contains the column names; and `stringsAsFactors `

leaves character columns as they are and does not convert them to `factors`

, which speeds up loading time and also makes them easier to work with. Looking at the data, we see that we have a lot of columns and some bad names, so we should rename those

`names(housing) <- c("Neighborhood", "Class", "Units", "YearBuilt",`

"SqFt", "Income", "IncomePerSqFt", "Expense",

"ExpensePerSqFt", "NetIncome", "Value",

"ValuePerSqFt", "Boro")

head(housing)

For these data, the response is the value per square foot and the predictors are everything else. However, we ignore the income and expense variables, as they are actually just estimates based on an arcane requirement that condos be compared to rentals for valuation purposes. The first step is to visualize the data in some exploratory data analysis. The natural place to start is with a `histogram `

of `ValuePerSqFt`

.

`require(ggplot2)`

ggplot(housing, aes(x=ValuePerSqFt)) + geom_histogram(binwidth=10) + labs(x="Value per Square Foot")

The `bimodal `

nature of the `histogram `

means there is something left to be explored. Mapping color to Boro and faceting on Boro reveal that Brooklyn and Queens make up one mode and Manhattan makes up the other, while there is not much data on the Bronx and Staten Island.

`require(gridExtra)`

p1 <- ggplot(housing, aes(x=ValuePerSqFt, fill=Boro)) +

geom_histogram(binwidth=10) +

labs(x="Value per Square Foot") +

theme(axis.text.x = element_text(angle = 90))

p2 <- ggplot(housing, aes(x=ValuePerSqFt, fill=Boro)) +

geom_histogram(binwidth=10) +

labs (x="Value per Square Foot") +

facet_wrap(~Boro) +

theme(axis.text.x = element_text(angle = 90))

grid.arrange(p1, p2, nrow = 1,widths=c(3, 5))

## Outlier Analysis

Now let us try to visualize two other important predictors ValueperSquareFoot and units to see if there are any outlier points that may affect our regression analysis.

`p3 <- ggplot(housing, aes(x=SqFt)) + geom_histogram()`

p4 <- ggplot(housing, aes(x=Units)) + geom_histogram()

grid.arrange(p3, p4, nrow = 1)

It can be seen from the above figure that there are quite a few buildings with an incredible number of units and square foot area. They may be potential outliers and let us confirm this using our boxplots.

`p5 <- ggplot(housing, aes(y=SqFt,x=1)) + geom_boxplot()`

p6 <- ggplot(housing, aes(y=Units,x=1)) + geom_boxplot()

grid.arrange(p5, p6, nrow = 1)

So it is quite visible from the `boxplots `

that there are quite a few points with extreme values. Having these points in our regression analysis may `skew `

our regression estimate and will result in poor prediction capability. After all, regression is just a `mean estimate`

of the data which gets affected if we have extreme value in our data. So we will see how many such points we may need to remove to make `regression `

analysis better. From the table, we can see that there are 6 rows of data with extreme `Units `

and `SqFt `

value. So we will those data points and continue our analysis.

`housing[housing$Units > 1000, ]`

`housing <- housing[housing$Units < 1000, ]`

Now that we have viewed our data a few different ways, it is time to start modeling. We already saw from the `histograms `

that accounting for the different boroughs will be important and subsequent `histograms `

and `boxplots `

indicated that `Units `

and `SqFt `

will be important as well.

Fitting the model uses the formula interface in `lm`

. Now that there are multiple `predictors`

, we separate them on the right side of the formula using plus signs `(+)`

.

`house1 <- lm(ValuePerSqFt ~ Units + SqFt + Boro, data=housing)`

summary(house1)

The first thing to notice is that in some versions of `R`

there is a message warning us that `Boro `

was converted to a factor. This is because `Boro `

was stored as a `character`

, and for modeling purposes `character `

data must be represented using indicator variables, which is how `factors `

are treated inside modeling functions.

## Model Summary and Interpretation

The `summary `

function prints out information about the model, including how the function was called, `quantiles for the residuals`

, `coefficient estimates`

, `standard errors`

and `p-values`

for each variable, and the `degrees of freedom`

, `p-value`

and `F-statistic`

for the model. There is no coefficient for the `Bronx `

because that is the baseline level of `Boro`

, and all the other `Boro `

coefficients are relative to that baseline.

The `coefficients `

represent the effect of the `predictors `

on the `response `

and the `standard errors`

are the uncertainty in the estimation of the coefficients. The `t value`

(t-statistic) and `p-value`

for the coefficients are numerical measures of statistical significance, though these should be viewed with caution, as most modern data scientists do not like to look at the statistical significance of individual coefficients but rather judge the model as a whole.

The model `p-value`

and `F-statistic`

are measures of its goodness of fit. The `degrees of freedom`

for regression is calculated as the number of observations minus the number of coefficients. In this example, there are `nrow(housing) − length(coef(house1)) = 2613`

degrees of freedom.

We explained the importance of t-value and F-value elaborately in our last article.

In short` t-value`

measures, the significance of individual variables and on the other hand F-statistics measure the overall significance of the model. Since `p-value`

for `t-statistics`

and `F-Score`

is less than 5 percent (general rule of thumb) we can safely conclude that the model we built is good and `predictors `

used in the model are significant enough. In addition, we have fairly decent `R-Square`

value of `0.6034 `

in our regression analysis which indicates `60 percent`

of our total variance is explained by the model.

## Coefficient Analysis

A quick way to grab the coefficients from a model is to either use the `coef `

function or get them from the model using the `$`

operator on the model object.

`coef(house1)`

We prefer visualizations over tables of information, and a great way of visualizing regression results is a coefficient plot. Rather than build it from scratch, we use the convenient `coefplot `

package , where each coefficient is plotted as a point with a thick line representing the `one standard error `

confidence interval and a thin line representing the `two standard error `

confidence interval. There is a vertical line indicating 0.

In general, a good rule of thumb is that if the two standard error confidence interval does not contain 0, it is statistically significant.

`library(coefplot)`

coefplot(house1)

As expected, being located in `Manhattan `

has the largest effect on `value per square foot`

. Surprisingly, the number of units or square feet in a building has little effect on value. This is a model with purely additive terms. Interactions between variables can be equally powerful.

## Models With Interaction Terms

To enter them in a formula, separate the desired variables with a` *`

instead of `+`

. Doing so results in the individual variables plus the interaction term being included in the model. To include just the interaction term, and not the individual variables, use `: `

instead. The results of interacting `Units `

and `SqFt `

are

`house2 <- lm(ValuePerSqFt ~ Units * SqFt + Boro, data=housing)`

house3 <- lm(ValuePerSqFt ~ Units : SqFt + Boro, data=housing)

We have fit numerous models from which we need to pick the “best” one. Model selection and model diagnostics are a completely different topic altogether which will be discussed in our next article.

*Model Diagnostics And Model Selection — 8*

In the meantime, we will visualize the coefficients from multiple models using handy `multiplot `

function from `coefplot `

package. The following plot shows a coefficient plot for models house1, house2 and house3.

`multiplot(house1, house2, house3)`

## Predictions

Regression is often used for prediction, which in R is enabled by the `predict `

function. We have multiple models and our first step is to find out which model fits the data best before using it for prediction. As discussed earlier we will see more on `model diagnostics `

and `model selection`

steps in our next article. As of now we will take a step back and predict `ValueperSquareFoot`

for our test data available at *housingNew.csv** *using our `house1 `

regression model.

`housingNew <- read.table("../data/housingNew.csv", sep=",",`

header=TRUE, stringsAsFactors=FALSE)

Making the prediction can be as simple as calling `predict`

, although caution must be used when dealing with factor predictors to ensure that they have the same levels as those used in building the model.

housePredict <- predict(house1, newdata=housingNew,

se.fit=TRUE,interval="prediction", level=.95)#Predictions with upper and lower bounds based on standard errors

head(housePredict$fit)

In this article, we discussed how to fit the `regression `

model with multiple `predictor `

variables. We have also visualized our `response `

and `predictor `

variable to understand it better. We added a few interaction terms in the model building process and discussed how it affects our model. Finally, we used our base model and predicted for the new dataset.

Although our` F-score `

and `R-Square`

suggests the overall significance of the model is good. It is not sufficient enough to conclude that the model we built is robust and complete. There are several other ways to test our model performance and robustness which will be discussed in our next article.

Model Diagnostics and Model Selection— 9

Do share your thoughts and support by commenting and sharing the article among your peer groups.