If you have not read part 7 of the R data analysis series kindly go through the following article where we discussed Simple Linear Regression — 7.
The contents in the article are gist from a couple of books that I got introduced during my IIM-B days.
R for Everyone — Jared P. Lander
Practical Data Science with R — Nina Zumel & John Mount
All the code blocks discussed in the article are present in the form of R markdown in the Github link.
To see all the articles written by me kindly use the link, Vivek Srinivasan.
The logical extension of
simple linear regression is
multiple regression, which allows for multiple predictors. The idea is still the same; we are still making predictions or inferences on the response, but we now have more information in the form of multiple predictors. The math requires some matrix algebra but fortunately, the
lm function is used with very little extra effort.
In this case, the relationship between the response and the
p predictors (p − 1 predictor and the intercept) is modeled as
Y is the
nx1 response vector.
X is the
nxp matrix (n rows and p − 1 predictor plus the intercept)
β is the px1 vector of coefficients (one for each predictor and intercept)
and ε is the nx1 vector of normally distributed errors.
which seems more complicated than simple regression but algebra actually gets easier. The solution for the coefficients is simply written as in
To see this in action we use New York City condo evaluations for the fiscal year 2011–2012, obtained through NYC Open Data. NYC Open Data is an initiative by New York City to make the government more transparent and work better. The original data were separated by borough with one file each for Manhattan, Brooklyn, Queens, the Bronx, and Staten Island, and contained extra information we will not be using.
So we combined the five files into one, cleaned up the column names and posted it at housing.csv. To access the data, either download it from that URL and use
read.table on the now local file, or read it directly from the
housing <- read.table("../data/housing.csv",
sep = ",", header = TRUE,
stringsAsFactors = FALSE)
A few reminders about what that code does: sep specifies that commas were used to separate columns; header means the first row contains the column names; and
stringsAsFactors leaves character columns as they are and does not convert them to
factors, which speeds up loading time and also makes them easier to work with. Looking at the data, we see that we have a lot of columns and some bad names, so we should rename those
names(housing) <- c("Neighborhood", "Class", "Units", "YearBuilt",
"SqFt", "Income", "IncomePerSqFt", "Expense",
"ExpensePerSqFt", "NetIncome", "Value",
For these data, the response is the value per square foot and the predictors are everything else. However, we ignore the income and expense variables, as they are actually just estimates based on an arcane requirement that condos be compared to rentals for valuation purposes. The first step is to visualize the data in some exploratory data analysis. The natural place to start is with a
ggplot(housing, aes(x=ValuePerSqFt)) + geom_histogram(binwidth=10) + labs(x="Value per Square Foot")
bimodal nature of the
histogram means there is something left to be explored. Mapping color to Boro and faceting on Boro reveal that Brooklyn and Queens make up one mode and Manhattan makes up the other, while there is not much data on the Bronx and Staten Island.
p1 <- ggplot(housing, aes(x=ValuePerSqFt, fill=Boro)) +
labs(x="Value per Square Foot") +
theme(axis.text.x = element_text(angle = 90))
p2 <- ggplot(housing, aes(x=ValuePerSqFt, fill=Boro)) +
labs (x="Value per Square Foot") +
theme(axis.text.x = element_text(angle = 90))
grid.arrange(p1, p2, nrow = 1,widths=c(3, 5))
Now let us try to visualize two other important predictors ValueperSquareFoot and units to see if there are any outlier points that may affect our regression analysis.
p3 <- ggplot(housing, aes(x=SqFt)) + geom_histogram()
p4 <- ggplot(housing, aes(x=Units)) + geom_histogram()
grid.arrange(p3, p4, nrow = 1)
It can be seen from the above figure that there are quite a few buildings with an incredible number of units and square foot area. They may be potential outliers and let us confirm this using our boxplots.
p5 <- ggplot(housing, aes(y=SqFt,x=1)) + geom_boxplot()
p6 <- ggplot(housing, aes(y=Units,x=1)) + geom_boxplot()
grid.arrange(p5, p6, nrow = 1)
So it is quite visible from the
boxplots that there are quite a few points with extreme values. Having these points in our regression analysis may
skew our regression estimate and will result in poor prediction capability. After all, regression is just a
mean estimate of the data which gets affected if we have extreme value in our data. So we will see how many such points we may need to remove to make
regression analysis better. From the table, we can see that there are 6 rows of data with extreme
SqFt value. So we will those data points and continue our analysis.
housing[housing$Units > 1000, ]
housing <- housing[housing$Units < 1000, ]
Now that we have viewed our data a few different ways, it is time to start modeling. We already saw from the
histograms that accounting for the different boroughs will be important and subsequent
boxplots indicated that
SqFt will be important as well.
Fitting the model uses the formula interface in
lm. Now that there are multiple
predictors, we separate them on the right side of the formula using plus signs
house1 <- lm(ValuePerSqFt ~ Units + SqFt + Boro, data=housing)
The first thing to notice is that in some versions of
R there is a message warning us that
Boro was converted to a factor. This is because
Boro was stored as a
character, and for modeling purposes
character data must be represented using indicator variables, which is how
factors are treated inside modeling functions.
Model Summary and Interpretation
summary function prints out information about the model, including how the function was called,
quantiles for the residuals,
standard errors and
p-values for each variable, and the
degrees of freedom,
F-statistic for the model. There is no coefficient for the
Bronx because that is the baseline level of
Boro, and all the other
Boro coefficients are relative to that baseline.
coefficients represent the effect of the
predictors on the
response and the
standard errors are the uncertainty in the estimation of the coefficients. The
t value (t-statistic) and
p-value for the coefficients are numerical measures of statistical significance, though these should be viewed with caution, as most modern data scientists do not like to look at the statistical significance of individual coefficients but rather judge the model as a whole.
F-statistic are measures of its goodness of fit. The
degrees of freedom for regression is calculated as the number of observations minus the number of coefficients. In this example, there are
nrow(housing) − length(coef(house1)) = 2613 degrees of freedom.
We explained the importance of t-value and F-value elaborately in our last article.
t-value measures, the significance of individual variables and on the other hand F-statistics measure the overall significance of the model. Since
F-Score is less than 5 percent (general rule of thumb) we can safely conclude that the model we built is good and
predictors used in the model are significant enough. In addition, we have fairly decent
R-Square value of
0.6034 in our regression analysis which indicates
60 percent of our total variance is explained by the model.
A quick way to grab the coefficients from a model is to either use the
coef function or get them from the model using the
$ operator on the model object.
We prefer visualizations over tables of information, and a great way of visualizing regression results is a coefficient plot. Rather than build it from scratch, we use the convenient
coefplot package , where each coefficient is plotted as a point with a thick line representing the
one standard error confidence interval and a thin line representing the
two standard error confidence interval. There is a vertical line indicating 0.
In general, a good rule of thumb is that if the two standard error confidence interval does not contain 0, it is statistically significant.
As expected, being located in
Manhattan has the largest effect on
value per square foot. Surprisingly, the number of units or square feet in a building has little effect on value. This is a model with purely additive terms. Interactions between variables can be equally powerful.
Models With Interaction Terms
To enter them in a formula, separate the desired variables with a
* instead of
+. Doing so results in the individual variables plus the interaction term being included in the model. To include just the interaction term, and not the individual variables, use
: instead. The results of interacting
house2 <- lm(ValuePerSqFt ~ Units * SqFt + Boro, data=housing)
house3 <- lm(ValuePerSqFt ~ Units : SqFt + Boro, data=housing)
We have fit numerous models from which we need to pick the “best” one. Model selection and model diagnostics are a completely different topic altogether which will be discussed in our next article.
Model Diagnostics And Model Selection — 8
In the meantime, we will visualize the coefficients from multiple models using handy
multiplot function from
coefplot package. The following plot shows a coefficient plot for models house1, house2 and house3.
multiplot(house1, house2, house3)
Regression is often used for prediction, which in R is enabled by the
predict function. We have multiple models and our first step is to find out which model fits the data best before using it for prediction. As discussed earlier we will see more on
model diagnostics and
model selection steps in our next article. As of now we will take a step back and predict
ValueperSquareFoot for our test data available at housingNew.csv using our
house1 regression model.
housingNew <- read.table("../data/housingNew.csv", sep=",",
Making the prediction can be as simple as calling
predict, although caution must be used when dealing with factor predictors to ensure that they have the same levels as those used in building the model.
housePredict <- predict(house1, newdata=housingNew,
se.fit=TRUE,interval="prediction", level=.95)#Predictions with upper and lower bounds based on standard errors
In this article, we discussed how to fit the
regression model with multiple
predictor variables. We have also visualized our
predictor variable to understand it better. We added a few interaction terms in the model building process and discussed how it affects our model. Finally, we used our base model and predicted for the new dataset.
R-Square suggests the overall significance of the model is good. It is not sufficient enough to conclude that the model we built is robust and complete. There are several other ways to test our model performance and robustness which will be discussed in our next article.
Do share your thoughts and support by commenting and sharing the article among your peer groups.