Multiple Linear Regression — 8
If you have not read part 7 of the R data analysis series kindly go through the following article where we discussed Simple Linear Regression — 7.
The contents in the article are gist from a couple of books that I got introduced during my IIM-B days.
R for Everyone — Jared P. Lander
Practical Data Science with R — Nina Zumel & John Mount
All the code blocks discussed in the article are present in the form of R markdown in the Github link.
To see all the articles written by me kindly use the link, Vivek Srinivasan.
The logical extension of simple linear regression
is multiple regression
, which allows for multiple predictors. The idea is still the same; we are still making predictions or inferences on the response, but we now have more information in the form of multiple predictors. The math requires some matrix algebra but fortunately, the lm
function is used with very little extra effort.
In this case, the relationship between the response and the p
predictors (p − 1 predictor and the intercept) is modeled as
where Y
is the nx1
response vector.
X is the nxp
matrix (n rows and p − 1 predictor plus the intercept)
β is the px1 vector of coefficients (one for each predictor and intercept)
and ε is the nx1 vector of normally distributed errors.
with
which seems more complicated than simple regression but algebra actually gets easier. The solution for the coefficients is simply written as in
To see this in action we use New York City condo evaluations for the fiscal year 2011–2012, obtained through NYC Open Data. NYC Open Data is an initiative by New York City to make the government more transparent and work better. The original data were separated by borough with one file each for Manhattan, Brooklyn, Queens, the Bronx, and Staten Island, and contained extra information we will not be using.
So we combined the five files into one, cleaned up the column names and posted it at housing.csv. To access the data, either download it from that URL and use read.table
on the now local file, or read it directly from the URL
.
housing <- read.table("../data/housing.csv",
sep = ",", header = TRUE,
stringsAsFactors = FALSE)
A few reminders about what that code does: sep specifies that commas were used to separate columns; header means the first row contains the column names; and stringsAsFactors
leaves character columns as they are and does not convert them to factors
, which speeds up loading time and also makes them easier to work with. Looking at the data, we see that we have a lot of columns and some bad names, so we should rename those
names(housing) <- c("Neighborhood", "Class", "Units", "YearBuilt",
"SqFt", "Income", "IncomePerSqFt", "Expense",
"ExpensePerSqFt", "NetIncome", "Value",
"ValuePerSqFt", "Boro")
head(housing)
For these data, the response is the value per square foot and the predictors are everything else. However, we ignore the income and expense variables, as they are actually just estimates based on an arcane requirement that condos be compared to rentals for valuation purposes. The first step is to visualize the data in some exploratory data analysis. The natural place to start is with a histogram
of ValuePerSqFt
.
require(ggplot2)
ggplot(housing, aes(x=ValuePerSqFt)) + geom_histogram(binwidth=10) + labs(x="Value per Square Foot")
The bimodal
nature of the histogram
means there is something left to be explored. Mapping color to Boro and faceting on Boro reveal that Brooklyn and Queens make up one mode and Manhattan makes up the other, while there is not much data on the Bronx and Staten Island.
require(gridExtra)
p1 <- ggplot(housing, aes(x=ValuePerSqFt, fill=Boro)) +
geom_histogram(binwidth=10) +
labs(x="Value per Square Foot") +
theme(axis.text.x = element_text(angle = 90))
p2 <- ggplot(housing, aes(x=ValuePerSqFt, fill=Boro)) +
geom_histogram(binwidth=10) +
labs (x="Value per Square Foot") +
facet_wrap(~Boro) +
theme(axis.text.x = element_text(angle = 90))
grid.arrange(p1, p2, nrow = 1,widths=c(3, 5))
Outlier Analysis
Now let us try to visualize two other important predictors ValueperSquareFoot and units to see if there are any outlier points that may affect our regression analysis.
p3 <- ggplot(housing, aes(x=SqFt)) + geom_histogram()
p4 <- ggplot(housing, aes(x=Units)) + geom_histogram()
grid.arrange(p3, p4, nrow = 1)
It can be seen from the above figure that there are quite a few buildings with an incredible number of units and square foot area. They may be potential outliers and let us confirm this using our boxplots.
p5 <- ggplot(housing, aes(y=SqFt,x=1)) + geom_boxplot()
p6 <- ggplot(housing, aes(y=Units,x=1)) + geom_boxplot()
grid.arrange(p5, p6, nrow = 1)
So it is quite visible from the boxplots
that there are quite a few points with extreme values. Having these points in our regression analysis may skew
our regression estimate and will result in poor prediction capability. After all, regression is just a mean estimate
of the data which gets affected if we have extreme value in our data. So we will see how many such points we may need to remove to make regression
analysis better. From the table, we can see that there are 6 rows of data with extreme Units
and SqFt
value. So we will those data points and continue our analysis.
housing[housing$Units > 1000, ]
housing <- housing[housing$Units < 1000, ]
Now that we have viewed our data a few different ways, it is time to start modeling. We already saw from the histograms
that accounting for the different boroughs will be important and subsequent histograms
and boxplots
indicated that Units
and SqFt
will be important as well.
Fitting the model uses the formula interface in lm
. Now that there are multiple predictors
, we separate them on the right side of the formula using plus signs (+)
.
house1 <- lm(ValuePerSqFt ~ Units + SqFt + Boro, data=housing)
summary(house1)
The first thing to notice is that in some versions of R
there is a message warning us that Boro
was converted to a factor. This is because Boro
was stored as a character
, and for modeling purposes character
data must be represented using indicator variables, which is how factors
are treated inside modeling functions.
Model Summary and Interpretation
The summary
function prints out information about the model, including how the function was called, quantiles for the residuals
, coefficient estimates
, standard errors
and p-values
for each variable, and the degrees of freedom
, p-value
and F-statistic
for the model. There is no coefficient for the Bronx
because that is the baseline level of Boro
, and all the other Boro
coefficients are relative to that baseline.
The coefficients
represent the effect of the predictors
on the response
and the standard errors
are the uncertainty in the estimation of the coefficients. The t value
(t-statistic) and p-value
for the coefficients are numerical measures of statistical significance, though these should be viewed with caution, as most modern data scientists do not like to look at the statistical significance of individual coefficients but rather judge the model as a whole.
The model p-value
and F-statistic
are measures of its goodness of fit. The degrees of freedom
for regression is calculated as the number of observations minus the number of coefficients. In this example, there are nrow(housing) − length(coef(house1)) = 2613
degrees of freedom.
We explained the importance of t-value and F-value elaborately in our last article.
In short t-value
measures, the significance of individual variables and on the other hand F-statistics measure the overall significance of the model. Since p-value
for t-statistics
and F-Score
is less than 5 percent (general rule of thumb) we can safely conclude that the model we built is good and predictors
used in the model are significant enough. In addition, we have fairly decent R-Square
value of 0.6034
in our regression analysis which indicates 60 percent
of our total variance is explained by the model.
Coefficient Analysis
A quick way to grab the coefficients from a model is to either use the coef
function or get them from the model using the $
operator on the model object.
coef(house1)
We prefer visualizations over tables of information, and a great way of visualizing regression results is a coefficient plot. Rather than build it from scratch, we use the convenient coefplot
package , where each coefficient is plotted as a point with a thick line representing the one standard error
confidence interval and a thin line representing the two standard error
confidence interval. There is a vertical line indicating 0.
In general, a good rule of thumb is that if the two standard error confidence interval does not contain 0, it is statistically significant.
library(coefplot)
coefplot(house1)
As expected, being located in Manhattan
has the largest effect on value per square foot
. Surprisingly, the number of units or square feet in a building has little effect on value. This is a model with purely additive terms. Interactions between variables can be equally powerful.
Models With Interaction Terms
To enter them in a formula, separate the desired variables with a *
instead of +
. Doing so results in the individual variables plus the interaction term being included in the model. To include just the interaction term, and not the individual variables, use :
instead. The results of interacting Units
and SqFt
are
house2 <- lm(ValuePerSqFt ~ Units * SqFt + Boro, data=housing)
house3 <- lm(ValuePerSqFt ~ Units : SqFt + Boro, data=housing)
We have fit numerous models from which we need to pick the “best” one. Model selection and model diagnostics are a completely different topic altogether which will be discussed in our next article.
Model Diagnostics And Model Selection — 8
In the meantime, we will visualize the coefficients from multiple models using handy multiplot
function from coefplot
package. The following plot shows a coefficient plot for models house1, house2 and house3.
multiplot(house1, house2, house3)
Predictions
Regression is often used for prediction, which in R is enabled by the predict
function. We have multiple models and our first step is to find out which model fits the data best before using it for prediction. As discussed earlier we will see more on model diagnostics
and model selection
steps in our next article. As of now we will take a step back and predict ValueperSquareFoot
for our test data available at housingNew.csv using our house1
regression model.
housingNew <- read.table("../data/housingNew.csv", sep=",",
header=TRUE, stringsAsFactors=FALSE)
Making the prediction can be as simple as calling predict
, although caution must be used when dealing with factor predictors to ensure that they have the same levels as those used in building the model.
housePredict <- predict(house1, newdata=housingNew,
se.fit=TRUE,interval="prediction", level=.95)#Predictions with upper and lower bounds based on standard errors
head(housePredict$fit)
In this article, we discussed how to fit the regression
model with multiple predictor
variables. We have also visualized our response
and predictor
variable to understand it better. We added a few interaction terms in the model building process and discussed how it affects our model. Finally, we used our base model and predicted for the new dataset.
Although our F-score
and R-Square
suggests the overall significance of the model is good. It is not sufficient enough to conclude that the model we built is robust and complete. There are several other ways to test our model performance and robustness which will be discussed in our next article.
Model Diagnostics and Model Selection— 9
Do share your thoughts and support by commenting and sharing the article among your peer groups.