Linear Regression — Learning the ‘R’ way!

Published in

Human Systems Data

4 min readMar 1, 2017

This chapter focuses on single and multiple linear regression approaches to supervised learning for quantitative data. It helps discover the relationship between various variables, the extent of the relationship, the accuracy within and the predictability of one with respect to another.

Let us understand the concepts that are explained in this chapter by applying them to the Communities and Crime Data Set from the ICS data archive. From this data we will try to predict the total number of violent crimes per 100K population in a given community, on the basis of various measures from community population available in the data-set. Firstly, let us consider the simplest case that,

ViolentCrimesPerPop≈ β0 + β1 × population

Where,

ViolentCrimesPerPop = total number of violent crimes per 100K population

population= population for community

β0 and β1 are two unknown constants that represent the intercept and slope terms in the linear model. Together, β0 and β1 are known as the model coefficients or parameters. In the dataset, there are some rows having population value as 0.0, for simplicity I have filtered them using dplyr’s filter method. Using the Fitting Linear Models (lm{stats}) in R, we can generate a linear model as follows:

crimeData <- fread(“crime.csv”)
filter(crimeData, population > 0)
pop.fit = lm(ViolentCrimesPerPop ~ population, data=crimeData)
summary(pop.fit)
plot(ViolentCrimesPerPop ~ population, data=crimeData, main=”Population vs Crime”)
abline(pop.fit, col=”red”)

The output generated is shown below:

The R-squared value of 0.139 suggests that there is a weak correlation between the two. This examples illustrates the linear regression for a single variable. But, in reality we have multiple variables that can affect the value of the output variable, e.g.

ViolentCrimesPerPop≈ β0 + β1 × racepctblack+ β2 × racePctWhite+ β3 × racePctAsian+ β4 × racePctHisp

Where,

racepctblack is the percentage of population that is African American, and the remaining three are the percentages of Caucasian, Asian and Hispanic respectively. Again using the same function but with different arguments in R:

race.fit = lm(ViolentCrimesPerPop ~ racepctblack + racePctWhite + racePctAsian + racePctHisp, data=crimeData)
summary(race.fit)
par(mfrow=c(2,2)) #divides the plot area in a grid of 2X2
plot(ViolentCrimesPerPop ~ racepctblack + racePctWhite + racePctAsian + racePctHisp, data=crimeData, main=”Race percentage vs Crime”)

The output generated is shown below:

Individual plots of crime rates with the race population percentages

A R-squared value of 0.5 indicates a moderate relationship between the input variables. The slope values indicate that the amount of crime is positively correlated to the percentage of the African American and Hispanic population, and negatively correlated to the Asian and Caucasian population.

coefficients(race.fit)

Coefficients from the multiple linear regression

Above model suggests that the four variables when taken together have an impact on the crime rate in the communities. But, if we had used forward selection or backward selection, we might have found different results. Forward selection is an incremental approach in which we start with an empty model and variables are added to it which results in the lowest Residual Sum of Squares (RSS). Similarly, backward selection is a decremental approach in which we start with all the possible variables in the model and then remove the least statistically significant.The four variables do have an impact but the small values of the slope for Caucasian and Asians suggests that they do not explain much of the model. So let us consider two different fit models to establish this result:

fitWithTwoVariables = lm(ViolentCrimesPerPop ~ racepctblack + racePctHisp, data=crimeData)
fitWithFourVariables = lm(ViolentCrimesPerPop ~ racepctblack + racePctWhite + racePctAsian + racePctHisp, data=crimeData)

Carrying out an ANOVA for these two models will enable us to find if the two models are similar or not.

anova(fitWithTwoVariables, fitWithFourVariables)

This gives the following results:

An F-value of ~2 indicates that the two models are approximately similar, i.e. the addition of two variables for Caucasian and Asian population doesn’t explain the crime rate much. This is also evident from the confidence intervals for the Caucasians and Asians from the multiple linear regression done above, as they contain the 0 within the interval, shown below:

CIs for the coefficients in multiple linear regression

We have eliminated the two variables from a set of four, and this task of determining which predictors are associated with the response is called variable selection. Thus, regression helps not only in learning from the data, but also in deciding which are the important variables to consider from a pool of variables.

Data Source — https://archive.ics.uci.edu/ml/datasets/Communities+and+Crime

Reference:

James, G., Witten, D., Hastie, T., & Tibshirani, R. (2013). An introduction to statistical learning (Vol. 6). New York: springer.

SIMPLE LINEAR CORRELATION AND REGRESSION.

Linear Regression — Learning the ‘R’ way!

Written by Vipin Verma