First attempt at Regression using R

What’s the Reading? Chapter 3 of “An Introduction to Statistical Learning: with Applications in R”

Eric del Rio
Human Systems Data
5 min readMar 1, 2017

--

Hi Folks,

This post is in response to a reading of Chapter 3 of An Introduction to Statistical Learning: with Applications in R by James and Colleagues (2013).

While I have read many scientific (mostly psychology) papers over the years, I feel like I have very rarely encountered studies that use regression as a form of analysis. As a result, I found this chapter to be very useful as it not only explains the functions and mathematical framework of linear and multiple linear regression, but it also does a good job of explaining the various values in these methods of analysis in terms of what they really mean about the data.

In simple terms, linear regression is useful for predicting the outcome (or dependent variable) of a given input (or independent variable). Multiple regression is useful when you want to see how multiple inputs may affect the output. The trick to doing this, is to come up with a formula that best fits the data you have so that your formula will also work for unknown input values.

I have read chapters about regression a number of times before for different classes, but I am somebody who tends to forget specifics if I don’t put them into practice. I am also a novice in R and would like to hone my skills. This is why I chose to attempt regression analyses in R and post it here for others who may be interested.

My Process Step 1: Finding a dataset

So, the chapter by James includes detailed instructions about how to perform a linear regression on a dataset from the MASS library called “Boston”. Because I didn’t want to just copy their code and draw the conclusions written in the paper, I decided to follow their instructions using a different set of data.

In looking for data, I at first found it difficult to locate a dataset with continuous variables that may be used as predictors for a continuous output variable. This is an issue because in order to perform a regression you need to use numeric values for all variables. One solution I found to this issue is to do what is called “Dummy Coding”. Dummy Coding is a process where you assign numeric values to categorical data. I found this to be a useful resource to learn more about dummy coding. Also, this is a good resource for how to code categorical variables for regression in R. For this first one, I had doubts about using categorical variables as predictors.

First, I looked on Kaggle to see if I could find a free dataset to use, but nothing jumped out at me. Next, I decided to use one of the datasets from the R “datasets” package. I found this great website on the R datasets package, which shows all of the datasets and has information about the variables.

I eventually arrived at a dataset called USArrests which looks at murder, assault, rape, and urban population % by state. The information about the dataset also included code that repairs some errors in the data.

My Process Step 2: How to share my R code with you

At this point, many of my comments might refer to the code in R. It might be helpful to follow along with the code which can be found in my GitHub Repository named “R_Practice” .

Strangely, because Medium doesn’t allow file uploads and I don’t have a website to host files, figuring out a way to share my R Script aside from copy/pasting it as text was the most difficult part of this process. GitHub is definitely worth the time as it is a vast collaborative network of coders. With a repository, you can post your master code and send out requests for others to help you work on it. All of the code that is posted for free is open source and available to the public!

My Process Step 3: Linear Regression

I decided to see how murder rate and assault were as predictors for urban population percentage as an attempt at Simple Linear Regression.

Before I performed the linear regressions, I decided to plot these comparisons to see if they had a visible linear relationship.

Plots of Murder and Assault as predictors for Urban Pop %. Not so Linear….

As you can see, the relationships seem to be quite curved so perhaps these are not the best predictors. I decided to continue anyway.

I found it very easy to run separate linear regressions of Murder and Assault. I’ll talk about Assault specifically in this section so as not to bore you with the redundancy.

I’ll talk about the results of this analysis in the order in which they are covered in the chapter. Our “null-hypothesis” is that there is no relationship between murder and UrbanPop. First, the p value was significant p=.0475 which means that it is unlikely that we are observing a relationship between these two variables by chance. Second, the F-statistic is 4.138 which is rather low (5 and over would be an indicator that we should reject the null hypothesis. Here is one video explaining the F-statistic on YouTube. Third, the Residual Standard Error (RSE) is 14.11 which means that on average, the urban population percentage could deviate around 14% from the regression model based on any given number of murder rates. Fourth, the R² statistic is .08 which means that this model explains a very low percent of the variance for Urban Pop. These values mean that there is likely not a linear relationship between these two variables.

I continued to follow the instructions in the textbook chapter and learned some really useful tools for simple linear regression that I recommend you to check out in my R_Practice Repository on GitHub.

My Process Step 4: Multiple Regression

For this step, I conducted a multiple linear regression using Rape, Murder, and Assault as predictors for urban population percentage.

What did it show?

First, the P-value was again significant for the whole comparison (.005), and for murder (.048) and rape (.014). Second, for these two factors the coefficients can be explained as: for every additional rape in a city the chances are that the city would be one with have .69% higher population density; for every additional murder in a state, the chances are that that state would have a 1.46% lower urban population. Third, because the F-value is ~5 we can reject the null hypothesis and accept that there is a relationship between these variables. The R² says that the model only accounts for about 24.5 percent of the variance in the data. So what is the strongest predictor? It seems likely that Murder is the biggest predictor because its coefficient has the biggest absolute value of significant predictors (coefficient= -1.46).

Concessions: It’s about the process

I admit, it may not seem too useful to be able to predict urban population percentage of a state based on the number of violent crime arrests there are in that state. It seems more interesting to see the patterns of violent crime in states of different population density. However, this data was suitable for some good practice in R. If anybody out there has some tips, or if I have made some mistakes, please comment below.

References

James, G., Witten, D., & Hastie, T. (2014). An Introduction to Statistical Learning: With Applications in R. (Ch 3)

--

--