Think of It as a Learning Experience

Earl Radina
Human Systems Data
Published in
3 min readMar 28, 2017

The following post is the result of several hours of red error text:

After successfully performing the multiple regression on the mtcars data set. I decided to attempt the same thing on another data set. Initially, I chose the personality data found by Professor Chiou (available at https://github.com/rudeboybert/JSE_OkCupid). However I ran into several problems that stemmed from one major issue. The example instructions provided work great for quantitative data, but begin to fall apart in the presence of qualitative data. After several several attempts eventually leading to my troubleshooting of the issue, I was lead to http://stats.idre.ucla.edu/r/modules/coding-for-categorical-variables-in-regression-models/. This post describes creating a factor variable to code for categorical variables and is likely an important read if one is interested. For me however, it proved impossible for me to transfer this method to my desired data.

As such, I instead performed a simple analysis of carbon monoxide as related to traffic, time, and wind direction found at http://www.statsci.org/data/general/cofreewy.html. This data looked at a Los Angeles carbon monoxide levels beside a busy freeway. The, “Hour” variable was an aggregation of several days with, “1” being 1:00 am. “Traffic” was measured in the density of traffic (cars/speed). My code was as follows:

input <- Freeway.Data[,c("CO","Hour","Traffic","Wind")]
model <- lm(CO~Hour+Traffic+Wind, data = input)
print(model)
Call:
lm(formula = CO ~ Hour + Traffic + Wind, data = input)
Coefficients:
(Intercept) Hour Traffic
1.318967 -0.005689 0.018402
Wind
0.179189

Based on this, one could surmise that traffic has a pretty strong effect on carbon monoxide levels. Which is what one would expect given that it is indeed the vehicles which are producing said CO gas. All would seem well, were it not for the fact that this data is intended for teaching purposes and the creators included their input:

Call: lm(formula = CO ~ Traffic + Wind + Wind^2 + sin((2 * pi)/24 * Hour) +
cos((2 * pi)/24 * Hour) + sin((4 * pi)/24 * Hour) + cos((4 * pi)/24 * Hour))
Residuals:
Min 1Q Median 3Q Max
-0.3477 -0.1078 0.0059 0.1407 0.3994
Coefficients:
Value Std. Error t value Pr(>|t|)
(Intercept) 1.7523 0.7793 2.2484 0.0390
Traffic 0.0179 0.0013 13.7864 0.0000
Wind 0.3178 0.3447 0.9220 0.3702
I(Wind^2) -0.0774 0.0268 -2.8851 0.0108
sin((2 * pi)/24 * Hour) -0.5437 0.9324 -0.5831 0.5680
cos((2 * pi)/24 * Hour) -0.7300 0.6574 -1.1106 0.2832
sin((4 * pi)/24 * Hour) 0.3664 0.4331 0.8460 0.4100
cos((4 * pi)/24 * Hour) 0.3110 0.1424 2.1838 0.0442

So what’s the deal? Reading the associated post tells us that the sin, pi, cos and squaring of variables is intended to smooth out the data to make it fit a more reasonable trend function. In data that undulates as rapidly and without warning as wind, smoothing the data would be a good idea. This can be done by adding a simple

^2

to your variable.

So what are the lessons to be learned here? I think the biggest thing to keep in mind here is that R will not do the work for you. It is just a tool a researcher can use. It is up to the researcher to keep in mind what kinds of variables they are dealing with. To reference earlier examples, are your variables more qualitative or quantitative? Each one will require a different code stream with qualitative data requiring a bit more work. The other is to look at how your variables interact. If your variables are environmental in nature and likely will have some unexplained variance, extra work will need to be performed (sin, cos and ^2) to make the relationships over time more visible.

--

--