The normality assumption in linear regression analysis — and why you most often can dispense with it

The normality assumption in linear regression analysis is a strange one indeed. First, it is often misunderstood. That is, many people seem to think that the y-variable (or outcome variable) should be normally distributed. Nothing could be further from the truth. Look at Figure 1, a histogram for the variable total number of matches played in the career for a sample of 217 players in the top-tier Norwegian soccer league. (The data stem from a few years back.) Obviously, this variable is far from normally distributed. Does this mean the variable cannot be a y-variable in a linear regression? Surly not, as we will see shortly.

Figure 1. Histogram of total number of matches played in the career for 217 players in the top-tier Norwegian soccer league.

What, then, does the normality assumption mean in a regression context? The key word here is the so-called residual. Any residual is the vertical distance between the location of a data point and the regression line. To make matters concrete, look at Figure 2 showing the total number of matches variable regressed on player age. Unsurprisingly the figure tells us that older players have more match experience than younger players on average. (The regression coefficient or slope is 12.7.)

The residual is a variable; the vertical distance from the 217 data points to the regression line varies for the players in the sample. Some players are located below the regression line, and some are located above. In this regard, the normality assumption means that the sample distribution for this residual should be normal.

Figure 2. Scatterplot of total number of matches played in the career and age for 217 players in the top-tier Norwegian soccer league, with linear regression line.

Well, then, is our residual normally distributed? As Figure 3 shows: Almost, but not completely (the blue line is our residual). Also, a test statistic rejects the null hypothesis suggesting a normal distribution (p < 0.05). In other words, our regression model violates the normality assumption in the more formal sense.

Figure 3. Kernel plot of the residual based on the regression reported in Figure 2.

What to do? Now we have arrived at the second strange thing about the normality assumption. We do not have to do anything! That is, we do not have to do anything if we are analyzing a so-called “large” sample — which typically means 120 observations or more. In other words, whenever you have more than 120 observations in your data, you could dispense with the normality assumption altogether. The reason is that we can invoke the Central Limit Theorem for large samples. But that is a lesson for another day.

By the way, I do all my statistical computing in Stata. But I hope to learn R someday … I also hope to be able to upload datasets on GitHub … And to provide fancy graphs …

Hope you enjoyed or learned something from my first blog post ever!

Best regards,

Christer

I am Christer Thrane, a sociologist and professor at Inland University College, Norway. I have written two textbooks on applied regression modeling and applied statistical modeling. Both are published by Routledge, and you find them here and here.

christer.thrane@inn.no

--

--

Christer Thrane (christer.thrane@inn.no)

I am Christer Thrane, a sociologist and professor at Inland University College, Norway. You find me on ResearchGate. I do lots of regression modeling ... :-)