Understanding statistics through sports: Linear regression

Having traced the historical roots of regression analysis (see my blog post here), it’s time to look at its modern-day equivalent. Yet contrary to many such explications, I will here as much as possible skip general phenomena and abstract equations. That is, I start with a sports (research) question: How is the age of football players related to the total number of matches they have played in their career? Having established this, I go on to explain the regression coefficient (i.e., slope), the principle of least squares, and the R-squared before I make the important (but often overlooked) distinction between the regression assumptions that matter and those who, quite frankly, are pretty much trivial. I close with showing regression for the case in which the x-variable is dichotomous (i.e., a dummy), and by extending the bivariate regression model into multiple regression. I shed light on these topics using a data set containing the players in the top-tier male Norwegian football (i.e., soccer) league. These data have more formally been analyzed elsewhere, but for the present purposes I discard the goalkeepers from the sample. That leaves 217 players for analysis.

Players’ age and players’ total number of matches played: the scatterplot

The question: How is the age of football players related to the total number of matches they have played in their career? The scatterplot between these variables appears in Figure 1, where I also have indicated the average values (i.e., the dashed lines). The extreme values for the variables are 18 and 41 (years of age) and 2 and 451 (matches played).

Figure 1 Scatterplot of number of matches played and players’ age.

We note that the scatterplot resembles the one Galton saw when he plotted the sons and daughters’ heights against their parents’ heights: a positive association. In our case, this translates into that younger players typically have played few matches in total in their career, as in being in the lower left quadrant. And that older players typically have played many such matches, as in being in the upper right quadrant. This, of course, makes intuitive sense.

The regression line, the regression coefficient (slope), and the magic of least squares

Linear regression is about finding the general trend in the association depicted in Figure 1. In this regard, we ask the statistics program to find the straight line that “best” (more on this below) summarizes the relationship between (in this case) age and number of matches played. This line is shown in Figure 2.

Figure 2 Scatterplot of number of matches played and players’ age, with linear regression line.

We note the upward-sloping line (from left to right) which not surprisingly bears the name regression line. The key in regression analysis is to calculate the steepness of this line, which we call the regression coefficient or slope. In our case, this regression coefficient (slope) is 12.70 or roughly 13. What does this mean? It means that if we move one point to the right on the x-axis, say from 24 to 25 years of age, the increase in number of matches played as measured by the regression line is 13 more matches. Or to be strictly precise: that a 25-year-old player has played 13 more matches on average in his career than a 24-year-old player.

If we replace age with x1 and number of matches with y, we get the general interpretation for the regression coefficient (slope): The change in y when x1 increases by one unit. Technically, the formula for any straight regression line is given by (and this is my only abstract equation, I promise!):

y = b0 + b1 × x1.

Here, y (number of matches) and x1 (age) are the variables, whereas b1 is the regression coefficient (slope) and b0 is the intercept (constant), i.e., the point where regression line crosses the y-axis. In our case, this regression equation becomes:

Number of matches = -246 + 13 × age.

That is, 13 is the regression coefficient (slope), and -246 is the intercept, i.e., where the line crosses the y-axis.

How does regression find the exact location of regression line, as in getting to the numbers for the regression coefficient and the intercept? The answer is the principle of least squares. Here is how it works: Choose any data point (i.e., dot) in Figure 2. Now, measure “mentally” the vertical distance (upwards or downwards) from this point and to the regression line. Then multiply this distance by itself, as in squaring, and you get an always-positive number. Repeat this procedure 216 more times, i.e., once for each remaining data point. Finally, add the 217 positive numbers together, and you have found “the sum of least squares” for this regression line. Punchline? You can draw an infinite number of other lines through the plot in Figure 2, repeat the procedure just described, and none of the sum of least squares-numbers you get will be smaller than the one you got for the regression line you started with! In this sense, the least squares-principle minimizes the sum of squares and provides the “best” estimate of the relationship between age (x1) and number of matches played (y) — often called “best fit.” From a technical point of view, getting to the numbers for the regression coefficient of 13 (b1) and the intercept of -246 (b0) is done my matrix algebra or something way beyond what I can comprehend …

The output from the statistics program and the R-squared

The regression outputs from different statistics programs look much alike. Figure 3 shows this output from Stata (the key terms are in red). The R-squared is 0.52. This suggests that age explains about 52 percent of the variation (strictly speaking, variance) in the number of matches variable. The R-squared is often reported in regressions, but scholars and research traditions differ in terms of how much weight we should put on this measure. But that’s something for another blog post.

Figure 3 Stata-output for the regression analysis yielding Figure 2.

The regression assumptions: introduction, generalities, and partial misunderstandings

How do we trust that the regression coefficient (or slope or b1) of 13 matches is correct? This question puts us smack in the middle of the realm of the regression assumptions — a topic full of partial misunderstandings. First, there are two kinds of assumptions: (1) Those addressing how to get an unbiased estimate of b1, and (2) those having to do with getting the correct p-value when making an inference about the sample-b1 to its unknown population counterpart: B1. That is, (2) concern only the “statistical significance” of b1 and not the size of b1 (1). In this regard, I hold that (1) is fundamentally more important than (2) in the bulk of applications. It is thus a paradox that (2) gets the lion’s share of attention in most explications of regression analyses, but let’s not get into that. We’ll start with the most important assumptions: the ones about getting an unbiased estimate for b1.

The very important regression assumptions

1. The regression model incudes all relevant x-variables. Think of our age and number of matches-regression. The assumption says that our b1 of 13 is unbiased if and only if there are no other relevant x-variables “out there.” Here, relevant means (a) correlated with age and (b) affecting number of matches. In practice, however, we might think of several such x-variables, and that’s generally why we bring other x-variables into the fore as in doing multiple regression (see below). Multiple regression does not solve the problem, however, because there might always be other x-variables “out there” which we do not have access to in our data. The long and short of it: Most regressions do not include all relevant x-variables. This assumption is thus most often not met, and we cannot be sure that b1 is unbiased.

2. Linearity. The assumption says there should be a roughly linear relationship between x1 and y for b1 to be unbiased. This is straightforward to check, and most often it entails replacing the straight regression line with a line that “let the data speak for themselves” in terms of how x1 and y are related. That is, we use a Lowess: locally weighted scatterplot smoother. This is shown in Figure 4. We conclude that our relationship is roughly linear. The linearity assumption is thus met. (We can also do a statistical test to check linearity, but let’s not into that now.)

Figure 4 Scatterplot of number of matches played and players’ age, with locally weighted scatterplot smoother.

3. No influential outliers. This means that that if we remove outlying observations — that is, in our case, players who are much younger/older than the “bulk” of players and/or players who have played many more or much fewer matches than the “bulk” of players — the b1 of 13 matches should not change in magnitude. There are statistical ways to detect such outliers, but graphical inspection of Figure 4 goes a long way at this introductory level. For the sake of the argument, say we remove players older than 35 years of age (n = 5) and players with more than 300 matches in their careers (n = 2). What is the b1 for this slightly reduced data set (n = 210). The answer is 10. We might have some influential outliers; this assumption might not be met. Generally, there are no bullet-proof procedures to handle influential outliers.

I consider the two last and very important assumptions, (4) no perfect multicollinearity and (5) additive effects, when we get to multiple regression at the tail end of the blog post.

The not-so-important regression assumptions

The assumptions covered above concern the (un)biasedness of b1. And, frankly, they are the ones we should care mostly about. The remaining assumptions does not concern the size of b1; they concern whether our tests of “statistical significance” are correct or not. But first we must make a giant leap in terms of a thought experiment. From here on, we assume that our data might be thought of as a random sample of some “super-population” of football players (which is hard, I admit that). In other words, we are analyzing a representative sample from a large population, and we want to find out whether it is ok to generalize our sample-b1 of 13 to this population. For this, as you probably know, we use a hypothesis test yielding a so-called p-value. How we arrive at this p-value, and how to interpret it correctly, is a large subject we can’t go into, but the crux of the matter is that we always need to calculate something called the standard error (SE) of b1 to get its p-value. And if this SE somehow gets calculated wrongly, the p-value also gets wrong. We then risk calling something (in this case a regression coefficient) statistically significant when it is not — or the other way around. The remaining regression assumptions thus concern how to get our SEs correct.

Technically, these not-so-important assumptions have to do with the residual. I did not mention it when I explained the principle of least squares above, but the residual is another name for the vertical “distance” between the data point and the regression line. So, the remaining assumptions are about the residuals.

6. No heteroscedasticity for the residuals. In plain words: The spread around the regression line should be constant for the various levels of x — or age in our case. We use a plot and or a statistical test to shed light on this. The plot appears in Figure 5. The dashed line is the b1 of 13, and we clearly see that the spread around the regression line increases for larger values of age (the x-axis). We have heteroscedasticity; the assumption is violated.

Figure 5 Residuals versus predicted values-plot for the regression reported in Figure 2.

Twenty or so years ago, a violation of the homoscedasticity assumption was problematic. Today, it is not. We just tell our statistics program to compute so-called robust SEs to get correct p-values, as in Figure 6. The robust SE appears in red, as does the correct p-value 0.000 (meaning smaller than 0.0001). We can thus forget about this assumption if we ask for robust standard errors.

Figure 6 Stata-output for the regression analysis yielding Figure 2, with robust standard errors (SEs).

7. Normal distribution for the residuals. The residuals, since they might be thought of as a variable, should be normally distributed. I wrote my first “test” blog post on this (you find it here), and the takeaway is that a normal distribution is only necessary for very small samples — say 50 or less observations. So, this assumption is redundant in most applications. That said, there are robust procedures to explore, but that will take us astray in this blog.

8. Uncorrelated residuals. In our context: The residual of one player should be independent of the residual of another player. This assumption has to with our data, and it is not something we normally can test “within” the data. In our case, I don’t think we violate this assumption, but it could happen that for example older (or younger) players “cluster” in certain football clubs. If this is the case, we have correlated residuals. The solution is to come up with, yes (!), “cluster-robust” standard errors which are straightforward to compute in sophisticated statistics programs such as Stata or R.

When the x-variable is a dummy

Using a continuous x-variable in a regression is a pedagogical means since it produces an intuitive scatterplot, as in Figure 2. But since we can reduce regression to an equation (yes, sorry about that), all types of x-variables will do. We can for example look at the relationship between total number of matches (y) and the origin of the players (x1; coded 0 = Norwegian and 1 = foreign). For this we get:

Number of matches = 94 - 22 × origin,

suggesting that foreign players on average have played 22 less matches than Norwegian players (notice the minus sign). The Norwegian players, as indicated by the intercept, have played 94 matches on average. (The only things changing from the statistics output in Figure 3 are the name and the numbers of the coefficients for b1 and b0.)

From bivariate to multiple regression

We do multiple regression rather than bivariate in most practical applications (for many reasons we cannot get into here). This is more complicated from a statistical point of view, but from a practical vantage point it’s just about adding more x-variables to the regression model: x2 gets b2, x3 gets b3 and so on. If we make a multiple regression out of our two examples above, we get:

Number of matches = -245 + 13 × age - 34 × origin.

The regression coefficient for age is still 13, but the coefficient for origin has increased in the sense that it has become more negative: from -22 to -34. Its interpretation has also changed, as it does in all multiple regressions: Foreign players have on average played 34 less matches than Norwegian players when we, to simplify a bit, compare players of the same age. We are controlling for (or adjusting for) age, as the saying goes.

The additional very important assumptions for multiple regression

As said above, there are two more very important regression assumptions in the multiple regression case.

4. No perfect multicollinearity. From a practical standpoint (to which you by now know I happen to be keen on), this assumption says that the correlation between the x-variables should not be “too high.” If this is the case, the b1 of the colinear x1 might be unbiased — or at the very least “untrustworthy.” So, the task is to assess the level of collinearity — typically measured by the VIFs (Variance Inflation Factors). Most textbooks say that the mean VIF score should be less than 10, but I think they just “repeat” this number without checking it. In any event, I follow Allison’s more prudent advice (see here) and get anxious when mean VIFs are larger than 2.5. What to do about multicollinearity? There is no magic bullet. Do nothing is good advice when all regression coefficients make sense. Add more observations, if possible, is another good piece of advice. A third option is to remove the problematic x-variable, but that might get you conflict with the first regression assumption mentioned earlier: The regression model incudes all relevant x-variables.

5. Only additive effects of x1. on y. In words: There should be no interaction effects between the x-variables. If this is the case, b1 is unbiased. In our example this boils down to the following: The “age-effect” on number of matches should be similar for both Norwegian and foreign players. In practice, this means testing whether an interaction regression model fits the data better than the plain-vanilla multiple regression model. If it does, we must abandon the plain-vanilla version and report two regression coefficients for age: one for the Norwegians and one for the foreigners. But interaction effects are something for another blog post. BTW, the age-effect is similar for the two groups in out data. That is, the additive effects-assumption is not violated.

Then we’re more than done for today, I Think! Sorry for this post on the longish side. I plan to make the next one shorter. Please find my recent blog posts on data analysis and data visualization in the field of sports here and here.

About me

I’m Christer Thrane, a sociologist and professor at Inland University College, Norway. I have written two textbooks on applied regression modeling and applied statistical modeling. Both are published by Routledge, and you find them here and here. I am on ResearchGate here, and you also reach me at christer.thrane@inn.no

--

--

Christer Thrane (christer.thrane@inn.no)

I am Christer Thrane, a sociologist and professor at Inland University College, Norway. You find me on ResearchGate. I do lots of regression modeling ... :-)