Logistic versus linear regression: Why the fuss?

“Welcome to the world championships of statistics! The task in today’s final game is to estimate the probability of a football player being on a national team as opposed to not. In the blue corner, the reigning champion, logistic regression analysis. In the red corner, the challenger, linear regression analysis.” Well, you get the idea …

Whenever a dependent variable (or y-variable or response variable) is dichotomous, aka a dummy variable, we must decide as analysts: How should we model this “choice-variable”? The standard recommendation and thus approach has for many years been logistic regression. Yet more recently many have argued in favor of the resurrection of linear regression — especially in economics, political science, and sociology. In this blog post, by analyzing real sports data in an intuitive and non-technical manner, I contrast the logistic and linear regression model in terms of analyzing a dependent dummy dependent variable. This dummy is whether a football (i.e., soccer) player plays on a national team (yes, as in coded 1) or not (no, as in coded 0). In this regard, 33 percent are national team players and 67 percent are not. The independent variables (or x-variables) are the origin of the player (0 = Norwegian, 1 = Foreign), the total number of club matches in the player’s career, and the rank of the player’s current club (1 = best-performing club; 16 = worst-performing club). The data contain the players in the top-tier male Norwegian football league. These data have been analyzed elsewhere, but for the present purposes I discard the goalkeepers from the sample. That leaves 217 players for analysis.

To make matters as intuitive as possible, I start with the more familiar linear regression model. If you are rusty on this, please look at my recent primer on linear regression here. I then proceed to logistic regression and compare this with the linear approach.

The pros and cons of linear regression when the y-variable is a dummy (LPM)

The linear regression model is also known as the LPM or Linear Probability Model in the context of a dummy y-variable. The pros are easy to summarize for this LPM. Since we really are talking about a plain-vanilla linear regression model, the interpretation of the regression coefficient is intuitive and straightforward: the probability increase in y = 1 in percentage points when x increases by one unit. Consider the output from Stata-exhibit 1 for our national team probability, i.e., on a national team = 1.

Stata-exhibit 1 Probability of playing on a national team by three x-variables. Linear regression.

The results: Foreign players have a 25.1 percentage points larger probability of playing on a national team than Norwegian players (i.e., the reference), all else being equal (which I skip hereafter). Players with 110 club matches of experience have a 2.3 percentage points larger probability of playing on a national team than players with 100 club matches of experience (i.e., 0.0023 × 10 more matches). Finally, players on a team ranked as number eight have a 2.0 percentage points smaller probability of playing on a national team than players on a team ranked as number seven. That is, the poorer-performing team, the lesser the probability of being a national team player. All coefficients are statistically significant at conventional levels, but let’s not get into that.

The translucent interpretation of the LPM is its main advantage. The cons are these: (1) Probabilities are, by definition, bounded between 0 and 1 (or 0 and 100 percent), but the linear regression model might yield predictions lower than 0 and/or larger than 1 (or 100 percent). An ever-increasing or ever-decreasing regression line makes that obvious. (2) Poor model fit (i.e., R-squared) because the data points are not scattered around but consist only of 0s and 1s. (3) The residuals for the LPM are always heteroscedastic. Of these three problems only the first one is severe, and therefore this is the one we are going to follow up below. The second and third problem have more character of being nuisances, and not real problems, as I show in my primer on linear regression.

How many predictions below 0 did our linear regression model yield? And how many above 1? Stata-exhibit 2 tells us that this number is 11 + 4 = 15 out of 217. Is this many? Hard to say …

Stata-exhibit 2 Predictions below 0 and above 1 for regression model in Stata-exhibit 1.

Figure 3 presents the linear predictions for player origin and number of club matches in career. For the predictions in the interval from 3 to 292 matches (2 players have played less than 3 matches; 6 players have played mor than 292 matches), all predictions make sense as in being above 0 and below 1.

Figure 3 Predictions based on regression model in Stata-exhibit 1. Rank of club is set at mean.

The pros and cons of logistic regression when the y-variable is a dummy

The main advantage of logistic regression is that it by mathematical necessity solves the problem of probabilities larger than 1 (or 100 percent) and/or smaller than 0 (or 0 percent). The reason is the sigmoid form of the logistic regression line — never reaching below 0 or above 1; see Figure 4.

Figure 4 The sigmoid regression line for logistic regression.

This obvious advantage comes at a price, however. Logistic regression coefficients have no intuitive metric as opposed to their linear counterparts. We must thus “back-transform” the logistic regression coefficients into something resembling linear regression coefficients: marginal effects. The logistic coefficients and these marginal effects appear in Stata-exhibit 3.

Stata-exhibit 3 Probability of playing on a national team by three x-variables. Logistic regression and marginal effects.

The results are straightforward to summarize: Foreign players have a 25.6 percentage points larger probability of playing on a national team than Norwegian players (i.e., the reference). Players with 110 club matches of experience have a 2.2 percentage points larger probability of playing on a national team than players with 100 club matches of experience (i.e., 0.0023 × 10 more matches). Finally, players on a team ranked as number eight have a 2.0 percentage points smaller probability of playing on a national team than players on a team ranked as number seven. In short: The logistic regression model produces results that for all intents and purposes are identical to the linear regression model.

Figure 6 presents, analogous to Figure 3 for the linear model, the logistic predictions for player origin and number of club matches in career. Again, the results are qualitatively similar, and the reason for the two curved (or non-linear) lines is to be found in the sigmoid shape of the logistic regression line.

Figure 6 Predictions based on logistic regression model in Stata-exhibit 3. Rank of club is set at mean.

Takeaways and implications

The main message of this post is that analyzing a dependent dummy variable with linear regression (aka LPM) or with logistic regression followed by marginal effects yield the same qualitative results — even though the linear regression model yields some nonsensical predictions. But this of course begs a new question: Are my results of a general nature? Yes, it appears so. Or rather, when we are studying dependent dummy y-s for which the distribution ranges between roughly 70:30 and 30:70, the LPM and logistic regression most often yield similar results. And since this typically is the case in the social sciences (but not so often in the medical sciences), that explains the reemergence of the LPM in the social sciences in recent years: The plain-vanilla multiple regression model gives us what we need without further hassle. It is therefore prudent to finish by re-asking our title question once more: Why the fuss?

Then we’re done for today. Please find my other recent blog posts on data analysis and data visualization in the field of sports here and here.

About me

I’m Christer Thrane, a sociologist and professor at Inland University College, Norway. I have written two textbooks on applied regression modeling and applied statistical modeling. Both are published by Routledge, and you find them here and here. I am on ResearchGate here, and you also reach me at christer.thrane@inn.no

--

--

Christer Thrane (christer.thrane@inn.no)

I am Christer Thrane, a sociologist and professor at Inland University College, Norway. You find me on ResearchGate. I do lots of regression modeling ... :-)