Tips for teaching logistic regression more effectively …

I recently posted two “tips” on how to more successfully teach regression analysis to those not well versed in math (that also help the mathematically inclined), which you find here and here. This post has a similar ambition for logistic regression. As before, my pieces of advice are based on 30 years of teaching regression analysis and my textbooks on the subject (see here and here). I will use fresh data on football (i.e., soccer) players for illustration, which I have also used for regression purposes in previous posts (see here and here). I cover only the bivariate logistic regression model in this post. If I get 50 reads and 25 claps on it, I will write a follow-up for the multiple logistic regression case; I promise 😊.

Preamble 1: the frequency table

Many instructors begin their introductions to logistic regression by introducing the sigmoid (i.e., almost S-like) regression line connecting x and y, accompanied by the equation for this line. Don’t! It scares the living sh… out of those not well versed in math. Instead, start with the frequency table describing the distribution for the binary y-variable. Our y-variable in this post is whether a football player was picked out for the team of the round (at least once) during the 2022 season (coded 1) or not at all (coded 0). The frequency distribution for this variable is shown in Stata-exhibit 1. (Given the ambition and the potential audience for this post, I see no reason to present fancy tables or graphs.)

Stata-exhibit 1

46 percent of the players were picked for team of the round, which obviously leaves 54 percent for the non-picked group. Having introduced the y’s frequency distribution, it’s time to ask how x affects this y. In my experience, always let the first x-variable you consider be binary. (Then you can postpone all things about the sigmoid regression function.) My chosen x-variable at present is whether a player has played any matches for his national team (coded 1) or not (coded 0). These percentages, not shown, are 19 (> 0 matches) and 81 (0 matches).

Preamble 2: the cross-table

The cross-table in Stata-exhibit 2 shows the cross-tabulation between the two beforementioned variables.

Stata-exhibit 2

At this point, the key issue to communicate to students is the numbers in red: that almost 59 percent of the players with representation for their national teams made the team of the round at least once. Furthermore, that (only) 43 percent of the players with no representation for their national teams made the team of the round at least once. This difference, it should be emphasized, is 15 percentage points (58.62–43.25 = 15.37). If the students get this, and they will (!), my experience tells me that it becomes straightforward to teach all types of students what logistic regression is about! But let’s not get ahead of ourselves …

Preamble 3: the linear regression model

The final introductory phase of logistic regression should involve using traditional linear regression to estimate the relationship between our two binary variables — the so-called linear probability Model (LPM). Stata-exhibit 3 takes care of this.

Stata-exhibit 3

Obviously, the coefficient (in red) is the 15.37 percentage points difference noted in the cross-table in Stata-exhibit 2. Also, the constant says that 43.25 percent of the players with no representation for their national teams made the team of the round at least once. In short, a linear regression involving two dummies always reproduce the 2 × 2 cross-table perfectly!

Introducing logistic regression

At this point, I normally introduce the three problems of using linear regression on a binary y: poor functional form, predictions possibly larger than 100 or smaller than 0, and heteroscedasticity. And then I introduce the sigmoid logistic regression function that saves the day in these respects (but not the equation!). The results appear in Stata-exhibit 4.

Stata-exhibit 4

The key point to tell your students is this: since the logistic regression line is non-linear (i.e., sigmoid) in its form, we can only interpret two features of the output: the sign of the coefficient (positive) and its significance level (0.036). That’s it! In other words, we must do some calculations to get logistic regression coefficient into more “meaningful” entities in terms of interpretation …

What to do regarding the interpretation of a logistic regression coefficient 1? Marginal effects

One possibility, which I prefer, is to back-transform the logistic regression coefficients into something resembling linear coefficients, which in the lingo goes by the name of “marginal effects.” Both R and Stata have canned versions for doing this, and Stata-exhibit 5 illustrates.

Stata-exhibit 5

We note in this case that the linear regression results (see Stata-exhibit 3) and the marginal effect based on the logistic regression model give the exact same results: 0.154. In other cases, the results may differ, but usually not by much. (There are various ways to calculate marginal effects, but let’s not get into that.)

What to do regarding the interpretation of a logistic regression coefficient 2? Odds ratios

Here is another approach, which I don’t recommend! The odds of being picked for the team of the round, according to Stata-exhibit 1, is 0.8563 (143/167 = 0.8563). For a player with representation for his national team, the similar odds, according to Stata-exhibit 2, is 1.41667 (34/24 = 1.41667). For a player with no representation for his national team, by contrast (and again according to Stata-exhibit 2), the analogous odds is 0.7622 (109/143 = 0.7622). The difference between these odds, the odds ratio, is thus 1.86 (1.41667/0.7622 = 1.8586): the odds of being picked for the team of the round is 1.86 times greater for players having represented their national teams. If you don’t trust my calculations (I wouldn’t!), please see Stata-exhibit 6.

Stata-exhibit 6

When the x-variable is continuous

Given the analyses above, introducing a continuous x-variable should be straightforward. Suppose the x-variable is the number of matches played in the season. Stata-exhibit 7 presents the results, and Figure 1 shows these results.

Stata-exhibit 7

In short: A player with, say, 20 matches in the season has a 2.45 percentage points higher probability of making the team of the round than a player with 19 matches in the season. Notice also the slightly sigmoid form of the regression line in Figure 1. Cool …!

Teaching multiple logistic regression analysis

Well, as stated in the introduction, this might be a work in progress … 😊

Takeaways

Understanding logistic regression can be hard, especially for those not well versed in math. This raises several challenges for any instructor. In this regard, a strong pedagogical case can be made for an intuitive approach that comprises a “detour” focusing on frequence tables, cross-tables and plain-vanilla, linear regression.

About me

I’m Christer Thrane, a sociologist and professor at Inland University College, Norway. I have written two textbooks on applied regression modeling and applied statistical modeling. Both are published by Routledge, and you find them here and here. I am on ResearchGate here, and you also reach me at christer.thrane@inn.no

--

--

Christer Thrane (christer.thrane@inn.no)

I am Christer Thrane, a sociologist and professor at Inland University College, Norway. You find me on ResearchGate. I do lots of regression modeling ... :-)