Tips for teaching regression analysis to those not well versed in math (that also help the mathematically inclined)

A long time ago, Seneca, a Roman philosopher, said that the best way to really learn something was to teach it. I agree (for what it’s worth), and this post is about teaching linear regression analysis in a more efficient manner. Textbooks, research papers, PowerPoints, and blogs abound on the inner workings of linear regression. Yet among these sources one is hard-pressed to find much advice on what works and what doesn’t in a practical teaching context. Of course, the implicit advice in any given textbook is that the pedagogical approach taken “here” is superior. Given the number of textbooks on regression analysis on the market, we can thus conclude that there are many opinions on the subject … Below follows my personal take on this, based partly on my 30 years of teaching regression analysis and partly on my textbooks on the subject (see here and here).

Know your audience!

Whenever you are going to teach the fundamentals of linear regression analysis (aka OLS regression), the first thing you should think about is your audience’s mathematical skills. In the social sciences, my home turf, experience tells me that too many instructors overestimate such skills. That is, they rely too much on symbols, equations, and abstract reasoning, and this emphasis cause many students to drop out. All else being equal, an instructor for STEM students can safely assume better math skills on average, but the moment he or she overestimates these skills …

Assuming you’ve hit the mark on the audience’s mathematical skills, three “choices” or reflection points need to be addressed:

(1) Choice of variables and units (i.e., data sets)

(2) Equations versus graphs

(3) Causal effects versus multiple causes approach

(1) Choice of variables and units (i.e., data sets)

Starting with the dependent variable, y, what should this be like? My strong conviction in this regard is that the y always should be intuitively recognizable and measurable — to anyone! That is, good ys are the count of something, the amount of time or money spent on doing something, or the price of something. Five examples are doctor visits, goals scored, hours of exercise per week, expenditures on vacation trips, and the price of dwellings. Not-so-good ys are all kinds of composite indexes (e.g., Gini-coefficients), aggregated rates (e.g., homicide rates per 100,00), and all kinds of variables where the minimum, maximum, and average are “blurry” for want of a better word. The vital point is that students should not have to think about what really is at stake regarding variation in y … As for the independent variables, the x-es, at least a couple of them should be on the numerical measurement level (aka continuous variables) having many possible values. This facilitates the use of scatterplots, to which I return in point (2) below.

In many cases, but not necessarily all, my prerequisites for the y- and x-variables lead me to prefer micro-level data (i.e., data for which “ordinary” people, athletes, products etc. are the units) over aggregated data (i.e., data on firms, schools, municipalities, countries etc.) when teaching regression — at least in the beginning of the course. Later, when the statistical fundamentals are in place, such aggregated data — possibly more relevant for the students’ choice of main subject — are welcome!

(2) Equations versus graphs

My previous posts on Medium.com on regression analysis (see, e.g., here, here, and here) are a testimony to my answer on the question of equations versus graphs. Graphs, as in a scatterplot with the dependent variable on the y-axis and the independent variable on the x-axis, trump equations — every time, seven days a week. The reason is obvious. Everybody understands a scatterplot, but not everybody understands equations. This does not mean that I discourage the use of equations; it just means that that the equations should be placed on the “map” of the scatterplot already there. In contrast, if you start with the equations, many students will not make it to the scatterplot …

At this point, I should put my money where my mouth is. Figure 1 is based on a data set on large, second-hand cabin cruisers sold in Norway. The dependent variable, y, is the sales price of the boat, and the independent variable, x, is the length of the boat. The red line is the regression line, and everybody — literally everybody! — gets that the regression line might be thought of as the general and positive association between length and sales price for the “average” boat in the data. (At this point in the course, it is no need to be more specific than that.)

When a positive association has been explicated (more of x goes together with more of y), one may procced to the equation yielding the regression line in Figure 1. In our case, this equation becomes:

Price = -380.58 + 15.54 × Length,

suggesting that a 45-feet cabin cruiser costs roughly 15,500 Euro more on average than a 44-feet cabin cruiser. Now, it is only a small mental leap to the explication of the constant (or intercept) and the slope (or regression coefficient) in general, abstract terms.

Once the regression equation above and similar ones are introduced (and repeated several times in computer training exercises), one might replace the numerical x-variable (i.e., length in Figure 1) with a dummy x-variable. For example, some boats have a so-called flybridge (i.e., a “second floor”), whereas other have not. The regression equation is:

Price = 171.68 + 112.16 × Flybridge,

and it is visualized in Figure 2. The figure should be straightforward to explain. The vertical distance between the top of the two bars is the slope (or regression coefficient) of 112.16, suggesting that boats with flybridge cost about 112,000 Euro more on average that boats without flybridge. (The average price for a boat without flybridge is just north of 171,000 Euro, i.e., the constant).

The last point on today’s agenda concerns the theoretical motivation for doing multiple regression analysis.

(3) Causal effects versus multiple causes approach

Several justifications can be made for multiple regression analysis, depending on context and subject matter. One approach is to seek the unbiased effect of x1 on y, as in the causal effect framework. In this scenario, other x-variables are most often passive control variables (as in potential confounders). Another approach to motivate multiple regression is to “look at nature” and observe the trite fact that most things in life have multiple causes. As such, if our regression model is to resemble nature at all, it should contain several x-variables. Both approaches work fine by me. Yet from a pedagogical stance, the “simplest” extension to a multiple regression with only two x-variables has lots going for it. Continuing with our example, we might check out how boat price is associated with boat length and (presence of) flybridge simultaneously. The regression equation becomes:

Price = -371.13 + 15.02 × Length + 23.95 × Flybridge,

which also is shown in Stata-exhibit 1.

Stata-exhibit 1.

Here, not much happens to the regression coefficient (slope) for boat length. But the coefficient for flybridge gets reduced to 15,000 Euro (from 112,000 Euro) in the multiple case. That is, if we compare boats with or without flybridge of the same length, the ceteris paribus price difference between these two boat types is “only” 15,000 Euro. The mathematically inclined might want a technical explanation as to why this occurs, but my mathematical abilities stop here. Sorry!

The next stop is more x-variables, interaction effects, and non-linearities (see here and here), before moving on to tests of significance (see here). But that’s something for another day.

Takeaways

There is not one correct way to successfully arrange and teach a course in regression analysis. But some success factors are:

(1) Know the background of your audience, especially their mathematical skills

(2) Use transparent and easy-to-recognize variables, preferably for micro-level data, at the start of the course

(3) Use graphs before turning to equations

(4) Let the first multiple regression contain one numerical x-variable and one dummy x-variable

About me

I’m Christer Thrane, a sociologist and professor at Inland University College, Norway. I have written two textbooks on applied regression modeling and applied statistical modeling. Both are published by Routledge, and you find them here and here. I am on ResearchGate here, and you also reach me at christer.thrane@inn.no

--

--

Christer Thrane (christer.thrane@inn.no)

I am Christer Thrane, a sociologist and professor at Inland University College, Norway. You find me on ResearchGate. I do lots of regression modeling ... :-)