Understanding Linear Regression Is So Simple a Manager Could Do It

Davin Cermak
6 min readMay 27, 2022

--

Ok, I’ll admit that this title is snarky and unfair, but it just got stuck in my head and so I went with it.

Shortly after starting a position with a new employer, I proposed a research project to my supervisor that included a linear regression model. The supervisor explained I could not proceed with the project as designed because the group I was working in did not provide forecasts or predictions. My proposal was to use a regression model to understand relationships between variables and not to forecast or predict. But being relatively new to the group, I did not push the issue further and changed the research proposal.

It made me realize that academics — at least when I was last in the classroom — often teach linear regression as a prediction tool. But the thought of how to describe linear regression models to help understand data continued to nag me until I found my “aha” moment looking through a regression analysis explanation in a data science context. I hope the following discussions will provide an “aha” moment for you as well (I wish I had discovered this when studying econometrics!) and will help you better understand linear regressions and better explain the results to others.

NOTE: I will discuss a couple of simple examples of linear regression models where the dependent variables ‘sex’ and ‘marital status’ are binomial with only two values: 0 or 1. I will simply focus on extracting group means from the data and will not get into a discussion of linear regression assumptions. The goal here is to build a simple and practical understanding of regression analysis on which to build.

A Naïve Model: The Mean

Using data from the Panel Study of Income Dynamics from the University of Michigan, we will look at several factors related to the wage earned by the head of household in the survey.

The most basic understanding of our dependent variable, ‘wage’, is a measure of central tendency (i.e. mean, median, mode, etc.). Since it is next to impossible for the human brain to look at a list of 6,000+ wages and process any meaningful information, we often want to know what the average wage is to get at least a basic understanding of the data we are considering. Calculating the mean of a set of wages (sum(wage) / number of wages) is considered a naïve model. It is naïve because we are making absolutely no assumptions about either the distribution of the data or its dependencies on any other data. All we know at this point is that the mean wage of our data is $55,460.

A Univariate Linear Regression Model: Controlling for Gender

There are additional pieces of information to consider in the data besides the mean wage, so let us start with looking at the mean wages of males and females:

sum(wages, males) / number of male wages = $64,107, and

sum(wages, females / number of female wages = $35,431.

In this case, we are calculating two naïve models, one for each gender, instead of just one for the entire sample. However, we can also use a simple univariate linear regression model to derive the mean wages for males and females (I will forego the academic specifications of the model in this discussion since there are several academic textbooks and online sources that provide such discussion and focus instead on the model results and interpretation):

The resulting coefficients from the model allow us to calculate the mean wages for each group. Defining the gender variable “sex” as a factor, the linear regression model interprets gender as a dummy variable where 0 = Male and 1 = Female.

If you have had a statistics course that covered linear regression models, you may recall the simple univariate linear regression formula: ywage=β0+βsex+ϵ. This can now can now be shown as wage = $64,107 + $-28,677(sex). The constant is the same as the mean wage for males. Solving this equation with sex=1 for females gives us the mean wage for females, $35,431.

We can learn more about the differences in wages between males and females using the linear regression results typically generated by statistical packages than we can by just considering the mean values of the two groups. Here, the difference in the means is pretty large, and with 6,815 people in the data set and we would expect it to be statistically significant and not just because of random errors when collecting the data. We can confirm this by looking at the t-values which are calculated in linear regression output.

The t-values interpretation is straightforward in this case. The benchmark in econometrics is that coefficients with a t-value probability, ‘Pr(>|t|)’, less than 0.05 are statistically significant. But statistically significant from what? Well, recalling that the constant is the mean wage for males in our data, the probability of the true population mean = $0 is less than 1 percent. This is, however, not as important as the t-value probability for females. This probability tests whether the wages of females are statistically different from the wages of males. In this model, the probability of the mean wage of women being lower than the mean wage of men due to random chance is less than 1 percent.

We can use this information to explain a couple of aspects of the data. On average, wages for males are $28,677 higher than average wages for females. Additionally, there is a very low probability that the gender difference in wages is because of random errors. We can be pretty confident that the difference in the survey data exists in the broader population.

Extending to a Multivariate Linear Regression Model: Controlling for Marital Status

This same understanding can be applied to a multivariate model with a more complex set of factors used to describe wages: marital status.

Here, we have several more potential influences on wages than whether the survey respondent is male or female. Unlike the previous univariate linear regression model, the model used here will be a multivariate model, but the same concepts will apply.

Similar to the univariate regression model, the multivariate model has a constant that represents married respondents with a mean wage of $74,207. The mean coefficient value for each marital status category is its difference from the constant mean for married respondents, i.e., never married respondents earn, on average, $36,431 less than married respondents, widowed $38,218 less, and so on. It is important in this model to know the meaning of the constant value, as most software programs will not include the ‘Married’ label (I added it manually) but will just refer to it as a ‘Constant’ or ‘Intercept’ value. The t-value probability interpretation is like that of the univariate model above. In this multivariate model, each individual marital status mean wage is statistically different (and much lower!) from the mean wage of our base case scenario — married respondents. Keep in mind that this model does NOT inform us whether the difference mean wages between Never Married and Widowed respondents, which are quite similar, are statistically significant.

One last point. You may have heard the terms ‘all else equal’ or ‘ceteris paribus’ used with regression model discussions. This simply means that we are only comparing individual marital status groups to our base married status. Here, it makes intuitive sense since each respondent can only be in one of these categories at a time.

To recap, understanding a linear regression result is as simple as understanding the mean of a set of numbers. To calculate the mean of one of the groups from the regression model, simply solve the linear regression equation. Unlike calculating the simple mean of a group, linear regression output from most statistical packages also includes information that can be used to guide us in determining whether the difference between the means is due to random chance.

--

--

Davin Cermak

A long-time economist and data analyst moving into the world of freelance consulting.