Demystifying Statistical Analysis 2: The Independent t-Test Expressed in Linear Regression

The Curious Learner
Sep 2, 2018 · 4 min read

Group comparison analyses such as the independent t-test and ANOVA may seem quite different from linear regression, but if we take a look at the cheat sheet in the first part of this series, we will notice that they actually fall under the same column of predicting a continuous dependent variable. The main difference is that t-tests and ANOVAs involve the use of categorical predictors, while linear regression involves the use of continuous predictors. When we start to recognise whether our data is categorical or continuous, selecting the correct statistical analysis becomes a lot more intuitive.

Categorical predictors (e.g. Male vs Female, Children vs Teens vs Adults, etc.) can be expressed in a linear regression using dummy or contrast codes. In fact, statistical packages such as SPSS automatically creates coded predictors in the background before running the appropriate statistical analysis, hence the term General Linear Model. In the next few parts of this series, I will attempt to illustrate how some of the popular group comparison analyses are represented in linear regression analysis, with the help of the textbook “Data Analysis: A Model Comparison Approach” by Carey Ryan, Charles M. Judd, and Gary H. McClelland. I hope that by drawing the connections between the various statistical analyses, it will become easier to identify when each statistical analysis should be used.

The independent t-test is one of the most commonly used statistical test to determine if there is any difference between 2 unrelated groups. For those who are unfamiliar with the test and want to know how it is usually conducted in SPSS, Laerd Statistics provides a comprehensive step-by-step guide. Otherwise, I will be explaining about the independent t-test using the following regression equation:

Ŷi = b0 + b1Xi

When comparing between 2 groups, there are 2 parameters in the regression model that need to be estimated: b0 and b1.

b0, commonly known as the intercept, estimates Ŷi when b1 is equal to 0. b1 is the estimated slope for the predictor Xi, which in this case represents the comparison between 2 groups such as “Male vs Female”. This comparison is done either by dummy coding (where Male is coded as 0 and Female is coded as 1) or by contrast coding (where Male is coded as -1 and Female is coded as 1).

When dummy coding is used, substituting Xi with 0 gives the Male mean, and substituting Xi with 1 gives the Female mean. Hence, in dummy coding, b0 always represents the Male mean (or rather the mean of the group being compared to), while b1 represents the difference between the means of Male vs Female.

Example values of Xi when dummy coding is used to differentiate Male vs Female.

Similarly, when contrast coding is used, substituting Xi with -1 gives the Male mean, and substituting Xi with 1 gives the Female mean. In this case, however, b0 represents the mean of the Male and Female means, while b1 represents 1/2 the difference between the means of Male vs Female (because -1 and 1 are 2 units apart).

Example values of Xi when contrast coding is used to differentiate Male vs Female.

Contrast coding provides a little more information than dummy coding, and avoids making comparisons to a fixed reference group. Nonetheless, the significance testing of the slope b1 provides the same information for both types of coding, i.e. whether or not the groups are statistically different. Essentially, a hypothesis test on the slope is asking whether or not the slope is 0. If the confidence interval of b1 captures the value of 0 (e.g. -0.6 to 0.3), then there is likely to be no difference between the two groups; conversely, if the confidence interval of b1 does not include 0 (e.g. 0.4 to 1.3), the groups are statistically different, and the p-value of b1 will also be less than .05.

Confidence interval of b1, when b1 captures the value of 0. If there is a possibility that b1 is 0, the difference between Males and Females will not be statistically significant.
Confidence interval of b1, when b1 does not include the value of 0. If the confidence interval of b1 does not include the value of 0, the difference between Males and Females will be statistically significant.

Quite evidently, this linear regression approach is equivalent to the independent-samples t-test that most people are familiar with, and produces the same results. In the subsequent posts of this series, I will continue to explain about other statistical analyses using the same method of linear regression and dummy/contrast coding.


The Curious Learner

Written by

Knowledge Sharing on Science, Social Science and Data Science.

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade