The Statistical Foundation of Linear Regression: T-Tests, ANOVA, and Chi-Square Tests

Albane Colmenares
6 min readSep 4, 2023

--

Image source: Flatiron School

In Kaggle’s 2020 State of Data Science and Machine Learning Survey, it was reported that 83.7% of data scientists favor linear and logistic regression as the most widely employed machine learning algorithms. As data scientists, we may wonder why statistical tests need to be learned if formulas can easily replace complex calculations. Nevertheless, the field of statistics plays a crucial role behind the scenes, building the foundations for effective predictive modeling, and understanding them helps make sense of the powerful models we are building.

This post will explore how statistical tests such as t-tests, ANOVA, and chi-square are fundamental to understanding their role in linear regression

Understanding Statistical Tests

Before diving into their connection with machine learning, we will first define the three statistical tests we are focusing on:

  • T-test: A t-test is used to compare the means of two groups and determine whether they are significantly different from each other. It evaluates the difference between the means relative to the variation within each group, providing insights into the statistical significance of observed differences. There are two types of t-tests: one sample t-test or two-sample t-test.
  • a- One-Sample T-test
    This procedure is used when a sample — or a group’s mean can be compared to a known value. If the population mean of that sample is known, the evaluation can be performed using one-sample t-test.
    For example, this statistic could be used to determine whether students who had initial experience in coding had better grades when studying data science than those who did not.
Image source: https://www.statstest.com/single-sample-t-test/
  • b- Two-Sample T-test
    A two-sample t-test is used when the means that need to be compared come from two independent samples.
    A common example where two-sample t-test is used is for clinical trials to test new medications. If a new drug helped lower blood pressure was released, two distinct groups’ means would be analyzed: the one who was given the drug (experimental group), and the one who was given the placebo (control group).
Image source: https://www.statstest.com/independent-samples-t-test/
  • ANOVA — or Analysis of Variance: ANOVA is used to analyze the differences between three or more groups by comparing the variances within and between them. It helps in determining if there are significant differences among the means of the groups being compared.
    An example to use ANOVA could be to measure the effectiveness of three different types of fertilizers on the growth of specific vegetables or plants. The hypothesis would be that one leads to a significantly higher number of vegetables than others.
Image source: https://databrio.com/blog/anova-step-by-step-procedure-
  • Chi-square test: The chi-square test examines whether there is a significant association between two categorical variables. It helps determine if the observed frequencies in different categories significantly deviate from the expected frequencies, indicating a relationship between the variables.
    A chi-square test could be used to evaluate voting preferences among residents of two different states for three different candidates. For example: is there an association between residents from New York and New Jersey and preference between “Candidate X”, “Candidate Y” and “Candidate Z”?
    To analyze this data, the chi-square test would help you assess whether there is a significant association — or dependence between residence and voting preference. If the chi-square test results show a significant association, it can be concluded that residence and voting preference are not independent, hence that there is a relationship between the two variables.
Image source:https://www.statstest.com/chi-square-test-of-independence/

Linear Regression and Statistical Tests

The tests we have just defined are fundamental statistics and play a critical role in linear regression. The paragraphs below explain further how each method is the foundation of this machine learning algorithm.

T-Tests: Assessing the Independent Variables’ Significance

As seen earlier, T-Tests are often used to determine if there is a significant difference between the means of two groups. When applied in the context of linear regression, they play a crucial role in evaluating the significance of individual predictors or independent variables.

If you were building a linear regression model to predict a student’s final exam score based on previous results such as exam one, exam two, and exam three scores. Each of these factors are considered independent variables. T-Tests come into play here. For each independent variable, a t-statistic and p-value are calculated. The p-value tells us whether the variable has a statistically significant impact on the dependent variable (here, final exam score). A low p-value suggests a significant impact, helping us decide whether to include or exclude that variable from our model.
Statistics associated with t-tests are highlighted in yellow in the above example of a multiple linear regression’s results.

ANOVA: Evaluating the Model’s Significance

ANOVA, or Analysis of Variance, serves a different but equally important role in linear regression. Instead of assessing individual predictors, ANOVA helps evaluate the overall significance of the regression model.

Continuing with our example, ANOVA would help determine whether the entire linear regression model (i.e. exam one, exam two, and exam three scores) significantly explains the variance in final exam scores. The F-statistic from ANOVA provides this insight. A low p-value associated with a high F-statistic value indicates that the model as a whole is statistically significant. In other words, at least one predictor in the model has a significant effect on the dependent variable.

Statistics associated with ANOVA are highlighted in light blue in the above example

Chi-Square Tests: Tying It Together in Logistic Regression

While t-tests and ANOVA primarily deal with continuous dependent variables, Chi-Square tests come into play when there is a categorical dependent variable, often in the context of logistic regression.

For example, if a logistic regression model predicts whether customers will make a purchase based on factors like age groups, gender, and website interactions. The answer to this question would be either yes or no and would become a categorical variable.

Here, Chi-Square tests would estimate how well the logistic regression model fits the data. These tests compare the predicted probabilities of purchase (from the model) to the actual observed outcomes.

To conclude, these three foundational statistical tests — T-Tests, ANOVA, and Chi-Square Tests — play essential roles in the world of linear regression . T-Tests help us define which predictors to include, ANOVA evaluates the overall model’s significance, and Chi-Square tests ensure the model fits the data well.

Understanding these statistical foundations empowers us to make informed decisions when building and interpreting linear regression models. They also help us make better sense of the results and ultimately enhance our ability to better analyze our model to improve it and make predictions. And you thought learning these was just a waste of time!

--

--