F-statistic: Understanding model significance using python

Aditya Manikantan
Analytics Vidhya
Published in
6 min readSep 12, 2021
Photo by Andrew Neel on Unsplash

In statistics, a test of significance is a method of reaching a conclusion to either reject or accept certain claims based on the data. In the case of regression analysis, it is used to determine whether an independent variable is significant in explaining the variance of the dependent variable. So suppose we have our regression equation:

y = intercept + β*xy: independent variable
β: regression co-efficient
x: dependent variable

In this case,

  • The null hypothesis H0 would be: β= 0 i.e predictor x is not able to explain the variance of the independent variable y.
  • Alternative hypothesis H1 would be: β ≠ 0 i.e x is significant in predicting the value of y.

Since here we have only one predictor a T-test should be enough. However, in reality, our model is going to include a number of independent variables. This is where F-statistic comes into play.

F-statistic can be used to find the joint significance of multiple independent variables. It’s used to compare two model’s ability to explain the variance of the dependent variable. To put it another way, it can help to determine whether to go with a complex model or a simpler version. The null and alternative hypothesis is similar to one in T-test. So for the given regression equation:

y = intercept + β1*x1 + β2*x2 + ... + βn*xn
  • The null hypothesis H0 would be: β1= β2 = … = βn = 0
  • Alternative hypothesis H1 would be: βi ≠ 0

So, if even one of the coefficients is significant, then there is a high possibility of rejecting the null hypothesis as the coefficients are not jointly insignificant anymore. Here the two models can be an unrestricted model which contains all the predictor variables or a restricted model in which we are restricting the number of predictors. For this article, I’ll be using an intercept-only model as the restricted model.

Each model has some residual associated with it. Residual is nothing but the measure of distance from a data point to the regression line. In our case we have two types of residuals:

  • SSRr: Sum square of residuals of the restricted model
  • SSRu: Sum square of residuals of the unrestricted model

SSRr will always be greater than SSRu since the restricted model is not able to capture the variance of the dependent variable well. Adding variables helps reduce the error irrespective of the fact that we are adding significant variables or just noise.

But if the unrestricted model always has lesser error than the restricted model then why do we need F-statistic? Because the question we are asking is if it’s significantly greater than the unrestricted model.

The equation for F-statistic is:

Since the sum square of residuals (SSR) is unitless, dividing by P in the numerator and N-P-1 in the denominator lets us form a distribution that can be comparable. These two also happens to be our degrees of freedom. So,

  • df1 = P: Degree of freedom 1
  • df2 = N-P-1: Degree of freedom 2

The distribution we are gonna compare it with is called the F-distribution. We usually take a confidence interval of 95% which translates to an alpha value of 0.05. Based on the values of the two degrees of freedom and the alpha value we can find the F-critical value on the F-distribution. If the F-statistic value is greater than the F-critical, we reject the null hypothesis, in other words, we have enough evidence to suggest that the given independent variables are significant in explaining the variance of the dependent variable.

To show this on python I have used an automobile dataset. The problem statement is just to find the price of the vehicles based on some features/independent variables. These include horsepower, peak-rpm, bore, and stroke.

So we start by first importing pandas and reading the CSV file.

import pandas as pd
df = pd.read_csv('Automobile_data.csv')
print(df.head())

Assign the independent and dependent variables to X and y variables respectively.

X = df.drop(['price'], 1)
y = df['price']

To calculate the F-statistic, I’ll be importing a library called statsmodels.

import statsmodels.api as sm

An intercept is not added by default so it has to be added manually.

X = sm.add_constant(x)
print(X)

Finally, we can fit the data using the OLS (Ordinary Least Squares) method of statsmodels.

results = sm.OLS(y, X).fit()A = np.identity(len(results.params))
A = A[1:,:]
print(results.f_test(A))

The results show that the given model got an F-statistic score of 108.272 compared to the intercept-only model. It also displays the two degrees of freedom: df1 = 4 and df2 = 190. So to find the F-critical value we can look up the f-distribution table for the alpha value of 0.05.

Source: http://www.socr.ucla.edu/Applets.dir/F_Table.html

Compare the values of degrees of freedom to get the F-critical value. Our df1 is 4 and df2 is 190 so we have to look up the 4th column and since df2 is greater than 120 we have to check the last row.

We get an F-critical value of 2.3719 which is much lower than our F-statistic score of 108.272. Since, F-statistic > F-critical, we reject the null hypothesis, which means that the independent variables are jointly significant in explaining the variance of the dependent variable.

Inversely, you can also print the summary of the OLS model which displays the regression results.

print(results.summary())

In this, we can check the p-value (listed as Prob F-statistic) in the summary to determine whether to reject or accept the null hypothesis. Here the p-value is the probability that the null hypothesis for the full model is true (i.e., that all of the regression coefficients are zero). Since the p-value is approximately zero, we reject the null hypothesis. In other words, there is evidence that suggests that there is a linear relationship between price and the set of predictor variables.

Summary

  • A test of significance can help us determine whether we should go with a complex model or a simpler one.
  • F-statistic can be used to understand if the given set of predictor variables are significant in explaining the variance of the dependent variable.
  • If the F-statistic > F-critical or if the Prob (F-statistic) is approximately 0 then we reject the null hypothesis. In other words, the given regression makes sense.

Happy modeling!

Data: Source

--

--

Aditya Manikantan
Analytics Vidhya

Hey! I am interested in AI, Technology, Statistics, and how AI could impact well-being.