“Unveiling the Power of the F-Statistic and t-Statistics in Regression Analysis: Unraveling the Secrets of Statistical Significance.” part2
F-statistic & Probe(F-statistic)
The F-test for overall significance is a statistical test that assesses whether a linear regression model provides a statistically significant improvement in fit compared to a baseline model that uses the mean of the dependent variable.
In linear regression, the goal is to find the best-fitting line that explains the relationship between the independent variable(s) and the dependent variable. The F-test helps determine if this relationship is statistically significant.
The F-test compares the variability explained by the regression model (explained variance) to the variability not explained by the model (residual variance).
It calculates an F-statistic, which is the ratio of the explained variance to the residual variance. If the F-statistic is sufficiently large, (Exaplained variance (ESS) is large as compared to unexplained varieance which is (RSS))
Note → TSS/ESS/RSS we have already discussed in our earlier post.you must have understanding about this
here is the link.
it indicates that the regression model provides a significantly better fit than the baseline model.
let’s consider an example to illustrate the F-test for overall significance.
Suppose we are interested in understanding the factors that affect the salary of employees in a company. We collect data on the following variables for a sample of 50 employees: years of experience, level of education (measured as highest degree earned), gender (1 for male, 0 for female), and job position (coded as 1 for entry-level, 2 for mid-level, and 3 for senior level).
We want to determine whether the linear regression model we build using these variables is statistically significant in explaining the variation in salary.
Here are the steps we would follow:
- State the null and alternative hypotheses: Null hypothesis (H0): All regression coefficients (except the intercept) are equal to zero (β1 = β2 = β3 = β4 = 0), meaning that none of the independent variables contribute significantly to the explanation of the variation in salary. Alternative hypothesis (H1): At least one regression coefficient is not equal to zero, indicating that at least one independent variable contributes significantly to the explanation of the variation in salary.
- Fit the linear regression model to the data, estimating the regression coefficients (intercept and slopes).
- Calculate the Sum of Squares (SS) values: Total Sum of Squares (TSS): The sum of squared differences between each observed value of salary and its mean. Regression Sum of Squares (ESS): The sum of squared differences between the predicted values of salary and its mean. Residual Sum of Squares (RSS): The sum of squared differences between the observed values and the predicted values of salary.
- Compute the Mean Squares (MS) values: Mean Square Regression (MSR): ESS divided by the degrees of freedom for the model (df_model), which is the number of independent variables (k). This could also be called as Average Explained Variance per independent feature. Mean Square Error (MSE): RSS divided by the degrees of freedom for the residuals (df_residuals), which is the number of data points (n) minus the number of estimated parameters, including the intercept (k+1). This could also be called as average unexplained variance per degree of freedom.
- Calculate the F-statistic: F-statistic = MSR / MSE
- Determine the p-value: Compute the p-value associated with the calculated F-statistic using the F-distribution or a statistical software package.
- Compare the calculated F-statistic to the p-value to the chosen significance level (α): If the p-value < α, reject the null hypothesis. This indicates that at least one independent variable contributes significantly to the prediction of salary, and the overall regression model is statistically significant. If the p-value ≥ α, fail to reject the null hypothesis. This suggests that none of the independent variables in the model contribute significantly to the prediction of salary, and the overall regression model is not statistically significant.
For example, let’s say that after performing the F-test, we obtain an F-statistic of 4.67 and a p-value of 0.01. Assuming a significance level of 0.05, we can see that the p-value is less than the significance level, so we reject the null hypothesis. This indicates that at least one independent variable contributes significantly to the prediction of salary, and the overall regression model is statistically significant.
In conclusion, the F-test for overall significance allows us to determine whether the linear regression model we have built is statistically significant in explaining the variation in the dependent variable. It helps us to evaluate the usefulness of the independent variables in predicting the dependent variable, and whether they should be included in the model or not.
R-squared (R²)
R-squared (R²), also known as the coefficient of determination, is a measure used in regression analysis to assess the goodness-of-fit of a model. It quantifies the proportion of the variance in the dependent variable (response variable) that can be explained by the independent variables (predictor variables) in the regression model. R-squared is a value between 0 and 1, with higher values indicating a better fit of the model to the observed data.
In the context of a simple linear regression, R² is calculated as the square of the correlation coefficient (r) between the observed and predicted values. In multiple regression, R² is obtained from the ratio of the explained sum of squares (ESS) to the total sum of squares (TSS):
R² = ESS / TSS where:
- An R-squared value of 0 indicates that the model does not explain any of the variance in the response variable, while an R-squared value of 1 indicates that the model explains all of the variance. However, R-squared can be misleading in some cases, especially when the number of predictor variables is large or when the predictor variables are not relevant to the response variable.
then Adjusted R-squared come to the picture.
Adjusted R-squared
Adjusted R-squared is a modified version of R-squared (R²) that adjusts for the number of predictor variables in a multiple regression model. It provides a more accurate measure of the goodness-of-fit of a model by considering the model’s complexity.
In a multiple regression model, R-squared (R²) measures the proportion of variance in the response variable that is explained by the predictor variables. However, R-squared always increases or stays the same with the addition of new predictor variables, regardless of whether those variables contribute valuable information to the model. This can lead to overfitting, where a model becomes too complex and starts capturing noise in the data instead of the underlying relationships.
Adjusted R-squared accounts for the number of predictor variables in the model and the sample size, penalizing the model for adding unnecessary complexity. Adjusted R-squared can decrease when an irrelevant predictor variable is added to the model, making it a better metric for comparing models with different numbers of predictor variables.
The formula for adjusted R-squared is:
Adjusted R²
where:
. R² is the R-squared of the model
- n is the number of observations in the dataset
- k is the number of predictor variables in the model
By using adjusted R-squared, you can more accurately assess the goodness-of-fit of a model and choose the optimal set of predictor variables for your analysis.
code comparison and proof of r2 square and adjusted r2 square
import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score
# Generate synthetic data
np.random.seed(42)
n = 100
x1 = np.random.normal(0, 1, n)
x2 = np.random.normal(0, 1, n)
irrelevant_predictors = np.random.normal(0, 1, (n, 10))
y = 2 * x1 + 3 * x2 + np.random.normal(0, 1, n)
# Helper function to calculate adjusted R-squared
def adjusted_r2(r2, n, k):
return 1 - (1 - r2) * (n - 1) / (n - k - 1)
# Fit linear regression models with different predictors
X = pd.DataFrame({'x1': x1, 'x2': x2})
X_with_irrelevant = pd.concat([X] + [pd.Series(irrelevant_predictors[:, i], name=f"irrelevant_{i}") for i in range(10)], axis=1)
model1 = LinearRegression().fit(X, y)
model2 = LinearRegression().fit(X_with_irrelevant, y)
# Calculate R-squared and adjusted R-squared for each model
models = [('Model with relevant predictors', model1, X.shape[1]), ('Model with irrelevant predictors', model2, X_with_irrelevant.shape[1])]
for name, model, k in models:
r2 = r2_score(y, model.predict(X_with_irrelevant.iloc[:, :k]))
adj_r2 = adjusted_r2(r2, n, k)
print(f"{name}: R-squared = {r2:.3f}, Adjusted R-squared = {adj_r2:.3f}")
Model with relevant predictors:
→ R-squared = 0.912, | Adjusted R-squared = 0.910
Model with irrelevant predictors:
→ R-squared = 0.919, | Adjusted R-squared = 0.908
Which one should be used?
The choice between using R-squared and adjusted R-squared depends on the context and the goals of your analysis. Here are some guidelines to help you decide which one to use:
- Model comparison: If you’re comparing models with different numbers of predictor variables, it’s better to use adjusted R-squared. This is because adjusted R-squared takes into account the complexity of the model, penalizing models that include irrelevant predictor variables. R-squared, on the other hand, can be misleading in this context, as it tends to increase with the addition of more predictor variables, even if they don’t contribute valuable information to the model.
- Model interpretation: If you’re interested in understanding the proportion of variance in the response variable that can be explained by the predictor variables in the model, Rsquared can be a useful metric. However, keep in mind that R-squared does not provide information about the significance or relevance of individual predictor variables. It’s also important to remember that a high R-squared value does not necessarily imply causation or a good predictive model.
- Model selection and overfitting: When building a model and selecting predictor variables, it’s important to guard against overfitting. In this context, adjusted R-squared can be a helpful metric, as it accounts for the number of predictor variables and penalizes the model for unnecessary complexity. By using adjusted R-squared, you can avoid including irrelevant predictor variables that might lead to overfitting.
In summary, adjusted R-squared is generally more suitable when comparing models with different numbers of predictor variables or when you’re concerned about overfitting. Rsquared can be useful for understanding the overall explanatory power of the model, but it should be interpreted with caution, especially in cases with many predictor variables or potential multicollinearity.
T-statistic
Performing a t-test for a simple linear regression, including the intercept term and using the p-value approach, involves the following steps:
- State the null and alternative hypotheses for the slope and intercept coefficients:
1. For the slope coefficient (β1):
○ Null hypothesis (H0): β1 = 0 (no relationship between the predictor variable (X) and the response variable (y))
○ Alternative hypothesis (H1): β1 ≠ 0 (a relationship exists between the predictor variable and the response variable)
For the intercept coefficient (β0):
○Null hypothesis (H0): β0 = 0 (the regression line passes through the origin)
○ Alternative hypothesis (H1): β0 ≠ 0 (the regression line does not pass through the origin)
2. Estimate the slope and intercept coefficients (b0 and b1):
Using the sample data, calculate the slope (b1) and intercept (b0) coefficients for the regression model.
3. Calculate the standard errors for the slope and intercept coefficients (SE(b0) and SE(b1)):
Compute the standard errors of the slope and intercept coefficients using the following formulas:
4. Compute the t-statistics for the slope and intercept coefficients:
Calculate the t-statistics for the slope and intercept coefficients using the following formulas:
5. Calculate the p-values for the slope and intercept coefficients:
Using the t-statistics and the degrees of freedom, look up the corresponding p-values from the t-distribution table or use a statistical calculator.
6. Compare the p-values to the chosen significance level (α):
A common choice for α is 0.05, which corresponds to a 95% confidence level. Compare the calculated p-values to α:
○If the p-value is less than or equal to α, reject the null hypothesis.
○ If the p-value is greater than α, fail to reject the null hypothesis
Confidence Intervals for Coefficients
Estimate the slope and intercept coefficients (b0 and b1):
Using the sample data, calculate the slope (b1) and intercept (b0) coefficients for the regression model.
Calculate the standard errors for the slope and intercept coefficients (SE(b0) and SE(b1)):
3. Determine the degrees of freedom:
In a simple linear regression, the degrees of freedom (df) is equal to the number of observations (n) minus the number of estimated parameters (2: the intercept and the slope coefficient). df = n — 2
4. Find the critical t-value:
Look up the critical t-value from the t distribution table or use a statistical calculator based on the chosen confidence level (e.g.95%)and the degrees of freedom calculated in step3.
5. Calculate the confidence intervals for the slope and intercept coefficients:
Compute the confidence intervals for the slope (b1) and intercept (b0) coefficients using the following formulas:
These confidence intervals represent the range within which the true population regression coefficients are likely to fall with a specified level of confidence (e.g., 95%)
Notes of The Story →Session on Assumptions of Linear Regression.pdf — Google Drive
code that used →regression-analysis.ipynb — Colaboratory (google.com)