Descriptive Statistics with Python — Learning Day 7

Linear regression

Gianpiero Andrenacci
Data Bistrot
18 min readAug 16, 2024

--

Descriptive Statistics with Python — All rights reserved

Just as natural phenomena follow discernible patterns, linear regression helps us uncover relationships and make predictions based on data.

At its core, linear regression analyzes the relationship between two variables, allowing us to model how changes in one variable influence another. By fitting a straight line through a set of data points, we can predict future values and understand underlying trends. This simple yet profound technique is foundational in fields ranging from economics and biology to engineering and social sciences.

Linear regression connects us to the predictive patterns inherent in the world, allowing us to make sense of past data and forecast future events. This powerful technique not only enhances our understanding of relationships within data but also equips us with the tools to navigate and anticipate the complexities of business.

Imagine a scenario where a business wants to predict sales based on advertising spend. By applying linear regression, they can identify the strength and direction of the relationship between these two variables, enabling informed decision-making and strategic planning.

Prediction Based on Correlation

When two variables are correlated, the relationship between them can be used for predictive purposes. For instance, consider the relationship between exercise frequency and body weight. If these variables are strongly correlated, knowing an individual’s exercise frequency can help predict their body weight. The predictive accuracy improves with the strength of the correlation between the two variables.

Let’s imagine we have data indicating that study hours and exam scores are positively correlated. This means that students who study more tend to score higher on exams. By using the number of hours a student studies, we can predict their exam scores. The stronger the correlation between study hours and exam scores, the more accurate our predictions will be.

Regression Line

A regression line is a statistical tool used to describe the relationship between two variables. It is a straight line that best fits the data points on a scatterplot, minimizing the distance between the points and the line itself. The regression line is fundamental in predicting values and understanding the relationship between variables in regression analysis.

Least Squares Regression Line

The least squares regression line is the most common method for fitting a line to a set of data points. This line is determined by minimizing the sum of the squares of the vertical distances (errors) between the observed values and the values predicted by the line. The formula for the least squares regression line is:

Where:

  • ŷ​ is the predicted value of the dependent variable.
  • b0​ is the y-intercept of the regression line.
  • b1 is the slope of the regression line.
  • x is the independent variable.

The slope b1 and the intercept b0​ are calculated using the following formulas:

Where n is the number of data points, x and y are the individual data points, and ∑ denotes the summation.

Predictive Errors

Predictive errors (or residuals) are the differences between the observed values and the values predicted by the regression line. These errors indicate how well the regression line fits the data. The formula for calculating the error for a given data point is:+

Where:

  • ei​​ is the error for the i-th data point.
  • yi is the observed value of the dependent variable for the i-th data point.
  • ŷ is the predicted value of the dependent variable for the i-th data point.

Predictive errors help in assessing the accuracy of the regression model. Smaller errors indicate a better fit, while larger errors suggest that the model may not be accurately capturing the relationship between the variables.

Total Predictive Error

The total predictive error (also known as the sum of squared errors, SSE) is the sum of the squares of all the predictive errors. It quantifies the overall discrepancy between the observed values and the values predicted by the regression line. The formula for SSE is:

Minimizing the SSE is the primary goal of the least squares regression method, as it results in the best-fitting line for the data. The lower the SSE, the better the regression line fits the data.

Practical Example in python

Let’s consider a practical example where we have data on the number of hours studied and exam scores. We aim to find the least squares regression line and calculate the predictive errors.

import numpy as np
import matplotlib.pyplot as plt

# Sample data: hours studied and exam scores
hours_studied = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10])
exam_scores = np.array([50, 55, 54, 60, 65, 68, 70, 75, 80, 85])

# Calculating the regression line parameters
n = len(hours_studied)
b1 = (n * np.sum(hours_studied * exam_scores) - np.sum(hours_studied) * np.sum(exam_scores)) / (n * np.sum(hours_studied**2) - np.sum(hours_studied)**2)
b0 = (np.sum(exam_scores) - b1 * np.sum(hours_studied)) / n

# Regression line
regression_line = b0 + b1 * hours_studied

# Plotting the data and the regression line
plt.scatter(hours_studied, exam_scores, color='blue', label='Data points')
plt.plot(hours_studied, regression_line, color='red', label='Regression line')
plt.xlabel('Hours Studied')
plt.ylabel('Exam Scores')
plt.title('Least Squares Regression Line')
plt.legend()
plt.show()

# Calculating predictive errors
predictive_errors = exam_scores - regression_line

# Calculating the total predictive error (SSE)
sse = np.sum(predictive_errors**2)
print(f"Total Predictive Error (SSE): {sse}")

The regression line, especially the least squares regression line, is a powerful tool for understanding and predicting the relationship between variables. By minimizing predictive errors, it provides the best fit for the data.

However, it is essential to evaluate the total predictive error to assess the accuracy of the model and ensure that it captures the underlying relationship between the variables effectively.

Total Predictive Error (SSE): 20.49696969696971

Standard Error of Estimate

The standard error of estimate represents a special kind of standard deviation that reflects the magnitude of predictive error in regression analysis. It provides a rough measure of the average amount by which the observed values (known Y values) deviate from their predicted values (predicted Y values).

Understanding the Standard Error of Estimate

The standard error of estimate gives us insight into the accuracy of the predictions made by the regression line.

It essentially tells us how much the actual data points differ from the regression line on average.

A smaller standard error of estimate indicates that the regression line fits the data points more closely, while a larger standard error suggests a less accurate fit.

The formula for the standard error of estimate is:

Where:

  • yi​ are the observed values.
  • ŷi​ are the predicted values.
  • n is the number of observations.

Let’s continue with the previous example of hours studied and exam scores to calculate the standard error of estimate.

import numpy as np
import matplotlib.pyplot as plt

# Sample data: hours studied and exam scores
hours_studied = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10])
exam_scores = np.array([50, 55, 54, 60, 65, 68, 70, 75, 80, 85])

# Calculating the regression line parameters
n = len(hours_studied)
b1 = (n * np.sum(hours_studied * exam_scores) - np.sum(hours_studied) * np.sum(exam_scores)) / (n * np.sum(hours_studied**2) - np.sum(hours_studied)**2)
b0 = (np.sum(exam_scores) - b1 * np.sum(hours_studied)) / n

# Regression line
regression_line = b0 + b1 * hours_studied

# Plotting the data and the regression line
plt.scatter(hours_studied, exam_scores, color='blue', label='Data points')
plt.plot(hours_studied, regression_line, color='red', label='Regression line')
plt.xlabel('Hours Studied')
plt.ylabel('Exam Scores')
plt.title('Least Squares Regression Line')
plt.legend()
plt.show()

# Calculating predictive errors
predictive_errors = exam_scores - regression_line

# Calculating the total predictive error (SSE)
sse = np.sum(predictive_errors**2)

# Calculating the standard error of estimate
standard_error_estimate = np.sqrt(sse / (n - 2))
print(f"Standard Error of Estimate (s_y|x): {standard_error_estimate}")
Standard Error of Estimate : 1.6006627415296495

Interpretation

The standard error of estimate provides a rough measure of the average amount of predictive error. It represents the average distance that the observed values fall from the regression line.

  • Smaller standard error of estimate: Indicates that the regression line closely fits the data points, meaning the predictions are more accurate.
  • Larger standard error of estimate​: Suggests that the regression line does not fit the data points well, indicating less accurate predictions.

The standard error of estimate is an important metric in regression analysis, providing a measure of the accuracy of the regression model’s predictions. By understanding and calculating the standard error of estimate, data scientists and analysts can better assess the fit of their regression models and the reliability of their predictions.

The Correlation Coefficient (r)

The correlation coefficient (r) quantifies the degree of linear relationship between two variables, providing valuable insights into how changes in one variable are associated with changes in another. Here are key points that highlight the importance of the correlation coefficient in predictive analysis:

Quantifies Relationship Strength and Direction

  • Strength: The value of r ranges from -1 to 1. An r value close to 1 indicates a strong positive linear relationship, while an r value close to -1 indicates a strong negative linear relationship. An r value around 0 suggests no linear relationship.
  • Direction: A positive r value means that as one variable increases, the other variable tends to increase. A negative r value indicates that as one variable increases, the other tends to decrease.

Facilitates Predictive Modeling

Understanding the correlation between variables is fundamental in interpreting and building predictive models. High correlation between the predictor (independent variable) and the target (dependent variable) suggests that the predictor is valuable for making accurate predictions.

Enhances Feature Selection

During the feature selection process in machine learning, variables that have high correlation with the target variable are often selected as features. This helps in improving the model’s performance by including only the most relevant variables.

Detects Multicollinearity

Correlation coefficients are also used to detect multicollinearity among predictor variables. Multicollinearity occurs when two or more predictors are highly correlated, which can cause issues in regression analysis. Identifying and addressing multicollinearity ensures the stability and reliability of the model.

Assists in Exploratory Data Analysis (EDA)

Correlation analysis is a vital part of exploratory data analysis. It helps in understanding the underlying structure of the data, identifying relationships, and generating hypotheses for further analysis.

It’s important to recognize that correlation does not imply causation. See also:
https://medium.com/@gianpiero.andrenacci/descriptive-statistics-with-python-learning-day-5-cf33af08f032

Squared Correlation Coefficient (r²)

The squared correlation coefficient (r²), also known as the coefficient of determination, quantifies the proportion of the total variability in one variable that can be predicted from its relationship with another variable.

It provides a measure of how well the independent variable explains the variation in the dependent variable.

  • Definition: The r² value is the square of the correlation coefficient (r). It ranges from 0 to 1.
  • r² = 0: Indicates that the independent variable does not explain any of the variability in the dependent variable.
  • r² = 1: Indicates that the independent variable explains all the variability in the dependent variable.
  • Interpretation: An r² value closer to 1 implies a strong relationship where a significant proportion of the variability in the dependent variable is predictable from the independent variable. Conversely, an r² value closer to 0 implies a weak relationship.

Importance of r²

  • Predictive Power: r² is an essential measure in regression analysis as it indicates the predictive power of the model. A higher r² value means the model better explains the variation in the dependent variable.
  • Model Evaluation: It helps in evaluating and comparing the performance of different regression models. Models with higher r² values are generally preferred, as they explain more of the variability in the data.
import numpy as np
import matplotlib.pyplot as plt
import statsmodels.api as sm
from scipy.stats import shapiro

# Adding a constant for the intercept
X = sm.add_constant(hours_studied)
model = sm.OLS(exam_scores, X).fit()

# Plotting residuals vs. fitted values to check homoscedasticity
fitted_values = model.fittedvalues
residuals = model.resid
plt.scatter(fitted_values, residuals)
plt.axhline(y=0, color='r', linestyle='-')
plt.xlabel('Fitted Values')
plt.ylabel('Residuals')
plt.title('Residuals vs. Fitted Values')
plt.show()

# Plotting Q-Q plot to check normality
sm.qqplot(residuals, line ='45')
plt.title('Q-Q Plot')
plt.show()

# Performing Shapiro-Wilk test for normality
shapiro_test = shapiro(residuals)
print(f'Shapiro-Wilk test statistic: {shapiro_test.statistic}, p-value: {shapiro_test.pvalue}')

In this example, let’s assume the correlation coefficient r is approximately 0.98. Squaring this value gives r2≈0.96r² .

This means that approximately 96% of the variability in exam scores can be predicted from the number of hours studied.

The remaining 4% of the variability is due to other factors not captured by the model.

Assumptions of the Linear Regression Model

Before performing a t-test on the coefficients of a linear regression model, it is essential to ensure that the underlying assumptions of the model are met. These assumptions are essential for the validity and reliability of the test results. Here are the primary assumptions of the linear regression model:

  1. Linearity:
  • The relationship between the independent variable (hours studied) and the dependent variable (exam scores) should be linear. This means that a change in the independent variable should result in a proportional change in the dependent variable.
  • Check: A scatter plot of the data points along with the regression line can help visually confirm linearity. If the points roughly follow a straight line, this assumption is likely satisfied.

2. Independence:

  • The observations should be independent of each other. This means that the exam scores of one student should not influence the exam scores of another.
  • Check: This assumption is more about the study design than the data itself. Ensuring that the data collection process was random and independent is key.

3. Homoscedasticity (Constant Variance of the Errors):

  • The variance of the error terms should be constant across all levels of the independent variable. In other words, the spread of the residuals (differences between observed and predicted values) should be roughly the same for all predicted values.
  • Check: A residual plot, which plots the residuals against the predicted values, can be used to assess homoscedasticity. If the residuals are randomly scattered around zero with no clear pattern, this assumption is likely met.

If you want to know more on Homoscedasticity: https://en.wikipedia.org/wiki/Homoscedasticity_and_heteroscedasticity

4. Normality of the Error Terms:

  • The error terms (residuals) should be approximately normally distributed. This assumption is particularly important for small sample sizes, as it affects the reliability of the t-tests and confidence intervals.
  • Check: A histogram or a Q-Q plot of the residuals can be used to assess normality. If the residuals roughly follow a normal distribution, this assumption is likely satisfied.

Consequences of Violating Assumptions

If any of these assumptions are violated, the results of the t-test may not be reliable:

  • Linearity Violation: If the relationship is not linear, the model may not accurately capture the relationship between the variables, leading to biased estimates.
  • Independence Violation: If observations are not independent, it can result in underestimated standard errors, leading to inflated t-statistics and potentially spurious significance results.
  • Homoscedasticity Violation: If the error variance is not constant, it can affect the efficiency of the coefficient estimates and the reliability of hypothesis tests.
  • Normality Violation: If the error terms are not normally distributed, the confidence intervals and hypothesis tests may not be accurate, especially for small sample sizes.

Here is how you might check these assumptions using Python, continuing from the previous code example:

import numpy as np
import matplotlib.pyplot as plt
import statsmodels.api as sm
from scipy.stats import shapiro

# Adding a constant for the intercept
X = sm.add_constant(hours_studied)
model = sm.OLS(exam_scores, X).fit()

# Plotting residuals vs. fitted values to check homoscedasticity
fitted_values = model.fittedvalues
residuals = model.resid
plt.scatter(fitted_values, residuals)
plt.axhline(y=0, color='r', linestyle='-')
plt.xlabel('Fitted Values')
plt.ylabel('Residuals')
plt.title('Residuals vs. Fitted Values')
plt.show()

# Plotting Q-Q plot to check normality
sm.qqplot(residuals, line ='45')
plt.title('Q-Q Plot')
plt.show()

# Performing Shapiro-Wilk test for normality
shapiro_test = shapiro(residuals)
print(f'Shapiro-Wilk test statistic: {shapiro_test.statistic}, p-value: {shapiro_test.pvalue}')
hapiro-Wilk test statistic: 0.9694209098815918
p-value: 0.8854244947433472

In this example:

  • The residuals vs. fitted values plot helps assess homoscedasticity.
  • The Q-Q plot helps assess the normality of the residuals.
  • The Shapiro–Wilk test statistic is basically a measure of how well the ordered and standardized sample quantiles fit the standard normal quantiles. The statistic will take a value between 0 and 1 with 1 being a perfect match.
  • The Shapiro-Wilk test provides a statistical test for normality, where a p-value less than 0.05 suggests a deviation from normality.

By ensuring these assumptions are met, you can confidently perform t-tests and make reliable inferences about the relationship between the variables in your linear regression model.

Understanding the t-Test in Linear Regression

The t-test in the context of linear regression is used to determine whether the coefficients of the regression model are statistically significantly different from zero. This helps in understanding if the independent variable (hours studied) has a significant effect on the dependent variable (exam scores).

Steps to Conduct a t-Test in Linear Regression

  1. Formulate the Hypotheses:
  • Null Hypothesis (H0): The coefficient is equal to zero (β = 0). This implies that the independent variable has no effect on the dependent variable.
  • Alternative Hypothesis (H1): The coefficient is not equal to zero (β ≠ 0). This implies that the independent variable has a significant effect on the dependent variable.

2. Calculate the t-Statistic:

  • The t-statistic for each coefficient is calculated as:

where β ​ is the estimated coefficient and SE(β) is the standard error of the estimated coefficient.

3. Determine the p-Value:

  • The p-value is determined based on the t-statistic and the degrees of freedom (df), which is typically n−k−1, where n is the number of observations and k is the number of predictors.
  • The p-value indicates the probability of observing the data assuming the null hypothesis is true.

A low p-value (typically < 0.05) suggests that the null hypothesis can be rejected.

4. Reject the null hypothesis:

  • If the p-value is less than the chosen significance level (e.g., 0.05), reject the null hypothesis. This means there is enough evidence to conclude that the coefficient is significantly different from zero.

Example with Python

Continuing from the previous code example, let’s perform the t-test on the linear regression model:

import numpy as np
import statsmodels.api as sm
import matplotlib.pyplot as plt
from scipy.stats import shapiro

# Sample data: hours studied and exam scores (n=30)
hours_studied = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10,
11, 12, 13, 14, 15, 16, 17, 18, 19, 20,
21, 22, 23, 24, 25, 26, 27, 28, 29, 30])
exam_scores = np.array([50, 52, 54, 56, 58, 60, 62, 64, 66, 68,
70, 72, 74, 76, 78, 80, 82, 84, 86, 88,
90, 92, 94, 96, 98, 100, 102, 104, 106, 108])

# Adding a constant for the intercept
X = sm.add_constant(hours_studied)
model = sm.OLS(exam_scores, X).fit(cov_type='HC3') # Use robust standard errors (HC3)

# Summary of the regression model
print(model.summary())
OLS Regression Results                            
==============================================================================
Dep. Variable: y R-squared: 1.000
Model: OLS Adj. R-squared: 1.000
Method: Least Squares F-statistic: 3.387e+32
Date: Thu, 25 Jul 2024 Prob (F-statistic): 0.00
Time: 16:19:55 Log-Likelihood: 956.60
No. Observations: 30 AIC: -1909.
Df Residuals: 28 BIC: -1906.
Df Model: 1
Covariance Type: HC3
==============================================================================
coef std err z P>|z| [0.025 0.975]
------------------------------------------------------------------------------
const 48.0000 2.37e-15 2.03e+16 0.000 48.000 48.000
x1 2.0000 1.09e-16 1.84e+16 0.000 2.000 2.000
==============================================================================
Omnibus: 8.068 Durbin-Watson: 0.143
Prob(Omnibus): 0.018 Jarque-Bera (JB): 8.160
Skew: -1.261 Prob(JB): 0.0169
Kurtosis: 2.590 Cond. No. 36.5
==============================================================================

Notes:
[1] Standard Errors are heteroscedasticity robust (HC3)

Interpreting the Output

Explanation of the OLS Regression Results

The output provided is a summary of an Ordinary Least Squares (OLS) regression model.

Ordinary Least Squares (OLS) regression is a statistical method used to model the relationship between a dependent variable and one or more independent variables. The main objective of OLS regression is to find the best-fitting line (or hyperplane in multiple dimensions) that minimizes the sum of the squared differences between the observed values and the values predicted by the model.

Key Concepts:

  1. Dependent Variable (y): The variable we are trying to predict or explain.
  2. Independent Variable (x): The variable(s) used to predict the dependent variable.
  3. Regression Line: The line that best fits the data, represented by the equation

where:

  • β0​ is the intercept (the value of y when x is 0).
  • β1 is the slope (the change in y for a one-unit change in x).

The OLS method calculates the coefficients β0 and β1 by minimizing the sum of the squared residuals (the differences between the observed and predicted values).

Here’s a detailed breakdown of the output for the model summary:

Model Summary

  • Dep. Variable: The dependent variable (exam scores in this case).
  • Model: Specifies that the model used is OLS (Ordinary Least Squares).
  • Method: Indicates that the method used is Least Squares.
  • No. Observations: The number of observations in the dataset (30 in this case).
  • Df Residuals: Degrees of freedom of the residuals, calculated as the number of observations minus the number of parameters estimated (30–2 = 28).
  • Df Model: Degrees of freedom of the model, equal to the number of parameters estimated minus one (2–1 = 1).

Fit Statistics

  • R-squared: A measure of how well the independent variable(s) explain the variance in the dependent variable. An R-squared of 1.000 indicates a perfect fit, meaning the model explains all the variability in the dependent variable.
  • Adj. R-squared: Adjusted R-squared, which adjusts the R-squared value based on the number of predictors in the model. In this case, it is also 1.000, indicating a perfect fit.
  • F-statistic: To interpret the F-statistic, compare it to a critical value from an F-table or a p-value. If the F-statistic is significantly large, it indicates that the regression model is significantly better than just predicting the mean of the dependent variable.
  • Prob (F-statistic): The p-value associated with the F-statistic. A value of 0.00 indicates that the model is statistically significant.

Log-Likelihood and Information Criteria

  • Log-Likelihood: A measure of the model fit. Higher values indicate a better fit.
  • AIC (Akaike Information Criterion): A measure of the model’s quality, taking into account the goodness of fit and the complexity of the model. Lower values are better. It means that the model has a higher likelihood and a lower complexity.
  • BIC (Bayesian Information Criterion): Similar to AIC, but with a stronger penalty for models with more parameters. Lower values are better.

Coefficients

  • const: The intercept of the model. A coefficient of 48.0000 with a very small standard error indicates a precise estimate. The intercept is statistically significant with a z-value that is very high.
  • x1: The slope of the model (hours studied). A coefficient of 2.0000 with a very small standard error indicates a precise estimate. The slope is statistically significant with a z-value that is very high and a p-value of 0.000.

Diagnostic Tests

  • Omnibus: A test for normality of the residuals. A value of 8.068 with a p-value of 0.018 suggests some deviation from normality.
  • Durbin-Watson: A test for autocorrelation in the residuals. A value of 0.143 suggests positive autocorrelation.
  • Jarque-Bera (JB): Another test for normality of the residuals. A value of 8.160 with a p-value of 0.0169 also suggests some deviation from normality.
  • Skew: A measure of the asymmetry of the distribution of residuals. A skew of -1.261 indicates that the residuals are left-skewed.
  • Kurtosis: A measure of the shape of the distribution of residuals. A kurtosis of 2.590 indicates a distribution that is slightly less peaked and has lighter tails than a normal distribution.
  • Cond. No.: A measure of multicollinearity. A value of 36.5 does not indicate severe multicollinearity.

Notes

  • Standard Errors: The standard errors are robust to heteroscedasticity (HC3). This means they have been adjusted to account for non-constant variance in the residuals, making the estimates more reliable under heteroscedasticity.

Main Takeaways

  1. Perfect Fit: The model has an R-squared and adjusted R-squared of 1.000, indicating a perfect linear relationship between hours studied and exam scores.
  2. Statistical Significance: Both the intercept and the slope are highly statistically significant, with p-values of 0.000.
  3. Model Robustness: The standard errors are heteroscedasticity robust, which adds reliability to the coefficient estimates.
  4. Potential Issues: There are indications of autocorrelation (Durbin-Watson) and deviations from normality (Omnibus and Jarque-Bera tests) in the residuals, which could impact the reliability of the model.

This output suggests an extremely strong linear relationship between hours studied and exam scores, though some diagnostic checks indicate potential issues that might need further investigation.

Conclusion Bonus: Written in collaboration with the Interactive Writer

The Elegance of Linearity

In the world of numbers and data’s vast expanse, linear regression stands as a beacon of simplicity and grace. Its straight line, drawn through the scattered points of our experiences, whispers secrets of relationships and trends.

With every slope and intercept, it reveals the harmony within the chaos, the pattern within the noise.

Linear regression, the poet’s pen in the scientist’s hand, captures the essence of predictability and connection. It reminds us that within the complex symphony of variables, there lies a simple, elegant truth:

that life’s myriad phenomena often align along the gentle gradient of a line.

As we draw these lines through our data, we honor the timeless elegance of linearity, finding beauty in simplicity and wisdom in its straightforward path.

--

--

Gianpiero Andrenacci
Data Bistrot

AI & Data Science Solution Manager. Avid reader. Passionate about ML, philosophy, and writing. Ex-BJJ master competitor, national & international titleholder.