Essential Math for Machine Learning: R-Squared
The Regression Fitness Score
This article is part of the series Essential Math for Machine Learning.
Introduction
Machine learning (ML) offers powerful tools for extracting insights and building predictive models from data. Understanding the mathematical concepts behind ML algorithms is crucial to making informed decisions during the model development process. One such concept is R-squared (R²), also known as the coefficient of determination. In this blog post, we’ll dive into R-squared, addressing what it is, its significance, and how it works, complete with a Python code example.
What is R-Squared?
In essence, R-squared is a statistical metric that reveals how well your machine learning model fits the data it was trained on. It represents the proportion of the variance in the dependent variable (the variable you’re trying to predict) that can be explained by the independent variables (the features you’re using for prediction).
Why Use R-Squared?
- Model Evaluation: R-squared provides a quick way to gauge the goodness-of-fit of your regression models. A higher R-squared value generally indicates a better fit of the model to the data.
- Model Comparison: When comparing different models, R-squared can help you determine which model explains more of the variability in your target variable.
- Caveat: It’s important to use R-squared in conjunction with other evaluation metrics. A high R-squared can occasionally be misleading, so it’s better interpreted alongside residual plots and other diagnostic measures.
How Does R-Squared Work?
Let’s break down the intuition behind R-squared:
- Baseline Model: Consider the simplest possible model — one that always predicts the average value of your target variable, regardless of the input features. This is our baseline.
- Total Variation: Calculate the total variation in the target variable (how much it deviates from its mean).
- Residuals: Measure the difference between the actual values and the values predicted by your machine learning model (these differences are the residuals).
- Explained Variation: Determine the variation explained by your model by seeing how much smaller the errors (residuals) are compared to the total variation.
- R-Squared Calculation: R-squared is essentially the explained variation divided by the total variation.
Formula: R² = 1 — (SS_res / SS_tot)
where:
- SS_res: Sum of squares of residuals (errors)
- SS_tot: Total sum of squares
Example: Predicting Ice Cream Sales
Imagine you have data on ice cream sales for different days along with the corresponding temperatures:
- Temperature (X): Independent variable (what we use to predict)
- Sales (Y): Dependent variable (what we’re trying to predict)
Understanding the Components
SS_tot (Total Sum of Squares): Measures the total variation in ice cream sales. We calculate it by:
- Finding the average ice cream sales.
- For each day, subtracting the average sales from the actual sales and squaring the difference.
- Adding up all these squared differences.
SS_res (Sum of Squares of Residuals): Measures how much your model’s predictions differ from the actual sales. We calculate it by:
- For each day, finding the difference between the predicted sales and actual sales (these are the residuals).
- Squaring each residual.
- Adding up all the squared residuals.
R² Calculation
R² = 1 — (SS_res / SS_tot)
Intuition: The closer your predictions are to the real sales, the smaller the SS_res (the residuals) will be. A small SS_res compared to the SS_tot means a larger R-squared, indicating a better fitting model.
Let’s say the total variation in sales (SS_tot) is 100, and your model’s errors result in a sum of squares of residuals (SS_res) of 20.
- R² = 1 — (20 / 100) = 0.8
An R-squared of 0.8 means your model explains 80% of the variability in ice cream sales based on temperature!
Python Implementation
Let’s illustrate this using Python’s Scikit-learn library for a simple linear regression. The code is available in this colab notebook.
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score
# Generate test data with a linear relationship
def generate_dataset(n_samples=100, slope=2, intercept=5, noise_std=0.1):
X = np.random.rand(n_samples, 1) # Generate random features
y = slope * X + intercept + noise_std * np.random.randn(n_samples, 1)
return X, y
# Calculate R-squared from scratch
def r_squared(y_true, y_pred):
mean_y = np.mean(y_true)
ss_tot = np.sum((y_true - mean_y)**2)
ss_res = np.sum((y_true - y_pred)**2)
r2 = 1 - (ss_res / ss_tot)
return r2
# Generate the data
X, y = generate_dataset()
# Create a linear regression model
model = LinearRegression()
model.fit(X, y)
# Make predictions
predictions = model.predict(X)
r2 = r_squared(y, predictions)
print("R-squared from scratch:", r2)
# R-squared using scikit-learn's function
r2_sklearn = r2_score(y, predictions)
print("R-squared (scikit-learn):", r2_sklearn)
Interpretation
- R-squared of 1: A perfect fit — all data points lie directly on the regression line.
- R-squared of 0: Your model offers no explanatory power — it’s as good as just predicting the mean value every time.
- Typical R-squared: Most real-world datasets fall somewhere between 0 and 1.
Conclusion
R-squared provides a valuable metric for understanding the goodness-of-fit of your linear regression models. Remember, it’s one of several tools you should use to evaluate your model’s performance. By understanding the concepts behind R-squared, you’re better equipped to make informed choices as you build machine learning models.