Photo by Martin Sanchez on Unsplash

Striking the Perfect Balance: Balancing Bias and Variance in Machine Learning

Balancing bias and variance is key to building models that generalize well to new data. In this post, you will learn the key concepts of bias and variance in machine learning with practical Python examples and best practices.

Carlos Raul Morales
9 min readMay 10, 2023

--

Introduction

Machine learning is all about making predictions based on data, but how can we tell if a model’s predictions are any good? This is where the concepts of bias and variance come in. Think of bias like a bullseye: a model with high bias consistently hits the same spot, but it may be far from the true value. On the other hand, variance is like a shotgun: a model with high variance shoots all over the place, making it hard to know where to aim.

In this post, we will discuss one of the most essential concepts in machine learning, the bias-variance tradeoff. By understanding this concept, we can choose the right model and optimize its performance.

TL;DR:

  • Bias refers to the difference between the true value and the expected value of a model’s predictions, while variance refers to the variability of the model’s predictions.
  • Balancing bias and variance is key to building models that generalize well to new data.
  • Techniques such as regularization, early stopping, ensembling, and hyperparameter tuning can help reduce bias and variance and improve model performance.
  • Hyperparameter tuning can be useful to find the sweet spot that provides the perfect Bias and Variance tradeoff.

Let’s get started by defining what bias and variance are in the context of machine learning.

Section 2: The Battle Between Bias and Variance

Imagine you’re trying to build a machine-learning model to predict the winners of horse races. But here’s the problem — your model is either too simple (biased) or too complex (high variance).

If your model is biased, it’s like you always bet on the same horse, even if the data shows that another horse is more likely to win. This can lead to poor predictions. If your model has high variance, you keep changing your bet every time you run the model. Sometimes you might win big, but other times you might lose everything. This makes your predictions unreliable. They are like two wrestlers in the ring, constantly battling for dominance in your model.

High bias represents the model’s tendency to oversimplify the data, underfit the data, and fail to capture important patterns.

High variance represents the model’s tendency to overcomplicate the data, overfitting it and capturing noise.

Bias and variance contributing to total error (taken from https://scott.fortmann-roe.com/docs/BiasVariance.html)

To understand this better, let’s consider the following example. Suppose we have a dataset with one feature (x) and one target variable (y). The linear regression model tries to fit a line to the data that minimizes the error between predicted and true values.

If we use a linear regression model with a single feature to fit a non-linear relationship between x and y, the model will have high bias and low variance. This is because the model is too simple to capture the complexity of the data.

On the other hand, if we use a high-degree polynomial to fit the data, the model will have low bias and high variance. This is because the model is too complex and captures the noise in the data.

See the example below

import numpy as np
import matplotlib.pyplot as plt
from sklearn.preprocessing import PolynomialFeatures
from sklearn.pipeline import make_pipeline
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split

import seaborn as sns

# using seaborn style for plotting
sns.set(style='whitegrid', palette='muted', font_scale=1)

# generate random dataset
num_of_samples = 100
np.random.seed(42)
x = np.linspace(-1, 1, num_of_samples)
y = np.sin(1.2 * np.pi * x) + np.random.normal(0, 0.25, size=num_of_samples)
x_train, x_test, y_train, y_test = train_test_split(
x.reshape(-1, 1), y.reshape(-1, 1), test_size=0.3, random_state=1)

# creating degrees array
degrees = [1, 2, 4, 8, 16, 20, 26, 32]
train_mse = []
test_mse = []

fig, axx = plt.subplots(nrows=3, ncols=3, sharex=True,
sharey=True, figsize=(24, 10))

for i in range(len(degrees)):
degree = degrees[i]
ax = axx[i//3, i % 3]

# Fit polynomial regression
poly_features_train = np.column_stack(
[x_train**d for d in range(1, degree+1)])
poly_features_test = np.column_stack(
[x_test**d for d in range(1, degree+1)])
model = LinearRegression().fit(poly_features_train, y_train)

# Sort x values in ascending order
x_sorted = np.sort(x)
x_sorted = x_sorted.reshape(-1, 1)

# Predict corresponding y values using trained model
poly_features_sorted = np.column_stack(
[x_sorted**d for d in range(1, degree+1)])
y_sorted_pred = model.predict(poly_features_sorted)

# Predict on training and testing data
y_train_pred = model.predict(poly_features_train)
y_test_pred = model.predict(poly_features_test)

# Compute MSE
train_mse.append(mean_squared_error(y_train, y_train_pred))
test_mse.append(mean_squared_error(y_test, y_test_pred))

# Plot predictions and data
ax.plot(x_sorted, y_sorted_pred, 'r-', label='prediction')
ax.plot(x_train, y_train, 'b.', label='training data')
ax.plot(x_test, y_test, 'g.', label='testing data')
ax.set_title(f'Degree {degree}')
ax.legend()

plt.show()

# Plot MSE vs. degree
plt.plot(degrees, train_mse, 'ro-', label='Training MSE')
plt.plot(degrees, test_mse, 'go-', label='Testing MSE')
plt.title('MSE vs. Polynomial Degree')
plt.xlabel('Polynomial Degree')
plt.ylabel('MSE')
plt.legend()
plt.show()

In this example, we are generating synthetic data with a non-linear relationship between x and y. Then, we fit a linear regression model and a polynomial regression model to the data.

As expected, the linear regression model has a high bias and low variance, while the polynomial regression model, with a high degree, has a low bias and high variance. The goal is to find the optimal tradeoff between bias and variance, which, in this case, represents a polynomial degree of 4.

Section 3: Regularization

Regularization aims to balance the trade-off between the bias and variance of the model by adding a penalty term to the objective function of the model. The penalty term adds a cost for complexity, encouraging the model to have simpler and more generalizable representations. This penalty can take different forms, depending on the type of regularization technique used. Some common regularization techniques include L1 regularization, and L2 regularization.

  • L1 regularization: also known as Lasso regularization, adds a penalty term that is proportional to the absolute values of the model parameters. This technique encourages the model to have sparse weights, meaning that some weights will be set to zero, leading to a simpler model.
  • L2 regularization: also known as Ridge regularization, adds a penalty term proportional to the square of the model parameters. This technique encourages the model to have small weights, reducing the impact of individual features and making the model more robust to noise in the data.

To see this in action, we will use Ridge Regression as a form of regularization to prevent overfitting. We create a pipeline with Polynomial Features and Ridge Regression and use it to fit the model to the training data and see how regularization affects the bias-variance tradeoff.

import numpy as np
import matplotlib.pyplot as plt
from sklearn.preprocessing import PolynomialFeatures
from sklearn.pipeline import make_pipeline
from sklearn.linear_model import Ridge
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split

import seaborn as sns

# using seaborn style for plotting
sns.set(style='whitegrid', palette='muted', font_scale=1)


num_of_samples = 100
np.random.seed(42)
x = np.linspace(-1, 1, num_of_samples)
y = np.sin(1.2 * np.pi * x) + np.random.normal(0, 0.25, size=num_of_samples)
x_train, x_test, y_train, y_test = train_test_split(
x.reshape(-1, 1), y.reshape(-1, 1), test_size=0.3, random_state=1)

degree = 20
train_mse = []
test_mse = []
alphas = [0, 0.001, 0.01, 0.05, 0.1, 1, 5, 10, 100] # regularization parameter

fig, axx = plt.subplots(nrows=3, ncols=3, sharex=True,
sharey=True, figsize=(26, 15))

for i in range(len(alphas)):
ax = axx[i//3, i % 3]
alpha = alphas[i]

# Create pipeline with PolynomialFeatures and Ridge Regression
model = make_pipeline(
PolynomialFeatures(degree),
Ridge(alpha=alphas[i % len(alphas)])
)

# Fit the model to training data
model.fit(x_train, y_train)

# Sort x values in ascending order
x_sorted = np.sort(x)
x_sorted = x_sorted.reshape(-1, 1)

# Predict corresponding y values using trained model
y_sorted_pred = model.predict(x_sorted)

# Predict on training and testing data
y_train_pred = model.predict(x_train)
y_test_pred = model.predict(x_test)

# Compute MSE
train_mse.append(mean_squared_error(y_train, y_train_pred))
test_mse.append(mean_squared_error(y_test, y_test_pred))

# Plot predictions and data
ax.plot(x_sorted, y_sorted_pred, 'r-', label='prediction')
ax.plot(x_train, y_train, 'b.', label='training data')
ax.plot(x_test, y_test, 'g.', label='testing data')
ax.set_title(f'Regularization {alpha}, degree {degree}')
ax.legend()

plt.tight_layout()
Ridge regularization applied to Polynomial regression

As we can see when applying regularization to the polynomial regression with a degree equal to 20, the model has it harder to overfit the data. As the strength of the penalty increases, the model is encouraged to find simpler and more generalizable representations of the data. However, we must be cautious not to over-regularize, as this can lead to underfitting and a model with high bias.

In other words, we must find the sweet spot that maximizes the model’s performance without sacrificing its ability to capture the complexity of the data.

Section 4: Hyperparameter tuning

Hyperparameter tuning is a powerful technique that can help improve the tradeoff between bias and variance in machine learning models. By fine-tuning the hyperparameters of a model, we can achieve better performance and reduce overfitting or underfitting.

In the case of polynomial regression, hyperparameter tuning can be applied to the degree of the polynomial used to fit the data, as well as the regularization parameter used to penalize large coefficients. A common approach to hyperparameter tuning is grid search, which involves systematically evaluating the model performance for a range of hyperparameter values.

To implement grid search, we need to define a range of values for each hyperparameter of interest. For example, in the case of polynomial regression, we can define a range of polynomial degrees to evaluate, such as [1, 2, 3, …, 20]. We can then train and evaluate the model for each combination of hyperparameters using cross-validation.

Cross-validation involves splitting the data into multiple folds and training the model on a subset of the data while evaluating its performance on the remaining data. This allows us to obtain a more accurate estimate of the model’s performance and avoid overfitting the training data.

After evaluating the model performance for all hyperparameter combinations, we can select the best set of hyperparameters that achieve the lowest validation error. This set of hyperparameters can then be used to train the final model on the full training set and evaluate its performance on a holdout test set.

See the example below:

degrees = [1, 2, 4, 8, 16, 20, 26, 32]

# create pipeline
pipeline = make_pipeline(PolynomialFeatures(), LinearRegression())

# set hyperparameter space
param_grid = {'polynomialfeatures__degree': degrees}

# perform grid search
grid = GridSearchCV(pipeline, param_grid, cv=5)
grid.fit(x_train, y_train)

# get best hyperparameters and model
best_degree = grid.best_params_['polynomialfeatures__degree']
best_model = grid.best_estimator_

# Sort x values in ascending order
x_sorted = np.sort(x)
x_sorted = x_sorted.reshape(-1, 1)

# Predict corresponding y values using trained model
y_sorted_pred = best_model.predict(x_sorted)

# Predict on training and testing data
y_train_pred = best_model.predict(x_train)
y_test_pred = best_model.predict(x_test)

# Compute MSE
train_mse = mean_squared_error(y_train, y_train_pred)
test_mse = mean_squared_error(y_test, y_test_pred)

# Plot predictions and data
fig, ax1 = plt.subplots(nrows=1, ncols=1, sharex=True,
sharey=True, figsize=(16, 6))
ax1.plot(x_sorted, y_sorted_pred, 'r-', label='prediction')
ax1.plot(x_train, y_train, 'b.', label='training data', markersize=10)
ax1.plot(x_test, y_test, 'g.', label='testing data', markersize=10)
plt.title(
f'Degree {best_degree}, Train MSE={train_mse:.3f}, Test MSE={test_mse:.3f}')
plt.legend()

plt.tight_layout()
Best model after hyperparameter tunning.

This can be a time-consuming process, especially for complex models and large datasets. However, it can be a powerful technique for improving the performance of a model and achieving a better tradeoff between bias and variance.

References

  1. “Understanding the Bias-Variance Tradeoff” by Scott Fortmann-Roe: This is a classic article that explains the concept of bias and variance in an easy-to-understand way, with clear examples and illustrations. It can be found at: https://scott.fortmann-roe.com/docs/BiasVariance.html
  2. “The Bias-Variance Tradeoff” by Sebastian Thrun: This is a lecture from the Stanford CS229 course that covers the concept of bias and variance in detail, with a focus on how it applies to machine learning models. It can be found at: https://www.youtube.com/watch?v=EuBBz3bI-aA
  3. “The Bias-Variance Tradeoff in Machine Learning” by Jason Brownlee: This is a blog post that covers the concept of bias and variance in machine learning, with clear examples and practical advice for how to balance the two. It can be found at: https://machinelearningmastery.com/gentle-introduction-to-the-bias-variance-trade-off-in-machine-learning/

--

--

Carlos Raul Morales

Hi there, I’m Carlos R. — aka Charlie5DH 👋. Passionate learner of everything related to AI.