Explained: Tuning Hyperparameters in Deep Learning: Part 1
In simple English and Code
This article explores the art of finding the best values for hyperparameters in deep learning. We are following up on the hyperparameters explained article.
Read that article first if you are not familiar with hyperparameters in general.
A short TL; DR. Hyperparameters are those variables in AI model training that are adjusted before the model's training starts. For example, learning rate is a popular hyperparameter.
The set of hyperparameters we define for training an AI model can significantly impact its performance.
How exactly do we find the best values for them?
How do we know the optimal values for something like the learning rate?
The process of finding that out is called tuning. This article aims to build clear intuition and understanding of approaching the same.
Note: This article uses AI assistance to improve the brevity and clarity of the content and code samples. However, the article is thoroughly edited and fact-checked to the best of my abiity. The code samples are also tested and validated on my local machine.
Basic Tuning Techniques
Let’s start our journey with the 2011 paper, ‘Algorithms for Hyper-Parameter Optimization,’ co-authored by Yoshua Bengio (Remember him from the previous article?).
Some of the earliest techniques were all manual human efforts to find optimal values. Most deep learning networks of the 2010s generally had about 10–50 hyperparameters. It’s quite hard to tune them all manually. With the rise of compute and GPU clusters, new methods emerged.
Random Search
Imagine optimizing a neural network with two hyper-parameters: learning rate and batch size.
Here’s how a random search might work:
- Define ranges: Learning rate [0.001, 0.01], batch size [16, 128].
- Randomly pick values: e.g. learning rate = 0.007, batch size = 64.
- Train the model with these values and evaluate its performance.
- Repeat the process, e.g. learning rate = 0.003, batch size = 32, and so on.
- After 100 iterations, select the combination with the highest accuracy or lowest loss.
Random search is a simple method where you randomly select values for hyper-parameters from specified ranges and evaluate the model’s performance. It’s straightforward and can often find good solutions faster than grid search (which exhaustively checks every possible combination).
However, random search might still be inefficient because it doesn’t leverage any information from previous evaluations to guide future searches.
Here’s a simple code example to implement a Random Search. Try to read the code structure and how the loops try to find the best hyperparamers.
(Note that this is just a barebones demo, and the actual code and architectures used these days can differ)
import torch
import torch.nn as nn
import torch.optim as optim
from torchvision import datasets, transforms
from torch.utils.data import DataLoader
import random
# Define the device to use for training
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
# Load and preprocess the MNIST dataset
transform = transforms.Compose([transforms.ToTensor(), transforms.Normalize((0.1307,), (0.3081,))])
train_dataset = datasets.MNIST('./data', train=True, download=True, transform=transform)
test_dataset = datasets.MNIST('./data', train=False, download=True, transform=transform)
# Define a simple neural network model
class SimpleNN(nn.Module):
def __init__(self):
super(SimpleNN, self).__init__()
self.fc1 = nn.Linear(28 * 28, 128)
self.fc2 = nn.Linear(128, 10)
def forward(self, x):
x = x.view(-1, 28 * 28)
x = torch.relu(self.fc1(x))
x = self.fc2(x)
return x
# Train and evaluate the model
def train_and_evaluate(learning_rate, batch_size):
# Data loaders
train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)
test_loader = DataLoader(test_dataset, batch_size=batch_size, shuffle=False)
# Model, loss function, and optimizer
model = SimpleNN().to(device)
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=learning_rate)
# Training loop
model.train()
for epoch in range(3): # 3 epochs for demonstration purposes
for data, target in train_loader:
data, target = data.to(device), target.to(device)
optimizer.zero_grad()
output = model(data)
loss = criterion(output, target)
loss.backward()
optimizer.step()
# Evaluation loop
model.eval()
correct = 0
total = 0
with torch.no_grad():
for data, target in test_loader:
data, target = data.to(device), target.to(device)
output = model(data)
_, predicted = torch.max(output.data, 1)
total += target.size(0)
correct += (predicted == target).sum().item()
# Calculate accuracy
accuracy = correct / total
return accuracy
# Define the range for hyperparameters
learning_rate_range = [0.001, 0.01]
batch_size_range = [16, 128]
# Number of iterations for random search
iterations = 10
# Track the best hyperparameters and accuracy
best_accuracy = 0
best_hyperparameters = {}
for i in range(iterations):
# Randomly select hyperparameters
learning_rate = random.uniform(learning_rate_range[0], learning_rate_range[1])
batch_size = random.randint(batch_size_range[0], batch_size_range[1])
# Train and evaluate the model
accuracy = train_and_evaluate(learning_rate, batch_size)
# Print the results
print(f"Iteration {i + 1}: Learning Rate = {learning_rate:.4f}, Batch Size = {batch_size}, Accuracy = {accuracy:.4f}")
# Update the best hyperparameters if the current accuracy is better
if accuracy > best_accuracy:
best_accuracy = accuracy
best_hyperparameters = {
'learning_rate': learning_rate,
'batch_size': batch_size
}
# Print the best hyperparameters and accuracy
print("\nBest Hyperparameters:")
print(f"Learning Rate: {best_hyperparameters['learning_rate']}")
print(f"Batch Size: {best_hyperparameters['batch_size']}")
print(f"Accuracy: {best_accuracy:.4f}")
Iteration 1: Learning Rate = 0.0071, Batch Size = 34, Accuracy = 0.9487
Iteration 2: Learning Rate = 0.0016, Batch Size = 98, Accuracy = 0.9688
Iteration 3: Learning Rate = 0.0061, Batch Size = 24, Accuracy = 0.9465
Iteration 4: Learning Rate = 0.0017, Batch Size = 94, Accuracy = 0.9701
Iteration 5: Learning Rate = 0.0045, Batch Size = 97, Accuracy = 0.9698
Iteration 6: Learning Rate = 0.0062, Batch Size = 40, Accuracy = 0.9507
Iteration 7: Learning Rate = 0.0075, Batch Size = 107, Accuracy = 0.9600
Iteration 8: Learning Rate = 0.0036, Batch Size = 94, Accuracy = 0.9696
Iteration 9: Learning Rate = 0.0018, Batch Size = 85, Accuracy = 0.9728
Iteration 10: Learning Rate = 0.0058, Batch Size = 38, Accuracy = 0.9563
Best Hyperparameters:
Learning Rate: 0.0017920013096534215
Batch Size: 85
Accuracy: 0.9728
Sequential Model-based Global Optimization
Sequential Model-Based Global Optimization (SMBO) offers a more intelligent approach to hyper-parameter optimization by using a surrogate model to approximate the true objective function (which measures the performance of hyper-parameter sets).
Here’s how it builds on the ideas of random search and enhances the optimization process:
- Surrogate Model: Instead of randomly sampling hyper-parameters, SMBO builds a surrogate model that predicts the performance of hyper-parameter sets. This model is cheaper to evaluate than the actual performance evaluation.
- Iterative Improvement: SMBO iteratively refines its search by:
— Using the surrogate model to predict which hyper-parameter sets are most promising.
— Evaluating the true performance of these sets.
— Updating the surrogate model with the new performance data. - Efficient Exploration: Each step uses the information from previous evaluations to make smarter choices about which hyper-parameters to try next. This reduces the number of expensive evaluations needed to find a good solution.
Gaussian Process (GP) for Hyper-Parameter Optimization
One of the options to use for the surrogate model in SMBO is the Gaussian Process (GP). It is a statistical model used to make predictions about unknown functions. It assumes that the function being modeled can be represented as a Gaussian distribution characterized by a mean function and a covariance function.
Using GPs in Sequential Model-Based Optimization (SMBO)
- Start Simple: Begin with no knowledge (an empty slate).
- Make Predictions: Use the GP to predict which hyper-parameters might work well.
- Test and Learn: Pick the most promising hyper-parameters and test them by running the model. This is the only expensive step because it involves training the model.
- Update Predictions: Use the results to update the GP, making it smarter for the next round.
- Iterate: Repeat this process, gradually moving towards the best hyper-parameters.
Example to Illustrate
Imagine you’re trying to bake the perfect cake, and you have different settings like oven temperature, baking time, and ingredient amounts.
- Random Search: Randomly trying different settings might eventually give you a good cake, but it can take many attempts.
- GP Approach:
- Initial Tries: Start with a few random cake recipes.
- Predict: After each attempt, a friend (GP) learns which settings are promising and suggests new ones.
- Test and Learn: You bake a cake with the suggested settings and tell your friend how it turned out.
- Refine: Your friend updates their advice based on your feedback, getting better each time.
With the GP approach, you spend less time on failed cakes and quickly zero in on the perfect recipe.
There are more such techniques, like the Tree-structured Parzen Estimator Approach (TPE), but without complicating further, let’s build an intuitive conclusion of what we are trying to do.
All of these techniques are essentially mathematical steps to try to shortlist the hyperparameters that could be the best, without training the complete model.
The goal is to save computation and time with such smart predictions. Only these shortlisted parameters are tested and validated.
Batch Normalization
In 2015, Sergey Ioffe and Christian Szegedy introduced a technique called Batch Normalization. This method adds an important hyperparameter to address something called the internal covariate shift problem, which ultimately helps in making the training of neural networks faster and more stable.
Imagine you’re trying to learn something new, but the rules keep changing every time you make progress. This constant change makes learning slow and frustrating. In neural networks, this happens when the distribution of inputs to each layer changes during training, making it hard for the network to learn effectively. This phenomenon was referred to as the internal covariate shift.
Batch normalization tackles this issue by normalizing the inputs in each layer. This means adjusting the data so that it has a mean of zero and a standard deviation of one. Think of it as leveling the playing field, making sure the rules stay consistent as the network learns.
How It Helps?
- Speed: By keeping the input distribution stable, batch normalization allows the network to train faster.
- Stability: It also makes the training process more stable, reducing the chances of the network getting stuck or diverging.
More Intuition
Imagine you’re trying to bake a cake once again, but every time you follow a recipe, the ingredients’ quality keeps changing unpredictably. One time, the flour is too fine; the next time, it’s too coarse. This inconsistency makes it hard to get the cake just right.
Batch normalization is like having a machine that ensures the flour is always the same quality, the eggs are always the same size, and the sugar is always the same sweetness. With these consistent ingredients, you can bake your cake faster and more reliably, knowing it will turn out well each time.
Exploring through code
Here’s a simple code example to understand how batch normalization is introduced into a neural network with Pytorch.
class NetWithBatchNorm(nn.Module):
def __init__(self, momentum):
super(NetWithBatchNorm, self).__init__()
self.fc1 = nn.Linear(28 * 28, 128)
self.bn1 = nn.BatchNorm1d(128, momentum=momentum) # Batch Normalization layer
self.fc2 = nn.Linear(128, 64)
self.bn2 = nn.BatchNorm1d(64, momentum=momentum) # Batch Normalization layer
self.fc3 = nn.Linear(64, 10)
def forward(self, x):
x = x.view(-1, 28 * 28)
x = torch.relu(self.bn1(self.fc1(x))) # Apply Batch Normalization and ReLU
x = torch.relu(self.bn2(self.fc2(x))) # Apply Batch Normalization and ReLU
x = self.fc3(x)
return x
We introduce a nn.BatchNormId
layer that takes in an input parameter called momentum
.
In the given code block, gamma
and beta
are implicitly introduced as learnable parameters within the nn.BatchNorm1d
layers. The layers automatically manage these parameters in the normalization process.
- Gamma (
γ
): This is a scale parameter that is learned during training. It allows the model to scale the normalized output, providing flexibility in adjusting the distribution of the activations. It typically starts at 1.0 and is adjusted during training. - Beta (
β
): This is a shift parameter that is also learned during training. It allows the model to shift the normalized output, effectively adjusting the mean of the activations. It typically starts at 0.0 and is adjusted during training.
In PyTorch, nn.BatchNorm1d
internally maintains these parameters and updates them as part of the model's learning process. They are learned alongside other model parameters, such as the weights in fully connected layers.
In the above code, Momentum controls how much the running averages of the mean and variance change as new data comes in during training.
Imagine you’re tracking the average temperature over time, but you want to give more importance to the recent days:
- High Momentum (e.g. 0.9): If you’ve had many days of warm weather and today is cold, the running average temperature won’t change much. You’re essentially saying, “I trust the warm trend more than today’s cold spike.”
- Low Momentum (e.g. 0.1): If you use a lower momentum, today’s cold temperature will have a bigger impact on the running average. You’re saying, “I’m willing to adjust more quickly to this new cold trend.”
Why Use Momentum?
- Stability: High momentum helps keep the running averages stable, making the model’s performance more predictable, especially when training with noisy or varied data.
- Adaptability: Lower momentum allows the model to quickly adapt to changes, which can be useful if the data distribution shifts over time.
Phew! Take a short break to introspect on the intuition behind batch normalization.
Automated Hyperparameter Optimization
In the mid-2010s, there was a significant increase in research aimed at automating hyperparameter optimization (HPO). Tools like Hyperopt, Bayesian Optimization, and Google’s AutoML emerged, marking a pivotal shift.
Example of using Hyperopt
Let’s consider a scenario where you’re optimizing the hyperparameters of a machine learning model to predict house prices based on various features like the number of bedrooms, square footage, and location. We’ll use a simple model like a Random Forest Regressor and optimize two hyperparameters: the number of trees (n_estimators
) and the maximum depth of each tree (max_depth
). We'll use Hyperopt to automate this optimization.
from hyperopt import fmin, tpe, hp, Trials
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import cross_val_score
from sklearn.datasets import fetch_california_housing
# Load dataset
data = fetch_california_housing()
X, y = data.data, data.target
# Define the objective function
def objective(params):
model = RandomForestRegressor(
n_estimators=int(params['n_estimators']),
max_depth=int(params['max_depth']),
random_state=42
)
# Use cross-validation and calculate the negative mean squared error
mse = -cross_val_score(model, X, y, cv=5, scoring='neg_mean_squared_error').mean()
return mse
# Define the search space for hyperparameters
space = {
'n_estimators': hp.quniform('n_estimators', 10, 200, 1),
'max_depth': hp.quniform('max_depth', 5, 50, 1)
}
# Perform optimization using TPE
trials = Trials()
best = fmin(
fn=objective,
space=space,
algo=tpe.suggest,
max_evals=50,
trials=trials
)
# Convert hyperparameters to integers
best = {k: int(v) for k, v in best.items()}
print("Best hyperparameters found:", best)
100%|██████████| 50/50 [32:53<00:00, 39.48s/trial, best loss: 0.42483916572972397]
Best hyperparameters found: {'max_depth': 48, 'n_estimators': 107}
Key building blocks in the code
- Data Loading: The California housing dataset is used to predict house prices.
- Objective Function: The
objective
function calculates the mean squared error (MSE) for different sets of hyperparameters. It returns the negative MSE, which Hyperopt minimizes to find the best hyperparameters. - Search Space: The
space
dictionary defines the range of possible values forn_estimators
andmax_depth
.hp.quniform
specifies that the values should be sampled uniformly from these ranges, ensuring a broad search. - Optimization: The
fmin
function uses the Tree-structured Parzen Estimator (TPE) (a Bayesian optimization method) to efficiently search for the best hyperparameters within the defined space. It runs for 50 evaluations, balancing exploration and exploitation. - Result Conversion: The best hyperparameters found are converted to integers and printed out.
Notice how you need not worry about the mathematical steps, the formulae, or the manual implementations of anything and directly use the approach when required. However, it is important to have an intuition of when and why you should use these techniques.
Efficient Hyperparameter Optimization
The focus on making hyperparameter optimization more efficient continued, with techniques such as early stopping for hyperparameter tuning becoming more refined. This approach allows for the termination of less promising hyperparameter configurations before completion, saving computational resources.
Another technique, called the Hyperband, combines random search with adaptive resource allocation and early-stopping. It dynamically allocates more resources to promising configurations and stops less promising ones early, thus optimizing the search process.
We have covered a lot of ground in this article. Let’s pause our exploration here.
For a more in-depth dive, you can take up the challenge of reviewing the literature listed below line by line:
- Hyper-Parameter Optimization: A Review of Algorithms and Applications
- Hyperparameter Optimization: Foundations, Algorithms, Best Practices and Open Challenges
Stay tuned for part two of this article, where we shall intuitively cover the following topics:
- Hyperparameter optimization for transformer and diffusion models
- How hyperparameters influence the performance and capabilities of large-scale models.
- Standardization of hyperparameters.
- More sophisticated hyperparameter optimization algorithms that utilize reinforcement learning, evolutionary algorithms, and meta-learning.
- Common challenges and misconceptions
- Best Practices in Hyperparameter Tuning
(The link will be updated here once it is published. Follow me and The Research Nest to never miss an update!)
Loved the content and want me to write such in-depth articles for your startup website, blog, or documentation? Feel free to hit me up with a proposal at adityavivek.xq@gmail.com.