Explained: Hyperparameters in Deep Learning

In simple English for everyone.

Published in

The Research Nest

9 min readMar 18, 2024

The year was 1986. Tom Cruise’s Top Gun was buzzing as the summer blockbuster. A few months before, Nintendo released the Legend of Zelda, which took the gaming communities by storm.

Let’s not forget the sorrows, too. It was the year of the Challenger Space Shuttle and the Chernobyl nuclear disasters. 1986 was a pivotal year in human history for many reasons.

Around the same time, in the lesser-known corners of the world, three researchers published a paper on what they call a “new learning procedure.”

The idea was to create a procedure that would adjust the variables of a mathematical equation to minimize the error between the output it gives and the actual output.

For example, you have an Input X and want an output Y. But you don’t know which equation, say f(X), will give you Y. To figure that out, you can create a random equation, say:

f(X) = aX^2 + bX + c

You can put random values for a, b, and c. You get some output, which will be some other random value. But you also know the output you want, Y. Looking at the difference between the current output and the desired output (Y), you can adjust the values of a, b, and c and try again and again until you get the correct answer.

Note that the equation was an oversimplification, but you get the gist?

How can the process of finding such variables as a, b, and c be made more efficient and automatic?

Let’s take a more complex equation for demonstration.

f(X) = w.sigmoid(v.X + b)

Here, w, v, and b are variables. Sigmoid is a special mathematical function. Let’s say for X=2, the output (let’s call it Y) must be 0.8.

For what values of w, v, and b will that happen? How do you automatically figure it out?

The idea goes like this:

Start with random values for the variables.
Compute the output.
Check the difference between the correct output and the output we get.
Use this error to calculate the gradient (or how the error is changing) of the error with respect to each variable. This basically tells us how much each variable contributes to the error.
Update the variables in the direction that reduces the error using a learning rate.

Read the Python code below for a more granular understanding.

import numpy as np

# Sigmoid function and its derivative
# Don't worry about what it does
# Just think of it as some math operation
def sigmoid(x):
    return 1 / (1 + np.exp(-x))

def sigmoid_derivative(x):
    return x * (1 - x)

# Initialize parameters as some random variables
w, v, b = 0.5, 0.3, -0.2
X = 2 # Input
Y = 0.8 # Desireed output
learning_rate = 0.01

# Let's make the math equation
z = v * X + b
a = sigmoid(z)
Y_pred = w * a

# Compute error as a simple difference between current output and actual desired output
error = Y - Y_pred

# A math computation done with respect to each variable using the error
dw = error * a
dv = error * w * sigmoid_derivative(a) * X
db = error * w * sigmoid_derivative(a)

# Update the variables based on some learning rate
w = w + learning_rate * dw
v = v + learning_rate * dv
b = b + learning_rate * db

print(f'Updated parameters: w={w}, v={v}, b={b}')

Updated parameters: w=0.502997366709045, v=0.30120288024750785, b=-0.19939855987624608

In short, we are finding the error and updating our variables based on it.

We can continually do this process till the error is close to zero. The final values of the variables we get are essentially what we call the “weights” and “biases” of the equation (in this case, the neural network).

The function that computes this error is called the loss function. The goal of training (or learning) the neural network is to minimize this loss, which is achieved by adjusting the weights and biases of the network through some process and using the learning rate to control the size of these adjustments.

But what’s the deal with the learning rate? Why don’t we directly do w = w + dw?

If you imagine the process of minimizing the error (loss function) as walking down a hill, the learning rate dictates how big each step is. Ideally, we want to control this process and take smaller steps to change the variables. We need a balance. Here are a few reasons:

Overshooting: Without a learning rate, the weight adjustments might be too large, causing the algorithm to overshoot the minimum of the loss function. This can lead to divergent behavior, where the error actually increases with each update.
Precision: Smaller updates, controlled by a lower learning rate, allow the optimization algorithm to converge to the minimum more precisely. Direct updates, especially if they’re large, can prevent the model from fine-tuning its parameters to the optimal values.
Generalization: Gradual learning, facilitated by a controlled learning rate, helps find a more generalizable set of parameters. Rapid, direct updates might lead to a solution that works well for the training data but poorly generalizes to unseen data.

The process we illustrated in the example to adjust the weights is called backpropagation. It was first introduced in a 1986 research paper by David Rumelhart, Geoffrey Hinton (who would eventually be known as the Godfather of AI), and Ronald Williams.

Learning rate in this context will later be known as a hyperparameter.

In short, hyperparameters are parameters that are set before the learning process begins and are not learned from the data. They control the learning process (the process of finding the best weights) itself, such as the learning rate, the number of hidden layers and neurons in a neural network, or the regularization strength. The optimal values for hyperparameters are usually found through experimentation or techniques like grid or random search.

Phew! That’s a very long introduction to the origin of hyperparameters in deep learning, but I hope it gave you a proper intuition as to where all these fancy terms are coming from and what they really mean.

More than 35 years later, the concepts of backpropagation and hyperparameters continue to be integral to training neural networks.

Ironically, it is as relevant as Top Gun or Legend of Zelda is today.

Our journey starts in 1986, but it certainly doesn’t end there.

Read on.

Further Experiments

Towards late 1989, Yann LeCun (who would later become the AI chief of Meta), a 29-year-old researcher at that time, used backpropagation for handwritten postal zip code recognition. It was one of the first applications of backpropagation in the real world. The input would be the images represented as arrays. The desired output would be the number the image represents (0–9). The results and methods would lay the foundations for Convolutional Neural Networks.

In 1998, LeCun et al. published a 46-page paper on Gradient-Based Learning Applied to Document Recognition. It discusses the architecture of CNNs, including hyperparameters like the size of the convolutional kernels and the number of feature maps.

Sepp Hochreiter and Jurgen Schmidhuber introduced the Long Short-Term Memory (LSTM) architecture a year before. All these developments pushed the complexity within the networks and the hyperparameters involved. However, the main focus wasn’t on the hyperparameters themselves.

Legendary AI researchers like Hinton and Yann LeCun continued to bring more techniques into the picture.

Fast forward to 2012, where the multi-decade research with training neural networks culminates into a paper titled Practical Recommendations for Gradient-Based Training of Deep Architectures by Yoshua Bengio. This paper mainly focuses on the training/learning process itself, where hyperparameters play a key role.

Let’s explore all the common types of the same. Don’t worry about the long lists that follow. The idea is to get familiar with the names of different terms for now so that in the future, when you come across something, you won’t be completely alien to it.

Common Hyperparameters in Deep Learning

Following up on Mr. Bengio’s paper, here are some hyperparameters you might often encounter, explained in simple English.

Learning rate: We have already seen this above. Typical values set are less than one but greater than 10^-6. The default value for this is generally set at 0.01.
Batch size: This refers to the number of training examples used in each iteration of the optimization algorithm. The choice of the batch size can significantly impact the performance of the optimization algorithm. Typically, it's between 1 to a few hundred.
Regularization coefficient: This is a hyperparameter used in regularization techniques to control the complexity of the model. We will discuss more about regularization in a separate article. I will link it here when it’s done.
Number of hidden units: This is the number of neurons in the hidden layers of the neural network.
Number of training epochs: This is the total number of times the entire training set is passed through the network during training.
Activation Functions: Functions like ReLU, Sigmoid, Tanh, etc., that determine the output of a neural network node given an input or set of inputs.
Optimizer: Algorithms like Adam, SGD (Stochastic Gradient Descent), etc., are used to update weights in the training phase. We will discuss more on Optimizers in a separate article. I will link it here when it’s done.
Learning Rate Decay: The technique of reducing the learning rate as the training progresses.
Dropout Rate: A regularization technique where randomly selected neurons are ignored during training to prevent overfitting.
Weight Initialization: Methods to set the initial values for the weight.
Momentum: A parameter that helps to accelerate SGD in the relevant direction and dampens oscillations.
Gradient Clipping: A technique to prevent exploding gradients in deep neural networks by limiting the values of gradients to a small range.

Note that these hyperparameters can be different depending on the context of the task and the model. This is not an exhaustive list.

Hyperparameters for Transformer Models

Read this blog post first if you are not familiar with Transformers.

Model size parameters
— Number of encoder-decoder layers
— Hidden size for the feed-forward networks
— Number of attention heads in the multi-head attention process
Training parameters
— Batch size
— Learning rate
— Optimizer
Regularization parameters
— Dropout rate
— Weight decay
— Gradient clipping
Attention parameters
— Attention dropout
— Positional encoding dimensions
Other parameters
— Maximum Sequence Length: The maximum length of the input sequences the model can handle.
— Vocabulary Size: The size of the vocabulary used by the model impacts the embedding layer dimensions.

Hyperparameters for Diffusion Models

Diffusion models are a class of generative models that simulate a diffusion process. The process starts by gradually adding noise to data over a series of steps until the data is completely random noise. Then, a reverse process is used, where a model learns to denoise this data step by step, eventually generating samples from the noise. This approach fundamentally differs from the attention mechanism and is primarily used for generating high-fidelity images, audio, and other types of dense data.

Here are some specific parameters in the context of diffusion models.

Number of Channels: In image models, this refers to the number of channels in the convolutional layers.
Number of Diffusion Steps: These steps are used to add noise to the data or reverse the process gradually. More steps can lead to higher-quality generation but require more computation.
Noise Schedule: A schedule that determines how much noise to add at each diffusion step. This can be linear, cosine, or learned.
Sampling Temperature: Controls the randomness of the generation process during inference. Lower temperatures can lead to less random (more deterministic) outputs.
Conditioning Information: For conditional diffusion models, parameters define how conditioning information (like text descriptions for text-to-image models) is incorporated.

Other common parameters, like optimizer, learning rate, batch size, training steps, etc., are generally common across all models.

The Importance of Hyperparameters

Towards the mid to late 2010s, the success of models like AlexNet for image recognition showed the impact of choosing the correct hyperparameters in the training process. Hyperparameters are essential because they determine the capacity of a machine learning model to learn from data and generalize to new, unseen data.

The choice of hyperparameters can significantly affect the time required to train and test a model. Moreover, the reproducibility of machine learning research depends on the clear reporting of hyperparameters used in studies, as other researchers can only replicate results if they know the original hyperparameters.

Let’s say you have a neural network, and the entire training process is defined. Before you start the training, the natural question to ask is: How do we determine what values to use for our hyperparameters?

But let’s stop here for this article. We have already explored a lot. Take a short break to reflect on what we have learned so far.

Stay tuned for a follow-up article where I will document in detail how to approach hyperparameter tuning — the process that will help us get the best values of hyperparameters we can use to train our AI models.

(A link to the article will be added here once it is ready)

Random Fun Fact

In 2018, Yann LeCun, Geoffrey Hinton, and Yoshua Bengio were awarded the prestigious Turning Award for their breakthrough contributions to the field of deep neural networks.

Loved the content and want me to write such in-depth articles for your startup website, blog, or documentation? Feel free to hit me up with a proposal at adityavivek.xq@gmail.com.