How to Choose the Best Learning Rate for Neural Network (Beginner Approach)

Ferisa Tri Putri Prestasi
Data Science Indonesia
5 min readAug 9, 2020
source

In this article, before starting in the tuning parameter topic, I’m going to show you the artificial neural network. Why? Because that is important for me to start the concept first. Neural Network is the branch of artificial intelligence that is quite broad and is closely related to other disciplines. In just enough detail that we will ready to see how Neural Network is the best choice with its applications because their concept is crucial for bringing them to life, I think it’s safe to assume that everyone reading this article at least have heard of Neural Network and you’re probably also aware that they’ve turned out to be an extremely powerful tool when applied to a wide variety of important case problem in the world like text translation, image recognition, etc.

Overview of Neural Network

The architecture of the Neural Network drawn below:

The architecture of the Neural Network

As a college student with a Mathematics major, the Neural Network concept is the best choice because this concept familiar with me when calculus and some vector-matrix notation and operations are applied inside it. Because fundamentally, a neural network is a just mathematical function that takes a variable in and gives another variable back where both of these variables could be vectors. We can see the illustration below:

Both of these variable become

The mathematical treatment has been kept at a minimal level, consistent with the primary aims of clarity and correctness. Derivations, theorems, and proofs are included when they serve to illustrate the important features of a particular neural network. For example, the mathematical derivation of the backpropagation training algorithm makes clear the correct order of the operations.

From these diagrams, a typical multilayer net because a multilayer net has more than one layer connection. That every connection contains weights that fixed with an iterative training process. Typically, there is a layer of weights between Input and Output called the Hidden Unit as Hidden Layer. Weight in every connection can be seen below:

Hyperparameter Subject

Neural Network needs some Hyperparameter, one of them is Learning Rate (LR) (have value 0–1) that namely with Gradient Descent (GD) which has a function as Optimizer in a neural network net. If you do not know what is Hyperparameter? I will tell first, so Hyperparameters are the variables that determine the network structure, in my experience, two kinds are Number of Hidden layer and Learning Rate which the variables which determine how the network is trained. Hyperparameters are set before training (before optimizing the weights and bias).

Generally, the type of GD is a derivative of the function itself. The role of the learning rate in the neural net controls the rate or speed at which the model learns. Specifically, Tuning Parameter includes Learning Rate items that will control the amount of apportioned error that the weights of the model are updated with each time they are updated, such as at the end of each batch of training examples. The example is given below:

example of function

Z symbolizes an equation of function, then W is weight in every neuron. Example Z = w² +1 will be minimized the most optimal, so we can initialize the value of w = 1. So, the value of Z will be minimized when Z = 1.

First, we must know the formula for update weight in every neuron below:

update weight

Gradient Descent includes Decay for a system that updates the learning rate every epoch. We must note there’re updates in this context means update Learning Rate not weight. If we’re not using Decay, the learning rate will be constant from the first epoch until last. If we’re using Decay, the formula will be written below:

update learning rate

Learning rate old or learning rate which initialized in first epoch usually has value 0.1 or 0.01, while Decay is a parameter which has value is greater than 0, in every epoch will be initialized 1–E1, 1-E2, 1-E3, 1-E4. We must know that the selected value of the learning rate must be careful because that function Z must be minimized the most optimal, Learning rate that is not suitable will make a training divergence. In mathematic, we know that Convergence, property (exhibited by certain infinite series and functions) of approaching a limit more and more closely as an argument (variable) of the function increases or decreases or as the number of terms of the series increases. In three case learning rate drawn below:

graph of a function with a suitable learning rate
graph of a function with a learning rate that is too small
graph of a function with a learning rate that is too large

Final Thought

Based on three graphs above, with suitable learning rate in range with decay can make graph convergence (how fast they reach the problem solved). The learning rate defines how quickly a network updates its parameters. In conclusion, you must make many experiments to know how your model improves. Too small learning rate slows down the learning process but converges smoothly. Too Large learning rate speeds up the learning but may not converge. I preferred using a decaying Learning rate which updates the value of the learning rate better in every epoch.

Next, I’ll be discussing the improved architecture of neural networks like convolutional neural networks that relate to my Bachelor Degree Final Project. Thank you.

Resources

Laurene, Fausett. “Fundamental of Neural Network” 12.257 (1994): 1–35.

Yaldi, Gusri. “Improving the Neural Network Testing Performance for Trip Distribution Modelling by Transforming Normalized Data Nonlinearly”. IJASEIT 208.5334 (2017): 7.

Vasudevan, Shrihari. “Mutual Information Based Learning Rate Decay for Stochastic Gradient Descent Training of Deep Neural Networks”. MDPI Entropy 2020, 22(5), 560.

--

--