Artificial Neural Networks- An intuitive approach Part 2

Niketh Narasimhan

Published in

Analytics Vidhya

11 min readJul 25, 2020

A continuation of an earlier article

Please find the link for part 1

Artificial Neural Networks- An intuitive approach Part 1

A comprehensive yet simple approach to the basics of deep learning

medium.com

Perceptron Learning
Methods of updating the weights
Weight decay learning rate
learning rate
Hyperparameter fixing:

Perceptron Learning

Let us recap what the Perceptron’s do in it’s most basic function.

Perceptrons take inputs , scale/multiple them with weights , sum them up and then pass them through an activation function to obtain a result.
The weights are initialized randomly for the first instance to obtain an output.
The weights are then adjusted to minimize the error using optimization of a loss function and regularization(Please go through the concept of regularization)
Gradient descent is used to optimize the loss function to obtain a minimum value of error.

Note: For a quick refresher on gradient descent it is recommended to go through the below link

Understanding the Mathematics behind Gradient Descent.

A simple mathematical intuition behind one of the commonly used optimisation algorithms in Machine Learning.

towardsdatascience.com

Methods of updating the weights:

There are multiple methods of updating the weights:

After the loss is calculated, the gradient of the loss is calculated with respect to the weights, since the loss function is actually a function of the weights of the neural network which are independent of each other. It’s like the loss is a multi variable function of the weights. The weights now are updated in a direction opposite to the gradient. This will decrease the loss function. That’s the aim. Over the training process, the weights which were initialized randomly or with some initial value, keep on updating such that the loss function is minimized.

Note: Online and stochastic gradient descent are the same things

Multiple methods of updating the weights

One epoch typically means your algorithm sees every training instance once. Now assuming you have “nn” training instances:

If you run batch update, every parameter update requires your algorithm see each of the “nn” training instances exactly once, i.e., every epoch your parameters are updated once.

If you run mini-batch update with batch size = bb, every parameter update requires your algorithm see “bb” of “nn” training instances, i.e., every epoch your parameters are updated about n/bn/b times.

If you run SGD(Stochastic Gradient descent)update, every parameter update requires your algorithm to see 1 of nn training instances, i.e., every epoch your parameters are updated about nn times.

to make the above points clearer let us take an example:

In Gradient Descent or Batch Gradient Descent, we use the whole training data per epoch whereas, in Stochastic Gradient Descent(online), we use only single training example per epoch and Mini-batch Gradient Descent lies in between of these two extremes, in which we can use a mini-batch(small portion) of training data per epoch.

Now let us introduce two important concepts Weight decay and Learning Rate.

Weight decay:

Neural networks learn a set of weights through iterations while propagating the error backwards

A network with large weights may signal an unstable network as small changes in the input could lead to large changes in the output. This might indicate that the the network has overfit on the training data and will fit poorly on the test data

Large weights make the network unstable. Although the weight will be specialized to the training dataset, minor variation or statistical noise on the expected inputs will result in large differences in the output.

A solution to this problem is to update the learning algorithm to encourage the network to keep the weights small. This is called weight regularization and it can be used as a general technique to reduce overfitting of the training dataset and improve the generalization of the model.

Since regularization is a basic concept ( I expect the readers to be aware of regularization concepts) though we will recap some of the basics here

Another possible issue is that there may be many input variables, each with different levels of relevance to the output variable. Sometimes we can use methods to aid in selecting input variables, but often the interrelationships between variables is not obvious.

Having small weights or even zero weights for less relevant or irrelevant inputs to the network will allow the model to focus learning. This too will result in a simpler model.

Encourage Small Weights

The learning algorithm can be updated to encourage the network toward using small weights.

One way to do this is to change the calculation of loss used in the optimization of the network to also consider the size of the weights.

The addition of a weight size penalty or weight regularization to a neural network has the effect of reducing generalization error and of allowing the model to pay less attention to less relevant input variables.

How to Penalize Large Weights

There are two parts to penalizing the model based on the size of the weights.

The first is the calculation of the size of the weights, and the second is the amount of attention that the optimization process should pay to the penalty.

Calculate Weight Size

Neural network weights are real-values that can be positive or negative, as such, simply adding the weights is not sufficient. There are two main approaches used to calculate the size of the weights, they are:

Calculate the sum of the absolute values of the weights, called L1.
Calculate the sum of the squared values of the weights, called L2.

L1 encourages weights to 0.0 if possible, resulting in more sparse weights (weights with more 0.0 values). L2 offers more nuance, both penalizing larger weights more severely, but resulting in less sparse weights. The use of L2 in linear and logistic regression is often referred to as Ridge Regression. This is useful to know when trying to develop an intuition for the penalty

It is possible to include both L1 and L2 approaches to calculating the size of the weights as the penalty. This is akin to the use of both penalties used in the Elastic Net algorithm for linear and logistic regression.

The L2 approach is perhaps the most used and is traditionally referred to as “weight decay” in the field of neural networks. It is called “shrinkage” in statistics, a name that encourages you to think of the impact of the penalty on the model weights during the learning process.

Recall that each node has input weights and a bias weight. The bias weight is generally not included in the penalty because the “input” is constant.

Control Impact of the Penalty

The calculated size of the weights is added to the loss objective function when training the network.

Rather than adding each weight to the penalty directly, they can be weighted using a new hyperparameter called alpha (a) or sometimes lambda. This controls the amount of attention that the learning process should pay to the penalty. Or put another way, the amount to penalize the model based on the size of the weights.

The alpha hyperparameter has a value between 0.0 (no penalty) and 1.0 (full penalty). This hyperparameter controls the amount of bias in the model from 0.0, or low bias (high variance), to 1.0, or high bias (low variance).

If the penalty is too strong, the model will underestimate the weights and underfit the problem. If the penalty is too weak, the model will be allowed to overfit the training data.

Tips for Using Weight Regularization

Use With All Network Types

Weight regularization is a generic approach.

It can be used with most, perhaps all, types of neural network models, not least the most common network types of Multilayer Perceptrons, Convolutional Neural Networks, and Long Short-Term Memory Recurrent Neural Networks.

In the case of LSTMs, it may be desirable to use different penalties or penalty configurations for the input and recurrent connections.

Standardize Input Data

It is generally good practice to update input variables to have the same scale.

When input variables have different scales, the scale of the weights of the network will, in turn, vary accordingly. This introduces a problem when using weight regularization because the absolute or squared values of the weights must be added for use in the penalty.

This problem can be addressed by either normalizing or standardizing input variables.

Use a Larger Network

It is common for larger networks (more layers or more nodes) to more easily overfit the training data.

When using weight regularization, it is possible to use larger networks with less risk of overfitting. A good configuration strategy may be to start with larger networks and use weight decay.

Grid Search Parameters

It is common to use small values for the regularization hyperparameter that controls the contribution of each weight to the penalty.

Perhaps start by testing values on a log scale, such as 0.1, 0.001, and 0.0001. Then use a grid search at the order of magnitude that shows the most promise.

Use L1 + L2 Together

Rather than trying to choose between L1 and L2 penalties, use both.

Modern and effective linear regression methods such as the Elastic Net use both L1 and L2 penalties at the same time and this can be a useful approach to try. This gives you both the nuance of L2 and the sparsity encouraged by L1.

Learning Rate:

The weights of a neural network cannot be calculated using an analytical method. Instead, the weights must be discovered via an empirical optimization procedure called stochastic gradient descent.

The optimization problem addressed by stochastic gradient descent for neural networks is challenging and the space of solutions (sets of weights) may be comprised of many good solutions (called global optima) as well as easy to find, but low in skill solutions (called local optima).

The amount of change to the model during each step of this search process, or the step size, is called the “learning rate” and provides perhaps the most important hyperparameter to tune for your neural network in order to achieve good performance on your problem.

In this tutorial, you will discover the learning rate hyperparameter used when training deep learning neural networks.

What Is the Learning Rate?

Deep learning neural networks are trained using the stochastic gradient descent algorithm.For more details go through the link for gradient descent.

Stochastic gradient descent is an optimization algorithm that estimates the error gradient for the current state of the model using examples from the training dataset, then updates the weights of the model using the back-propagation of errors algorithm, referred to as simply backpropagation.

The amount that the weights are updated during training is referred to as the step size or the “learning rate.”

Specifically, the learning rate is a configurable hyperparameter used in the training of neural networks that has a small positive value, often in the range between 0.0 and 1.0.

The learning rate is often represented using the notation of the lowercase Greek letter eta (n).

During training, the backpropagation of error estimates the amount of error for which the weights of a node in the network are responsible. Instead of updating the weight with the full amount, it is scaled by the learning rate.

This means that a learning rate of 0.1, a traditionally common default value, would mean that weights in the network are updated 0.1 * (estimated weight error) or 10% of the estimated weight error each time the weights are updated.

Effect of Learning Rate

A neural network learns or approximates a function to best map inputs to outputs from examples in the training dataset.

The learning rate hyperparameter controls the rate or speed at which the model learns. Specifically, it controls the amount of apportioned error that the weights of the model are updated with each time they are updated, such as at the end of each batch of training examples.

Given a perfectly configured learning rate, the model will learn to best approximate the function given available resources (the number of layers and the number of nodes per layer) in a given number of training epochs (passes through the training data).

Generally, a large learning rate allows the model to learn faster, at the cost of arriving on a sub-optimal final set of weights. A smaller learning rate may allow the model to learn a more optimal or even globally optimal set of weights but may take significantly longer to train.

At extremes, a learning rate that is too large will result in weight updates that will be too large and the performance of the model (such as its loss on the training dataset) will oscillate over training epochs. Oscillating performance is said to be caused by weights that diverge (are divergent). A learning rate that is too small may never converge or may get stuck on a suboptimal solution.

In the worst case, weight updates that are too large may cause the weights to explode (i.e. result in a numerical overflow).

Therefore, we should not use a learning rate that is too large or too small. Nevertheless, we must configure the model in such a way that on average a “good enough” set of weights is found to approximate the mapping problem as represented by the training dataset.

How to Configure Learning Rate

It is important to find a good value for the learning rate for your model on your training dataset.

The learning rate may, in fact, be the most important hyperparameter to configure for your model.

Unfortunately, we cannot analytically calculate the optimal learning rate for a given model on a given dataset. Instead, a good (or good enough) learning rate must be discovered via trial and error.

The range of values to consider for the learning rate is less than 1.0 and greater than 10^-6.

The learning rate will interact with many other aspects of the optimization process, and the interactions may be nonlinear. Nevertheless, in general, smaller learning rates will require more training epochs. Conversely, larger learning rates will require fewer training epochs. Further, smaller batch sizes are better suited to smaller learning rates given the noisy estimate of the error gradient.

A traditional default value for the learning rate is 0.1 or 0.01, and this may represent a good starting point on your problem.

Leslie N. Smith dound a solution to this on section 3.3 of the 2015 paper “Cyclical Learning Rates for Training Neural Networks” .

This technique trains a network starting from a low learning rate and increase the learning rate exponentially for every batch.

Learning rate increases after each mini-batch

Record the learning rate and training loss for every batch. Then, plot the loss and the learning rate. Typically, it looks like this:

The learning rate and training loss for every batch is recorded. Then the loss and the learning rate is plotted. example below

The loss decreases in the beginning, then the training process starts diverging

First, with low learning rates, the loss improves slowly, then training accelerates until the learning rate becomes too large and loss goes up: the training process diverges.

We need to select a point on the graph with the fastest decrease in the loss. In this example, the loss function decreases fast when the learning rate is between 0.001 and 0.01.