Artificial Neural Networks- An intuitive approach Part 5

A continuation of an earlier article

Niketh Narasimhan
Analytics Vidhya
15 min readJul 25, 2020

--

Please find the link for part 4

Contents

  1. Gradient descent advanced optimization techniques
  2. Autoencoders
  3. Dropout
  4. Pruning

Gradient descent advanced optimization techniques:

In this post we will cover a few advanced techniques namely,

Momentum ,Adagrad, RMSProp ,Adam etc

A machine learning model can contain millions of parameters or dimensions. Therefore the cost function has to be optimized over millions of dimensions.

The goal is to obtain a global minimum of the function which will give us the best possible values to optimize our evaluation metric with the given parameters.

The odds of obtaining a local minima inmost of the dimensions a high dimensional space are low , we are much more likely to encounter saddle points.

Saddle Points:a point at which a function of two variables has partial derivatives equal to zero but at which the function has neither a maximum nor a minimum value.

In mathematics, a saddle point is a point on the surface of the graph of a function where the slopes(derivatives) in orthogonal directions are all zero (a crtitical point), but which is not a local extremum of the function. An example of a saddle point is when there is a critical point with a relative minimum along one axial direction (between peaks) and at a relative maximum along the crossing axis.

As shown in our example below

Saddle points can drastically slow down optimization process , In the diagrams shown below the stochastic gradient descent converges prematurely to a value which is below optimum. The other points are different optimization techniques

In gradient descent we take a step along the gradient in each dimension. In the first animation,using SGD we get stuck in the local minima of one dimension , while we are also at the local maxima of another dimension( Gradient is close to zero)

Because our step size in a given dimension is determined by the gradient value, we’re slowed down in the presence of local optima.

Stochastic Gradient descent :

The Stochastic Gradient descent also known as vanilla gradient descent updates the current weight wt using the current gradient ∂L/∂wt multiplied by some factor called the learning rate, α.

Momentum:

Instead of depending only on the current gradient to update the weight, we use a moving average gradient instead of the immediate gradient at each time step.In this method we don’t get stuck at a local minima because we are not dependent on the current gradient on the average gradient values of the previous x steps,This said, it’s still possible to get stuck at a saddle point if we only have momentum in the direction of the minimum. You can see this in the first animation, where the momentum optimizer fails to exit the local optima.

Gradient descent with momentum replaces the current gradient with Vt (which stands for velocity), the exponential moving average of current and past gradients (i.e. up to time t).

Momentum

Nesterov accelerated gradient:

Intution:

AdaGrad — Gradient Descent with Adaptive Learning Rate

The main motivation behind the AdaGrad was the idea of Adaptive Learning rate for different features in the dataset, i.e. instead of using the same learning rate across all the features in the dataset, we might need different learning rate for different features.

It adapts the learning rate to the parameters, performing smaller updates
(i.e. low learning rates) for parameters associated with frequently occurring features, and larger updates (i.e. high learning rates) for parameters associated with infrequent features. For this reason, it is well-suited for dealing with sparse data.

RMS prop:

Intuition

AdaGrad decays the learning rate very aggressively (as the denominator grows). As a result, after a while, the frequent parameters will start receiving very small updates because of the decayed learning rate. To avoid this why not decay the denominator and prevent its rapid growth.

This technique aims to prevent local optima from slowing our convergence process by adaptively scaling the learning rate in each dimension according to the exponentially weight average of the gradient.

In RMSProp history of gradients is calculated using an expoential decaying average unlike the sum of gradients in AdaGrad, which helps to prevent the rapid growth of the denominator for dense features.

Adam:

The name Adam is derived from adaptive moment estimation

In Momentum based gradient descent we are using the cumulative history of gradients to move faster in the gentle surfaces and we have seen RMSProp which is also using history to decay the denominator and prevent its rapid growth. The way these algorithms uses the history is different, In Momentum GD, we use history to compute the current update whereas in case of RMSProp history was used to adjust the learning rate (shrink or boost).

Adam combines these two separate histories into one algorithm.

(i) the gradient component by using V, the exponential moving average of gradients (like in momentum), and
(ii) the learning rate component by dividing the learning rate α by square root of S, the exponential moving average of squared gradients (like in RMSprop).

Intuition

Here, I’d like to share with you some intuition on why gradient descent optimisers use exponential moving average for the gradient component and root mean square for the learning rate component.

Why take exponential moving average of gradients?

We need to update the weight, and to do so we need to make use of some value. The only value we have is the current gradient, so let’s utilise this to update the weight.

But taking only the current gradient value is not enough. We want our updates to be ‘better guided’. So let’s include previous gradients too.

One way to ‘combine’ the current gradient value and information of past gradients is that we could take a simple average of all the past and current gradients. But this means each of these gradients are equally weighted. This would not be intuitive because spatially, if we are approaching the minimum, the most recent gradient values might provide more information that the previous ones.

Hence the safest bet is that we can take the exponential moving average, where recent gradient values are given higher weights (importance) than the previous ones.

Why divide learning rate by root mean square of gradients?

The goal is to adapt the learning rate component. Adapt to what? The gradient. All we need to ensure is that when the gradient is large, we want the update to be small (otherwise, a huge value will be subtracted from the current weight!).

In order to create this effect, let’s divide the learning rate α by the current gradient to get an adapted learning rate.

Bear in mind that the learning rate component must always be positive (because the learning rate component, when multiplied with the gradient component, should have the same sign as the latter). To ensure it’s always positive, we can take its absolute value or its square. Let’s take the square of the current gradient and ‘cancel’ back this square by taking its square root.

But like momentum, taking only the current gradient value is not enough. We want our updates to be ‘better guided’. So let’s make use of previous gradients too. And, as discussed above, we’ll take the exponential moving average of past gradients (‘mean square’), then taking its square root (‘root’), hence ‘root mean square’. All optimisers in this post which act on the learning rate component does this, except for AdaGrad (which takes cumulative sum of squared gradients).

Note: Two New concepts for optimization RAdam and Lookahead are gaining mileage , Kindly have a look for those interested.

Autoencoders:

Autoencoders are a specific type of feedforward neural networks where the input is the same as the output. They compress the input into a lower-dimensional code and then reconstruct the output from this representation. The code is a compact “summary” or “compression” of the input, also called the latent-space representation.

Intitution: When we read the above points immediately the concept of dimensionality reduction in the form of PCA strikes us!!!In fact, if we were to construct a linear network (ie. without the use of nonlinear activation functions at each layer) we would observe a similar dimensionality reduction as observed in PCA

Major points of Autoencoders:

  1. Specifically, we’ll design a neural network architecture such that we impose a bottleneck in the network which forces a compressed knowledge representation of the original input.
  2. If the input features were each independent of one another, this compression and subsequent reconstruction would be a very difficult task. However, if some sort of structure exists in the data (ie. correlations between input features), this structure can be learned and consequently leveraged when forcing the input through the network’s bottleneck.
  3. Now we have designed a model where the compressed output(Xbar) is obtained after passing the inputs(x)
  4. We train the model to minimize the reconstruction error(Xbar-x). Thus our entire input is compressed into smaller number of features.
  5. A bottleneck constrains the amount of information that can traverse the full network, forcing a learned compression of the input data.

This is what a typical autoencoder network looks like. This network is trained in such a way that the features (z) can be used to reconstruct the original input data (x). If the output (Ẋ) is different from the input (x), the loss penalizes it and helps to reconstruct the input data.

Autoencoders are mainly a dimensionality reduction (or compression) algorithm with a couple of important properties:

  • Data-specific: Autoencoders are only able to meaningfully compress data similar to what they have been trained on. Since they learn features specific for the given training data, they are different than a standard data compression algorithm like gzip. So we can’t expect an autoencoder trained on handwritten digits to compress landscape photos.
  • Lossy: The output of the autoencoder will not be exactly the same as the input, it will be a close but degraded representation. If you want lossless compression they are not the way to go.
  • Unsupervised: To train an autoencoder we don’t need to do anything fancy, just throw the raw input data at it. Autoencoders are considered an unsupervised learning technique since they don’t need explicit labels to train on. But to be more precise they are self-supervised because they generate their own labels from the training data.

There are 4 hyperparameters that we need to set before training an autoencoder:

  • Code size: number of nodes in the middle layer(hidden layer). Smaller size results in more compression.
  • Number of layers: the autoencoder can be as deep as we like. In the figure above we have 2 layers in both the encoder and decoder, without considering the input and output.
  • Number of nodes per layer: the autoencoder architecture we’re working on is called a stacked autoencoder since the layers are stacked one after another. Usually stacked autoencoders look like a “sandwitch”. The number of nodes per layer decreases with each subsequent layer of the encoder, and increases back in the decoder. Also the decoder is symmetric to the encoder in terms of layer structure. As noted above this is not necessary and we have total control over these parameters.
  • Loss function: we either use mean squared error (mse) or binary crossentropy. If the input values are in the range [0, 1] then we typically use crossentropy, otherwise we use the mean squared error.

Avoid overfitting:

  1. Increasing these hyperparameters will let the autoencoder to learn more complex codings. But we should be careful to not make it too powerful. Otherwise the autoencoder will simply learn to copy its inputs to the output, without learning any meaningful representation.
  2. It will just mimic the identity function. The autoencoder will reconstruct the training data perfectly, but it will be overfitting without being able to generalize to new instances, which is not what we want.
  3. Deliberately keep the code size small . Since the coding Layer has a lower dimensionality than the input data, the autoencoder is said to be undercomplete. It won’t be able to directly copy its inputs to the output, and will be forced to learn intelligent features.Ideally, this encoding will learn and describe latent attributes of the input data.

Autoencoders are trained the same way as ANNs via backpropagation

Because neural networks are capable of learning nonlinear relationships, this can be thought of as a more powerful (nonlinear) generalization of PCA. Whereas PCA attempts to discover a lower dimensional hyperplane which describes the original data, autoencoders are capable of learning nonlinear manifolds (a manifold is defined in simple terms as a continuous, non-intersecting surface). The difference between these two approaches is visualized below.

Denoising auto encoders:

Keeping the code layer small forced our autoencoder to learn an intelligent representation of the data. There is another way to force the autoencoder to learn useful features, which is adding random noise to its inputs and making it recover the original noise-free data. This way the autoencoder can’t simply copy the input to its output because the input also contains random noise. We are asking it to subtract the noise and produce the underlying meaningful data. This is called a denoising autoencoder.

To answer the above question that how would an autoencoder remove the noise from the images:

A small tweak is all that is required here. Instead of using the input and the reconstructed output to compute the loss, we can calculate the loss by using the ground truth image and the reconstructed image. This diagram illustrates my point wonderfully:

Denoising

Sparse Autoencoders:

We introduced two ways to force the autoencoder to learn useful features: keeping the code size small and denoising autoencoders. The third method is using regularization. We can regularize the autoencoder by using a sparsity constraint such that only a fraction of the nodes would have nonzero values, called active nodes.

In particular, we add a penalty term to the loss function such that only a fraction of the nodes become active. This forces the autoencoder to represent each input as a combination of small number of nodes, and demands it to discover interesting structure in the data. This method works even if the code size is large, since only a small subset of the nodes will be active at any time.

Following are the observations of sparse encoder

  1. One result of this fact is that we allow our network to sensitize individual hidden layer nodes toward specific attributes of the input data.
  2. Whereas an undercomplete autoencoder will use the entire network for every observation, a sparse autoencoder will be forced to selectively activate regions of the network depending on the input data.
  3. As a result, we’ve limited the network’s capacity to memorize the input data without limiting the networks capability to extract features from the data.
  4. This allows us to consider the latent state representation and regularization of the network separately, such that we can choose a latent state representation (ie. encoding dimensionality) in accordance with what makes sense given the context of the data while imposing regularization by the sparsity constraint.

Dropout

Large neural nets trained on relatively small datasets can overfit the training data.This has the effect of the model learning the statistical noise in the training data, which results in poor performance when the model is evaluated on new data, e.g. a test dataset. Generalization error increases due to overfitting.

With Dropout, the training process essentially drops out neurons in a neural network. They are temporarily removed from the network, which can be visualized as follows:

Dropout

Dropout Intution

Why could Dropout reduce overfitting?

You may now wonder: why does Bernoulli variables attached to regular neural networks, making the network thinner, reduce overfitting?

  1. Computing the gradient is done with respect to the error, but also with respect to what all other units are doing This means that certain neurons, through changes in their weights, may fix the mistakes of other neurons. lead to complex co-adaptations that may not generalize to unseen data, resulting in overfitting.
  2. Dropout prevents these co-adaptations by making the presence of other hidden [neurons] unreliable. Neurons simply cannot rely on other units to correct their mistakes, which reduces the number of co-adaptations that do not generalize to unseen data, and thus presumably reduces overfitting as well.
  3. A fully connected layer occupies most of the parameters, and hence, neurons develop co-dependency amongst each other during training which curbs the individual power of each neuron leading to over-fitting of training data.

Intuitively: It can be understood that let’s say a football team is forced to play each of it’s matches with only 9 players instead of say 11( a random mix of the eleven players always), this team when it gets back to playing with all 11 eleven players would be able to adapt to all situations as all the players have learnt to take the responsibility of the missing players.

Training Phase:

Training Phase: For each hidden layer, for each training sample, for each iteration, ignore (zero out) a random fraction, p, of nodes (and corresponding activations).

Testing Phase:

Use all activations, but reduce them by a factor p (to account for the missing activations during training).

Note : All the activations are kept active during the Test phase

Some Observations:

  1. Dropout forces a neural network to learn more robust features that are useful in conjunction with many different random subsets of the other neurons.
  2. Dropout roughly doubles the number of iterations required to converge. However, training time for each epoch is less.
  3. With H hidden units, each of which can be dropped, we have
    2^H possible models. In testing phase, the entire network is considered and each activation is reduced by a factor p.

Pruning:

Simply put, pruning is a way to reduce the size of the neural network through compression. After the network is pre-trained, it is then fine-tuned to determine the importance of connections. This is done through the ranking of the neurons from the network, .

The basic principles of pruning include removing unimportant weighted information using second derivative data. This results in better generalisation results, improved speed of processing the results and a reduced size as well.

  1. Pruning is usually done in an iterative fashion, to avoid the pruning of necessary neurons. This also ensures that an important part of the network is not lost, as neural networks are a black box. The first step is to determine which neurons are important and which aren’t.
  2. After this, the last important neuron is removed, followed by the fine-tuning of the algorithm. At this point, a decision can be made to continue the pruning process or to stop pruning.
  3. While it has not been widely publicised method of reducing the size, this is due to the previous infectivity of ranking algorithms. It is also a better approach to start with a larger network and prune it after training rather than training a smaller network from the get-go.

One of the first methods of pruning is pruning entire convolutional filters. Using an L1 norm of the weight of all the filters in the network, they rank them. This is then followed by pruning the ’n’ lowest ranking filters globally. The model is then retrained and this process is repeated.

Importance of pruning

With the rise of mobile inference and machine learning capabilities, pruning becomes more relevant than ever before. Lightweight algorithms are the need of the hour, as more and more applications find use with neural networks.

The most recent example of this comes in the form of Apple’s new products, which use neural networking to ensure a multitude of privacy and security features across products. Owing to the disruptive nature of the technology, it is easy to see its adoption by various companies.

As can be seen in the graph above with 100 pc , 51.3 pc and 21.1 Pc , accuracy has increased slightly . Dropout retains the same activations during the test and pruning drops the activations during the test phase , hence higher ac-curacies were observed after pruning as correspondingly the test data has also been pruned.

--

--