# Deep Neural Networks into Deeper

The toughest job in the field of Data Science is training DNN

# Motivation

For the past two years, I’m working in the Data Science field which always makes me learn and know better in the field. While coming to DNN, I feel more comfortable working and gather tricks to make neural networks run faster and get the best accuracy for my model. Through this article, I spread those tricks to make the toughest job into a simplified job.

**Training DNN’s** is a tough job to complete due to computational time and effort. When dealing with DNN problems few problems arise like :

- Rise of Vanishing gradients, in some cases, Exploding gradients.
- Training may extremely slow.
- Overfitting the training set, especially if the data is too small and noisy which is very often.

# Contents

- Vanishing & Exploding gradients
- Pretrained Layers
- Optimizers
- Regularization techniques

# Vanishing & Exploding gradients

This problem would arise mostly under the backpropagation algorithm due to weights that need to update. So, by this algorithm, the gradients will be smaller and smaller as the process moves to lower level layers that may occur of “Vanishing Gradients” (Erasing of gradients in layers).

In other cases, gradients will be larger and larger as the process moves to lower level layers, and occur the problem of “Exploding Gradients”

In the Logistic activation function when input becomes large then function saturates at 0 or 1, with a derivative at extremely close to 0. Thus, when backpropagation kicks in it have virtually no gradient to propagate back through the network; and little gradients exists keep getting diluted as backpropagation progresses down through the top layers. So, there is nothing left for the lower layers.

The ReLU activation function is not perfect, because it suffers from a problem known as dying ReLU’s during training some neurons may die (Stops outputting anything other than zero).

- Leaky ReLUs never die, they can go into a long coma, but they have a chance to eventually wake up
- To use the leaky ReLU activation function then create a LeakyReLU layer and add it to your model just after the layer you want to apply it :

## Batch Normalization

This technique address the problems of vanishing and gradient problems by normalizing. As, we cannot guarantee with the above initializers to solve gradients problem, but with BN we can. Adding an operation in the model just before or after the activation function of each hidden layer. This operation simply zeros centers and normalizes input, then scales and shifts the results using two new parameter vectors per layer; one for scaling and the other for shifting. In other words, the operation lets the model learn the optimal scale and mean of each of the layer’s inputs. In most cases, If we add a BN layer as the very first layer then no need of standardizing your training set (eg: Using a standard scaler) the BN layer will do for you.

Computing statistics over batch instances would be unreliable. So there is a solution; running the whole training set through the neural network and compute the mean & standard deviation of each input of the BN layer. These “final” input means and standard deviations could then be used instead of batch input & standard deviations when making predictions. Mostly BN estimates these final statistics during training by using the moving average of the layer’s input means and standard deviations. This whole process is done automatically when you initialize the batch normalization layer.

- Training may be a bit slower when we use BN, but performance is good.
- BN has four parameters that are multiplied by the previous layer output shape.
- In four parameters two are trainable at backpropagation and the other two are moving averages.

- We can even use this method through the above code when initializing the BN layers in the model. First, we can remove activation functions from hidden layers and just add them after the BN layer and also BN layer includes one offset parameter per input, you can remove that using bias term from the hidden layer setting it as False.

## Gradient Clipping

This most prominent technique used for exploding gradients problem, to clip the gradients during backpropagation so they never exceed the threshold limit. Mostly this technique is used in RNN’s. In practice, this works well. Implementing clippings is a matter at Keras with clip value or clip norm argument at optimizers.

- From the above code; the optimizer will clip every component of the gradient vector to a value between -1.0 and 1.0.
- If we need gradient clipping that does not change the direction of the gradient vector, you should clip by norm using clip norm instead of clip value.
- If Original gradient vector is [0.9, 100.0]; if you set clipvalue=1.0 then [0.9, 1.0] and clipnorm=1.0 then [0.00899964, 0.999595]

# Pretrained Layers

We will discuss three different types of pre-trained layers :

- Transfer Learning
- Unsupervised pretraining
- pretraining on an auxiliary task

## Transfer Learning

Building DNN from scratch is hard and time-consuming, So we can use the pre-trained models (already trained models) for the same time of problems. When using pre-trained models we have a convincing process of using hidden layers and output layer.

- We can freeze the reuse layers from the pre-trained model to our new model.
- For some problems, we need a different output than the pre-trained model output so we can make changes to the output layer and use the rest of the layers from a pre-trained model.
- And even we can make use of upper layers from the pre-trained model by freezing them and lower layers can be dropped and use different layers with the required output.
- If you have plenty of training data, you can even add many more hidden layers to the new model.
- Most problematic in transfer learning is matching with input shape of a pre-trained model with a new model. We have to reshape the input shape of data to match with the pre-trained model input shape. Transfer learning works best when the inputs have similar low-level features.
- Initializing a new output layer for the new model and rest hidden layers from the pre-trained model would make large errors when training due to gradients that may wreck the reused weights. To solve this, we have to freeze the reused layers for few epochs by making trainable=False

## Unsupervised Pretraining

It is also known as Self-supervised learning in which it has to learn for itself without properly labeled data. So, it is quite different from supervised learning and tough to find pre-trained models on it. If a data has no proper labels then we have to gather labels for the data and If in case it is quite a big training set then? But, we can still perform unsupervised pretraining by autoencoders & GAN’s (Generative Adversarial Network). Then, we can reuse the lower layers of autoencoders and GAN’s and add your output layer at the end then fine-tune the final network with supervised learning.

- Unsupervised pretraining is good to use GAN’s & autoencoders when you have a complex task, with little labeled data but plenty of not labeled data.

## Pretraining on an auxiliary task

We can train our neural network using an auxiliary task by which it is easy to obtain or generate labeled data then reuse it for our task. The first neural network lower layers learn the features from the input that will likely to reusable by the second neural network.

- For example, If we want to build a neural network that classifies human faces of few famous persons but unfortunately, pictures are few to train and there not many to input the network then we can use the pre-trained model of faces recognition with a similar problem.
- So, we can reuse those lower layers to train human face classification.
- By this, we can say that lower layers could examine features properly for building a neural network and more powerful to reuse for other model buildings.

Self-supervised learning is when you automatically generate the labels from the data itself, then you train a model on the resulting “labeled” dataset using supervised learning techniques. It is best classified as a form of unsupervised learning.

# Optimizers

When coming to optimizers it is used to fasten the training neural networks. Mostly, the Gradient Descent optimizer is widely used for neural networks training and many more fast optimizers are used like :

- Momentum Optimization
- Nesterov Accelerated Gradient
- AdaGrad
- RMSProp
- Adam and Nadam Optimization.
- Learning rate scheduling

**Momentum Optimization**

Due to momentum, the optimizers may overshoot a bit, then come back, overshoot again, and oscillate like this many times before stabilizing at the minimum. This is one of the reasons it’s good to have a bit of friction in the system. It gets rid of these oscillations and thus speeds up convergence.

- It keeps track of the exponentially decaying average of past gradients.

`optimizer = keras.optimizer.SGD(lr=0.001, momentum=0.9)`

**Nesterov Accelerated Gradient**

It is generally faster than momentum optimization to use it simply set `nesterov=True`

`optimizer = keras.optimizer.SGD(lr=0.001,momentum=0.9,nesterov=True)`

**AdaGrad**

It frequently performs for simple quadratic problems, but it often stops too early when training neural networks. The learning falls so much that the algorithm ends up stopping entirely before reaching the global optimum.

- Even though Keras has an AdaGrad optimizer we can’t use it for deep neural networks, but can be useful for simpler tasks like linear regression.

**RMSProp**

It works better than the Adagrad optimizer. It was the preferred optimizer by researchers until Adam optimization came around.

- It keeps track of the exponentially decaying average of past squared gradients.

`optimizer = keras.optimizers.RMSprop(lr=0.001, rho=0.9)`

**Adam**

It stands for Adaptive moment estimation which combines the ideas of momentum & RMSProp optimizations. Just like them keeps track of exponentially decaying of past and past squared gradients.

- Since Adam is an adaptive learning rate algorithm like (AdaGrad & RMSProp). It requires less tuning of the learning rate hyperparameter.
- Most importantly, Adam makes it even easier to use than Gradient Descent.

**Learning Rate Scheduling**

Learning rates place a crucial role when training neural networks which are best for building DNN. These are nothing but setting up a network with learning rates to learn quickly and reducing time consumption. Setting up learning rates is a bit hard to find the exact rates because We can’t set too high rates and too low rates to the network it needs balanced learning rates.

- High learning rates cause diverge at training and low will eventually converge to the optimum.
- Balanced learning rates are required for better network pieces of training.
- We can find the good learning rates by training the model for a few hundred iterations, exponentially increasing the learning rate from a very small value to a very large value, and then looking at the learning curve and picking a learning rate slightly lower than the one at which learning curve starts shooting back up. You can then reinitialize the model and train it with that learning rate.
- There are many different strategies to reduce the learning rate during training. It can also be beneficial to start with a low learning rate, increase it, then drop it again. These strategies are called “Learning schedules”.
- Most commonly used schedules :

- Power scheduling
- Exponential scheduling
- Piecewise constant scheduling
- Performance scheduling
- I cycle scheduling

LearningRateScheduler will update the optimizer’s

`learning_rate`

attribute at the beginning of each epoch.

# Regularization

DNN can have tens of thousands of parameters, sometimes even millions. This gives them an incredible amount of freedom and means they can fit a huge variety of complex datasets. But, this great flexibility also makes the network prone to overfitting the training set. Then we need Regularization. Since regularization has many techniques we will discuss few popular techniques.

## l1 and l2 Regularization

Here, l2 regularization constrains a neural network’s connection weights, and l1 regularization if you want to sparse the model (with many weights equal to 0).

- The l2( ) function returns a regularizer that will be called at each step during training to compute the regularization loss. This is then added to the final loss. As you might expect, you can just use Keras.regularizers.l1( ) if you want l1 regularization. In case you need both l1 and l2, use Keras.regularizers.l1_l2( )

**Dropout**

It is one of the finest and most popular regularization techniques used. We can find what it is used for dropping a few set of neurons in the layers through the name. To implement dropouts using Keras, you can use Keras. layers.Dropout

- During training, it randomly drops some inputs (setting them to 0) and divides the remaining inputs by the keep probability. After training, it does nothing at all; just passes the inputs to the next layer.

- Since dropout is only active during training, comparing the training loss and the validation loss can be misleading. In particular, a model may be overfitting the training set and yet have similar training and validation losses. So make sure to evaluate training loss without dropout (eg. after training)
- These Dropouts are not only for overfitting data also used for underfitting problems.

**Max-Norm Regularization**

Max-Norm regularization does not add a regularization loss term to the overall loss function. Reducing r increases the amount of regularization and helps reduce overfitting. Max norm regularization can also help alleviate the unstable gradients problems (if you are not using Batch Normalization).

# Conclusion

Congratulations! You have just learned some cool techniques that make tough jobs to simplified jobs in Data Science. Through, these techniques you can solve the problems of gradients, computational speed, and overfitting training set.

Hope you like it!! Thanks for reading.

Here, is the repo of my complete project to check the codes that are used in this article.