How to train Neural Networks

Published in

Analytics Vidhya

4 min readOct 5, 2020

In this post, I am going to write about the general blueprint to be followed for any deep learning model. Here I am not going in-depth into deep learning concepts but this acts as a basic step that can be followed to develop neural networks. Some steps may be added or can be removed from the below list based on the requirement.

1 . Data preprocessing

The data we get for modeling is most of the time unstructured and raw, where we have lots of data that is not required for our case. So we need to keep the data which is necessary and leave out the rest

2. Weight initialization

The first step comes in modeling a neural network is weight initialization and this is an extremely important step because if the weights are not initialized properly then converging to minima is impossible, but if done is the right way then optimization is achieved in the least time. There are several techniques such as zero initialization, random initialization, HE initialization, glorant initialization, Xavier, etc.

3. Choose the right activation function

The activation function is considered as a gate that can be as simple as on or off or it can transform the input of the neuron and give output. There are several types of activation functions that you can choose based on the use cases. Some of the activation functions are broadly categorized into Linear and nonlinear. The problem with linear activation functions is that backpropagation cannot be applied. And if there are multiple layers of a linear transformation, it is still equivalent to one layer because the function of linear is still a linear function. We have non-linear activations like, Sigmoid, tanh, ReLu which solves the problems of linear activation functions.

Image courtesy: https://www.google.com/url?sa=i&url=https%3A%2F%2Fwp.wwu.edu%2Fmachinelearning%2F2017%2F02%2F12%2Fdeep-neural-networks%2F&psig=AOvVaw1KUT6UrNsKH9O6rN6THk75&ust=1601991340554000&source=images&cd=vfe&ved=0CAIQjRxqFwoTCJDJp__InewCFQAAAAAdAAAAABAD

4. Batch Normalization

Normalization is bringing all the features to one single scale for example there can be features which has values from 1–100 and there can be features from 0–1. We need to normalize the data either to 0–1 so that the learning rate will be faster. If the input layer can benefit from the normalization why not hidden layers be benefitted, so we add batch normalization to hidden layers as well especially for the later layers which are closer to the output layer in a deep neural network so that the convergence becomes easier. Normalization makes non of the activation to go much higher.

5. Add Dropout layers if required

Dropout layers are basically added to avoid overfitting and you can have this layer if you suspect overfitting in your model. Dropout simply drops out some of the neurons randomly in a particular layer. Though if we add dropout it takes a longer time to converge but each epoch will take a shorter period.

6. Choose a good optimizer

Optimizers are methods that change the attributes of the neural network. Such as weights and learning rate in order to reduce the losses. There are many optimizers to choose from. Such as Gradient descent, Stochastic gradient descent, Mini-Batch gradient descent, momentum, Adagrad, AdaDelta, Adam. Adam optimizers are recent ones and are best optimizers so far, which takes less time and is more efficient to train any neural network.

7. Hyper Parameter tuning

Hyperparameters are all the training variables set manually with a pre-determined value before starting the training.

Some of the common hyperparameters are as follows :

Learning rate
Momentum
Adam’s hyperparameter
Number of hidden layers
Number of hidden units for different layers
Learning rate decay
Mini-batch size

Among all the hyperparameters, learning rate/sep size is the most important hyperparameters which tell how far to move in the gradient. If the learning rate is small, we will have more reliability, but it takes a lot of time for convergence. There are several methods to find the leaning rate some of them are, trial and error, grid search, random search, Bayesian optimization.

8. Loss Function

Ultimately what matters in ML or Deep learning models is to reduce the loss function. So in a classification task, we need to reduce and in regression tasks, we need to minimize the loss. In the case of classification tasks, we normally reduce log loss, in multiclass classification it is multiclass log loss. And in the case of regression tasks , it would be a mean squared loss.

9. Monitor gradients

There can be two kinds of problems, regarding gradients, one can be exploding gradient and vanishing gradient descent. The model is unstable, resulting in large changes in loss from update to update. We can say that it's suffering from an exploding gradient. The vanishing gradient problem is that in some cases, the gradient will be vanishingly small, effectively preventing the weight from changing its value. In the worst case, this may completely stop the neural network from further training. We need to monitor the gradients by using techniques like gradient clipping.

10 Visualization

Plots give a nice visualization of the performance of the models. It is easy to make sense of the data being poured out of the model and make an informed decision about the changes that need to be made on the parameters or hyperparameters that affect the model.

This article should not be considered as a guide to building a deep learning model, rather this article is just a way how deep learning models can be built.

Happy Learning

How to train Neural Networks

Written by Arpitha M S