Machine Learning

Solving the Vanishing Gradient Problem with Self-Normalizing Neural Networks using Keras

How to Improve Convergence and Performance of Deep Feed-Forward Neural Networks via a Simple Model Configuration

Jonathan Quijas
Oct 21 · 7 min read
Image for post
Image for post
Photo by Martin Sanchez on Unsplash

Problem Statement

Training deep neural networks can be a challenging task, especially for very deep models. A major part of this difficulty is due to the instability of the gradients computed via backpropagation. In this post, we will learn how to create a self-normalizing deep feed-forward neural network using Keras. This will solve the gradient instability issue, speeding up training convergence, and improving model performance.

Disclaimer: This article is a brief summary with focus on implementation. Please read the cited papers for full details and mathematical justification (link in sources section).

Background

In their 2010 landmark paper, Xavier Glorot and Yoshua Bengio provided invaluable insights concerning the difficulty of training deep neural networks.

It turns out the then-popular choice of activation function and weight initialization technique were directly contributing to what is known as the Vanishing/Exploding Gradient Problem.

In succinct terms, this is when the gradients start shrinking or increasing so much that they make training impossible.

Before the wide adoption of the now ubiquitous ReLU function and its variants, sigmoid functions (S-shaped) were the most popular choice of the activation function. One such example of a sigmoid activation is the logistic function:

Image for post
Image for post
Source: https://www.mhnederlof.nl/logistic.html

One major disadvantage of sigmoid functions is that they saturate. In the case of the logistic function, the outputs saturate to either 0 or 1 for negative and positive inputs respectively. This leads to smaller and smaller gradients (very close to 0) as the magnitude of the inputs increases.

Image for post
Image for post
Visualization of the logistic and ELU activation functions, along with their corresponding derivatives. Due to saturation, the derivatives of the logistic function will tend to shrink. In contrast, the derivative of the ELU function will be constant for positive inputs.

Since the ReLU and its variants do not saturate, they alleviate this vanishing gradient phenomenon. Improved variants of the ReLU such as the ELU function (shown above) have smooth derivatives all across:

  • For any positive input, the derivative will always be 1
  • For small negative numbers, the derivative will not be close to zero
  • Smooth derivative all across

NOTE: It is therefore beneficial to have inputs with an expected value of 0, as well as a small variance. This would help maintain strong gradients across the network.

Another important insight found in the paper was the effect of weight initialization using a normal distribution with 0 mean and standard deviation of 1, a widely-popular choice prior to the authors’ findings.

The authors showed that the particular combination of a sigmoid activation and weight initialization (a normal distribution with 0 mean and standard deviation of 1) makes the outputs have a larger variance than the inputs. This effect is compounded across the network, making the input of deeper layers have a much larger magnitude relative to the input of more shallow (earlier) layers. This phenomenon was later named Internal Covariate Shift in a 2015 landmark paper by Sergey Ioffe and Christian Szegedy.

As we saw above, this translates into smaller and smaller gradients when using sigmoid activations.

The problem is further accentuated with the logistic function, since it has an expected value of 0.5, rather than 0. The hyperbolic tangent sigmoid function has an expected value of 0, and thus behaves better in practice (but also saturates).

The authors argued that in order for the gradients to be stable during training, the inputs and outputs of all layers must preserve more or less the same variance across the entire network. This would prevent the signal from dying or exploding when propagating in a forward pass, as well as gradients vanishing or exploding during backpropagation.

Image for post
Image for post
The distribution generated with the LeCun Normal initialization leads to much more probability mass centered at 0 and has a smaller variance. This is in contrast to the distribution generated with the normal initialization (0 mean and standard deviation of 1), where there is a much wider spread of values (larger variance).

To achieve this, they proposed a weight initialization technique, named Glorot (or Xavier) initialization after the paper’s first author. It turns out that with a slight modification of the Glorot technique, we get LeCun initialization, named after Yann LeCun.

Yann LeCun proposed his LeCun initialization in the 1990’s, with references found in the Springer publication Neural Networks: Tricks of the Trade (1998).

Self-Normalizing Feed Forward Neural Networks (SNNs)

In 2017, Günter Klambauer et al. introduced self-normalizing neural networks (SNNs). By ensuring some conditions are met, these networks are able to preserve outputs close to 0 mean and standard deviation of 1 across all layers. This means SNNs do not suffer from the vanishing/exploding gradient problem and thus converge much faster than networks without this self-normalizing property. According to the authors, SNNs significantly outperformed the other variants (without self-normalization) in all learning tasks reported in the paper. Below is a more detailed description of the conditions needed to create an SNN.

An SNN must be a sequential model comprised only of fully-connected layers.

NOTE: Certain types of networks are more adequate than others depending on the task. For example, convolutional neural networks are commonly used in computer vision tasks, primarily due to their parameter efficiency. Make sure a fully-connected layer is adequate for your task. If this is the case, then consider using a SNN. Otherwise, Batch Normalization is an excellent way to ensure proper normalization across the network.

Image for post
Image for post
Example of a deep, sequential, fully-connected neural network. Image souce: https://www.oreilly.com/library/view/tensorflow-for-deep/9781491980446/ch04.html

A sequential model in this case refers to one where layers are in a strict sequential order. In other words, for each hidden layer l, the only inputs layer l receives are strictly the outputs of layer l-1. In the case of the first hidden layer, it only receives the input features. In Keras, this type of model is in fact referred to as a Sequential model.

A fully connected layer is one where each unit in the layer has a connection to every single input. In Keras, this type of layer is referred to as a Dense layer.

The input features must be standardized. This means the training data should have 0 mean and standard deviation of 1 across all features.

All layers in an SNN must be initialized using the LeCun Normal initialization. As we saw earlier, this will ensure the range of weight values lies closer to 0.

Image for post
Image for post
Visualization of weights as a matrix. The range of values for the standard normal initialized weights is much larger than the range of the LeCun Normal initialized weights.

The authors introduced the Scaled ELU (SELU) function as the activation function for SNNs. As long as the previous conditions are met, the SELU provides a guarantee of self-normalization.

Keras Implementation

The following example shows how to define an SNN for a 10-class classification task:

def get_model(num_hidden_layers=20, input_shape=None, hidden_layer_size=100):   model = keras.models.Sequential()   model.add(keras.layers.Flatten(input_shape=input_shape))   for layer in range(num_hidden_layers):      model.add(keras.layers.Dense(hidden_layer_size,    activation='selu', kernel_initializer='lecun_normal'))      model.add(keras.layers.Dense(10, activation='softmax'))   return model

Experimental Results

Below is the comparison between a regular feed-forward neural network and an SNN on three different tasks:

Both networks shared the following configuration:

  • 20 hidden layers
  • 100 units per hidden layer
  • Nadam optimizer
  • Learning rate of 7e-4
  • 50 epochs

For both models, the learning curves stop at the epoch where the best performance metric was achieved

Image for post
Image for post
The SNN reached the best validation accuracy in 28% less iterations compared to the regular model.
Image for post
Image for post
Distribution across time of the SNN’s last layer’s weights.
Image for post
Image for post
Distribution across time of the regular model’s last layer’s weights.
Image for post
Image for post
Validation loss and accuracy is consistently better for the SNN compared to the regular model.
Image for post
Image for post
The SNN achieved 12% better F1 Score on the test set compared to the regular model.
Image for post
Image for post
The SNN achieved the best validation accuracy in 32% less epochs compared to the regular model (34 epochs for the SNN vs all 50 for the regular model).

Conclusion

By ensuring our feed-forward neural network configuration meets a set of conditions, we can get it to automatically normalize. The required conditions are:

  • The model must be a sequence of fully-connected layers
  • Weights are initialized with the LeCun normal initialization technique
  • The model uses the SELU activation function
  • Inputs are standardized

This almost always will lead to improved performance and convergence compared to models without self-normalization. If your task requires a regular feed-forward neural network, consider using the SNN variant. Otherwise, Batch Normalization is an excellent (but more time and compute costly) normalization strategy.

Towards AI

The Best of Tech, Science, and Engineering.

By Towards AI

Towards AI publishes the best of tech, science, and engineering. Subscribe to receive our updates right in your inbox. Interested in working with us? Please contact us → https://towardsai.net/contact Take a look

By signing up, you will create a Medium account if you don’t already have one. Review our Privacy Policy for more information about our privacy practices.

Check your inbox
Medium sent you an email at to complete your subscription.

Jonathan Quijas

Written by

Hi, I’m Jonathan. I’m interested in Bayesian methods, causal reasoning, and Applied Machine Learning/Machine Learning Engineering. Feel free to reach out!

Towards AI

Towards AI is a world’s leading multidisciplinary science publication. Towards AI publishes the best of tech, science, and engineering. Read by thought-leaders and decision-makers around the world.

Jonathan Quijas

Written by

Hi, I’m Jonathan. I’m interested in Bayesian methods, causal reasoning, and Applied Machine Learning/Machine Learning Engineering. Feel free to reach out!

Towards AI

Towards AI is a world’s leading multidisciplinary science publication. Towards AI publishes the best of tech, science, and engineering. Read by thought-leaders and decision-makers around the world.

Medium is an open platform where 170 million readers come to find insightful and dynamic thinking. Here, expert and undiscovered voices alike dive into the heart of any topic and bring new ideas to the surface. Learn more

Follow the writers, publications, and topics that matter to you, and you’ll see them on your homepage and in your inbox. Explore

If you have a story to tell, knowledge to share, or a perspective to offer — welcome home. It’s easy and free to post your thinking on any topic. Write on Medium

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store