Meet the ReL function

moodayday™
AI³ | Theory, Practice, Business
7 min readAug 18, 2019

In this post, I’m going to give a little brush up on the famous ReLU which are so prevailing in the deep learning world now.

Overview

Divided into the following parts:

  1. What is an activation
  2. Linear, non-linear activation functions
  3. Good things about ReL
  4. Cons of ReL

What is an activation function

First, what is an activation anyway? For each node (can also say neuron), the inputs are multiplied by the weights in a node and summed together. This value is referred to as the summed activation of the node.

Then this value is passed down to a function called the activation function and the output of this function defines the specific output or “activation” of the node.

Linear & Non-Linear

Now, activation functions can be linear or non-linear.

Linear activation functions are just… linear functions. Linear means that the output can be produced through a linear combination of the inputs. What this linearity implies is a CNN using such functions is easier to train but the downside is that they fail to learn complex hidden relationships between the features of the data and the labels.

They are not completely useless however and are still used in the output layer for networks that predict quantities in regression problems for example.

Nonlinear activation functions are more adapted to learning problems of complexity as high as in computer vision, voice recognition, etc. as they allow the neurons to learn more complex patterns in the data.

Until recently, the two widely used nonlinear activation functions are the sigmoid and hyperbolic tangent activation functions. They are not the best option out there nowadays, so we will not talk about them here. The main problem with both the sigmoid and tanh functions is that they saturate (read this paper for a nice analysis). This means that at some point they just don’t allow the neural network to adapt the weights to improve the performance of the model (which is the meaning of learning in Machine Learning in general).

This limitation has long forced the performance of neural networks to remain under a mediocre limit… until the light was shed on the superiority of Rectifier Linear function.

ReL or the Rectifier Linear function

In the context of artificial neural networks, the rectifier is an activation function defined as the positive part of its argument. In Python it would look like this:

>>> from math import max
>>> def rel(x): return max(0,x) # where x is the input to a neuron

This max(0,x) function is also known as a ramp function.

>>> rel(30)
30
>>> rel(-2019)
0
>>> rel(0)
0
Rectifier Linear function

Actually, although this new function is called “linear”, it’s not exactly a linear function. When it’s applied on a set of input values evenly distributed around zero, the ReL function is linear for half of the input domain (on the positive input values) and nonlinear for the other half. Thus, it is referred to as a piecewise linear function or a hinge function. This makes the ReL function very powerful because it has the best of both worlds:

  • it can make the neural network learn (to mimic) very complex functions from the data (nonlinear function are best at this)
  • it is easy to train and doesn't saturate (linear activation functions are best at this)

This activation function was first introduced to a dynamical network by Hahnloser et al. in 2000 with strong biological motivations and mathematical justifications. It has been demonstrated for the first time in 2011 to enable better training of deeper networks, compared to the widely used activation functions prior to 2011, e.g., the logistic sigmoid (which is inspired by probability theory; see logistic regression) and its more practical counterpart, the hyperbolic tangent. The rectifier is, as of 2017, the most popular activation function for deep neural networks. (Wikipedia)

A node or unit that implements this activation function is referred to as a rectified linear activation unit, or ReLU for short. Often, networks that use the rectifier function for the hidden layers are referred to as rectified networks.

Despite the simplicity of this activation function, the adoption of ReLU have had so much positive impact on the deep learning community that they are often considered as one of the few techniques that allowed the routine development of very deep neural networks.

Its derivative

The derivative of the ReL function is the Heaviside function which is as easy to computate as its definition: it outputs 1 for positive inputs (x>0) and 0 otherwise. The derivative of the activation function is required when updating the weights of a node as part of the back-propagation of error.

The Heaviside function

However as you can easily notice, the derivative ReL function is differentiable everywhere but at zero, which is one of its few cons.

Pros of the ReLU

  • It’s computationally efficient: maxing two numbers is easy
  • Scale-invariant: max(0, s * x) = s * max(0, x)
  • Sparse representation is a desirable property in representational learning as it can accelerate learning and simplify the model. For example, in a randomly initialized network, only about 50% of hidden units are activated (having a non-zero output).
  • Better gradient propagation: Fewer vanishing gradient problems compared to sigmoidal activation functions that saturate in both directions.
  • Linear Behavior: in general, a neural network is easier to optimize when its behavior is linear or close to linear.

deep rectifier networks can reach their best performance without requiring any unsupervised pre-training on purely supervised tasks with large labeled datasets. Hence, these results can be seen as a new milestone in the attempts at understanding the difficulty in training deep but purely supervised neural networks, and closing the performance gap between neural networks learnt with and without unsupervised pre-training.

Cons

The rectifier linear function is great but not perfect. Here are some of its problems:

  • Non-differentiable at zero; however, it is differentiable anywhere else, and the value of the derivative at zero can be arbitrarily chosen to be 0 or 1.
  • Not zero-centered.
  • Not bounded.
  • Should be used on hidden layers only. For the output layer, use softmax function.
  • Use on hidden layers, and preferably go for something like the softmax activation function on the output layer.
  • Doesn’t perform well on Recurrent Neural Networks; go for the hyperbolic tangent activation function instead (NOT the sigmoid)
  • Dying ReLU problem: ReLU neurons can sometimes be pushed into states in which they become inactive for essentially all inputs. In this state, no gradients flow backward through the neuron, and so the neuron becomes stuck in a perpetually inactive state and “dies”. This is a form of the vanishing gradient problem. In some cases, large numbers of neurons in a network can become stuck in dead states, effectively decreasing the model capacity. This problem typically arises when the learning rate is set too high. It may be mitigated by using leaky ReLUs instead, which assign a small positive slope for x < 0. If you run into this problem, you can use other variants of the ReL function designed to cope with it e.g. the Leaky ReL function, the maxout function, etc.

Use “He Weight Initialization”

The ground-breaking paper (Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification) published in 2015 by Kaiming He et al. presented a very efficient recipe to use with ReLU.

Before training a neural network, the weights of the network must be initialized to small random values.

When using ReLU in your network and initializing weights to small random values centered on zero, then by default half of the units in the network will output a zero value.

For example, after uniform initialization of the weights, around 50% of hidden units continuous output values are real zeros

In their paper, they proposed an initialization method very suitable for use with ReLU, now commonly referred to as “He initialization”. The idea is rather simple to express: scale your input data to the range between +/- sqrt(2/n) where “n” is the number of nodes in the prior layer known as the fan-in). In practice, both Gaussian and uniform versions of the scheme can be used for the He initialization.

Summary

In this post I presented the ReLU activation function. I hope you liked it.

Please don’t forget to CLAP & FOLLOW me it you enjoyed the reading. I would also love to read your response!

Go further

--

--