Activation Functions for Deep Learning

Mehmet Toprak
4 min readJun 14, 2020

--

Activation functions play a major role in the learning process of a neural network. So far, we have used only the sigmoid function as the activation function in our networks, but we saw how the sigmoid function has its shortcomings since it can lead to the vanishing gradient problem for the earlier layers. In this blog, we will discuss other activation functions; ones that are more efficient to use and are more applicable to deep learning applications.

There are seven types of activation functions that you can use when building a neural network. There is the binary step function, the linear or identity function, there is our old friend the sigmoid or logistic function, there is the hyperbolic tangent, or tanh, function, the rectified linear unit (ReLU) function, the leaky ReLU function, and the softmax function. In this blog, we will discuss the popular ones, which are the sigmoid, the hyperbolic tangent, ReLU, and the softmax functions.

Sigmoid Function

This is the sigmoid function. At z = 0, a is equal to 0.5 and when z is a very large positive number, a is close to 1, and when z is a very large negative number, a is close to zero. Sigmoid functions used to be widely used as activation functions in the hidden layers of a neural network. However, as you can see, the function is pretty flat beyond the +3 and -3 region. This means that once the function falls in that region, the gradients become very small. This results in the vanishing gradient problem that we discussed, and as the gradients approach 0, the network doesn’t really learn. Another problem with the sigmoid function is that the values only range from 0 to 1. This means that the sigmoid function is not symmetric around the origin. The values received are all positive. Well, not all the times would we desire that values going to the next neuron be all of the same sign. This can be addressed by scaling the sigmoid function, and this brings us to the next activation function: the hyperbolic tangent function.

Hyperbolic Tangent Function

This is the hyperbolic tangent, or tanh, function. It is very similar to the sigmoid function. It is actually just a scaled version of the sigmoid function, but unlike the sigmoid function, it’s symmetric over the origin. It ranges from -1 to +1. However, although it overcomes the lack of symmetry of the sigmoid function, it also leads to the vanishing gradient problem in very deep neural networks.

ReLU Function

The rectified linear unit, or ReLU, function is the most widely used activation function when designing networks today. In addition to it being nonlinear, the main advantage of using the ReLU, function over the other activation functions is that it does not activate all the neurons at the same time. According to the plot here, if the input is negative it will be converted to 0, and the neuron does not get activated. This means that at a time, only a few neurons are activated, making the network sparse and very efficient. Also, the ReLU function was one of the main advancements in the field of deep learning that led to overcoming the vanishing gradient problem.

Softmax Function

One last activation function that we will discuss here is the softmax function. The softmax function is also a type of a sigmoid function, but it is handy when we are trying to handle classification problems. The softmax function is ideally used in the output layer of the classifier where we are actually trying to get the probabilities to define the class of each input. So, if a network with 3 neurons in the output layer outputs [1.6, 0.55, 0.98] then with a softmax activation function, the outputs get converted to [0.51, 0.18, 0.31]. This way, it is easier for us to classify a given data point and determine to which category it belongs.

In conclusion, the sigmoid and the tanh functions are avoided in many applications nowadays since they can lead to the vanishing gradient problem. The ReLU function is the function that’s widely used nowadays, and it’s important to note that it is only used in the hidden layers. Finally, when building a model, you can begin with using the ReLU function and then you can switch to other activation functions if the ReLU function does not yield a good performance.

--

--

Mehmet Toprak

Data Scientist | Machine Learning Engineer | Data Engineer