8. Introduction to Deep Learning with Computer Vision — Activation Functions

Inside AI
Deep-Learning-For-Computer-Vision
7 min readNov 23, 2019

Written by Praveen Kumar & Nilesh Singh

After our audacious first attempt at building an actual network (MNIST from scratch), we are finally back to theory.

In this blog we’ll try to understand:

What are activation functions? How do they work? Why do we require them? What are their different types? and which one should you choose for your network?

What is the activation function?

Fig 1

In simplest of terms, the activation function is a function that takes an input signal and converts it to an output signal. According to Wikipedia, “ In artificial neural networks, the activation function of a node defines the output of that node given an input or set of inputs.

In Fig 1, we have 3 inputs for node 2. Node 2 performs some arithmetic operations and derives the activated output. The final output is simply an activated output or an output of the activation function.

Inside node 2, we can see “f activation function” written. Let’s try to understand how they work and their different types.

How do activation functions work?

Activation functions are simply mathematical functions that act as a gateway for some gradients (the input values). For example, consider the step function.

Fig 2

In Fig 2, if we get an input value which is less than 0, we always return 0 and if we get input as 0 or any positive integer, we always return 1.

That’s how all the activation functions work. Each activation function is a simple mathematical formula based threshold gateway, which changes behavior as per on the input at hand and gives an activated output.

Let’s take another example to completely understand how output differentiate based on the activation function.

Fig 3

In Fig 3, we have a sigmoid activation function. This is very very similar to our threshold step function, but the only difference is the output for input values in the range -4 to +4. If we give input in the rage (+-4), the sigmoid function returns between 0.1 to 0.9 and at input value 0, it returns 0.5 whereas, in a threshold step function, it always returns 0 for negative input values and 1 for input values in range 0 to +infinity.

Why do we require activation functions?

For an artificial neural network to work and do complicated computations, it requires more than a linear representation. Activation functions provide the network with the said non-linearity. Without activation functions, our neural networks would cease to be anything more than a simple linear regression model.

Activation functions introduce non-linearity in our network.

Consider the following 2 cases:

  1. The parameters of a linear equation
Fig 4

If we were to learn the parameters of the above equation, we would not require an activation function and this could be simply done by a predefined linear model.

2. Learning a human face

Fig 5

If we were to learn a human face, you would require a function that can allow such complex and non-linear information to pass between layers.

In such cases, we want to move the complex non-linear information between layers. This is not achievable by a simple predefined linear function. And that’s when activation comes to our rescue. Activation functions also give you the ability to abstract complex mappings from data.

Types of activation functions

  1. Sigmoid function:

Sigmoid is one of the earliest activation function that exists. In almost all cases, newer and more reliable functions are used over it. Nonetheless, it won’t hurt to know a little about it.

The curve for Sigmoid looks like an S.

Sigmoid Activation Function

As it is evident from the figure, the range of output values produced by sigmoid is 0 to 1.

Mathematically it can be expressed as:

f(x) = 1 / 1 + exp(-x)

Although, Sigmoid has some good characteristics such as being differentiable at any point (this comes in handy while backpropagating). But it’s pros are heavily outweighed by its vices. Sigmoid activation can cause a vanishing gradient problem and it can kill gradients as it is very sensitive to even tiny changes in weights. The convergence rate of the sigmoid is also very less.

2. Hyperbolic tangent Activation or tanh:

The curve for tanh is very similar to sigmoid with the only real difference lying in output range for both activation functions. The tanh function can map input values to a output range between -1 and 1. It is centered at 0.

Mathematical representation:

f(x) = 1 — exp(-2x) / 1 + exp(-2x) (or) 2 *sigmoid(2x)-1

Comparison between sigmoid and tanh curves

Even with the increased range, the tanh activation function suffers from semi chronic versions of the same ailments as sigmoid function and hence is not very popularly used.

3. ReLU (Rectified Linear Unit) Activation Function

The name sounds quite complicated, doesn’t it? Trust me, the working is not half as complicated. Think of ReLU as a gatekeeper who knows only two rules:

  1. If the number trying to enter is greater than 0, then open the gate and let it in.
  2. If the number is negative, then don’t let it in, instead, just send a 0 in its place.

Mathematical Representation:

f(x) = max(0,x)

Curve for ReLU

Despite its very simple nature, ReLU outperforms nearly all other activation functions and is the most commonly used activation function across the DL community. This does not suffer from vanishing gradient problem or slow convergence rate. Also, a very big plus point with ReLU is that it is fast as it does not involve any complex calculations as tanh or sigmoid.

It does suffer from a problem called dying ReLU. To quote Wikipedia:

“ReLU neurons can sometimes be pushed into states in which they become inactive for essentially all inputs. In this state, no gradients flow backward through the neuron, and so the neuron becomes stuck in a perpetually inactive state and “dies.” In some cases, large numbers of neurons in a network can become stuck in dead states, effectively decreasing the model capacity. This problem typically arises when the learning rate is set too high.”

4. All other kinds of ReLU:

To counter the dying ReLu problem, Leaky ReLu was introduced which allowed a small negative value to pass through the neuron. The slope of this negative value depends on a parameter alpha. The value of alpha is 0.0 in the case of vanilla ReLU, if it is made 1.0 then essentially the curve on the negative side becomes x=y and we get a linear function.

Leaky ReLU with diff. alpha values

We have a plethora of other over-engineered versions of ReLU. They include Threshold ReLu, ELU (Exponential Linear Unit), SELU (Scaled Exponential Linear Unit) and so on. These activation functions are barely used and it sometimes feels that they were engineered just for the sake of it.

If you wish to read more about all activation functions in detail you can visit this link.

Which one should you choose?

As as a beginner you are most likely going to use ReLU over other activation functions. Sometimes people also prefer Sigmoid and Tanh. But then that’s it. Due to their complex nature, sigmoid and tanh are actually big culprits of pulling back the DL revolution by many years. ReLU is currently the state-of-the-art and for fun, people also say it the simple-of-the-art.

Let it sink in for a while now. It is a needy concept that changes according to your problem statement. As we go further, we will also highlight some of the other pros and cons of ReLU as well as other activation functions. Let’s take some rest and get back to more concepts in upcoming articles.

Hope you enjoyed it. See you soon!

NOTE: We are starting a new telegram group to tackle all the questions and any sort of queries. You can openly discuss concepts with other participants and get more insights and this will be more helpful as we move further down the publication. [Follow this LINK to join]

--

--

Inside AI
Deep-Learning-For-Computer-Vision

We write about NLP, Speech Recognition, Computer Vision, Kaggle, and Data Science Competitions.