Activation Functions (Part 1)

Vineeth S Subramanyam
Analytics Vidhya
Published in
6 min readMar 15, 2021

An activation is a function applied to the output of a neuron that allows it to learn more complex functions as we go deeper in a neural network. They can also be thought of as mapping to modify the range of a function. For example we could use an activation function to ensure that all our outputs lie between 0 and 1, or between -1 and 1.

Figure 1: Expression for the output of a neuron, followed by applying an activation function
  • The first equation in Figure 1 is the formula used to calculate the output of a neuron. The equation takes the form of a straight line y = mx + c
Figure 2: Plot of the equation of y
  • As we increase the number of layers of a neural network, we want to learn more complex features to better represent the data. If we were to use linear functions as activations in our neural network, learning more complex features becomes quite challenging. Looking at the figure below, it is difficult for us to split these points with a straight line.
Figure 3: Non Linear Features
Figure 4: 2 Layer Neural Network
Figure 5: Defining the terms used
  • If we were to use a linear activation function like the equation defined in Figure 5, where f_activation = a*y, we could end up representing the entire network as a single layer.
Figure 6: Expanding the derivatives
  • Going through the equations in Figure 6, y represents the final output of the network after the linear activation. x1 represents the output of layer 1 after the linear activation.
  • We can now replace the x1 term in the first equation with the expression for x1 from the second equation, and then expand the equation.
Figure 7: Simplifying the dervatives
  • After simplifying the equation, since a2 and a1 are constants, we can write a2*W2*a1*W1 as a new weight matrix W3.
  • (a2*W2*a1*b1 + a2*b) in essence again represents a new matrix that could be a bias b3.
  • This would essentially mean that the last layer can be replaced by a single equation y = W3*x0 + b3
  • This is one of the main reasons why a non linear activation function is used to break the direct linear relationship of input to output.

Commonly used activations:

  • Sigmoid Function
Figure 8: Sigmoid Function
Figure 9: Graph of sigmoid function, and its derivative

Forward Pass of a Sigmoid Neuron:

  • The sigmoid function ranges from 0 to +1
  • If we analyze the sigmoid graph, for input values that are greater than 5, or less than -5, we get pretty much the same output. (i.e, we would nearly get the same output of 1 for an input of 10, 100, or 1000).
  • The presence of an exponent calculation in its formula also makes it slower to compute in comparision to some other activation functions, however that isnt a significant drawback.

Backward Pass of a Sigmoid Neuron:

Figure 10: Derivative of Sigmoid Neuron
  • This derivative can be re written as:
Figure 11: Simplified derivative of sigmoid neuron
  • The graph of the derivative is represented by the backward plot in Figure 9
  • Looking at the graph, we can see that just how the forward pass saturates at higher values, the derivative also saturates at higher values. This is because the gradient in the function at higher values is negligible, leading to a derivative of almost 0 at higher values. This is commonly called the vanishing gradient problem.
  • Once the weight of a layer using a sigmoid neuron becomes too large in magnitude, its forward output saturates to either 0 or 1, and its derivative becomes 0, leading to us not updating its value by any significant amount.
  • Since the sigmoid function scales its output between 0 and 1, it is not zero centered (i.e, the value of the sigmoid at an input of 0 is not equal to 0, and it does not output any negative values).
Figure 12: Derivative of a weight using sigmoid neuron
Figure 13: Derivative of y with respect to weights
  • From Figure 12 and 13, we can see that the derivative of the loss with respect to the weights of a layer depends on the input to the layer itself
    (i.e x).
  • Now for example if we have a 2 layer neural network and were to use a sigmoid activation function after layer 1, we know that the output of that layer 1 would always lie between 0 and 1.
  • If we were to calculate the derivative of layer 2 with respect to its weights, i.e dy/d(weight), looking from Figure 13 we can see that it equates to x (i.e the output of layer 1). Since we used a sigmoid activation at layer 1, x would always be a positive result between 0 and 1.
  • This also means that the sign of the derivative of layer 2's weights would completely depend on the sign of the term d(Loss)/dy, since dy/d(weight) is always positive.
  • This leads to us constraining all the weights of layer 2 to be updated either positively, or negatively. Sigmoid does not allow some weights to be updated by a positive value and some by a negative value. This could lead to a lot of zig zagging, as we would first update the weights positively in the first step, and then negatively in the next step.
  • Tanh Function
Figure 14: Tanh Function
Figure 15: Graph of tanh function, and its derivative

Forward Pass of a Tanh Neuron:

  • The tanh function ranges from -1 to +1
  • If we analyze the tanh graph, we notice the same persisting issue from the sigmoid neuron, which is, that at values greater than 3 or less than -3, we end up with the same output of either +1 or -1, that causes saturation.
  • The presence of multiple exponents in its formula makes it slower than the sigmoid neuron to compute.

Backward Pass of a Tanh Neuron:

Figure 16: Derivative of a function u/v
  • The derivative of the tanh function takes the form in Figure 16
Figure 17: Calculating the Tanh Derivative
Figure 18: Tanh derivative continued
Figure 19: Tanh derivative continued
Figure 20: Tanh Derivative simplified
  • The graph of the equation is represented by the backward plot in Figure 15
  • The saturation of the tanh function at large magnitudes poses the same issues as the sigmoid neuron in terms of vanishing gradients(i.e the derivatives become nearly 0 at large positive or negative values, causing no weight updates).
  • The advantage the tanh function has over the sigmoid function is that it is zero centered.
  • Going back to the explanations for Figure 12 and 13, since the derivative of a layer’s weights (i.e dy/d(weight)) directly depend on the input to that layer itself(i.e x), using a tanh function would give us an output containing both positive and negative values.
  • This would allow us to get rid of the constraint of either updating all the weights by a positive value, or all the weights by a negative value, since x would lie between (-1, 1).

Summary:

  • In the next post I will go over some more commonly used activations like Relu and Leaky Relu.
  • Here is the link to a simple code to implement the activation functions along with their derivatives, and visualize their output.

--

--