Activation Functions: A Short Summary

심현주
Hyunjulie
Published in
5 min readSep 29, 2018

1. What are Activation Functions?

Activation Functions make the decision of whether or not to pass a signal to the next layer. They take in the weighted sum of inputs plus a bias as an input. They are usually applied to both hidden and output layers.

It can also be referred to as a non-linear transformation we do over the input signal. But why do we need to transform a linear function? Linear functions (e.g. f(x) = x) are easy to solve, but they are limited in their complexity. Thus, it makes them less powerful when learning the complex functional mapping from data. Non-linearity makes it more powerful and adds the ability to learn something complex from data to represent arbitrary functional mappings.

A simple example of how an activation function is applied

2. Short Summary of Gradient Descent and Backpropagation

Suppose the desired output of a network is y. The network produces an output of y’. Difference between the predicted output and the desired output (y-y’) is converted into the loss function (J). Our goal is to optimize the loss function (i.e. make our loss as small as possible) over the training set.

Visualization of Gradient Descent

The loss function above is shaped like a bowl — the partial derivative of the loss function (J) with respect to the weight is the slope of the bowl at the location. By moving in the direction predicted by partial derivatives, we can get to the bottom of the bowl — minimizing our loss function.

Gradient Descent is the idea of using partial derivatives of function iteratively to get to its local minimum.

In Neural Networks, the weights are updated using a method called Backpropagation. It uses chain-rule iteratively (backward pass in the picture) to reach the minimum value of the loss function.

Simple Visualization of Backpropagation

3. Types of Activation Functions

There are two types of activation functions: Linear Activation Functions (f(x)=x, which basically passes the input without much modification) and Non-Linear Activation Functions. I will be discussing most frequently used activation functions nowadays.

Cheatsheet of most frequently used Activation Functions

3.1 Sigmoid

  • Also known as Logistic Activation Function
  • Range: Between 0 and 1
Sigmoid Function
  • Problems:
    — Vanishing Gradient problem: function is flat near 0 and 1 → during back-propagation, the gradients in neurons whose output is near 0 or 1 are nearly 0 (a.k.a saturated neurons). It causes the weights in these neurons unable to update
    — Output is not zero-centered: makes gradient updates go too far in different directions
    — Saturates and kills gradients
    — Slow convergence
import numpy as npdef sigmoid(x):
return 1 / (1 + np.exp(-x))
def sigmoid_derivative(x):
return x * (1 - x)

3.2 Hyperbolic Tangent (Tanh)

  • Tanh(x) = 2Sigmoid(2x)-1
    Also known as scaled sigmoid functions. Made to solve the problem of zero-centered
  • Range: Between -1 to 1 (zero-centered)
Tanh Function
  • Usually used in classification between two classes
  • Problem:
    Vanishing Gradient problem
def tanh(x):
# return 2 * sigmoid(2x) - 1
return np.tanh(x)
def dtanh(x):
return 1. - x * x

3.3 Rectified Linear Units (ReLU)

  • Simple and efficient: It is said to have 6 times improvement in convergence from tanh function
  • Range: [0, infinity)
ReLU Function
  • Avoids Vanishing Gradient problem
  • Can only be used within hidden layer in Neural Network model (output is not scaled)
  • Problem:
    Some gradients are fragile during training and can die. It causes weight update which will make it never activate on any data point again.
def ReLU(x):
return np.maximum(0.0, x)
def ReLU_derivative(x):
if x <= 0:
return 0
else:
return 1

3.4 Leaky ReLU

  • Fixes the problem of dying neurons
  • Range: (infinity, infinity) → increases the range of ReLU function
ReLU vs Leaky ReLU
  • Slope below x=0 (usually called ‘a’) is in most cases 0.01
  • When a is not 0.01, it is called Randomized ReLU
def LReLU(x):
a = 0.01
if x >= 0:
return x
else:
return a * x
def LReLU_derivative(x):
a = 0.01
if x >= 0:
return 1
else:
return a

3.5 Exponential Linear Unit (ELU)

  • Acts like ReLU when x is positive, but for negative values it is a function bounded by a fixed value -1
  • Range: (-1, infinity)
  • Fast
ELU vs ReLU vs LReLU and ELU functions
def ELU(x):
a = 0.01
if x >= 0:
return x
else:
return a(np.exp(x) - 1)
def ELU_derivative(x):
a = 0.01
if x >= 0:
return 1
else:
return ELU(x) + a

3.6 Self-Gated Activation Function (SWISH)

  • A recently proposed activation function by Google researchers
  • According to the paper, it performs better than ReLU
  • One-sided boundedness property at 0
Mathematical representation of Swish
def SWISH(x):
b = 1.0
return x * sigmoid(b * x)
def SWISH_derivative(x):
b = 1.0
return b * SWISH(x) + sigmoid(b * x)* (1 - b * SWISH(x))

In Neural Networks, the rule of the thumb is to use ReLU or LReLU, but there’s no harm explore the new activation functions to see if they work better :)

Sources:

http://image-net.org/challenges/posters/JKU_EN_RGB_Schwarz_poster.pdf

https://arxiv.org/pdf/1710.05941.pdf

--

--