Activation Functions: A Short Summary

Published in

Hyunjulie

5 min readSep 29, 2018

1. What are Activation Functions?

Activation Functions make the decision of whether or not to pass a signal to the next layer. They take in the weighted sum of inputs plus a bias as an input. They are usually applied to both hidden and output layers.

It can also be referred to as a non-linear transformation we do over the input signal. But why do we need to transform a linear function? Linear functions (e.g. f(x) = x) are easy to solve, but they are limited in their complexity. Thus, it makes them less powerful when learning the complex functional mapping from data. Non-linearity makes it more powerful and adds the ability to learn something complex from data to represent arbitrary functional mappings.

A simple example of how an activation function is applied

2. Short Summary of Gradient Descent and Backpropagation

Suppose the desired output of a network is y. The network produces an output of y’. Difference between the predicted output and the desired output (y-y’) is converted into the loss function (J). Our goal is to optimize the loss function (i.e. make our loss as small as possible) over the training set.

The loss function above is shaped like a bowl — the partial derivative of the loss function (J) with respect to the weight is the slope of the bowl at the location. By moving in the direction predicted by partial derivatives, we can get to the bottom of the bowl — minimizing our loss function.

Gradient Descent is the idea of using partial derivatives of function iteratively to get to its local minimum.

In Neural Networks, the weights are updated using a method called Backpropagation. It uses chain-rule iteratively (backward pass in the picture) to reach the minimum value of the loss function.

3. Types of Activation Functions

There are two types of activation functions: Linear Activation Functions (f(x)=x, which basically passes the input without much modification) and Non-Linear Activation Functions. I will be discussing most frequently used activation functions nowadays.

Cheatsheet of most frequently used Activation Functions

3.1 Sigmoid

Also known as Logistic Activation Function
Range: Between 0 and 1

Problems:
— Vanishing Gradient problem: function is flat near 0 and 1 → during back-propagation, the gradients in neurons whose output is near 0 or 1 are nearly 0 (a.k.a saturated neurons). It causes the weights in these neurons unable to update
— Output is not zero-centered: makes gradient updates go too far in different directions
— Saturates and kills gradients
— Slow convergence

import numpy as npdef sigmoid(x):
 return 1 / (1 + np.exp(-x))def sigmoid_derivative(x):
 return x * (1 - x)

3.2 Hyperbolic Tangent (Tanh)

Tanh(x) = 2Sigmoid(2x)-1
Also known as scaled sigmoid functions. Made to solve the problem of zero-centered
Range: Between -1 to 1 (zero-centered)

Usually used in classification between two classes
Problem:
Vanishing Gradient problem

def tanh(x):
# return 2 * sigmoid(2x) - 1
 return np.tanh(x)def dtanh(x):
    return 1. - x * x

3.3 Rectified Linear Units (ReLU)

Simple and efficient: It is said to have 6 times improvement in convergence from tanh function
Range: [0, infinity)

Avoids Vanishing Gradient problem
Can only be used within hidden layer in Neural Network model (output is not scaled)
Problem:
Some gradients are fragile during training and can die. It causes weight update which will make it never activate on any data point again.

def ReLU(x):
 return np.maximum(0.0, x)def ReLU_derivative(x):
 if x <= 0: 
  return 0
 else: 
  return 1

3.4 Leaky ReLU

Fixes the problem of dying neurons
Range: (infinity, infinity) → increases the range of ReLU function

Slope below x=0 (usually called ‘a’) is in most cases 0.01
When a is not 0.01, it is called Randomized ReLU

def LReLU(x):
 a = 0.01
 if x >= 0: 
  return x
 else:
  return a * xdef LReLU_derivative(x):
 a = 0.01
 if x >= 0: 
  return 1
 else:
  return a

3.5 Exponential Linear Unit (ELU)

Acts like ReLU when x is positive, but for negative values it is a function bounded by a fixed value -1
Range: (-1, infinity)
Fast

def ELU(x):
 a = 0.01
 if x >= 0:
  return x
 else: 
  return a(np.exp(x) - 1)def ELU_derivative(x):
 a = 0.01
 if x >= 0:
  return 1 
 else: 
  return ELU(x) + a

3.6 Self-Gated Activation Function (SWISH)

A recently proposed activation function by Google researchers
According to the paper, it performs better than ReLU
One-sided boundedness property at 0

Mathematical representation of Swish

def SWISH(x):
 b = 1.0
 return x * sigmoid(b * x)def SWISH_derivative(x):
 b = 1.0
 return b * SWISH(x) + sigmoid(b * x)* (1 - b * SWISH(x))

In Neural Networks, the rule of the thumb is to use ReLU or LReLU, but there’s no harm explore the new activation functions to see if they work better :)

Sources:

http://image-net.org/challenges/posters/JKU_EN_RGB_Schwarz_poster.pdf

https://arxiv.org/pdf/1710.05941.pdf