## Deep Learning

# Neural Networks Part 2: Activation Functions And Differentiation

# Activation Functions

A neural network is a network of artificial neurons connected to each other in a specific way. Job of neural network is to learn from given data. The prediction function that neural network must learn can be highly non-linear. Activation functions for artificial neurons are chosen to capture underlying non-linearity.

Activation functions (generally) have functional form of** f(u)=f(w ᵀ x+b),** where

**is weight vector and**

*w***single training data vector**

*x*This can be treated as linear combination of inputs, followed by a non-linear transformation. There are multitude of options available to chose a non-linear transformation. Some of the prominent ones are as follows.

## 1. Sigmoid Activation Function

A sigmoid function, ** f(u)= 1 / (1 + e⁻ ᵘ)**. It takes a real-valued number and “squeeze” it into range between

**and**

*0***. Large negative numbers become ≈**

*1***and large positive numbers become ≈**

*0***.**

*1***Pros:**

For binary classification problem it is used as activation of output layer of a neural network.

**Cons**:

**Can saturate and kill gradients:**When neuron’s activation saturates ator*1*, the gradient becomes almost zero. This creates difficulties in learning.*0***Outputs are not zero-centered**: Since outputs are in rangeto*0*, neurons in next layer will receive data that is not zero centered. Hence, gradient of weights*1*during backpropagation will be either all positive or all negative, which can cause undesirable zig-zagging dynamics in gradient updates of weights. When considering gradients added over all training data in a batch, this problem will not be much severe compared to “Saturate and kill gradients”*w*

## 2. Tanh Activation Function

A **tanh **function, ** f(u)=sinh(u)/cosh(u)**. It takes a real-valued number and “squeeze” it into range between -1 and 1. Large negative numbers become ≈−1 and large positive numbers become≈1.

**Pros**:

It is preferred over sigmoid because its outputs are zero centered

**Cons**:

**Can saturate and kill gradients: **When neuron’s activation saturates at 1 or -1 , the gradient becomes almost zero. This creates difficulties in learning.

## 3. ReLU Activation Function

The Rectified Linear Unit, **ReLU **is *f(u)=max(0,u)*

**Pros:**

- Greatly increase training speed compared to tanh and sigmoid
- Less expensive computations compared to
tanh and sigmoid - Reduces likelihood of the gradient to vanish. Since when
, the gradient has constant value.*u>0* **Sparsity:**When more u<=0, thecan be more sparse*f(u)*

**Cons:**

- Tends to blow up activation (there is no mechanism to constrain the output of the neuron, as
itself is the output).*u* **Closed ReLU or Dead ReLU**: If inputs tend to make, then the most of the neurons will always have 0 gradient updates hence closed or dead.*u<=0*

## 4. Leaky ReLU:

It solves the dead ReLU problem. ** 0.01** is coefficient of leakage. Leaky ReLU is as follows:

## 5. Parameterized ReLU Or PReLU:

Parameterizes coefficient of leakage αα in Leaky ReLU.

## 6. Maxout

Generalization of ReLU, Leaky ReLU and PReLU. It does not have functional form of ** f(u)=f(w ᵀ x+b)** , instead it computes function

*max(w′ᵀx+b′,wᵀx+b)***Pros:**

Maxout has pros of ReLU but doesn’t have dead ReLU issue

**Cons:**

It has twice number of weight parameters to learn ** w′** and

*w*## 7. Softmax

A sofmax function is generalization of sigmoid function. Sigmoid is used for 2-class (binary) classification whereas Softmax is used for multi-class classification. As shown in above figure Softmax function turns logits [2.0, 1.0, 0.1] into probabilities [0.7, 0.2, 0.1]

# What Activation Function Should I Use ?

- For output layer, use sigmoid or softmax in classification task
- For output layer, use no activation or Purelin function
in regression task*f(u)=u* - Use the ReLU non-linearity, if you carefully set learning rates and monitor the fraction of “dead ReLU” in network.
- Else try Leaky ReLU or Maxout.
- Or try tanh, although it might be worse than ReLU
- Avoid sigmoid

# Differentiation:

## Basic Formulas:

Given f(x)f(x) and g(x)g(x) are differentiable functions (the derivative exists), cc and nn are any real numbers:

## Sigmoid Function:

## Tanh Function:

# References:

- Tensorflow playground link
- http://cs231n.github.io
- Udacity Deep Learning Slide on Softmax