# Activation Functions

### What are activation functions ?

Activation functions also known as transfer function is used to map input nodes to output nodes in certain fashion.

#### They are used to impart non linearity .

There are many activation functions used in Machine Learning out of which commonly used are listed below :-

— — — — — — — — — — — — — — — — — — — — — — — — — — — — — — —

Identity or linear activation function :-

→ F(x) = x

→ You will exact same curve.

→ Input maps to same output.

Binary Step

→Very useful in classifiers

Logistic or Sigmoid

→ Maps any sized inputs to outputs in range [0,1].

→ Useful in neural networks.

Tanh

→Maps input to output ranging in [-1,1].

→Similar to sigmoid function except it maps output in [-1,1] whereas sigmoid maps output to [0,1].

ArcTan

→ Maps input to output ranging between [-pi/2,pi/2].

→ Similar to sigmoid and tanh function.

Rectified Linear Unit (ReLu)

→ It removes negative part of function.

Leaky ReLu

→ The only difference between ReLu and Leaky ReLu is it does not completely vanishes the negative part,it just lower its magnitude.

Softmax

→Softmax function is used to impart probabilities when you have more than one outputs you get probability distribution of outputs.

→Useful for finding most probable occurrence of output with respect to other outputs.

— — — — — — — — — — — — — — — — — — — — — — — — — — — — — — —

### Desirable properties of activation functions

1. Non Linearity

The purpose of the activation function is to introduce non-linearity into the network in turn allows you to model a response variable (aka target variable, class label, or score) that varies non-linearly with its explanatory variables

Non-linear means that the output cannot be reproduced from a linear combination of the inputs

Another way to think of it: without a non-linear activation function in the network, a NN, no matter how many layers it had, would behave just like a single-layer perceptron, because summing these layers would give you just another linear function (see definition just above).

2) Continuously differentiable

This property is necessary for enabling gradient-based optimization methods.

The binary step activation function is not differentiable at 0, and it differentiates to 0 for all other values, so gradient-based methods can make no progress with it

3) Range

When the range of the activation function is finite, gradient-based training methods tend to be more stable, because pattern presentations significantly affect only limited weights.

When the range is infinite, training is generally more efficient because pattern presentations significantly affect most of the weights. In the latter case, smaller learning rates are typically necessary.

4) Monotonic

When the activation function is monotonic, the error surface associated with a single-layer model is guaranteed to be convex.

5) Approximates identity near the origin

When activation functions have this property, the neural network will learn efficiently when its weights are initialized with small random values.

When the activation function does not approximate identity near the origin, special care must be used when initializing the weights.

— — — — — — — — — — — — — — — — — — — — — — — — — — — — — — —

### Which activation function you should use for your model?

→As a part of choosing which activation function you should use in your model ,you should try out different functions and choose which fits best to your model.

— — — — — — — — — — — — — — — — — — — — — — — — — — — — — — —