Activation Functions, Optimization Techniques, and Loss Functions

Afaf Athar
Analytics Vidhya
Published in
14 min readAug 22, 2020

Activation Functions:

A significant piece of a neural system Activation function is numerical conditions that decide the yield of a neural system. The capacity is joined to every neuron in the system and decides if it ought to be initiated (“fired”) or not, founded on whether every neuron’s info is applicable for the model’s expectation. Initiation works likewise help standardize the yield of every neuron to a range somewhere in the range of 1 and 0 or between — 1 and 1.

Progressively, neural systems use linear and non-linear activation functions, which can enable the system to learn complex information, figure and adapt practically any capacity speaking to an inquiry, and give precise forecasts.

Linear Activation Functions:

Step-Up: Activation functions are dynamic units of neural systems. They figure the net yield of a neural node. In this, Heaviside step work is one of the most widely recognized initiation work in neural systems. The capacity produces paired yield. That is the motivation behind why it is additionally called paired advanced capacity.

The capacity produces 1 (or valid) when info passes edge limit though it produces 0 (or bogus) when information doesn’t pass edge. That is the reason, they are extremely valuable for paired order studies. Every rationale capacity can be actualized by neural systems. In this way, step work is usually utilized in crude neural systems without concealed layer or generally referred to name as single-layer perceptions.

  • The simplest kind of activation function
  • consider a threshold value and if the value of net input say x is greater than the threshold then the neuron is activated
Step-Up Function

This kind of system can group straightly distinguishable issues, for example, AND-GATE and OR-GATE. As it were, all classes (0 and 1) can be isolated by a solitary straight line as outlined underneathAssume that we got edge an incentive as 0. After that point, the accompanying single layer neural system models will fulfill these rationale capacities.

AND-GATE and OR-GATE

However, a linear activation function has two major problems:

  1. Unrealistic to utilize backpropagation (slope plunge) to prepare the model — the subordinate of the capacity is consistent, and has no connection to the info, X. So it’s impractical to return and comprehend which loads in the information neurons can give a superior expectation.
  2. All layers of the neural network collapse into one with linear activation functions, no matter how many layers in the neural network, the last layer will be a linear function of the first layer (because a linear combination of linear functions is still a linear function). So a linear activation function turns the neural network into just one layer.

Non-Linear Activation Functions:

Present-day neural system models use non-straight activation capacities. They permit the model to make complex mappings between the system’s sources of info and yields, which are basic for learning and demonstrating complex information, for example, pictures, video, sound, and informational indexes which are non-straight or have high dimensionality.

Practically any procedure conceivable can be spoken to as a useful calculation in a neural system, given that the initiation work is non-straight. Non-straight capacities address the issues of a direct enactment work: They permit backpropagation since they have a subordinate capacity which is identified with the information sources. They permit the “stacking” of various layers of neurons to make a profound neural system.

Different shrouded layers of neurons are expected to learn complex informational indexes with significant levels of exactness. Thus, types of non-linear activation functions are:

  1. SIGMOID:

The fundamental motivation behind why we utilize the sigmoid capacity is that it exists between (0 to 1). In this manner, it is particularly utilized for models where we need to foresee the likelihood as a yield. Since the likelihood of anything exists just between the scope of 0 and 1, sigmoid is the correct decision.

Sigmoid-Function

Advantages:

  • Smooth slope, forestalling “bounces” in yield esteems.
  • Output values bound somewhere in the range of 0 and 1, normalizing the yield of every neuron.
  • Clear forecasts — For X over 2 or beneath — 2, will in general bring the Y esteem (the expectation) to the edge of the bend, exceptionally near 1 or 0. This empowers clear expectations.

Disadvantages:

  • Vanishing Gradient — for exceptionally high or low estimations of X, there is practically no change to the forecast, causing a disappearing slope issue. This can bring about the system declining to learn further or being too delayed to even think about reaching an exact forecast.
  • Outputs not zero focused and computationally costly.
Sigmoid-Curve

2. TANH:

The tanh work is characterized as follows: It is nonlinear, so we can stack layers. It is bound to the range (- 1, 1) The angle is more grounded for tanh than sigmoid ( subordinates are more extreme) Like sigmoid, tanh additionally has an evaporating inclination issue.

Advantages

  • Zero centered making it easier to model inputs that have strongly negative, neutral, and strongly positive values.
  • Otherwise like the Sigmoid function.

Disadvantages

  • Like the Sigmoid function
Tanh-Function

3. Rectified Linear Unit( RELU):

In arithmetic a capacity is viewed as straight at whatever point a function f: A→B if for each x and y in the area A has the accompanying property: f(x)+f(y)=f(x+y). By definition the ReLU is max(0,x). Accordingly, if we split the area from (−∞,0] or [0,∞) at that point the capacity is straight. Be that as it may, it’s anything but difficult to see that f(−1)+f(1)≠f(0). Subsequently, by definition, ReLU isn’t straight.

ReLU-Function

By and by, ReLU is so near straight this frequently befuddles individuals and miracle how might it be utilized as an all-inclusive approximator. I would say, the most ideal approach to consider them resembles Riemann wholes. You can inexact any consistent capacities with bunches of little square shapes. ReLU initiations can create bunches of little square shapes. Truth be told, practically speaking, ReLU can make rather confusing shapes and estimated many entangled areas.

I likewise want to explain another point. As brought up by a past answer, neurons don’t bite the dust in Sigmoid, but instead, evaporate. The purpose behind this is because at greatest the subordinate of the sigmoid capacity is .25. Henceforth, after such a large number of layers you wind up duplicating these angles and the result of exceptionally little numbers under 1 will in general go to zero rapidly.

Subsequently, in case you’re constructing a profound learning system with a ton of layers, your sigmoid capacities will stale rather rapidly and turn out to be pretty much futile. The key remove is the evaporating originates from increasing the inclinations, not simply the slopes.

ReLU-Curve

Advantages

  • Computationally proficient — permits the system to meet rapidly
  • Non-direct — even though it would seem that a straight capacity, ReLU has a subsidiary capacity and takes into account backpropagation

Disadvantages

  • The Dying ReLU issue — when data sources approach zero or are negative, the angle of the capacity gets zero, the system can’t perform backpropagation and can’t learn.

4. LEAKY-RELU

Leaky-ReLU-Function

LEAKY ReLUs is one endeavor to fix the “perishing ReLU” issue. Rather than the capacity being zero when x < 0, a LEAKY ReLU will rather have a little negative slant (of 0.01, or thereabouts). That is, the capacity registers f(x)=1(x<0)(αx)+1(x>=0)(x) where α is a little consistent

Leaky-ReLU-Curve

Advantages

  • Prevents biting the dust ReLU issue — this variety of ReLU has a little positive slant in the negative region, so it empowers backpropagation, in any event, for negative information esteems
  • Otherwise like ReLU

Disadvantages

  • Results not steady — LEAKY ReLU doesn’t give reliable forecasts to negative info esteems.

Optimization Functions

Gradient Descent Update Rule: Gradient descent is a streamlining calculation used to limit some capacity by iteratively moving toward the steepest drop as characterized by the negative of the slope. In AI, we use slope plunge to refresh the boundaries of our model. Boundaries allude to coefficients in Linear Regression and loads in neural systems.

Walking-Down-Hill

Beginning at the head of the mountain, we venture out toward the path determined by the negative inclination. Next, we recalculate the negative inclination (going in the directions of our new point) and make another stride toward the path it determines. We proceed with this procedure iteratively until we get to the base of our diagram, or to a point where we can no longer move downhill–a neighborhood least.

Gradient-Update-Rule_Function

Learning rate: The size of these steps is known as the learning rate. With a high learning rate, we can make more progress each progression, however, we chance to overshoot the absolute bottom since the slant of the slope is continually evolving. With an exceptionally low learning rate, we can certainly move toward the negative slope since we are recalculating it so oftentimes. A low learning rate is more exact, however figuring the inclination is tedious, so it will take us a long effort to get to the base.

Cost function: A Loss Functions lets us know “how great” our model is at making expectations for a given arrangement of boundaries. The cost of work has its bend and its slopes. The slant of this bend discloses to us how to refresh our boundaries to make the model more exact.

Step-by-step: Presently we should run gradient descent utilizing our new cost work. There are two boundaries in our cost work we can control: m (weight) and b (bias). Since we have to consider the effect everyone has on the last forecast, we have to utilize fractional subsidiaries. We compute the halfway subsidiaries of the cost work for every boundary and store the outcomes at an angle.

Math

Given the cost function:

Cost-Function

To solve for the gradient, we iterate through our data points using our new m and b values and compute the partial derivatives. This new gradient tells us the slope of our cost function at our current position (current parameter values) and the direction we should move to update our parameters. The size of our update is controlled by the learning rate.

Types Of Optimization Techniques:

  1. Momentum Based GD:
  • Momentum based Gradient Descent Update Rule: One of the main issues​ with Gradient Descent is that it takes a lot of time to navigate regions with gentle slopes because the gradient is very small in these regions
  • An intuitive solution​ would be that if the algorithm is repeatedly being asked to go in the same direction, then it should probably gain some confidence and start taking bigger steps in that direction
  • Now, we have to convert this intuition into a set of mathematical equations Gradient Descent Update Rule,

ωt+1 = ωt — η∇ωt

→ υt = γ * υt−1 + η∇ωt

→ ωt+1 = ωt − υt

→ ωt+1 = ωt − γ * υt−1 − η∇ωt

→ If γ * υt−1 = 0 then it is the same as the regular Gradient Descent update rule

→To put it briefly υt−1 is the history of the movement in a direction and γ ranges from 0–1

A few points to note:

a. Momentum based gradient descent oscillates in and out of the minima valley (u-turns)

b. Despite these u-turns, it still converges faster than vanilla gradient descent

Now, we will look at reducing the oscillations in Momentum based GD

2. Nesterov Accelerated Gradient Descent(NAG):

In Momentum based Gradient Descent, we can see that the movement occurs in two steps: The first is with the history-term γ * υt−1 and The second is with the weight term η∇ωt consider first moving with the history term, then calculate the second step from where we were located after the first step ( ωtemp ).

Using the above intuition, the Nesterov Accelerated Gradient Descent solves the problem of overshooting and multiple oscillations

→ωtemp = ωt − γ * υt−1 compute ω temp based on movement with a history

→ ωt+1 = ωtemp − η∇ωtemp

→ move further in the direction of the derivative of ωtemp

υt = γ * υt−1 + η∇ωtemp update history with movement due to derivative of ωtemp

3. Adaptive Gradient( Adagrad ):

Intuition​ : Decay the learning rate for parameters in proportion to their update history (fewer updates, lesser decay). The Adagrad (Adaptive Gradient) is an algorithm that satisfies the above intuition Adagrad:

→ υt = υt−1 + (∇ω t ) 2 — Squared to ignore the sign of the derivative

→This value increments based on the gradient of that particular iteration, i.e. the value of the feature is non-zero.

→In the case of dense features, it increments for most iterations, resulting in a larger υt value

→ For sparse features, does not increment much as the gradient value is often 0, leading to a lower υt value.

→This value increments based on the gradient of that particular iteration, i.e. the value of the feature is non-zero. In the case of dense features, it increments for most iterations, resulting in a larger vt value

→The denominator term √ (υt ) serves to regulate the learning rate η For dense features, υt is larger, (√υt ) becomes larger thereby lowering η

→For sparse features, υt is smaller, (√υt ) becomes smaller, and lowers η to a smaller extent. The ε term is added to the denominator √ (υ t ) + ε to prevent​ a divide-by-zero​ error​ from occurring in the case of very sparse features i.e. where all the data points yield zero up till the measured instance.

Advantage​ : Parameters corresponding to sparse features get better updates

Disadvantage​ : The learning rate decays very aggressively as the denominator grows (not good for parameters corresponding to dense features).

4.RMSProp:

The history of the gradients being multiplied by the decay ratio. Adagrad got stuck when it was close to convergence (it was no longer able to move in the vertical direction because of the decayed learning rate), RMSProp overcomes this problem by being less aggressive on the decay:υ t = β * υt−1 + (1 − β )(∇ωt )2

5. Adaptive Moment Estimation(ADAM):

→ Adaptive Moment Estimation (Adam) combines ideas from both RMSProp and Momentum. It computes adaptive learning rates for each parameter and works as follows.

Adam-Rule

It computes the exponentially weighted average of past gradients (vdWvdW). It also computes the exponentially weighted average of the squares of past gradients and these averages have a bias towards zero and to counteract this a bias correction is applied. The parameters are updated using the information from the calculated averages.

Loss Function:

Machines learn by methods for a loss function. It’s a strategy for assessing how well explicit calculation models the given information. If forecasts go amiss a lot from genuine outcomes, misfortune capacity would hack up an exceptionally huge number. Step by step, with the assistance of some improvement work, misfortune work figures out how to lessen the blunder in expectation. It will go through several loss functions and their applications in the domain of machine/deep learning.

Broadly, loss functions can be classified into two major categories depending upon the type of learning task we are dealing with — Regression losses and Classification losses. In classification, we are trying to predict the output from a set of finite categorical values i.e Given large data set of images of handwritten digits, categorizing them into one of 0–9 digits. Regression, on the other hand, deals with predicting a continuous value for the example given floor area, the number of rooms, size of rooms, predict the price of the room.

NOTE 
n - Number of training examples.
i - ith training example in a data set.
y(i) - Ground truth label for ith training example.
y_hat(i) - Prediction for ith training example.

Regression:

  1. Mean Square Error/Quadratic Loss/L2 Loss:
  • Mean Square Error (MSE) is the most used regression loss function. MSE is the sum of squared distances between our target variable and predicted values.
Mean Square Error
  • Below is a plot of an MSE function where the true target value is 100, and the predicted values range between -10,000 to 10,000. The MSE loss (Y-axis) reaches its minimum value at prediction (X-axis) = 100. The range is 0 to ∞.

2. Mean Absolute Error/L1 Loss:

Mean Absolute Error (MAE) is another loss function used for regression models. MAE is the sum of absolute differences between our target and predicted variables. So, it measures the average magnitude of errors in a set of predictions, without considering their directions. (If we consider directions also, that would be called Mean Bias Error (MBE), which is a sum of residuals/errors). The range is also 0 to ∞.

Mean Absolute Error

3. Huber Loss/Smooth Mean Absolute Error:

Huber loss is less sensitive to outliers in data than the squared error loss. It is also differentiable at 0. It is an absolute error, which becomes quadratic when the error is small. How small that error has to be to make it quadratic depends on a hyperparameter, 𝛿 (delta), which can be tuned. Huber loss approaches MSE when 𝛿 ~ 0 and MAE when 𝛿 ~ ∞ (large numbers.)

Huber-Loss

4. Log Cosh Loss

Log-cosh is another function used in regression tasks that’s smoother than L2. Log-cosh is the logarithm of the hyperbolic cosine of the prediction error.

Log-Cosh-Loss

CLASSIFICATION:

  1. Cross-Entropy loss: Cross-entropy loss, or log loss, measures the performance of a classification model whose output is a probability value between 0 and 1. Cross-entropy loss increases as the predicted probability diverge from the actual label. So, predicting a probability of .012 when the actual observation label is 1 would be bad and result in a high loss value. A perfect model would have a log loss of 0.
  2. Categorical Cross-Entropy loss: Categorical cross-entropy is a loss function that is used in multiclass classification tasks. These are tasks where an example can only belong to one out of many possible categories, and the model must decide which one. Formally, it is designed to quantify the difference between two probability distributions.
Categorical-Cross-Entropy-loss

3. Binary Cross-Entropy Loss/ Log Loss: Binary cross-entropy is a loss function that is used in binary classification tasks. These are tasks that answer a question with only two choices (yes or no, A or B, 0 or 1, left or right). Several independent such questions can be answered at the same time, as in multi-label classification or binary image segmentation.

Hope this helps :)

Follow if you like my posts. Please leave comments for any clarifications or questions.

Additional Resources I found Useful:
1. https://towardsdatascience.com/common-loss-functions-in-machine-learning-46af0ffc4d23
2. https://ieeexplore.ieee.org/document/8407425
3. https://www.researchgate.net/publication/228813985_Performance_Analysis_of_Various_Activation_Functions_in_Generalized_MLP_Architectures_of_Neural_Networks

Connect via LinkedIn https://www.linkedin.com/in/afaf-athar-183621105/

Happy learning 😃

--

--

Afaf Athar
Analytics Vidhya

I Do Data. I write what I wish I could have read when I was younger