Why do we use the derivatives of activation functions in a neural network?

4 min readDec 18, 2017

Derivatives represent a slope on a curve, they can be used to find maxima and minima of functions, when the slope, is zero. Also, the derivative measures the steepness of the graph of a function at some particular point on the graph. In computational networks, the activation function of a node defines the output of that node given an input or set of inputs. A standard computer chip circuit can be seen as a digital network of activation functions that can be represented by the binary set (1) “ON” or (0) “OFF”, depending on the input.

When constructing Artificial Neural Network (ANN) models, one of the key considerations is selecting an activation functions for hidden and output layers that are differentiable. This is because calculating the backpropagation error is used to determine ANN parameter updates that require the gradient of the activation function for updating the layer. The most commonly-used activation functions used in ANNs are the identity function, the logistic sigmoid function, and the hyperbolic tangent function

This activation function simply maps the pre-activation to it and can output values that range from positive infinity to negative infinity. But why use an identity activation function? It turns out that the identity activation function is very useful. For instance, some of the traditional methods for forecasting include linear and nonlinear regression, ARMA and ARIMA time series forecasting, logistic regression, principal component analysis, discriminant analysis, and cluster analysis. These methods require statistical analyst to filter through tens or even hundreds of variables to determine which ones might be appropriate to use in one of these classical statistical techniques.

Theoretically any differential function can be used as an activation function, however, the identity and sigmoid functions are the two most commonly applied. The identity activation function, also referred to as linear activation, is a flow through mapping of least squares linear regression in algebra as h(xl) = xl.

A multi-layer network that has a nonlinear activation functions amongst the hidden units and an output layer that uses the identity activation function implements a powerful form of nonlinear regression. Specifically, the network can predict continuous target values using a linear combination of signals that arise from one or more layers of nonlinear transformations of the input.

The Logistic Sigmoid Activation Function Non-linear Activation Function

The Nonlinear Activation Functions are the most used activation functions. Non linearity helps to make the graph a binary classification problems (i.e. outputs values that range (0, 1), thus, the logistic sigmoid.

Therefore, it is especially useful for models where we have to predict the probability as an output. Since probability of anything exists only between the range of 0 and 1, sigmoid is the right choice for it is a differentiable function. That means, we can find the slope of the sigmoid curve at any two points by use of the derivative.

The logistic sigmoid is inspired somewhat on biological neurons and can be interpreted as the probability of an artificial neuron “firing” given its inputs. Moreover, the logistic sigmoid can also be derived as the maximum likelihood solution for logistic regression in statistics. Calculating the derivative of the logistic sigmoid function makes use of the quotient rule and a clever trick that both adds and subtracts a one from the numerator:

Deriving the Sigmoid Derivative for Neural Networks

Training a neural network refers to finding values for every cell in the weight matrices such that the squared differences between the observed and predicted data are minimized. In practice, the individual weights comprising the two weight matrices are adjusted by iteration and their initial values are often set randomly.

The question then becomes how should the weights be adjusted — i.e., in which direction +/- and by what value? And, that’s where the derivative comes in. A large value for the derivative will result in a large adjustment to the corresponding weight. This makes sense because if the derivative is large that means one is far from a minimum. Weights are adjusted in the direction of steepest descent surface defined by the total error observed versus predicted by squaring the error.

After the error on each pattern is computed by subtracting the actual value of the output vector from the value predicted by the NN during that iteration, each weight in the weight matrices is adjusted in proportion to the calculated error gradient. Because the error calculation begins at the end of the NN and proceeds to the front, it is called back-propagation.

At this point, the process is complete. The simple technique that has actually been used is to derive the quotient and product rules in calculus: adding and subtracting the same thing, which changes nothing, to create a more useful representation.

Why do we use the derivatives of activation functions in a neural network?

Written by Alfonso Llanes