Deep Neural Networks. Theory. Part 2.

Gradient Descent

Give us a message if you’re interested in Blockchain and FinTech software development or just say Hi at Pharos Production Inc.

Or follow us on Youtube to know more about Software Architecture, Distributed Systems, Blockchain, High-load Systems, Microservices, and Enterprise Design Patterns.

Pharos Production Youtube channel

In the previous article we have said that network output error is a function of it’s weights. Than our goal is to find weights to minimize the error.

GRADIENT DESCENT

To find weights with minimum error, we can tune them in the direction of error minimization like on the image above. Every step is the opposite to the slope — gradient. If we will go down through the gradient, we can find the minimum error in the bottom — minimum of the error function.

Let’s define an Error term and Weight Step in a gradient descent. Weight step is proportional to the gradient — the partial derivative from the error with respect to each step. Then we can set an arbitrary scaling parameter — learning rate — which allows to set the size of gradient descent step — Eta.

So Error Term is an Error times the activation function. And the weight step is a learning rate times error term times the value of input.

Error Term
Weight step

This formulas is for one output unit. To calculate total error for multiple outputs as a sum of output errors.

MEAN SQUARE ERROR

When we have many outputs, summing up all the weight steps can lead to really large values. So instead of SSE we will use Mean of the Square Errors (MSE). To compensate weight steps we should use really small learning rate. But instead we can divide on a number of records in our data. Than learning rate will be in a range of 0.01–0.001.

Mean of the Square Errors

ALGORITHM

The general algorithm to update weights using gradient descent.

  1. Set weight step to zero;
  2. For each record make a forward pass through the network calculating:
  • output
Output
  • Calculate error gradient in output unit
Error gradient
  • Update weight step
Weight step

3. Update weights. Eta — learning rate, m — number of records.

Weight

4. Repeat for e epochs.

For activation function we use sigmoid here.

Sigmoid

And the gradient of sigmoid is

Sigmoid derivative

Where h is the input to the output unit

Input to output unit

HIDDEN LAYER

Before we have deal with a one output unit. But let’s inject our first hidden layer.

Hidden layer
Each hidden layer unit

BACKPROPAGATION

Let’s make a multilayer neural network learn. We will update weights for hidden layers based on the output error. But to do this we need to know how much each error each of the hidden units contributed into the final result. Since we know the error at the output we can use weights to work backwards to hidden layers.

Error of each hidden unit

j — hidden unit, k — output unit, delta zer0 — output error, W — weight matrix between the output and hidden layers, f’(hj) — gradient.

Than gradient descent step is the same as before, just with the new errors.

Weight step

Where w — weights between inputs and hidden layer, x — inputs.

Than for how many layers there are

Delta output — output error, V input — input to the layer(for example the hidden layer activations to the output), eta — learning rate(step size).

Thanks for reading!

--

--

Dmytro Nasyrov
Pharos Production

We build high-load software. Pharos Production founder and CTO.