Deep Neural Networks. Theory. Part 2.

Machine Learning & Data Science A-Z Guide.

Published in

Pharos Production

4 min readJun 7, 2017

Give us a message if you’re interested in Blockchain and FinTech software development or just say Hi at Pharos Production Inc.

Or follow us on Youtube to know more about Software Architecture, Distributed Systems, Blockchain, High-load Systems, Microservices, and Enterprise Design Patterns.

Pharos Production Youtube channel

In the previous article we have said that network output error is a function of it’s weights. Than our goal is to find weights to minimize the error.

GRADIENT DESCENT

To find weights with minimum error, we can tune them in the direction of error minimization like on the image above. Every step is the opposite to the slope — gradient. If we will go down through the gradient, we can find the minimum error in the bottom — minimum of the error function.

Let’s define an Error term and Weight Step in a gradient descent. Weight step is proportional to the gradient — the partial derivative from the error with respect to each step. Then we can set an arbitrary scaling parameter — learning rate — which allows to set the size of gradient descent step — Eta.

So Error Term is an Error times the activation function. And the weight step is a learning rate times error term times the value of input.

This formulas is for one output unit. To calculate total error for multiple outputs as a sum of output errors.

MEAN SQUARE ERROR

When we have many outputs, summing up all the weight steps can lead to really large values. So instead of SSE we will use Mean of the Square Errors (MSE). To compensate weight steps we should use really small learning rate. But instead we can divide on a number of records in our data. Than learning rate will be in a range of 0.01–0.001.

ALGORITHM

The general algorithm to update weights using gradient descent.

Set weight step to zero;
For each record make a forward pass through the network calculating:

output

Output

Calculate error gradient in output unit

Error gradient

Update weight step

Weight step

3. Update weights. Eta — learning rate, m — number of records.

Weight

4. Repeat for e epochs.

For activation function we use sigmoid here.

Sigmoid

And the gradient of sigmoid is

Sigmoid derivative

Where h is the input to the output unit

Input to output unit

HIDDEN LAYER

Before we have deal with a one output unit. But let’s inject our first hidden layer.

BACKPROPAGATION

Let’s make a multilayer neural network learn. We will update weights for hidden layers based on the output error. But to do this we need to know how much each error each of the hidden units contributed into the final result. Since we know the error at the output we can use weights to work backwards to hidden layers.