Stochastic Gradient Descent

Published in

Deep Learning#g

2 min readApr 30, 2018

Cost Function?

To begin, we formulate a “Cost Function”(or called loss function) to evaluate how bad the parameter set of the neural network is, represented as C(𝛳). We have to find out the 𝛳* which minimizes the cost C(𝛳*).

𝛳* = arg min C(𝛳)

For example, we use the defined function in the handwritten digit classificatoin. The number of the given training data x is R(1,…, r,…, R) and y is the correspondent ground truth. Our goal is to minimize the C(𝛳*).

How to find the minimum?

Imagine that dropping a ball somewhere, we can then get the local minimun where the ball stop moving just as the picture show.

So, we repeat the calculation of the gradient at different places, until the gradient approach to zero. Suppose that 𝛳 has two variables {𝛳1, 𝛳2}.

Learning Rate

𝜂 represents the learning rate, meaning that “how far” the distance between the before point and the next one calculated. If we pick a small learning rate, it might takes longer time to reach the local minimum while if we set a larger one, we will miss the minimum.

Stochastic Gradient Descent

To calculus the cost, we have to sum all the examples in our training data because of the algorithm of gradient descend, but if there are millions of training data, it takes much time. To simplify, we use stochastic gradient descent.

Pick an example x, we assume that all example have equal probabilities to be picked.

# http://cs231n.github.io/optimization-1/while True:
    data_batch = sample_training_data(data, batch_size=1)
    weights_grad = evaluate_gradient(loss_func, data_batch, weights)
    weights += step_size * weight_grad

If you like this article and consider it useful for you, please support it with 👏.

Stochastic Gradient Descent

Written by Gary(Chang, Chih-Chun)