Best Optimization Gradient Descent Algorithm

4 min readDec 13, 2018

At a theoretical level, gradient descent is an algorithm that minimizes functions. Gradient descent algorithm’s main objective is to minimize the cost function. It is optimization algorithms to minimize errors.

(difference of actual value and predicted value)

It basically just does what we were doing by hand — change the theta values, or parameters, bit by bit, until we find desired minimum.

Understand With Example

Suppose you are at the top of a mountain, and you have to reach a lake which is at the lowest point of the mountain. A twist is that you are blindfolded and you have zero visibility to see where you are headed. So, what approach will you take to reach the lake?
The best way is to check the ground near you and observe where the land tends to descend. This will give an idea in what direction you should take your first step. If you follow the descending path, it is very likely you would reach the lake.

More about gradient decent, with a cost function, GD also requires a derivative of the cost function with respect to a single weight, done for all the weights (dJ/dw). This dJ/dw depends on your choice of the cost function. The most common is the Mean-Squared Error cost function.

This formula shows the gradient computation for linear regression with respect to Mean-Squared Error cost function.

Types of Gradient Descent Algorithms

There are three types of Gradient Descent Algorithms:

Batch Gradient Descent
Stochastic Gradient Descent
Mini-Batch Gradient Descent

Batch Gradient Descent

In the batch gradient descent, to calculate the gradient of the cost function, we need to sum all training examples for each steps.
If we have 5 millions training samples then the gradient descent algorithm should sum 5 millions training samples for every epoch. To move a single step, we have to calculate each with 5 million times!
In which updates the model after all training samples have been evaluated
Model updates, and in turn training speed, may become very slow for large data-sets.
Python code implementation.

def gradientDescent(X, y, theta, alpha, num_iters):
    """
       Performs gradient descent to learn theta
    """
    m = y.size  # number of training examples
    for i in range(num_iters):
        y_hat = np.dot(X, theta)
        theta = theta - alpha * (1.0/m) * np.dot(X.T, y_hat-y)
    return theta

Stochastic Gradient Descent

In stochastic Gradient Descent, we use one example or one training sample at each iteration instead of using whole data-set to sum all for every steps.
SGD is widely used for larger data-set training and computationally faster and can be trained in parallel.
Need to randomly shuffle the training examples before calculating it.
The frequent updates immediately give an insight into the performance of the model and the rate of improvement is often called an online machine learning algorithm.
The frequent updates can result in a noisy gradient signal. The noisy learning process down the error gradient.
Python code implementation.

def SGD(f, theta0, alpha, num_iters):
    """ 
       Arguments:
       f -- the function to optimize, it takes a single argument
            and yield two outputs, a cost and the gradient
            with respect to the arguments
       theta0 -- the initial point to start SGD from
       num_iters -- total iterations to run SGD for
       Return:
       theta -- the parameter value after SGD finishes
    """
    start_iter = 0
    theta= theta0
    for iter in xrange(start_iter + 1, num_iters + 1):
        _, grad = f(theta)
        theta = theta - (alpha * grad) # there is NO dot product!
    return theta

Mini-Batch Gradient Descent

It is splits the training data-set into small batches that are used to calculate model error and update model coefficients.
The batched updates provide a computationally more efficient process than stochastic gradient descent.
Error information must be accumulated across mini-batches of training examples like batch gradient descent.
Python code implementation.

minibatch_size = 50
n_experiment = 100

# Create placeholder to accumulate prediction accuracy
accs = np.zeros(n_experiment)

for k in range(n_experiment):
    # Reset model
    model = make_network()

    # Train the model
    model = sgd(model, X_train, y_train, minibatch_size)

    y_pred = np.zeros_like(y_test)

    for i, x in enumerate(X_test):
        # Predict the distribution of label
        _, prob = forward(x, model)
        # Get label by picking the most probable one
        y = np.argmax(prob)
        y_pred[i] = y

    # Compare the predictions with the true labels and take the percentage
    accs[k] = (y_pred == y_test).sum() / y_test.size

print('Mean accuracy: {}, std: {}'.format(accs.mean(), accs.std()))

Best Optimization Gradient Descent Algorithm

Understand With Example

Batch Gradient Descent

Stochastic Gradient Descent

Mini-Batch Gradient Descent

References:

Written by Faisal Shahbaz