Day 3: Improved Batch Gradient Descent

Krrish Dholakia
Journey Into Vision and AI
5 min readOct 12, 2017

Disclaimer

This is my attempt at better expanding my own knowledge of artificial intelligence by sharing what I learn by using various online resources, on a weekly basis. I will always put the relevant links that I am using, in an attempt to redirect you to people who know far more about ML/AI than I currently do. In the event that you’d like me to elaborate on something or I have misinterpreted something and made an error in explaining, please leave a comment! This will only help me improve the quality of content available to people on this publication.

Resource Materials

A great resource for beginning an introduction to Machine Learning is Andrew Ng’s course available on both Coursera and YouTube. While I already have some experience with Ng’s Coursera course, having started watching the YouTube videos, my personal preference is with the YouTube videos as I enjoy the work that Ng does with proofs and the student-teacher interactions in the course.

Another resource that I have begun using for Machine Learning is the course created by Udacity and Georgia Tech, which is available for free on the Udacity website.

Non-Iterative Batch Gradient Descent

A significant issue with the initial learning algorithm for batch gradient descent was that we had to iterate across every training example to improve the weights. The algorithm in focus is described below:

Repeat {

from j = 0 to m {

}

}

What we’re going to do is try to improve the weights (theta) in a single step.

The cost function J(THETA) can be written as a cost vector.

The learning rate (alpha) is constant across all values. This allows us to keep it constant in the general vector equation we are going to describe below.

The THETA values can be assigned as a vector containing all values from 0-m.

In this case you can assume m = n

Here’s how that update equation would now look:

Improving The Initial Learning Algorithm

It’s important at this stage to remember what the x-value holds. It’s a list of our training features for a particular training example.

Therefore from the above description, we can say that a Feature vector called X can be described as:

Remember that the formula for hypothesis was:

hypothesis with multiple features

This equation was derived in the previous article (corresponding equation number is (3)).

Having written down the features in a training set as a feature vector, we can now conduct vector multiplication. If you recall from the previous sub-heading, we managed to get the weights as a vector as well, so here’s how the combined vector multiplication equation looks like:

The weights are consistent across all features. This is a benefit of the non-iterative batch gradient descent algorithm we devised above. Therefore this means that the hypothesis equation is now re-written as:

As you may have noticed above, I included a sub-script “i” for the hypothesis to show different hypothesis for every example in the training set. This is intuitive enough. What this allows us to do is then re-write the hypothesis values as an “m” dimensional vector:

hypothesis vector

Similarly we can also re-write the y-value vector:

This encompasses all the “true prices” for the corresponding “predicted prices”

The true-price vector (y value vector) will also be an m-dimensional vector.

Recall that Cost-Function (J(THETA)) is:

What we’re going to do is derive the cost function from our vectors and then work to minimize the cost function using vectors. This method proves to be far faster than iterative algorithms as we will show.

Equation (26) is a vector form of the J(THETA) cost-function equation which is going through first-order differentiation with respect to THETA to allow us to find the minimum of the cost function. To find the minimum cost function value, the first-order differentiation will be = 0, as the slope of a minimum value (both local and global) = 0.

Another point to note, in case you were wondering how we managed to eliminate the

Is that this is a constant value with respect to THETA, and therefore when differentiated will be = 0, as shown in equation (23).

This is how our equation will re-work itself when we attempt to find the minimum cost-function value.

Thus we have managed to arrive at a closed form value of the THETA, which will find the value keeping in mind minimising cost-function is a priority.

Thank you for taking the time to read this! If you have any doubts/questions, feel free to leave it in comments. In the next article, I will be showing non-linear relationships between predicted values (hypothesis) and our features!

Relevant links:

Andrew Ng’s Coursera course: https://www.coursera.org/learn/machine-learning

Andrew Ng’s YouTube videos: https://www.youtube.com/watch?v=UzxYlbK2c7E&list=PLA89DCFA6ADACE599

Udacity+Georgia Tech Machine Learning Course: https://classroom.udacity.com/courses/ud262

--

--