Learning Machine Learning — Part 1: Linear Regression, Cost Functions and Gradient Descent

Ryan Gotesman
5 min readMar 18, 2018

--

I just finished going through week 1 of Andrew Ng’s machine learning course on Coursera and there were a lot of interesting ideas. I thought I would summarize and discuss the more important ones.

First off a definition.

Machine learning is the science of teaching a computer how to solve a problem without explicitly programming the solution. Rather you allow the computer to come to its own empirical solution from looking at the data.

This is a very powerful method because it can generalize across many fields and be used everywhere from diagnosing breast cancer to suggesting a Spotify song.

Machine learning is generally broken down into 2 main categories: Supervised and unsupervised learning.

Supervised learning is used when you have a dataset with input x AND output y. It is labeled. You know that a relationship exists that somehow links x to y and you go about discovering it through machine learning.

For instance, if you wanted to predict a person’s weight from their height, and you had a dataset consisting of pairs of height and weight, this would be a supervised learning problem. Both the independent and dependent variables are know. Specifically it is a regression, supervised learning problem since you are trying to predict a continuous variable.

If instead the dataset told you each person’s weight and whether they were, quote unquote, healthy or unhealthy, and given somebodies weight you wanted to make this prediction, that would be a classification, supervised learning problem since you are trying to predict discrete classes.

The second main kind of machine learning is unsupervised learning which is when you have a dataset with no labels and you want to see if it can be divided into certain subgroups that give it some underlying order. A recent example of this was in the medical journal Lancet where researchers, using cluster analysis, were able to identify 5 subgroups of diabetes, compared to the usual 2.

The course then moves to the first learning model: linear regression with 1 independent variable or univariate linear regression.

Figure 1 — Univariate Linear Regression Model

Linear regression has the mathematical form h of Fig 1 where parameters theta zero and theta one define the y-intercept and the slope respectively. For any dataset you want a learning algorithm that can determine which values of theta lead to h best modeling the data.

The obvious question is how do we measure how well h fits the data? And that is where the cost function J comes in. J can come in many forms but in this case we define the cost function as below. Note that it is a function of theta whereas h is a function of x.

Figure 2 — The cost function

The way this cost function works is we look at what our candidate linear regression model predicts the dependent variable should be based on an input x as well as the true value of the dependent variable associated with x, take the difference and square it. We do this for each element of the dataset before adding them all up and dividing by m, the total number of data elements, to obtain a value for the average square distance between what h predicts the dependent variable should be and what the dependent variable truly is. We divide this value by 2 just because it makes the number easier to deal with.

Clearly, if the linear regression hypothesis was perfect it would always predict the correct value of the dependent variable and lead to J=0. That is the goal. Depending on the values of theta that we give our model, J will acquire some value and the shape of J for univariate linear regression is given below (Fig 3). Rather than plugging in thousands, or tens of thousands of combinations of theta and seeing which gives the smallest value of J, we would like an algorithm that can get us to the values of theta that minimize J, regardless of the starting values of theta.

Figure 3 — Cost function J for Univariate Linear Regression
Figure 4 — Gradient descent updating theta

The algorithm that does this is called gradient descent. Consider some arbitrary cost function J with the shape in Fig 4 and imagine we initialized our learning model with values of theta given by the X. Gradient descent will update the values of theta, choosing ones that reduce J with each iteration until a local minimum is reached. Depending on the initial values of theta, gradient descent may converge to different local minima but this is not a problem for univariate linear regression.

The rule defining the gradient descent descent algorithm is as follows:

Figure 5 — Gradient Descent Algorithm

A key feature of the algorithm is that each theta is to be updated simultaneously so the partial derivative is calculated on J for both theta zero and theta one before the values of theta are changed. Another important feature is alpha, the learning rate, which determines how quickly gradient descent converges. It is equivalent to step size. Make it too small and the algorithm will take a long time to converge. Make it too large and the algorithm might overstep the local minimum or even fail to diverge. The partial derivative gives a sense of the direction at which the function J is changing and you take a step in that direction, updating theta as you go until a local minimum is reached.

The week ended with a review on basic linear algebra properties but I knew most of this from a university course so nothing too interesting was shared.

Overall I really enjoyed the first week of the course and am looking forward to learning more. Comment below if you have any questions or if anything was unclear.

Thanks

--

--