Learning Machine Learning — Part 2: Multiple Linear Regression

4 min readMar 25, 2018

This is a continuation of my Learning Machine Learning series. You can find Part 1 here.

Week 2 of Coursera’s Machine Learning course covers multiple linear regression. This is similar to univariate linear regression except rather than just a single independent variable, our model now includes two or more. Each of these variables comes with its own parameter so we can write our new hypothesis as:

Though this formula may seem complex, understanding it is quite simple. Theta zero represents the value of the hypothesis when all independent variables have a value of zero. Every other theta, theta_i, can be thought of as how much h(x) increases when x_i increases by 1 unit. That is because, if we hold every x constant, except x_i which we increase by 1, then we can see that h(x_1,x_2…(x_i)+1,x_(i+1)…x_n)-h(x_1,x_2,…x_i,x_(i+1)…x_n) = theta_i*(x_i+1)-theta_i*(x_i)=theta_i.

To make the notation cleaner we can group all the thetas into a column vector theta, all the variables into another column vector x, and by defining x_0=1 we can write h(x) succinctly as:

As before, we want to find the parameters theta that minimize our cost function J

and we achieve this through the gradient descent algorithm

Note that the summation term is achieved by taking the derivative of J with respect to theta_i.

Using gradient descent we can start with initially random values of theta and step by step converge to parameters that optimize the fit of our hypothesis.

The course then moved on to consider more practical matters when running the gradient descent algorithm. The first was the importance of feature scaling. Having data variables with wildly different scales can increase the time it takes for gradient descent to converge.

Imagine you’re trying to model data with 2 variables, one being household income, the second the number of children in the home. It’s immediately clear that these 2 variables exist on different scales. The income variable can range from 0 to the millions or even billions. The second variable probably won’t go over 10. In cases like these its important to scale the data which can involve dividing each variable by its value range or some measure of dispersion like standard deviation. Doing so will cause the variables to have the same scale and speed up the rate at which gradient descent converges.

Another pragmatic factor to consider is how to set the learning rate alpha. If alpha is too small gradient descent will take a while to converge. If it is too large, gradient descent may never converge. The best way to figure out the optimal alpha is through trial and error. Start with an alpha of 0.001 and increment upward to 1. Assess how the cost function converges which each alpha, does it decrease quickly, slowly or oscillate? Picking the optimal alpha will improve gradient descent’s ability to converge.

The course then touched upon the normal equation which is a mathematical formula for figuring out the optimal values of theta.

Unfortunately the course doesn’t go over how it’s derived but I found a nice post that goes over it. The cool thing about the normal equation is you don’t need to rely on the iterative process of gradient descent to find your optimal values of theta. For datasets with a small number of features this is very convenient and efficient. However once you start to get 10000 or more feature, multiplying those enormous matrices together becomes incredibly inefficient and gradient descent is the way to go. In addition, the normal equation, while applicable to linear regression, can’t be used for other ML methods like logistic regression while gradient descent can.

The week ended with an overview of key commands in the Octave programming language. Personally, I would have preferred the course to work with Python since this is what I will use for most of my ML projects but the knowledge is pretty transferable. One cool thing I figured out how to do is vectorize the gradient descent update algorithm so that the update for all values of theta is achieved in one line.

Learning Machine Learning — Part 2: Multiple Linear Regression

Written by Ryan Gotesman