Andrew Ng Machine Learning Course Summary — Week 2

This article summarizes multivariate linear regression, gradient descent, polynomial regression, and the normal equation.

6 min readDec 10, 2023

Disclaimer: If you haven’t read/watched week 1 summary, I recommend you do that before reading this article.

For those of you who prefer to watch a video I made on my channel regarding my journey:

Multivariate Linear Regression

While predicting the price of a house based only on its size is very nice, it does not describe what happens in our world. For example, the price of a house is also affected by the number of floors. That’s what Multivariate Linear Regression comes to solve!

Multivariate linear regression to put it simply, is linear regression with multiple variables. Instead of having the hypothesis hθ(x) = θ₀ + θ₁x, we will now have hθ(x) = θ₀ + θ₁x₁ + … + θₙxₙ.

Going back to the “housing price” example, for n=2:

θ₀ — represents the basic price of a house.
θ₁ — represents the price per square meter.
θ₂ — represents the price per floor.

Also for those of you who know matrix multiplication, our hypothesis can be represented as follows, with x₀ being 1:

Gradient Descent for Multiple Variables

The gradient descent equation itself is generally the same form; we just have to repeat it for our ’n’ features:

* One thing to notice is that all variables θ₀,…, θₙ have to be updated simultaneously, just like we did before with one variable.

Feature Scaling And Mean Normalization

Going back to the “housing price” example, something you might have noticed is that while the size of a house in feet² is usually a number with 3 digits or more, the number of floors in a private house does not go beyond a single digit number (well at least for most of the houses). It turns out that the difference in the scale of the input values may slow gradient descent. That is where “feature scaling” and “mean normalization” come in.

Feature scaling — adjusting the input values so they’re all on a similar scale, like dividing them by the range maximum so they are all between 0 and 1. For example, if private houses have 9 floors at max:

Mean normalization — making the average value for each input roughly zero by subtracting the average from each value. For example, if the housing prices are in the range of 100 to 2000 and have a mean value of 1000:

Polynomial Regression

If a linear function does not fit our data well, we can change the behavior or curve of our hypothesis function by making it a quadratic (hθ(x) = θ₀ + θ₁x₁ + θ₂x₁²), cubic (hθ(x) = θ₀ + θ₁x₁ + θ₂x₁² + θ₃x₁³), square root (hθ(x) = θ₀ + θ₁√x₁) function or any other form.

For example:

If our data looks like this:

A quadratic hypothesis function (hθ(x) = x₁² — 10x₁ + 25) will be a good match:

And if our data looks like this:

A squared root hypothesis function (hθ(x) = √x₁) will be a good match:

Features

Combining Features — sometimes one feature is dependable on another feature, in this case, we combine them into one. For instance, if we want our hypothesis function to ignore x₂ when x₁ = 0 and address x₂ when x₁ = 1 we can introduce a new feature x₃ = x₁x₂ to our hypothesis, and end up with hθ(x) = θ₀ + θ₁x₁ + θ₃x₁x₂.
Scaling — when using a polynomial hypothesis one thing to keep in mind is the range of our features. For instance, if our hypothesis is of the form hθ(x) = θ₀ + θ₁x₁ + θ₂x₁² + θ₃x₁³, and x₁ has a range 1–1000 then a range of x₁² becomes 1–1000000 and that of x₁³ becomes 1–1000000000 giving x₁³ a great impact on our hypothesis output.

The Normal Equation

While using gradient descent for minimizing our cost function can be nice it takes many iterations and also requires us to select a good “learning rate” (α) which can be quite tricky. That is when the normal equation comes in!

The normal equation is a non-iterative algorithm that will minimize our cost function (Jθ) by explicitly taking its partial derivatives with respect to the θⱼ’s, and setting them to zero. This allows us to find the optimum theta without iterations.

The normal equation formula is the following: θ = (XᵀX)⁻¹Xᵀy

Example of Using The Normal Equation

Let’s say we want to predict the price of a house based on its size (x₁), number of bedrooms (x₂), number of floors (x₃) and the age of the home (x₄). In that case, given 4 training examples X and y are constructed in the following way:

What Happens If XᵀX Is Noninvertible

There can be many causes for this problem, but, the common causes are:

Redundant features — two features are very closely related -> they are probably linearly dependent.
Too many features (m < n) — in this case, we might need to delete some features.

When To Use The Gradient Descent Over The Normal Equation?

If you think like me, you might be thinking to yourself that there is no reason not to use the normal equation, it just sounds better. Well, that is not always the case.

Since computing the inversion of a matrix has a time complexity of O(n³), if we have a large number of features, the normal equation will be slow (notice that XᵀX dimensions are (n+1)x(n+1)). In practice, when n exceeds 10000 it might be a good time to use gradient descent.

Note: If you have noticed something I haven’t talked about in this article is “feature scaling”. Well, that’s because you don’t need to, that is another one of the advantages of the normal equation method.

Hope you enjoyed!!

I don’t make money out of my articles, I just love to share my knowledge.
Feel free to clap and/or support me on my socials below:

Link to my YouTube channel: https://www.youtube.com/@ExcaliBearCodes

Link to a video regarding this subject on my YouTube channel:

Link to my blog: https://excali-blog.vercel.app

Appendix

Relevant articles from my “Zero To Hero Machine Learning” series: