Andrew Ng Machine Learning Course Summary — Week 1

This article summarizes supervised and unsupervised learning, linear regression, and gradient descent.

5 min readNov 30, 2023

INFO: This summary is based on my “Zero To Hero Machine Learning” series where I upload daily and explain what I learned on that day about machine learning from Andrew Ng’s course.

For those of you who prefer to watch a video I made on my channel summarizing week 1 of the course:

Machine Learning Basics

What is Machine Learning?

Machine Learning is the field of study that gives computers the ability to learn without being explicitly programmed.

Supervised Learning

In supervised learning, we are given a data set and already know what our correct output should look like, having the idea that there is a relationship between the input and the output.

In this type of learning, problems can be categorized into “regression” and “classification” problems.

In a regression problem, we are trying to predict results within a continuous output. For example, predicting the price of a house based on its size.
In a classification problem, we are trying to predict results in a discrete output. For example, Given a patient with a tumor, we have to predict whether the tumor is malignant or benign.

Unsupervised Learning

In unsupervised learning, we are given a data set, and we don’t know what our correct output should look like. Instead of having labeled examples to learn from, the algorithm explores the data on its own, identifying inherent patterns, relationships, or clusters without explicit guidance.

For example, given a large collection of articles, using this type of learning you can analyze the text and automatically group similar articles together, revealing patterns or themes without any predefined labels.

Linear Regression Basics

Model Representation

We’ll use symbols like x^(i) for input, and y^(i) for the output to figure out. So the pair (x^(i), y^(i)) is the i-th training example. Using this dataset we want to create a hypothesis that takes an input (x) and gives us a good guess for the output (y).

Cost Function

The cost function is used for measuring how good our hypothesis is at measuring things. By looking at the average difference between what our guess says (hypothesis) and what actually happens (the real result).

The cost function is like saying, “How wrong are we on average?” It calculates the mean (average) of the squared differences between our guess and the real outcome. This is known as the “Squared Error” or “Mean Squared Error” function.

The cost function helps us figure out how much we need to adjust our guess to get it right the next time.

Linear Regression with One Variable

Alright, let’s break it down. Imagine you have a bunch of points on a graph. You want to draw a straight line (represented by hθ(x) = θ₀ + θ₁x) that goes through these points in the best way possible.

The idea is to find θ₀ and θ₁ so that the average squared vertical distance from the points to the line is as small as possible, this will minimize our cost function.

In a perfect scenario, the line would go exactly through every point, and the cost would be 0:

But usually, nothing is perfect and we will most of the time end up with something like this:

Gradient Descent

So we have our hypothesis (hθ(x) = θ₀ + θ₁x) and we have a way of measuring how well it fits into the data (cost function). Our goal is to minimize the cost function, that way we know our hypothesis makes good guesses. That’s where gradient descent comes in.

What Is Gradient Descent?

Gradient Descent is an optimization algorithm for finding a local minimum of a differentiable function or in simple terms it helps us improve our hypothesis by minimizing the cost function.

A simple way to see the value of the cost function for each pair of θ₀, θ₁ would be putting θ₀ on the x-axis, θ₁ on the y-axis and the cost function on the vertical z-axis.

Remember, our goal is to minimize the cost function -> get to the very bottom of the pits in our graph.

How it works?

Starting from arbitrarily θ₀ and θ₁, the way to do so is by checking the slope of the landscape at your current location and using that to walk downhill from your starting point until you reach the bottom.

Your step size is affected by 2 things:

The slope of the landscape — the steeper the landscape is, the bigger your step.
The learning rate (α) — Too big and you might overshoot the minimum; too small, and it’ll take forever.

*α is the parameter in which we multiply the partial derivative of the cost function for updating both θ₀, θ₁.