Building Intuition around Supervised Machine Learning with Gradient Descent

Sean Gahagan
6 min readSep 18, 2022

--

In my last note, we looked at the different types of machine learning, including types of supervised learning. This week, we’ll look at one of the most common ways that models learn in supervised learning. To do this, we’ll need to introduce some terminology and light notation that (I promise) will be important to explain future concepts clearly.

TERMINOLOGY AND LIGHT NOTATION

Training Sets: It’s good practice that the data set used for a machine learning model be broken out into three buckets:

  1. The training set is the subset of data that has been set aside for use in training the model.
  2. The cross validation set is reserved for validating the model.
  3. The test set is reserved for testing the model.

For example, if you have data on 1000 home sales, you might use 800 of them for your training set, 100 for your cross validation set, and 100 for your test set. The ideal ratio varies.

This week, we will only focus on the training set, but we’ll talk more about the key uses of cross validation set and test set in my upcoming notes.

A training set contains a number of training examples each consisting of input variables known as features (denoted by the variable x) and output variables (also known as “target” variables, denoted by the variable y). The number of examples in the set is denoted by the variable m.

If you’re using data on a home’s square footage (denoted by x) to predict the sale price of that home ( y ), a training example might look like this: (x,y) = (3,320 sqft, $504,000). If we have 1000 such examples, then m=1000.

Learning Algorithms: A learning algorithm is a process followed by a machine to train a model using the training set. After each iteration of training, the model updates its hypothesis (a function denoted by h) about the relationship between the inputs (x) and the outputs (y).

Parameters: The hypothesis is defined by the models’ parameters (denoted by θ). Parameters are also known as “weights”, but this name can be misleading, because if you’re using neural networks, they don’t translate to a given feature’s importance.

Following our example: The hypothesis function for a linear regression model that is predicting a home’s sale price based on square footage would look like this: h(x) = θ0+θ1*x. Here, θ0 is the intercept term (known as the bias term), and θ1 is the weight or parameter applied to the square footage (x) to calculate a prediction (h). Here, θ0 is the intercept term (known as the bias term), and θ1 is the weight or parameter applied to the square footage (x) to calculate a prediction (h) of the home’s sale price (y).

Cost Function: The way that a learning algorithm assesses the accuracy of a model’s predictions is known as a cost function. A simple cost function is the sum of squared error between the model’s predictions (h) and the “right answers” (y) across each training example included in the calculation. Cost function value is denoted as J. In general, the lower the value of the cost function, the more accurate the model’s predictions are.

The cost function in our example would be the square of the difference between the predicted sale price (h) and the actual sale price (y) summed across each training example included.

LEARNING BY GRADIENT DESCENT

Thank you for reading through the terminology and notation. I know it’s a lot, and I’ll link back to this note in future notes, so don’t worry about trying to memorize all of it at once. Now that we have this terminology, we can dive into one of the most common ways that models learn and develop the hypothesis function needed to make predictions: gradient descent. I’ll move away from math now and instead use a visualization based on our home sales prediction example.

The Cost Function as a “Landscape”: Imagine a 3D plot of our cost function. The two horizontal axes are our two parameters (θ0 and θ1), and these form a horizontal plane. In this example, you can think about the values of these two parameters as coordinates on a map (like latitude and longitude).

The vertical axis is the cost function (J) for any combination of our parameters, with the higher values of J being at a higher altitude above the horizontal plane. For each coordinate on our map (i.e., for each combination of parameter values), the cost function’s output is a certain altitude J above the horizontal plane. All of the values of J form a surface that might look like a hilly landscape with peaks and valleys.

To have our model make the most accurate predictions about a home’s sale price, we want our model to use the location coordinates (θ0, θ1) of the bottom of the lowest valley as its parameters. To minimize the aggregate error between the model’s predictions and the “right answers”, the learning algorithm tries to find the bottom of the lowest valley.

Finding the Bottom of the Lowest Valley

Given this type of landscape, the steps of gradient descent look like this:

  1. Start at a point on the landscape (i.e., “coordinates on the map” given by a certain combination of parameters).
  2. Determine the gradient at that point (i.e., the slope of the “terrain” at that point).
  3. Take a step in the direction of the steepest descent. (The size of each step can be adjusted using a term called the “learning rate” denoted by α. More on that in the future.)
  4. Repeat steps 1 through 3 until you’re no longer descending.

At the end of this process, your final “coordinates” are the parameter values for an optimized hypothesis function that will make better predictions. Depending on the cost function and the model being used, there is a chance that you will end up at a local optimum (i.e., the bottom of the valley, but not the lowest valley), but we’ll discuss that more later. Applying gradient descent to the cost function is an example of a learning algorithm.

There are a two main types of gradient descent that vary depending on how many training examples you use at a time to calculate the cost function. The size of the training set and the computational cost of calculating the cost gradients will guide which type you choose.

  • Batch Gradient Descent — You use all of the training examples at once in one giant “batch” to calculate the cost function “landscape”. In this type of gradient descent, the “landscape” represents the aggregate of the prediction error for all of the training examples, and you take all of your descending steps down the slope of this landscape. This is easy to visualize, because you can just imagine walking downhill until you reach the bottom.
  • Stochastic Gradient Descent (“one at a time”) — You only use one training example at a time to calculate the cost function “landscape”. At your starting point, you start with your first training example, calculate the slope of the cost function for that training example at your position, then take a step downhill. Then at your new position, you take the second training example, calculate the slope of the cost function with this second training example, and then take another step downhill. You repeat this process for all training examples, and when you reach the last training example, you take your final step downhill. This is harder (but more fun) to visualize: You can imagine this as taking a step downhill, then after you take that step the entire landscape changes, then you take another step downhill, then the landscape changes again, and you keep walking downhill in this changing landscape until you reach the last landscape.

So in a nutshell, gradient descent is just walking downhill on the cost function “landscape”, and when you reach the bottom (or the last landscape), your coordinates are the parameters that will optimize your model’s predictions. There are other more advanced algorithms to optimize your model’s parameters, but in most cases gradient descent is good enough.

In this visualization, we looked at gradient descent in a 3D space (2 inputs and 1 output), because we can visualize it. Here’s what’s really neat though: Gradient descent works the same for 2 inputs as it does for 2 billion inputs. The ability to run gradient descent on larger equations becomes limited by computational costs, but instead of walking down a 3D landscape, machine learning models can theoretically walk down an infinite-dimensional landscape, and this is what makes machine learning predictions so powerful.

Next Up

Next, I’ll share a note on ways to build new features for a model’s predictions, how adjustments to a model’s features can make gradient descent faster, how we can check if gradient descent is working, what can be done when it’s not, and a shortcut for getting to “the bottom of the lowest valley”.

--

--