Helping Supervised Learning Models Learn Better & Faster

5 min readSep 24, 2022

My last note introduced some important terminology, some light notation, and a description of one of the most common ways that machine learning models learn: gradient descent (walking downhill on the cost function “landscape”). Please feel free to reference back to that post as needed — links to past posts are listed below.

This week’s post is focused on:

how to help models learn faster by “simplifying the terrain” that the learning algorithm has to navigate during gradient descent.
building new features from available data.
checking gradient descent to be sure it’s working
a shortcut to “the bottom of the lowest valley” that’s sometimes (but not always) faster than gradient descent.

Simplifying the terrain with feature scaling & mean normalization

If you think about gradient descent as walking down the “cost function landscape”, you can imagine that some landscapes might take longer to get to the bottom of than others, especially if you’re only able to check the slope of the terrain at one point at a time when determining which way to step, and if the size of your steps are dictated by an equation.

Feature scaling and mean normalization are two ways to help simplify the landscape and make it easier for your learning algorithms to find “the bottom of the lowest valley”.

In feature scaling, you’re bringing all of your features’ values to be within a common smaller range of numbers by dividing each value in a set of features by the same number. A rule of thumb is to aim to get every set of features to have a range on the scale between (-3 to 3) and (-1/3 to 1/3).

If we had a square footage feature set (x1), where the smallest house was 1000sqft and the largest house was 5000sqft, then to apply feature scaling, we could update the square footage value for each example by dividing by 5000 to have their values be closer to the “number of bedrooms” features (x2).

In mean normalization, you’re bringing the mean of your feature set close to zero by subtracting the feature set’s mean from each feature. If you do this for all of your feature sets, then you could make your “cost function landscape” quite a bit simpler.

For our “number of bedrooms” feature set (x2), if the average number of bedrooms is 2.5, then to apply mean normalization, we would update the number of bedrooms for each example by subtracting 2.5.

When combined (applying feature scaling and mean normalization to all features), these two methods can greatly simplify the landscape that a learning algorithm has to navigate as it’s doing gradient descent and stepping downhill. This can help it find the bottom even faster.

Building New Features

In our examples so far, our features have all been direct inputs from the available data. In predicting home sale prices, we directly used square footage, number of bedrooms, etc. But, models will often use features that are not just from the available data, but instead built from that data.

One simple example is to combine inputs into a new feature, so that instead of learning from a home’s frontage (property width) and a home’s depth separately, you can combine them to create a feature for the property’s area (frontage x depth). You can then have the model use this as a new feature, and this may result in better predictions.

Another example is in polynomial regression, where we not only use features like number of bedrooms (x), but can use features like the square or cube of the number of bedrooms (x², x³, and so on). Because this can create much larger numbers, feature scaling can really help keep the these features under control.

Another example is to use the square root of a feature as a new feature, like the using the square root of the number of bedrooms (sqrt(x)).

When determining what inputs to use in a machine learning model, it’s helpful to have a sense of why that input might be meaningful to the model’s predictions so that you can build features that are appropriately configured.

To get inspiration on what types of features you could explore building for a model, you could consult human experts in the field that your model will be used for to see what indicators they find important for their own judgements and ask them why they think those indicators are important. How do they relate to other indicators and variables?

Making Sure Gradient Descent is Working

Gradient descent can be checked by simply plotting the cost function (i.e., prediction error) over the number of iterations of gradient decent, where an iteration is a single step downhill. As the number of iterations increases, you should see your cost function value decrease (i.e., your predictions are getting more accurate).

You should see the cost function decrease over time with more iterations of gradient descent.

If instead it is increasing, or if it is oscillating, the learning rate (α) might be too big (i.e., the “steps downhill” might be too big).

To visualize this problem, imagine you’re a giant walking downhill. At the exact spot where you’re standing, the direction of the steepest downhill slope is northeast. So, you take one giant-sized step northeast, and your stride takes you over a valley, and your foot lands on an even higher hill that happened to be in that direction.

Decreasing the the learning rate (i.e., taking smaller steps) may solve this.

The Normal Equation

You don’t always need to do gradient descent to find the bottom of the lowest valley. Sometimes you can take a shortcut by using the normal equation to use linear algebra to solve for the parameter values that optimize your model’s predictions.

So, is this always better than gradient descent? No, there are pros and cons to both approaches:

Gradient Descent

Pros: Works well, even when you have more than 10,000 features!
Cons: You need to choose a good learning rate, and it can take many iterations to find the best parameter values.
Computational Cost: Cheaper (faster): correlates with (number of features)²

Normal Equation

Pros: You don’t have to choose a learning rate, and you only need to solve it once (no iterations).
Cons: You have to compute some linear algebra that can become very slow and computationally expensive with higher numbers of features (e.g., greater than 10,000 features).
Computational Cost: More expensive (slower): correlates with (number of features)³

Next Up

My next note will introduce a new type of prediction that will be critical for future notes on classification models and neural networks: the sigmoid function!

Earlier Notes in this Series: