In-Depth Machine Learning for Teens: Linear Regression

10 min readAug 21, 2022

Linear regression is a repetitive process that tries to fit an equation to a bunch of data points. In machine learning, it’s often used to easily interpolate and extrapolate data that isn’t in a dataset. For example, you may have a list of a few hundred house prices, each house’s square footage, and the number of bedrooms.

But how would you predict the price of a new home that’s being built with 4 bedrooms and 2300 square feet? Of course, there’s no way you’re going through the data manually. Instead, linear regression allows you to set the variables in a term, and the algorithm simply alters the coefficients of each term until it fits the model just right. Traditionally, the coefficients that you adjust are denoted with theta (θ).

This is just an example equation — there can be even more combinations between the inputs, and even could have other mathematical functions such as square roots or trigonometry.

Note From Author

As of now, this article is part of a series with 5 others. It is recommended that you read them in order, but feel free to skip any if you already know the material. Keep in mind that material from the previous articles may (and most probably will) be referenced at multiple points in this article — especially gradient descent, the keystone of all machine learning.

Survey

If you wouldn’t mind, please fill out this short survey before reading this article! It is optionally anonymous and would help me a lot, especially when improving the quality of these articles.

Linear Regression Survey

Before You Begin

When using any kind of machine learning algorithm, it’s important to be aware of bias and variance.

Bias, in this context, can be thought of as “underfitting” data. For example, you could be using a linear function to model the intensity of sunlight present from 7 AM to 5 PM, however, for this application, something like a quadratic or a sine function would be a much better model.

Clearly, this data cannot be fit correctly with a straight line (blue). Instead, something which can curve up and back down would serve more utility — like a quadratic (red) or a sine function (green).

However, a model with high variance is also not good. This happens when you “overfit” data. Essentially, the model tries so hard to fit every data point that it loses the capability to predict new data points with accuracy and thus does not generalize well. An example is shown below to demonstrate this concept:

The red eight-degree polynomial fit, while fitting all the data points perfectly, is not a good model.

Keeping in mind that we’re trying to predict data that doesn’t exist, the quadratic fit (blue) offers a broad and generalized fit, whereas the eight-degree polynomial (red) fits all the data points but fluctuates too much to predict values for new points accurately — not to mention it has a very sharp rise on both sides, which would lead to unreasonably high predictions in those regions. As you can see, fitting the data extremely closely isn’t necessarily optimal.

How can this problem be fixed? Well, the data can be split into two parts, called training data and validation data. As the name implies, the training data is used to train the machine learning model, and the validation data is used to check whether the predictions of the model come out as reasonable. When paired with a cost function, which measures how good of a fit your model provides (you’ll be learning more about them in the upcoming sections), you can identify both bias and variance. Specifically, high bias will result in an exceptionally high value from the cost function in either dataset, whereas high variance will result in a low value from the cost function in the training data, but an exceptionally high one in the validation data. Using this concept can help you both pick the form of the equation used in modeling, and check if that equation fits the purpose.

Some of you are probably wondering by now, “Why do I need all of these checks when I can simply graph the data like you did?” Well, sometimes you cannot graph the data, or visualize it in any convenient form. In fact, if you have more than two parameters in a dataset, you already cannot visualize it in a simple and intuitive manner! For this reason, you need other ways of measuring how well the model fits the data. While it may seem a little abstract and uncomfortable initially, you’ll get used to it.

Bias Term

You’re probably wondering why we’re going over bias again. We’re not — this is something different! Unfortunately in machine learning, there are lots of things with “bias” in the name.

The bias term is a lone coefficient θ that isn’t multiplied with any other variables. There isn’t a benefit to having multiple of these, as you could add them up and get an equivalent equation. However, it does serve a very important purpose. You can think of this as the “b” term typically found in the equation of a line (y = mx + b). It helps offset the equation vertically. As a comparison, if you’re trying to predict house prices and they’re all around the value of 100,000, then you want to start somewhere around there and adjust your prediction based on other factors.

Basic Definitions and Conventions

Mathematical Definitions

Σ — known as summation. In the notation, a new variable is declared and the variable is iterated up by one each time, starting on the lower bound and going up to the higher bound.

This isn’t limited to just simple expressions, as we will see later on

Mathematical Notation

x represents all the input data that we have in our data table

xᵢ represents the value of the i-th row in the data table that we plug into the prediction function f(θ, xᵢ)

xᵢ, ⱼ represents the value of the j-th parameter in the i-th row of the data table

θ represents all the coefficient parameters that we plug into the prediction function f(θ, xᵢ). These are updated at each gradient descent step.

θᵢ represents the value of the i-th coefficient parameter that we plug into function f(θ, xᵢ). This is updated at each gradient descent step.

f(θ, xᵢ) is the prediction function that takes in coefficient parameters θ and input parameters xᵢ

y represents the true value of the output for all the inputs in the data table

yᵢ represents the true value of the output at the i-th row of the data table

Terminology

Loss function — how the “goodness of fit” is measured for a single data point, often uses ordinary least squares (OLS) or least absolute deviation (LAD)

Cost function — average of the loss function over all data points in the data table, typically represented by J(θ)

n — number of data points in the data table, think “the number of rows in the data table”

Ordinary least squares (OLS) — the method of using the square of the difference between the actual output yᵢ and the predicted output f(θ, xᵢ) as the loss function. The measured value of the loss function is called the R2 error, and the measured value of its cost function counterpart is called Mean Squared Error, or MSE for short.

Least absolute deviation (LAD) — the method of using the absolute value of the difference between the actual output yᵢ and the predicted output f(θ, xᵢ) as the loss function.

Popping the Hood — How it Works

What to Minimize

In gradient descent, we need a function to “descent” from, as the goal is to reach a minimum. So, naturally, to perform linear regression, we first need to define something called a cost function. Before defining the cost function, we actually have to define something else first — the loss function.

The loss function is defined for each data point and measures how well the prediction is relative to the true value. This value is always positive, and a higher value indicates that the true value is far away from the prediction, whereas a lower value indicates that the true value is close to the prediction.

A common loss function is called ordinary least squares, or OLS for short. It is measured by finding the square of the difference between the predicted value and the true value.

The cost function is defined on top of this, as the average of the loss function of all the points in the dataset.

While this may initially look foreign, it’s still an average — Σ adds together n values, and 1/n divides to result in the average

An obvious first question would be why least squares is generally preferred over absolute value (formally known as least absolute deviation, or LAD for short) as a loss function. Well, LAD does have its benefits, especially when dealing with outliers.

This is because OLS “punishes” the model more than LAD for having data points farther away from the predicted value, in order to reach a sort of equilibrium between having all the data points moderately distant from the line of fit. In contrast, LAD doesn’t add any extra “weight” to the loss function if it’s further away. Note that this “weight” comes in the form of squaring the term. As a watered-down concept, LAD is like having loss function values 1 and 8, where the model has to pay slightly more attention to the 8 in order to lower it, whereas OLS is like having loss function values 1²=1 and 8²=64, where the model has to pay significantly more attention to the 64 in order to lower it.

However, it also has drawbacks. Its first and foremost drawback is that it sometimes has the problem of having more than one solution, potentially even infinitely many. This can sometimes lead to inconsistent results. Another drawback is that having significant outliers isn’t necessarily bad — general errors and noise are typically around where it’s supposed to be, and should even out in a moderately sized dataset. Having outliers further away is unlikely, so often it should serve as motivation to look into why there is such a large deviation at that data point. Even so, with a large enough dataset, that too should even out.

But which one do you choose? That depends on what you’re trying to do. If you care about outliers or have a pretty big dataset, then you should probably use OLS. If you know that the data contains spiking outliers, then you should probably LAD. While OLS is more common and is good enough for almost all applications, in the end, it’s really up to you.

One more thing — if you really want to apply the advantages of both, then you can look into Huber Loss, which applies a smooth “transition” between the two using the concept of derivatives — OLS when in nearby territory and LAD when farther out. But be warned — you’re about to get hit with a bunch of calculus, and it’s not going to be explained because it assumes at least an undergrad audience.

How to Minimize

This part’s quick and easy, at least in explanation. First, you pick some starting values for your coefficients. These are typically left as 0 for linear regression, but theoretically, it can be anything you choose. After that, you simply use gradient descent to perform a step based on the gradients of the cost function, J(θ), and repeat until you converge to a minimum of J(θ).

Hands-On Lab

It’s time for your first lab! Here, you will implement a linear regression model from scratch, using a basic framework. Simply create a copy of the Colab and you’re good to go!

As promised, the partial derivatives are given to you, under a dropdown. However, if you know even just the basics of calculus, it is suggested that you derive these yourself. Don’t worry, partial derivatives are not nearly as complicated as they look — simply treat all the variables as constants, other than the variable you’re taking the partial derivative of.

Linear Regression Lab

colab.research.google.com

Parting Notes

Congratulations on completing… the most basic form of machine learning. We’re not even up to AI yet! I know, I know, it’s a lot to learn, and there’s a ton of background stuff involved. While initially, it may seem a bit slow, having a strong foundation is necessary for the things that we will be working towards in the future.

One more thing — the next article discusses matrices, which is a crucial topic for all future labs. It’s important that you have this concept down, as it’s used extensively by data scientists that make use of machine learning, no matter the language they code in, or whether they use a fancy library that does all the math for them *stares disapprovingly*.

In all seriousness, feel free to use libraries to do all of this for you after you’re done with this course, but it’s important to know what’s going on inside if you’re going to be building on top of it. Besides, you’re here to learn how it works (without the mind-boggling math), not just how to use it — and that’s exactly what you’re going to get :P.

Done reading? Ready to learn more? Check out my other articles in this series!
Training Faster and Better
Logistic Regression
Neural Networks

Or, feel free to refer back to any previous articles in this series:
Gradient Descent

External resources used:
https://math.stackexchange.com/questions/3580109
https://machinelearningmastery.com/linear-regression-for-machine-learning/
https://www.geeksforgeeks.org/ml-linear-regression/
https://demonstrations.wolfram.com/ComparingLeastSquaresFitAndLeastAbsoluteDeviationsFit/