Linear Regressions | Machine Learning for High Schoolers

4 min readOct 28, 2021

Linear regression on a set of arbitrary points

This blog post covers lectures 2, 3, and 4 from the Stanford Machine Learning Course http://cs229.stanford.edu/syllabus-spring2020.html.

The document being used is the following: http://cs229.stanford.edu/notes2020spring/cs229-notes1.pdf

Least Mean Squares Algorithm

This is your classic linear regression, or “line of best fit.” The goal of the least mean squares regression algorithm is to find the line that minimizes the sum of the squared distances between the data points and the point on the line with the same x-value. A linear regression is a very basic algorithm, used in cases where the data is mostly linear. Just a few terms to know -

A regression deals with continuous variables whereas a classification deals with discrete or categorical variables.
The actual value of the dependent variable minus the predicted value from a linear regression is called a residual.

We start by randomly initializing our weights. (Think of this as the slope and y-intercept). From here, it is easy to calculate the sum of the squared residuals. This is what we will try to minimize.

Now, we get to the hard part of trying to update the coefficients. Using gradient descent, we can make small updates to the slope and intercept to more closely fit the data. The reason we have to do this step by step is that we might overshoot the best fit line. Essentially, we’ll take turns updating the slope and the intercept (as we are trying to optimize with respect to more than one variable) in order to minimize the overall function. Here’s some more information on gradient descent: https://ml-cheatsheet.readthedocs.io/en/latest/gradient_descent.html#:~:text=Gradient%20descent%20is%20an%20optimization,the%20parameters%20of%20our%20model.

When we get very close to the actual best fit line, even small changes could cause us to overshoot. Having a tolerance parameter can avoid this from making the program run forever due to the line “oscillating” around the best fit line. A tolerance parameter essentially tells us how close to the best fit line is close enough. Usually, in the real world, undetectable differences essentially don’t matter, and you have to pay attention to the context of the data to determine how close is good enough.

Stochastic Gradient Descent

In the first example, we had to go over the entire dataset in order to make a change. That process was called batch gradient descent. However, we have a computer that can do things more efficiently than we can, which is why we can make changes after each variable is touched. This is called stochastic gradient descent. It’s pretty much the same as batch gradient descent except that it includes a learning rate, a measure of how much to “listen to” each new piece of data. We will start by adapting a lot to the new data and then later start fine tuning by gradually decreasing the learning rate. Stochastic gradient descent is used generally when there is less data, as it is a lot of effort to do the optimization for every single data point.

Locally Weighted Linear Regression

One way to test whether or not to use a linear regression is by using a locally weighted linear regression. This means that points closer to a certain range in the prediction function will be given more weight. This allows for a nonlinear function and shows the true pattern of the data while still protecting against extreme “overfitting” (i.e. not forcing the prediction function to pass through all the data points). This is actually sometimes easier than linear regression as there is no learning phase, and used in cases where data is not linear or when a linear regression is too computationally expensive. The algorithm learns by just predicting based on points around the value it is trying to predict for.

Discrete Variables

For discrete predictions, we should use a logistic regression using the sigmoid function because it is limited to the range (0, 1). This is a very common algorithm as it is very simple, and is great at classification problems. It is also used in neural networks, which will be discussed in a later blog post.

This is what the sigmoid function looks like

If we want to choose between two options this is the best. One thing to note: to update weights, we are using a gradient ascent algorithm this time since we want to maximize probability rather than minimize error. When dealing with classification problems involving more than two classes, there is the softmax regression. The reason we are using these regression methods for classification is that we are looking at the probability that a certain object falls within a particular class, and probability is continuous.

Newton’s Method

Another way to update the algorithm is to use the Newton’s method, which is x = x₀-f(x₀)/f’(x₀). If there is not an overwhelming number of dimensions, this is actually faster than gradient descent.