ML Basics: Linear Regression

13 min readMay 16, 2019

In this ongoing series of blog posts called Machine Learning Basics I want to cover the underlying core concepts of modern machine learning. While it seems that more and more people are just throwing neural networks at problems it is utterly important to know how things work under the hood. This gives a more intuitive understanding of the process and maybe offers insight into different solutions and approaches.

Warning: These posts will be containing a lot of math. I’ll do my best to explain everything in an easy and comprehensible way and provide further links if necessary. If you can’t follow then don’t worry, these concepts are hard and need some time to sink in. If you have questions however you can always comment on the post or reach out to me on Twitter.

Today’s topic: Linear Regression

So first of all what is Linear Regression?

You will probably know that the complete picture of machine learning can be split up into supervised and unsupervised learning. If we look closer into the field of supervised learning the underlying idea is always to predict something. Now here comes the biggest distinction:

Do we want to classify something into a set of predefined classes? Then we do classification! (An example would be the MNIST dataset where we get an image as an input and want to output which digit it shows. The output always has to be between 0–9 and only these 10 values are a meaningful prediction so we want to classify the image into one of 10 classes)
Do we want to output a somehow continuous value? Then we will do regression! (Here we can think of predicting the price of a house. We will have some features like number of bedrooms, square feet, distance to city center, and want to output the cost in dollars. The cost is a value in a range of e.g. 100 $ - 1 million $. This is not a certain class but has an unlimited number of possible outcomes which is called continuous)

The clever reader that you are you have already guessed that linear regression does — regression! Wonderful, so we know what we want to predict and that is already the first step into the right direction.

Mathematical Data Representation

When we start to think about machine learning we have to remind ourselves that machine learning is math. Not very complicated math if we take a closer look at it. But it might be intimidating at first glance. Don’t worry though, you’ll get the hang of it. As some of you might agree math can be abstract at times. This is why I’ll try to explain the concepts with an example. As mentioned before we will try to predict the price of a house from the information we have about this house.

So we can describe our task in the following way. We have the size of a house in square feet and want to predict the price of the house in dollars (or any other currency). In order to create a linear regression model we need a number of examples where we know the size and the price of houses. From these examples we want to create an algorithm that will predict the price of a new house that we have never seen before only from its size in square feet.

It is always easier to understand concepts when we have a visual explanation, so in the image below we see an example for a trained linear regression algorithm. The x-axis in this case is the size of the house in square feet and the y-axis is the price of the house. The blue dots are the samples of the houses we know and the red line is what our algorithm will predict. We see that while it is not perfectly fitting all values it has a pretty good average prediction over the whole space.

Example for a linear regression curve. ***Source: Wikipedia***

Now that we know what we want to do, let’s define the mathematical representations. We have a number of houses m which is called the training set as we train our algorithm on it. We have the size x for each house which is called the input variable / sample. Also we have the price of each house y called the output variable. Let’s have a quick look at what we have so far:

m: the number of training samples
x: the input variable
y: the output variable

What we will do now is to simply combine the information we have and stick them into an easier-to-use format. We use vectors and matrices for that and the idea is quite simple. Each input variable x will be one column of our input matrix X. So if we represent the first house with size x₁ and the second house with size x₂ and so on we can then create a matrix that looks like this (for 4 samples):

How we can represent single data samples into a matrix.

Therefore we can create our new input matrix X which has the dimensionality: ℝⁿˣᵐ. Again m is the number of samples we have and n is the number of features we are observing (here: n=1). In our case this is only a single feature (the size of the house in square feet) but it can actually be multiple features such as number of bedrooms, number of bathrooms, size of the garden, etc. If we only have a single input we perform univariate linear regression whereas for multiple input features we do what is called multivariate linear regression. We stick to one feature for the sake of simplicity.

We can do the same sort of operation for the output variables (which are all of the house prices we know). To get a better mathematical representation we input the prices of all the samples we have into a single vector. The resulting vector y has the dimensionality ℝᵐ because it contains one value for each sample (remember: m = number of samples).

How we can transform our output samples into a vector.

The pipeline

Now that we have a good understanding of the setup and the representations of the values we have it is time to think about what we want to do. I will try to explain the steps in an overview first to give you a better understanding of the pipeline we will go through in order to create our algorithm. Don’t worry, we will look at each step in detail but let’s jump onto the buzzword-train first.

In order to go from our input values (sizes of houses) to a prediction (estimates of house prices) we need something called a hypothesis function which we will denote by h(X). So again we use our training set to estimate the prices of the houses we already know the correct value y for. Because we know the correct values we can determine how far off our predictions were. This difference is calculated with a function called cost function. Then we will use a learning algorithm called gradient descent to adjust our hypothesis function in some way to make better predictions. Those were a lot of new names and concepts so let’s look at a graphical representation of that. Then we will look at each step in more detail.

The overall pipeline for our linear regression model.

Hypothesis function

The hypothesis function is the core of the estimation process. We feed our samples X into the function and want it to predict a continuous output (which is mostly represented by y with a ^ on top). You already learned what the term regression stands for (we want to predict a continuous output, e.g. the price of a house). Now we come to the term linear because the form of the hypothesis function is simply a linear function. If you are not sure what this means I’m sure it will become clear when you see the function written out:

We introduce a new variable here called θ (theta). These are the parameters that the algorithm uses to predict the price of the house. It consists of n+1 (giving it dimensionality ℝⁿ⁺¹) values which are also called weights as each of them can be thought of as the weight of a certain feature. In our case we only have a single feature but in the case of n features it would look like this:

So for each feature we have a value θₓ to multiply it with, giving it a certain weight. We can think of this as: How important is a feature (for the final price of the house)? If it has a high impact on the price (e.g. square feet) it will have a high weight attached to it. On the other hand if it only has a minor impact on the final price (e.g. number of door handles) it will have a low weight attached to it.

You may notice that the last parameter θ₀ does not have a feature but is a constant. This factor is called the bias and is important to allow us to give a base price even if all other values might be zero. This is part of (almost) every machine learning system so make sure to think of it when you implement this.

Cost function

Now that we have the estimation of our housing prices we need to determine how well we performed. I mentioned before that we need to define a cost function J for this. We are doing this because we want to know how well the parameters θ do at predicting the exact house prices. Because we have the real prices in our training set we can make use of this with a function called the mean squared error function which looks like this:

1: this is simply the result of the hypothesis function with the ith training sample
2: the ground truth (house price) for sample i is denoted by yᶦ. The difference between our prediction (hypothesis) and the correct function gives us the error (how far off we were with our prediction). We then take the square to have only positive values and have the squared error.
3: the name of our function includes the term mean and this is what we calculate here. Because we sum up the squared errors for all training samples and then divide it by m we get the average error. This is important as we want to perform well over the whole set and not focus on single samples. (Note: do not be confused by the factor 2 before m. It will make the next steps easier but aside from that has no meaning.)

Now we can calculate how far off we were on our predictions for the whole dataset. Next, we need a way to adjust the parameters to perform better on the next run.

Learning algorithm

At first there is no intuitive way to think of how we can correct the parameters in a way that we know we will predict better in the future. This is where the Gradient Descent algorithm comes into play. I’d recommend everyone who wants to be involved in the field of machine learning to really understand this because gradient descent (and its variants) are used all over the field (even in deep learning and neural networks). While it is easy in general let’s first take a look at how we can come up with that solution.

So first we plot the cost function J (the mean squared error function):

A simplified plot of the cost function J. The red cross marks the minimum point of the function.

As we can see the cost function has a minimum value. This value is what we are looking for. We can think of this as the point where our cost function will yield the lowest possible result. Therefore the parameters of our model (which are responsible for the result of the cost function) have the best configuration we can calculate.

However we will initialize our parameters with a random factor that is very unlikely (read: never) to be the optimal value. So our goal is to start at a certain point (e.g. the blue cross in the figure below) and work our way towards the “bottom” of the function (indicated by the red arrow).

Of course (not) all of us remember from calculus how we can find the minimum of a function: calculating the derivative! The approach of gradient descent is to calculate the partial derivative for all parameters of the cost function and subtract it (weighted by a factor α) from the original parameter value.

If we do this over and over we will arrive at the minimum after a certain number of steps. So the overall algorithm can be described as:

Start with some (random) θ’s
Calculate predictions and evaluate the cost function J(θ)
Adjust the parameters θ to reduce the error J(θ)
Repeat steps 2 & 3 until we arrive at a minimum (in mathematical terms: until convergence)

Great, now we have everything in place and we can create the mathematical formula for adjusting our parameters. The minus sign in the formula stems from the fact that we want to descent to a minimum in our cost function.

1: we defined this before in order to calculate how far off our prediction was from the ground truth.
2: we take the gradient (i.e. the derivative with respect to the i’th factor of θ) from this cost function
3: the α factor is called the learning rate and the name sums up perfectly what it does. If we choose it to be small (e.g. 0.001) the progress in our learning process will be slow compared to larger values (e.g. 0.1). Finding the perfect value is tricky. If we choose the learning rate to be too large we might overshoot the perfect value and never come up with a great solution. If it is too small the training takes a (very) long time. Take a look at the graph below to see this for yourself.

**Learning rates** explained (blue cross is the start, red arrows indicate a single step in the gradient descent): **(1)** indicates a learning rate that is too large and the error “overshoots” and gets even larger over time. **(2)** shows a **well-chosen** learning rate as we progress steadily towards the minimum. **(3)** has a learning rate that is **too small** which takes a very long time to eventually reach the minimum.

We have almost reached the finish line. Now we want to finally take a look at how the final equation looks like because we have to make an important distinction. First let’s look at the general formula for calculating the gradient in our case of predicting the price of a house by one parameter θ₁ (the size of the house) and what we called a bias θ₀:

Here we replaced the cost function with the formula we defined for it before. Now we still have the hypothesis function in there so let’s replace that, too (here we have the 2 parameters for size of house θ₁ and bias θ₀):

Now we have something we can rather simply derive. So let’s first look at the derivation for the bias:

The difference for the other parameters is that we have to differentiate the content of the brackets as well so we’ll get the same result only with a multiplication of the input feature included at the end:

Great! Now we have everything we need to be adjusting our parameters from the training samples we have. Let’s summarize again what we have learned here.

Summary

We have covered quite a lot in this post. First we learned the mathematical representation of our input data. Then we took a look at the general pipeline of our linear regression machine learning model. After that we defined the hypothesis function to generate predictions and the cost function to evaluate how well far off these predictions were. Finally the gradient descent algorithm was introduced so that we can really learn and adjust the parameters of our model to get better and better over time at predicting values.

To sum up our linear regression workflow looks like this:

create an input matrix X containing the known samples we have for our use-case
predict values for each sample using our hypothesis function h(θ)
evaluate how well our predictions worked with the use of our mean squared error cost function J(θ)
adjust our parameters using gradient descent to produce better results for the next predictions
repeat steps 2–4 until our cost function reaches its lowest point

We are done here but of course this is only the tip of the iceberg. This is the theory behind the linear regression algorithm and it is very important to understand the entire process. However when we want to apply said algorithm in the real world there is a lot of errors that can be made and a whole set of practical questions that will arise. Let me know if you are also interested in another blog post on that.

As a final remark I want to mention that most of the work that is discussed is done by tools in practice. Popular libraries like scikit-learn or tensorflow will do the heavy lifting for you. However I think it is very important to know the underlying principles and formulas so that you have a solid understanding how to properly train and test algorithms as well as spot errors and problems in the process.

Thanks for following this post until the end and feel free to comment if you have questions or reach out to me on Twitter.