Linear Regression: Deriving the Formulas

Jonathan Bogerd
5 min readAug 31, 2022

--

Introduction

In this article we will cover the basics of Linear Regression. In part 1 we will derive the formulas and test their accuracy. Part 2 will discuss the assumptions underlying Linear Regression and in particular Ordinary Least Squares (OLS). In it we will test what happens if any of these assumptions is violated. In Part 3 we will investigate Linear Regression with multiple input variables. This 3 part series will teach all the basics you need to know on Linear Regression.

Deriving the Formulas

The idea of linear regression is to find the linear relation between an outcome variable, usually called ‘y’ and a set of input variables, denoted by ‘x’. This linear relation can be mathematically expressed as follows:

In this equation, the input variable(s) x are related to the output y by a factor ‘b’ and a constant ‘a’. Graphically, a is the intercept of the given line, and b is the slope. Similar to the provided example, we will only discuss the case with one explanatory variable, that is one variable x, instead of a set of input variables. This is usually called simple regression.

Linear Regression deals with the question: what are the best estimates for a and b to approximate the linear relation between x and y. In other words, we want to find the line that gives the best fit through the data points. In order to do this, we first have to determine what best really means.

Determine the Error

Probably the most intuitive measure would be to minimize the error for each observation, that is the vertical distance between each point and the given line. This error for observation i is denoted by

The vertical distance is then the absolute value of e. We want to minimize the sum of all errors, hence:

Here we sum over all observations and for the rest of this article we will use the symbol n to denote the number of observations.

A standard approach for minimizing a function, is to calculate the derivative and equate it to zero. The solution to this equation then is the minimum. Of course, you would also have to check if a minimum exists for instance calculating the second order derivative, but for simplicity, we will ignore this part. If you are familiar with calculus you know that the derivative of an absolute value function is a piecewise defined function and that it is not defined at the point zero. If you look at the graph below, you can see that the slope is not continuous, indicating a piecewise derivative.

Therefore, although it is of course possible to use this error function, the most common approach is to minimize the sum of all squared errors:

Finding parameters a and b by minimizing this function is called Ordinary Least Squares. Lets start by finding the value of a that minimizes the error function.

Calculating the Intercept a

As explained, we want to find the minimum of squared errors, by calculating the derivative and equating it to 0. However, as we have two parameters that can be tuned, we will need partial derivatives with respect to both a and b. Lets start with finding the expression for a:

Note that b and x are not variables in this partial derivative, but fixed constants. Therefore, the chain rule shows that we only need to multiply the derivative by -1, resulting in the shown derivative. Both y and x are dependent on i, however a is not. Rewriting to clarify this expression yields:

Dividing both sides by 2n results in a expression for a. Note that summing y over all observations and then dividing by n is equivalent to the average of y, denoted by y bar. The same applies for x. Therefore, we found the value of a that minimizes the squared error:

Calculating the Slope b

Finding the value for b is a little bit more involved. First note that the average of x, x-bar, is independent of i. Then, the expression for the derivative of a can also be written as follows:

where we multiplied both sides by the average value of x and divided by -1. Now we move the average of x inside of the summation, which is allowed due to the mentioned independence. This yields:

But, why would we do this? We want to obtain a value for b, not for a! However, it turns out that we can use this expression, which we know equals 0, to our advantage.

The partial derivative with respect to b is:

Subtracting the new equation based on the derivate of a on both sides of the equation and reworking terms yields :

Now we can use the equation we found for a to solve for b:

Here we have our formulas for both a and b! Note that the formula for b is very close to the formula of the covariance of x and y. This is not surprising as both measure the relation or the effect of a change to x on y.

Data Generation and Test

Now, lets put our formulas to the test using some Python code. We will generate a random sample of 1000 x values between 0 and 10 and a random sample of 1000 error values, with a standard normal distribution with mean zero. In the next article, on underlying assumptions for OLS, we will go into the reasons for the normal distribution.

Then we calculate y using the linear relation formula, now including the error term:

We will set a to 3 and b to 2 in this example. Now let’s calculate our best estimate for a and b using our formulas. All this is implemented in the section of code below:

In this case, we find a = 3.09 and b = 1.99, very close to the true underlying values of and b, despite the random errors that are present. In the graph below, you can find a plot with the data and the plotted line.

Conclusion

In this article we derived the formulas used in Ordinary Least Squares and tested the accuracy of the found estimates. However, Ordinary Least Squares is based on 7 assumptions that we did not explicitly discuss in this article. That will be the topic of the next article!

If you want to read more articles on data science, machine learning and AI, be sure to follow me on Medium!

--

--

Jonathan Bogerd

Data Scientist. I write about Data Science, Machine Learning and anything related to AI.