The AI Odyssey: Linear Regression

Beatrice de Waal
AI Odyssey
Published in
13 min readOct 23, 2023

“Prediction is very difficult, especially about the future.” Niels Bohr

There are many instances in our daily lives in which we try to imagine what the future is going to look like; in some of these instances though, we can’t just settle for imagination. Very often in fact, we need a framework, a model, that will give us an accurate prediction that we can justify; as in, we can explain the reasoning behind the process that allowed us to reach it.

Think of forecasting stock prices to make an investment, deciding which customer segment is better to advertise your new service, or trying to determine the relationship between developing a medical condition and some risk factors.

This is exactly the aim of regression: to find the relationship between a set of independent variables, or predictors (our inputs), and one variable that depends on them, which is therefore called the dependent variable (our output). Once we have discovered what this relationship is, and we have formalised it in a model, we can use it to make predictions about what output values we would obtain for values of the inputs for which we have no direct observation.

Let’s now put this idea into a mathematical form. We need a way to map an input onto an output; that is exactly what functions are for:

Let’s analyse the terms one by one:

  • X is our independent input variable; it could be a scalar variable, in which case we would be looking at a simple regression, or a vector variable, that would make our model a multivariate regression.
  • Y is the dependent output variable, which is always a scalar.
  • f( ) is the function that captures the relationship between the input and the output.
  • Finally, we introduce the ε term into our model, which is the error term. It is a random variable that accounts for all the variations in output Y that are not explained by the variations of input X. We can also refer to as noise.

The most simple, yet powerful, regression model is linear regression. As implied by its name, in linear regression there is a linear relationship between the input variables and the output variable. Mathematically, we can represent it like this:

Linear regression is thus a parametric model, meaning that it contains parameters (βᵢ) that we shall estimate, or learn, through the data we have already available.

To be able to successfully apply the linear regression model we must be aware, or have a reasonable guess, of a linear relationship between our inputs and output.

Let us contextualise linear regression in the machine learning landscape: first of all, since we know very precisely what our output variable is, linear regression falls within supervised learning algorithms. To adopt some machine learning jargon, from now on we can also refer to Y as our target variable.

Second, our goal when applying linear regression is to predict the value of a continuous numerical variable: this is the decisive feature that distinguishes regression and classification models; in these latter ones we in fact deal with categorical, as opposed to numerical, outputs.

Linear Regression: step by step

We can now dive into more technical aspects of linear regression. As it was already mentioned, despite its relative simplicity linear regression is an extremely powerful tool, which serves as a fundamental building block for more complex methods; it is therefore crucial to understand it well. To make sure all odds are in our favour, we will look at the most basic type of linear regression: a simple, as in a two-variables (one independent, one dependent) two-parameters (an intercept and a slope), linear regression. One common application of such a model is that of income and spending (assume there is no lending or borrowing allowed): it is reasonable to assume that people with more money at their disposal will spend more than those with limited means.

Our starting point is a dataset of matching observations of variables X and Y. We can visualise it through a scatterplot.

We use the scatterplot as a first check of whether there is an actual linear relationship between variables X and Y. In the graph above, the linear relationship is pretty clear.

What we now need to find is the line that will best fit all our data; in other words, we need to estimate the parameters A and B of the equation:

such that the distances between the data points and the regression line are as small as possible.

As for any linear function, we know that B is the intercept of the best-fit line, while A is the slope, or the change in the dependent variable Y when there is a one-unit change in variable X. When we develop an interpretation of the model, we have to be careful with the meaning we give to parameter B: from a purely mathematical standpoint, B is the value that variable Y assumes when variable X is zero. However, depending on the situation we are modelling, this interpretation may not make sense; think of our previous example of spending and income and assume B is positive: if a person has no income whatsoever and can’t borrow any money, how could they ever spend something that they don’t own?

This example also allows us to note the difference between making estimates for values of Y associated with X values within the range of the observed data (interpolation) and for values of Y associated with X values outside the range of the observed data (extrapolation). Results obtained by interpolation are generally a lot safer than those obtained by extrapolation, especially in some fields. For instance, think of a theory in the realm of economics, that describes human behaviour: while a relationship between two variables may be linear within a certain interval, there is nothing to guarantee that, beyond a certain point, the relationship will not change. Once again, in our example we assumed a linear relationship between income and spending; it is plausible, though, to expect this relationship to break down for very wealthy people, as they don’t have many constraints that the majority, with an income that allows them to live fairly comfortably but with sound budgeting, is subject to.

Despite the scatter plot clearly showing some sort of linear relationship between X and Y, the variation of Y is not entirely accounted for by the variation of X. Just as income can be a good predictor of spending, it does not mean that it is the only variable that affects it. We use the term ε, the error, to account for the effects of all other influencing factors on Y.

Once we have found our estimates B* and A* of parameters B and A, we can finally make estimates y* of the value y assuming that variable X takes value x, through the equation:

We show the best-fit line in the scatterplot:

Therefore, every point on the best-fit line is an estimate y*, while every datapoint is an actual observation y. The distance between y* and y, or the error of the specific estimate y* on the real observation y, is e = y* — y.

So far we have stated multiple times that to find the best-fit line we need to find the parameters’ estimates that minimise the distances between the data points and the line. But how do we do that? One of the most widely used methods is the “least-squares”: let’s see how it works.

As we have already mentioned above, we want to describe the variation of Y in terms of the variation of X, but X does not exclusively determine Y; there is also the error e, a catch-all variable for every other factor. Since we need a model that gives the best possible predictions, we must obtain parameters B* and A* that minimise the error e. To do so, we need to minimise the sum of all the errors. Since some errors are positive and some are negative, if we were to simply sum all the differences between the estimations and the observations as they are, we would obtain zero. A possible solution could be to take the absolute value of the differences, but the one that is more widely adopted is to take their squares. We thus have:

Please note that all the bounds of the sums go from 1 to n; they have been omitted for lighter notation.

To minimise this function, we must resort to differential calculus; since we have two unknowns, we will first differentiate with respect to B* and set that derivative equal to zero. We thus obtain an expression of B* that depends on A*. We then differentiate the sum of squares error with respect to A*, substituting for B* the expression we have just found; we set this derivative equal to zero as well, obtaining:

If you are interested in all the steps of the derivation, you can find them explained with great clarity here.

Please notice that we can rewrite the term A* as follows:

using the covariance of variables X and Y, the sample standard deviations of X and Y and the Pearson correlation coefficient r. It is absolutely sensible for the correlation coefficient to appear in this formula since it measures exactly the linear correlation between two variables. We will touch upon the correlation coefficient again later on.

How a computer makes estimates: a brief introduction to gradient descent

Computing the values of A* and B* on paper is potentially extremely time-consuming, so we usually rely on computers, which use a procedure called gradient descent. We are not going to focus on gradient descent in this article, but to understand the basic concept behind it, you can imagine a ball rolling on a convex surface: it will reach the surface’s lower point. Then, think of the gradient as a generalisation of the slope; the gradient holds information about the direction in which the function grows the most, or is steeper. Therefore, descending the gradient means moving in the opposite direction, where the function will be smaller.

We can visualise gradient descent for the minimisation of the squared error function:

which is indeed a convex parabola. Imagine being on the horizontal axis and taking small steps to the left and to the right:

If the step is not too big, you will reach a point in which the function has a lower value. You can iterate this process until you arrive at a minimum, where the gradient is equal (or extremely close) to zero. We can model this process through this simple expression:

where α is the so-called learning rate, a parameter chosen by the experimenter that stands for “the width of the step”.

Parameters that are chosen by the experimenters are called hyperparameters; it’s always crucial to have clear in mind which parameters are determined by the experimenter (such as the learning rate in gradient descent) and which ones are obtained through the machine learning model (such as B* and A* in linear regression).

Looking back on our steps

Before moving any further, we shall take a step back and analyse some assumptions that need to hold for linear regression to be applicable:

  1. The values x are either fixed numbers or realisations of a random variable, and are all independent of the error terms ε. This means that there isn’t any correlation between the value taken by X and the one taken by ε.
  2. The error terms ε are random variables, with mean equal to zero and uniform variance. This property is called homoscedasticity, and it implies that there isn’t any outlier among the error terms. The opposite of homoscedasticity is heteroscedasticity, meaning that the variances of the errors vary throughout the sample, and are possibly correlated with x. In particular, it is quite common that errors associated with values of x that belong to different ranges have different variances: going back to our income and spending example, people with low income all spend around the same amount, because they don’t really have a choice; on the other hand, among people with high income, some will spend extravagantly, others will spend more moderately, resulting in higher variance of the error. Such variances make up a cone-shaped heteroscedastic pattern.
  3. The random errors ε are independent of one another.

At this point, it should be clear enough when and how it is possible to utilise linear regression. But, once we applied it and it is finally time to get our first estimates y*, how can we be sure that our model is good enough?

Many measures and tests can be computed to assess the quality of our model; in this section, we will look at the one which is probably the most popular: the coefficient of determination, or .

is a measure that represents the proportion of the variance of the estimated values y* that is explained by the variance of the dependent variable x. Before stating the formula for , we shall define three sums that help us analyse the linear regression, graphically too:

  • Sum of Squares Error (SSE): we have already seen this sum, as it is the function we have minimised in the least squares procedure. It accounts for the variability of y* that is not explained by the variability of x.
  • Sum of Squares Regression (SSR): accounts for the variability of y* that is explained by the regression equation.
  • Sum of Squares Total (SST): it is the total variability of estimates y*, and is therefore equal to SST = SSR + SSE.

Finally, we can compute as the ratio between the Sum of Squares Regression and the Sum of Squares Total:

Note that is always a number between 0 and 1: the closer it is to 1, the larger the proportion of the variability of y* is explained by the linear regression equation, and thus the better our model fits. It is often shown as a percentage.

Something important to keep in mind about is that it does not provide information about the accuracy of your estimates y*, but only on how well the linear model fits the data you have already directly observed.

Lastly, it can be shown, although we won’t prove it here, that the coefficient of determination is, for a simple linear regression, equal to the Pearson correlation coefficient squared:

As easy as it can get: Linear Regression in R

One of the easiest ways to set up a linear regression model is by using R software: once we have our vectors, or columns of a data frame, with the values of our dependent and independent variables, all we need to do is call the function lm( ) and assign it to another variable to store the results. To then obtain some useful summary measures, we can simply call the function summary( ):

Conclusion

As we reach the end of our discussion on linear regression, we can make one important consideration: linear regression is a rather basic statistical method, invented long before the rise of machine learning and artificial intelligence. Yet, it is one of the best places to start your journey, as it allows you to get acquainted with some cornerstones of the thought process of a data scientist.

Linear regression, or some other concepts that we have met today, were most likely not completely alien to you before. This is no mistake: there is a certain mist hovering around machine learning, which is sometimes not really justified, but rather a dreaded language barrier. I hope you’ll follow through with us to the next step of this voyage, to learn some more rules of this fascinating language game.

Bibliography:

IBM. “What is Linear Regression?” Accessed October 17, 2023.

https://www.ibm.com/topics/linear-regression#:~:text=Linear%20regression%20analysis%20is%20used,is%20called%20the%20independent%20variable

3Blue1Brown. “Gradient descent, how neural networks learn”. Accessed October 17, 2023.

https://youtu.be/IHZwWFHWa-w?si=wrutY0v8am9F0Abi

Khan Academy. “Gradient descent”. Accessed October 17, 2023.

https://www.khanacademy.org/math/multivariable-calculus/applications-of-multivariable-derivatives/optimizing-multivariable-functions/a/what-is-gradient-descent#:~:text=Gradient%20descent%20minimizes%20differentiable%20functions,direction%20of%20the%20negative%20gradient

jbstatistics. “Deriving the least squares estimators of the slope and intercept (simple linear regression)”. Accessed October 17, 2023

https://youtu.be/ewnc1cXJmGA?si=o4arom1sbKx0XE3d

P. Newbold, W. L. Carlson, B. M. Thorne. “Statistics for Business and Economics”. 9th ed. 2020.

A. Lindholm, N. Wahlström, F. Lindsten, T. B. Schön. “Machine Learning. A First Course for Engineers and Data Scientists”. 1st ed. 2022.

D. P. McGibney. “Applied Linear Regression for Business Analytics with R”. 1st ed. 2023.

--

--