Linear Regression: An Introduction

Indrani Banerjee
8 min readOct 22, 2022

--

Many of us are taught about regression in the early days of secondary school science courses as part of ‘lines of best fit’. Somewhere along the journey they become regression lines, and now one of the most important skills you should show off when trying to enter the world of data science.

So, what is regression and why is it so important?

Regression analysis is really a fancy way of saying we want to predict what a dependent variable is doing for a specific independent variable. Just because we may want to do something or can do something doesn’t always mean we should, right? This is where interpolation and extrapolation come into the conversation. And, finally what does our model actually mean? We have to answer this to wrap up our regression analysis. This is where we discuss the quality of our model, and you’ll hear words like correlation and causation being thrown around.

At its core, regression analysis is a statistical method we use to try and determine the association between one variable (the dependent variable) and at least one another variable (the independent variable. It’s worth nothing that we are looking for association rather than causation here: we want to know what happens to y when x changes rather than why the change in y happens as x changes.

When should you use regression analysis?

It’s all about the data! You can only do regression analysis when you have a continuous dependent variable and want to predict numerical values about what happens to that dependent variable as an independent variable changes.

It’s worth noting that there are a few terms that mean basically the same thing here, and what you choose to use is very dependent on the industry you’re working in. In the world of Physics, we stick to independent and dependent variables, and in the context of Data Science they are commonly referred to as features and target variables.

I’ll keep this post limited to Linear Regression, and I’ll follow this up later with more complex regression models. My aim is to demystify a lot of the jargon that’s thrown around about linear regression, and then show that applying the same principles to non-linear models is essentially pretty easy.

What is Linear Regression?

Say you have two variables, and when you plot them on a graph you see a linear trend: this means you can almost see a straight line that illustrates the trend. As you can see in the left-hand figure, the red dots are scattered to almost show a trend of y values increasing as x values also increase. The figure on the right shows what a possible regression line, or a line of best fit, could be.

Scatterplot (left) and the same scatterplot with an added trendline.

Pretty simple, isn’t it? You’re probably remembering drawing these in secondary school science and maths lessons with a ruler and a pencil. So, what’s the fuss? The fuss is about whether the line of best fit is really, truly the best possible line for data set. Imagine if you were making predictions on house prices with relation to increases in the sealable area of the house, you’d want to be pretty confident about your predictions before finding yourself knee deep in mortgage payments without the rewards of the eventual payoff.

To make sure the regression analysis can make valuable predictions there are a couple of things to consider:

To make sure the regression analysis can make valuable predictions there are a couple of things to consider:

  1. The Data!

For us to use the data to make meaningful predictions, the dataset must be good to start with. This means the data should be representative of the population, or maybe you’ve got the data of the whole population. This problem must be addressed at the data collection step and is known as sampling methods. And of course, as mentioned before, the variables are continuous and independent.

2. Outliers

A dataset with outliers isn’t a bad dataset. There are quite a few different ways of identifying outliers. One common method is to use the interquartile range.

Honestly, outliers are part of the randomness of nature so expect them to exist in the data. The question here really is what to do with them? One of the disadvantages of the linear regression is that it’s quite sensitive to outliers so we need to have a think about how ‘bad’ the outliers really are. Were we to do a linear regression by hand, we will need to consider for example if we want to reduce the effect of extreme values or not. If so, we might decide to use a median-median line. With linear regression the real question is what kind of a linear regression to select: do you want the effects of the outliers to be amplified or do you want them to be minimised.

Say, you wanted to explore whether an increase in GDP leads to increases in the salary of a country. Then, you’d like your dataset to not just have the salaries of the ultra-wealthy nor exclusively of those living below the poverty line. How do you check to make sure you can work effectively with your data?

They say a picture is worth a thousand words: you graph your data! Sometimes seeing the plotted raw data can show you if there are obviously anomalous data points. Then peek at the summary statistics: mean, mode, median, standard deviation, variance, maximums, and minimums. See if there is a linear trend.

When you’re in school you’re most likely told to draw a line of best fit with equal number of data points above the line as below. Not a bad call for an eyeballing method. In that case you were most likely deciding to ignore values that were quite far away from the general points. Chances are though if you and your friend work with the same data but do your own individual graphs, your lines would be slightly different; so how do we get the same lines when we add trend lines to our graphs on graphing calculators and computers?

When using technology, machines are automatically using an algorithm, a method, for determining the equation of the straight line. Most common types of linear regression that you might have heard of from secondary school are least square regression lines and median — median lines. They are a little different in how they work but they essentially all do the same thing: they reduce the vertical distances between the line’s prediction and the actual values in the dataset. The result is that we get a straight line that is in the ‘best’ position to illustrate the trend of the data. As this method calls for some maths rather than an eyeballing method, we all basically get the same equation for the line of best fit. The idea of getting the same model as everyone else is great: this allows for our models to be reproducible.

The Model: What is it?

A linear regression model is an equation representing a straight line.

When you have one independent variable it’s known as simple linear regression, and this is the focus of this post.

When you have n independent variables, it’s better to use a multiple linear regression:

How do we judge the quality of our model?

There are quite a few different ways of determining the quality of the model, and even more importantly it’s good to conclude whether a linear model is appropriate or not.

The Coefficients
For a simple linear regression, we will only have one coefficient, for multiple linear regressions there will be one coefficient for each independent variable. The value tells us whether the independent variable and the dependent variable have a positive or negative correlation. Coefficients have the same units as the variables so it’s very easy to interpret their meaning.
Positive Correlation: As the independent variable increases the dependent variable also increases.
Negative Correlation: As the independent variable increases the dependent variable decreases.
Large coefficients signify one unit change in the independent variable will result in big changes in the dependent variable.

Correlation Coefficient (r)
This is a measure of how linear the association between the independent and dependent variables actually is. The most common type of correlation we use is known as the Pearson Product Moment Correlation. It ranges between -1.0 to 1.0, where 1.0 indicates a strong positive correlation and -1.0 a strong negative correlation.

This illustrates a few different datasets and example correlation coefficients.

Coefficient of Determination (R-Squared)
Simply speaking, this is the square of the correlation coefficient, so its values range from 0 to 1. The R-Squared value essentially tells us the proportion of the variance in the dependent variable that can be predicted from the independent variable. If our R-squared value is close to 1 we can conclude that our linear model is a good fit for the dataset.

Residual Plots

A residual plot is where the residuals are shown on the vertical axis and the independent variable on the horizontal axis.
If we see the points randomly scattered around the horizontal axis, we can conclude a linear model is appropriate for the dataset.

Residual Analysis in Regression from Stattrek.com

Error Analysis
There are a couple of types of errors that we can calculate to determine the accuracy of the predictions of our model. Feel free to check out their formula. They can look a little complicated, but you rarely work these out manually.
Mean Squared Error (MSE): This is the mean/average squared differences between our observed and predicted values. This is actually the value our regression model tried to minimise when determining the ‘best’ line of fit for our trend. Because the units of the MSE is the square of the units of our data, this is a little harder to interpret. Another interesting point worth noting is that the MSE penalises larger errors more than smaller errors.
Root Mean Squared Error (RMSE) is the standard deviation of the residuals, in other words, the square root of the MSE. The smaller the RMSE the higher the concentration of the data around the regression line. One advantage the RMSE has over the MSE is the units: RMSE has the same units as our data so it’s much easier to interpret.

I’m going to follow this up with a pythonic ways of conducting linear regressions next week. I would recommend using Statsmodel, Scikit Learn, or Pingouin if you would like to use Python to try out some linear regression. There are slight differences to these packages but all pretty simple to learn. Check out how they work in my notebook next week and of course have a play yourself!

--

--