An Introduction to Simple Linear Regression
A Brief Personal Introduction
Hi! My name is Kit 😀 I work at EDF, on their (excellent!) Data and Technology Graduate scheme. Before this, however, I studied at The University of Glasgow, and after five years I obtained a first class MSCi in Statistics. I think Statistics underpins almost everything that data professionals do, and in this series of blogs I hope to introduce to you a variety of important statistical concepts, to help make you a more informed and well-rounded Data professional.
I am currently in the third placement of my scheme, working in the Volume Forecasting team within Wholesale Market Services. Specifically, I work in the Export part of the team, which oversees the forecasting of EDF’s generation assets, e.g. solar farms, wind farms, energy from waste sites etc. Statistics is completely crucial to the team and underpins everything it does, as all the forecasting models are fundamentally statistical models. Regression is crucial to the work the team does, for example the Wind Regression Model. This is a model that uses a huge amount of data, for example near real-time weather forecasts of wind speeds, air pressure, etc., amongst other things, to predict the amount of power that a wind site will generate. Although the model we use is far more complicated than what we will cover today, it is genuinely just a more complicated version of the same thing, built on the exact same principles.
I won’t focus too much on the mathematical theory behind things, instead trying to emphasise practical examples, and including R code and the necessary data to illustrate things as I go. The data and code will be available for you to take and play around with yourself. Without any further ado, I present our first topic:
The Basics of Simple Linear Regression
Linear regression is one of the most powerful data modelling tools at your disposal, and one of the most used techniques. A huge number of statistical models stem from simple linear regression, and once you know the basics of it, you can expand on this by learning more advanced and more powerful versions, such as Multiple Linear Regression, Generalised Linear Models (GLMs), Mixed Models, etc… (but more on those in a future blog!).
What does Simple Linear Regression aim to do? Simply put, in Simple Linear Regression, you have a single variable that you are interested in modelling, and you have one variable that you think may be good at modelling it. Let us now define some terms so we can more easily refer to things later:
Response Variable (Y) — This is the variable we are interested in modelling, and the one we may want to predict values of. Imagine this as the variable on the Y axis of a graph. This is also known as the dependent variable, or the output.
Explanatory Variable (x) — This is the variable we think may be useful in predicting the value of our response variable. Imagine this as the variable on the X axis of a graph. This is also referred to as a predictor variable, independent variable, or covariate.
Fundamentally, almost all regression models, (not just simple ones), can be expressed in the following form:
Y = The response variable we defined earlier.
x = The explanatory variable we defined earlier.
f(x) = Some function of the explanatory variable(s) x. In the case of Simple Linear Regression, this will be defined shortly.
ε = A so-called random error.
Imagine for one moment that there was no random error, and we just had the following:
This would imply that we could exactly predict the response variable Y using some function of our explanatory variable x. However, this sort of exactness and 100% confidence is reserved only for mathematics, not statistics. We introduce the random error ε to define a level of uncertainty in our model, and our subsequent predictions.
In the case of Simple Linear Regression, we consider a simple function f(x) and therefore a relatively simple model. We can express this in the following way:
This is simply the graph of a straight line, where:
α = The y intercept
β = The gradient of the line
The other components were defined above.
I won’t go into the details here, but Simple Linear Regression is essentially a method to calculate the values of α and β, as well as defining the random errors ε. In other words, Simple Linear Regression allows you to estimate these model parameters. The details of how these values are calculated may be discussed in detail in a subsequent blog, but if you are interested in how they are calculated, look up Ordinary Least Squares, or alternatively the Maximum Likelihood approach.
A worked example with some R code
Let us illustrate all of this with a practical example. I have included all the R code necessary to perform this analysis and will show the relevant outputs.
The dataset we are using is one that is included within R, and so requires no effort to download and start investigating. The dataset is called ‘Orange’ and contains values for the age and circumference of orange trees. We are interested in whether the circumference of a tree can be used to predict the age. So, using the terminology defined earlier, we have:
Response Variable = Y = Age of the orange tree
Explanatory Variable = x = Circumference of the tree, in millimetres
This is what our data looks like:
# Install packages, if required
# install.packages('tidyverse')
# Load in packages
library(tidyverse)
# Load in the data
data("Orange")
force(Orange)
# Plot the data
ggplot(data = Orange, mapping = aes(x = circumference, y = age)) +
geom_point() +
labs(title = 'Circumference and age of orange trees',
x = 'Circumference (mm)',
y = 'Age (days)') +
theme_minimal()
What can we say initially about a potential relationship between these two variables? Well, circumference and age appear to be positively correlated, i.e. as the circumference of the tree increases so too does the age of the tree. We can also note that this relationship appears approximately linear, i.e. you could reasonably plot a straight line of best fit through this data. The relationship doesn’t appear non-linear, such as looking exponential or quadratic. This is good, as, (in short), it means Simple Linear Regression appears an appropriate model choice for this data.
Now that we are satisfied that Simple Linear Regression is an appropriate choice for this data, we can create a model. As we said earlier, what is that essentially doing in this case? Crucially, it is fitting a line of best fit by calculating the appropriate model parameters, (intercept, gradient, and random errors), with the equation of that line being of the form:
R makes it incredibly easy to calculate the values of α and β. With a few short lines of code, we can calculate these values and plot the resulting line on top of our data:
# Fit a linear model to age and circumference, and view the model parameter
# values
linearModel <- lm(Orange$age~Orange$circumference)
linearModel
# Plot the data with the regression line plotted on top
ggplot(data = Orange, mapping = aes(x = circumference, y = age)) +
geom_point() +
labs(title = 'Circum. & age of orange trees, w/ a regression line',
x = 'Circumference (mm)',
y = 'Age (days)') +
geom_smooth(method = "lm", se = FALSE) +
theme_minimal()
How well does our line fit our data? It is not a perfect fit, there is a reasonable amount of spread of values around the line, particularly for higher values of circumference. However, a model will never fit the data perfectly as we mentioned earlier, and relatively speaking this regression fits the data reasonably well.
We can mathematically define how strong this linear relationship is using something called a sample correlation coefficient, r. For simplicity’s sake I won’t define it formally mathematically here, but it is defined abstractly in the following way
r = A measure of the linear association between x and Y. It has a value between -1 and 1, with 0 implying there is no correlation whatsoever between x and Y, -1 implying a perfect negative correlation, and 1 implying a perfect positive correlation.
Using some R code, we can calculate the value of r in this case to be 0.91. This is a very strong positive correlation between the circumference and age of an orange tree.
# Calculate the correlation coefficient
cor(x = Orange$circumference, y = Orange$age)
It is very important to note at this point, however, that correlation does not imply causation. Just because our variables are strongly correlated, we cannot say that one causes the other. The correlation may be caused by some other unknown third variable. (This is known as confounding, which you can look up to learn about in more detail). Much more complicated and detailed analysis is required to determine which variable causes an effect in another (one of the many reasons why medical trials are so long, complicated, and often not particularly clear or helpful).
Back to our model! We got the following model parameters:
Giving a model equation of:
# Get the model parameters
linearModel$coefficients
How can we interpret the values of our linear model? We can say that a 1mm increase in the circumference of the tree corresponds to an increase in the age of the tree by 16.6 days, on average. (Technically we also interpret the 7.8 intercept value as saying that a circumference of 0mm implies the tree is on average 7.8 days old, however we know this doesn’t make sense. This inconsistency in logic comes from the fact our model was only calculated on a finite range of age/circumference values, and can only fit a sensible model based on the range of data available. This actually touches on a crucial point that I won’t go into too much detail on, but essentially, we can only use our models for interpolation, not extrapolation. i.e. we can only use our models for analysis and making predictions on the range of data available, we cannot extend it any further (extrapolation), and any such attempts will be invalid.
Another important statistic to calculate and examine is R², the coefficient of determination. This defines how much variation in our response variable Y has been explained by the model. Again, I won’t go into the mathematical details here, but you can look up more information if you’re curious.
(Note, when looking at Linear Regression with more than one explanatory variable i.e. non-simple linear regression, you look at a value called R² adjusted, but we just need to look at R² here). Again, using some R code to calculate this, we get that:
# Calculate the coefficient of determination
summary(linearModel)
summary(linearModel)$r.squared
In other words, 83% of the variation in the age of orange trees can be explained by our model (using circumference). For context, R² values approaching 1 imply the model is a good fit for the data, and values less than 0.5 imply the model is a bad fit. In this case, our model has done reasonably well.
Assumptions of the Model
The simple linear regression model we defined earlier relies on a certain set of assumptions. Formally, any prediction or other analysis from the model can only be valid if these assumptions are met. Let us define the assumptions of a Simple Linear Regression Model:
- The deterministic part of the model captures all the non-random structure in the data.
- The scale of the variability of the errors is constant at all values of the explanatory variables.
- The errors are independent.
- The errors are normally distributed.
- The values of the explanatory variables are recorded without error.
# Check assumptions of the model
plot(linearModel)
The exact details of these assumptions and how to check them are beyond the scope of this blog, and may be discussed in further detail in a future blog, but for now let us assume that all of the assumptions are satisfied, and do some prediction using our model.
Recall our model equation:
We can use this to predict the age of an orange tree, given a circumference measurement. (Again, as I said earlier, we can only do this for circumference values from the range of our data, in this case between 30mm and 214mm.
# Check range of data
dataRange <- c(min(Orange$circumference), max(Orange$circumference))
dataRange
Let’s say that we measured the circumference of an orange tree to be 100mm and wanted to predict the age of the tree. We would simply plug the 100mm value into out model in the following way:
Which approximately equals 1667 days. Neat! 😀
(Note: Whenever we predict a value like this, we always express some level of uncertainty in our prediction, usually in the form of a confidence interval. However, the details of how to calculate confidence intervals are beyond the scope of this blog and may be detailed in a future one).
Wrap Up
Hopefully this gives a useful introduction into statistical modelling, particularly linear regression, and in our case specifically simple linear regression. Not all the details were explained here, but this should give a solid introduction in what is involved in regression.
Appendix (Code)
# Install packages, if required
# install.packages('tidyverse')
# Load in packages
library(tidyverse)
# Load in the data
data("Orange")
force(Orange)
# Plot the data
ggplot(data = Orange, mapping = aes(x = circumference, y = age)) +
geom_point() +
labs(title = 'Circumference and age of orange trees',
x = 'Circumference (mm)',
y = 'Age (days)') +
theme_minimal()
# Fit a linear model to age and circumference, and view the model parameter values
linearModel <- lm(Orange$age~Orange$circumference)
linearModel
# Plot the data with the regression line plotted on top
ggplot(data = Orange, mapping = aes(x = circumference, y = age)) +
geom_point() +
labs(title = 'Circum. & age of orange trees, w/ a regression line',
x = 'Circumference (mm)',
y = 'Age (days)') +
geom_smooth(method = "lm", se = FALSE) +
theme_minimal()
# Calculate the correlation coefficient
cor(x = Orange$circumference, y = Orange$age)
# Get the model parameters
linearModel$coefficients
# Calculate the coefficient of determination
summary(linearModel)
summary(linearModel)$r.squared
# Check assumptions of the model
plot(linearModel)
# Check range of data
dataRange <- c(min(Orange$circumference), max(Orange$circumference))
dataRange