Understanding Multivariate Linear Regression with Python and Football

Part II of Linear Regression to Reinforcement Learning Football Mastery

7 min readOct 31, 2023

A Side Note on Cost Functions

Linear regression model performance for the 2021–2022 Premier League season data.

Above is the data from the 2021–2022 Premier League season that we used in the last article, however this time, I’ve included the residuals. The residuals are just the deltas between the actual output and the predicted output for each input data point. Our goal when we’re modelling anything is to minimise the error between the actual and predicted. For a linear regression model, this essentially means translating or rotating the line of best fit until the total sum of residuals are minimised. We square these residuals in the line of best fit calculation for three main reasons:

To account for positive and negative numbers.
To make the larger errors stand out.
It makes it easier to optimise and find the minimum of the cost function (see below for more).

Example Cost Function when Using Linear Regression.

Above we have the cost function of modelling the 2021–2022 Goals Scored vs Games Won relationship with linear regression — essentially this plot just shows what plugging every possible value for a and b does to the total error. Hopefully, what this plot makes clear is that when modelling the cost function with least squares regression, the cost function becomes convex in shape, meaning there’s a clear, single, minimum point that can be found through calculus.

To be clear: Just because you can optimise the cost function does not mean you have a good model — an optimised linear regression model just means that you’ve found the lowest error possible whilst modelling using Y = aX+b.

Okay… but what happens when Y isn’t just influenced by one thing? Welcome to multivariate regression.

What is Multivariate Linear Regression?

Multivariate linear regression is a statistical method used to model the relationship between a dependent variable Y and multiple independent variables (X1, X2, …, Xn) by fitting a linear equation to the observed data.

The simple difference between simple linear regression and multivariate is that multivariate handles more independent variables. It’s very unlikely that a singular independent variable is wholly responsible for the variation in a dependent variable (e.g. how healthy you are doesn’t solely rely on how many chocolate buttons you eat, it also relies on how much exercise you do, if you smoke etc.).

Multivariate regression allows us to capture more complex relationships by considering multiple input variables, or features.

Okay, so how does it work?

The Multivariate Linear Regression Equation

The short answer is that multivariate regression works in a very similar way to simple linear regression.

In simple linear regression, the equation we use to model a relationship is Y = aX + b, where a is the coefficient and b is the intercept. In multivariate regression, the equation extends to Y = a1X1 + a2X2 + … + b, where a1, a2, … are coefficients for each independent variable.

The only difference is that instead of only having to calculate the cost function minima for two values (a and b), we now have to find the optimum coefficient for every variable we’re planning to use in the model plus the intercept.

This sounds more difficult to solve?

Well, yes and no. The principles are the same as simple linear regression, we optimise the cost function by tweaking each coefficient to minimise the sum of squared residuals. The good thing is that as long as the dependent variable Y behaves linearly with respect to each variable, the cost function, like above, will form a convex shape. As long as this assumption holds, there will always be one global minimum that will define the coefficients for each variable (and the Y intercept) that optimise the cost function.

Challenges with Working with Multiple Features

There are a number of challenges with working with multivariate regression but I’ll keep this section short, the top three challenges in my opinion are:

Multicollinearity: Multicollinearity occurs when independent variables are highly correlated with each other. This makes it difficult to determine the individual effect of each variable on the dependent variable. It can lead to unstable coefficient estimates and reduce the interpretability of the model.
Dimensionality: As the number of independent variables increases, the dimensionality of the problem also increases. This can make visualisation, computation, and interpretation more challenging. As an example, I wouldn’t be able to plot the cost function for a multivariate regression problem — I don’t have enough axes!
Sample Size: The sample size must be large enough to support the inclusion of multiple independent variables. With a small sample, the model may not provide reliable estimates.

Practical Example — Can Premier League Match Stats Help Us Predict the Match Scores?

Again, I know this is a simple example and not one that is particularly useful (yet) given that we wouldn’t know the final match stats before the final result but we’ll build up the usability and complexity of the models as we go over further techniques. I’ll once again omit a lot of the code, just to keep the article clean but you can find the data and python code here.

Below we have the covariance plot for each Match Stat variable in the dataset. The plot allows us to quickly examine the relationships between two variables in the dataset for correlation but also collinearity. Correlation can be examined by looking for the squares with high red density (positive correlation) and high blue density (negative correlation). What is a bit more tricky to spot is the collinearity of features.

Covariance Plot for Match Stats of 2022–2023 Premier League Data.

As an example, mathematically a Shot on Target must also be Shot and we can see that by the high correlation between the two variables. To check for potential collinearity, you read across from the dependent variable (in our case Home Goals FT and Away Goals FT) and check whether two variables have similar colours. Closely related colours, like we see with H Shots on target and H Shots is a good indicator that there is some collinearity that we need to protect for.

Training the Model

To train the model, we can rely on scikit-learn’s Linear Regression function. We will actually train two models so that we can compute both the home goals and away goals scored:

# Separate the X independent variables/features from the dependent features.
X = stats.drop(['Home Goals FT', 'Away Goals FT'], axis=1)
y_h = stats['Home Goals FT']
y_a = stats['Away Goals FT']
# Train the models with scikit-learn's LinearRegression() function.
model_h = LinearRegression()
model_h.fit(X, y_h)
model_a = LinearRegression()
model_a.fit(X, y_a)

It’s as simple as that.

Again, a cautionary tale, it’s easy to create a model, it’s harder to make a good model and it’s almost impossible to debug a bad model if you don’t understand the fundamentals of what is happening beneath the surface.

Informal Prediction

Once the models have been fit to each dataset, we can use the models to predict future score lines.

Below is the actual and predicted home and away goals for the 10 most recent 2023–2024 season Premier League fixtures. By using the models on the unseen dataset, we can start to evaluate the performance. Across the 2023–2024 dataset, the models were able to predict 20.2% of correct scorelines and 64.65% of correct results — I wouldn’t get excited about improving your Super6 performance just yet…

Actual vs Predicted Home Goals and Away Goals for the 10 Most Recent Premier League Matches (As of Writing)

Formal Performance Measures

This is all well and good but how do we understand whether the model is behaving as expected or not? scikit-learn offers two built in metrics that allow us to compare the performance of the model:

Mean Squared Error (MSE): This measures the average of the squared differences between the predicted values and the actual values (the residuals). Smaller MSE values mean that the model’s predictions are closer to the actual values.
R-squared (R2) Score: R2 measures the proportion of the variance in the dependent variable (target) that is explained by the independent variables (features) in the model. R2 ranges from 0 to 1, where 0 indicates that the model does not explain any of the variance, and 1 means the model perfectly explains the variance. It is a measure of goodness of fit.

Both MSE and R2 should be used as relative measures for comparing different models. And while lower MSE and higher R2 values are generally preferred, neither metric on its own provides a full assessment of a model’s performance.

How to Improve the Model

The first thing to investigate in improving our model is whether we have utilised the existing data to its full potential.

Using the metrics above, we could investigate whether removing some of the independent variables could improve the performance of the model. By targeting the variables that appear to share collinearity with more dominant variables, the coefficients generated for the remaining variables may offer a more optimised model.

Once we are confident that the current data has been optimised, it’s time to explore our other options.

Naturally obtaining more data is an appealing option, either by adding additional features into the dataset or by engineering the existing feature set to generate new ones. This however, must be done with care, as to avoid introducing non-linear relationships — it’s a difficult balance between enriching the dataset and maintaining the integrity of the model.

Conclusion

Multivariate linear regression is a powerful statistical method that has allowed us to model the complex interplay of factors that influence outcomes, such as Premier League match scores and results.

As we’ve seen, the ability to capture the relationships between multiple independent variables and a dependent variable opens up exciting possibilities. However, we’ve also encountered some of the challenges that come with it, such as multicollinearity and dimensionality complexity as the feature set grows.

Next we’ll explore how to handle non-linear relationships using techniques like polynomial regression.

Until tomorrow :)