Predicting Football Matches: Part 1

4 min readAug 20, 2022

Plus data wrangling and some data visualisations.

I recently tried my hand at a different type of sports analytics project; rather than utilising free sports event level data, I came across some fake football league data (the UK Super League). This contained various information, from match results for a full season, future fixtures, match odds and more. My aim was to interrogate all the data to answer some questions about the previous season and make predictions about the upcoming season.

Context

The first season of the UK Super League has just finished, with the second season approaching. Our aim is to analyse the results of the first season, then make predictions for the second season.

Each team retains the same squad for the second season. This is also an American style league for simplicity, with no relegation or promotion from a lower division.

Our aims:

Find which team wins the league in season 1
Find when in the season this team secured the league title
Find the biggest upset in season 1
Predict the outcomes of matches in season 2 and in turn the league
Create a visualisation showing the likelihood of each team’s finishing position in season 2

Explanatory Data Analysis (EDA)

As with all good data analysis, it is important we understand the data we have, from its shape, format, contents, basic statistical properties etc.

I will use some of the standard Pandas EDA functions on the most important dataset called “results” to do this, such as:

The .describe() function shows that 1512 matches were played in season 1, the home/away team scored on average 1.7/1.3 goals per game respectively, and had on average 14.97/11.75 shots per game respectively. We would expect the home team to have an advantage and to generally perform better than the away team, so seeing the data confirms this is a good sign.

The .info function shows us our data’s columns, the data type of each column and the number of non-null values in each. Each column has 1512 non-null values, so we know our data is clean.

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1512 entries, 0 to 1511
Data columns (total 9 columns):
 #   Column      Non-Null Count  Dtype
---  ------      --------------  -----
 0   SeasonID    1512 non-null   int64
 1   Gameweek    1512 non-null   int64
 2   MatchID     1512 non-null   int64
 3   HomeTeamID  1512 non-null   int64
 4   HomeScore   1512 non-null   int64
 5   HomeShots   1512 non-null   int64
 6   AwayTeamID  1512 non-null   int64
 7   AwayScore   1512 non-null   int64
 8   AwayShots   1512 non-null   int64
dtypes: int64(9)
memory usage: 106.4 KB

The .corr() function called on the specified columns shows us if there is any correlation between them. Here we selected the only columns we would find any interesting correlation-based insights from (e.g. MatchID and Gameweek wouldn’t tell us much and would add noise to our correlation matrix).

Nothing we wouldn’t expect here, shots is relatively positively correlated to the corresponding score for home and away.

Next, let’s take a quick peek at the first few rows of our data, so we’re familiar with the format:

Finally, let’s check the teams we are dealing with in this league:

results.head()

Who won the league in Season 1?

There are a number of ways in which we could do this but I decided to create a table that all football fans will understand; a league table. This will list all our teams in order based on the number of points they achieved at the end of the season.

In order to do this, I have to use the season 1 results data (containing all goals scored for home and away team in each match) and the team ID data (to figure out which team played in each match). i could have worked with just the team IDs, however this is less intuitive and was easier to manipulate than random IDs.

The process I followed was to merge the team names to the results data using the team IDs, then allocate points to each team in each match based on a conditional of who scored the most goals (or if equal goals were scored).

To get the total points for each team, I grouped and summed the above dataframe by HomeTeamName and AwayTeamName and renamed the columns for the home and away dataframes.

Merging these two dataframes on Team gives us the total information for each team across the season. All we need to do is total the Home_Points and Away_Points and add columns for matches played, goals for and against and goal difference. Sorting the dataframe by Points then gives us our league table:

Conclusion: Who won the league in Season 1?

We can see that Manchester won the league by 13 points at the end of the 54 match season (those players must be extremely fit!), with only 25 points separating the top 4.

I will split this article into subsequent parts, as it would be lengthy read in its entirety. This Part was mainly getting under the hood of the data and setting the context for the rest of the analysis. I hope you enjoyed Part 1, follow for Part 2 coming soon!