Forecasting my beloved Arsenal performance for EPL 2021/22 season using Prophet

Vishnu Nandakumar
Analytics Vidhya
Published in
7 min readJul 30, 2021

I am a massive Arsenal fan and to be honest, I am kind of obsessed with them. Most of my free time goes with watching highlights where Arsenal had played some wonderful football in the past. As much as said they are not doing well and it's a worrying sign for the whole of the footballing world as Arsenal played the most entertaining football in the past decade bettered only by the prime generation of Barcelona in my opinion. Since I love machine learning, I had this thought of predicting how many points would Arsenal rack up in the upcoming premier league season of 2021/22. So I thought of modelling a forecasting solution and have here tried to predict the results and points Arsenal would get in the following season. So let us jump on to the solution right away.

Data preparation:

For creating any model we need at least a minimum amount of data, so here I have used a web scraper package html-table-parser-python3 which scraped the tables from the websites that we search for. Please follow this example from geeks for geeks to learn more about it. In our case, I have scraped only the EPL results of Arsenal from the past three seasons (i.e) (2018–2021). So as you scrape the website you will get a lot of tables that are present on the website. For example, while scraping one of the seasons of Arsenal I happened to see the following table.

So tables appear as a list of lists that had to be transformed into data frames to process them for further progress. Post retrieving these tables we need to do some basic processing to convert them into a consumable manner for the models. Please find the processing steps I used, you can add further if wish for more features

Code-snippet

The above code snippet will add features like the referee, whether the home advantage is present for a team or not, the score etc. We can add other features like temperature, weather, talisman player for each team, holiday effects and all as these features will also help a lot in modelling. So our final data frame will be like the following. I have created one more variable “PI” which tells about the relative probability of Arsenal winning that game. Derivation of that feature is done in the following notebook. Basically, it derives a numerical value for each opponent based on the points that they racked up during the previous seasons relative to Arsenal.

Basic EDA:

As we have done the preprocessing of the data let’s do a simple EDA on the dataset to know more about how have Arsenal performed under certain conditions.

  • Home advantage:

As expected in any sport for any team, Arsenal has also gained more points when played in their home.

  • Influence of Referees:

Games officiated by referees for Arsenal

Since most games are officiated by these six referees: Michael Oliver, Atkinson, Anthony Taylor, J. Moss, M.Dean and P. Tierney. There are bound to have a lot of say in how the games go for each team.

Points secured when games officiated by different referees

The fascinating fact that Arsenal has secured more points in the last three seasons when officiated by Mike Dean is strange given their history. The tables have changed in recent years, maybe Mike Dean really disliked Wenger after all.

Modelling

Since I am treating this as a forecasting problem, let's try out the open-source library from Facebook for modelling time series fbprophet. So let's briefly understand what is a time series, in simple words any series dependent on time as a variable is a time series. Basically, a time series is composed of three components

  • Trend: It is the way in which a time series move wrt to time, i.e how it changes (the long term component which doesn’t change systematically)
  • Seasonality: The component denotes the cyclic nature of the time series, ie how repetitive the time series is - weekly, monthly, yearly etc. (short time component with cyclic or periodic fluctuations)
  • Residuals: The remains of the non-ideal time series. If we compute the seasonality and trend part of a time series we will get an ideal scenario of a time series but the residuals are the ones that make them distinguish themselves from the ideal scenario. (the irregular fluctuations)

Mostly the times series are of two natures: additive and multiplicative. If your seasonality component gets added to your trend then it's additive and if your seasonality changes rapidly with your trend then it's mostly a multiplicative time series.

As Prophet expects the input data to be in a certain format, we have to process the data as required. Since the data has only the points for each individual match let's get the cumulative points over the period of each season separately. Finally, the training data looks as below.

As we can see the series is of multiplicative seasonality because the seasonality terms change or are highly dependent on the trend. Since prophet only requires the date and values columns we can only use those two unless we want to use some regressors.

Prophet takes in two variables as input, one is the date column while the other is the values to be trained and forecasted, we can modify parameters like seasonality, regressors, constraints on how to fit seasonality and trend terms.

  • Let’s try to model the data without any regressors and only using the cumulative points for each season as the dependent variable. So the data for the training is as given below, upon fitting the model we get the following as the result
  • Using info on whether the game is played at home or away.
  • Adding information on referees and home advantage

As we can see from the above plots and predictions that the regressors influence the training a lot, with more regressors we can train the model better and increase its efficiency. In the first case, without any regressors, the model seems to have fitted only on the series value that is seen before, while in the third case I have chosen the top five referees who have officiated the maximum number of games for Arsenal in the past three seasons along with info on where they have played those games. Another way to improve time series forecasting is to try the rolling method of modelling where we train the model with windows of data as this method only considers the recent history of data and ignores the historical data way into the past, you can check my other post on the same which is given below

Just to give an idea of how rolling method forecasting would improve the results in auto-regressive models like Prophet, I fitted the model on the first two seasons of the whole dataset and forecasted in on the third season, you can see the results as followed:

Forecasted in a single go
Rolling forecast with window 7 — i.e training in a week

In the first image, I fitted the model on the first two seasons and forecasted for the entire third season, while in the second case I did incremental training by adding 7 days to the training dataset and forecasted on the subsequent 7 days. You can find more on this logic in the above article I mentioned. The catch here is as you reduce the window size the results will improve but it will not be always your ideal scenario if we use auto-regressive models. The amount of data required is ideally proportional to the prediction range.

Last but not least I have forecasted the points Arsenal would get in the 2021–22 season, the input variables are date, home/away fixture, PI value. The results are as follow:

It seems to be that Arsenal could end up with 65 points on the final match-day, this could be really near to or way off from the actual values but if we include more features like who played more games, weather conditions, time of the game etc we can achieve even better results. I will work on this and try to improve the model and see if I can forecast much near to the actual values that will be seen in May 2022. Right now my prediction is they will end up with 65 points.

You can find the relevant notebooks from here:

Buddies, I highly appreciate the time and support you have given me through this. Stay safe and take care. Until next time bye.

--

--

Vishnu Nandakumar
Analytics Vidhya

Machine Learning Engineer, Cloud Computing (AWS), Arsenal Fan. Have a look at my page: https://bit.ly/m/vishnunandakumar