Forecasting Football Results -Part 1: Web Scraping

Matheus Mello
8 min readNov 19, 2022

--

Python project exploring football results and betting odds

Data solutions have taken place in multiple tasks accross society, as data modeling becomes a powerful resource in handling randomness. In this series of posts, we aim to build a solution for predicting football outcomes from data collected accross the web.

Main goal

First of all, we need to set our project main goal, which will be the core of our work. So, as I write, the FIFA World Cup is approaching, and it’s usual for people to gather and bet in the matches and tournament results, willing to win some prize or just having fun. This behavior is a global phenomenon and sport betting is a billionaire industry, which attracts millions of people seeking to explore the odds.

In this context arises the main goal of this project, that is, to build a tool which could help us making profit trough betting strategies in the football betting market. Then, with our objective set, we can start structuring the steps which will lead us towards our goal.

Project setup

Ok, our final objective is set, but it’s a vague and distant goal by now. Doing some reverse engineering might be a good idea to clear up our path. The first thing that comes to my mind is to build an application which allows us to simulate betting strategies based on some kind of prediction. So, as you may imagine, our framework to predict our outcomes will be a machine or deep learning model, as we want to use past information to make predictions about future events. That being said, we need to collect the data input somewhere in the web, which by the way is the main topic of this first article.

Before going any further into the web scraping, let’s summarise the steps we’ve just described.

  1. Collecting data
  2. Feature engineering and Modelling
  3. Building a betting strategy simulator

Finally, this project is still in development as I write, and it’s main purpose is to serve as a study report for me, so you should face it like that and be free to make any suggestions or commenting my work. You can access the code repo in here. That’s it. Let’s dive into the data collection.

Data sources

There are some different data sources I found out while looking in the web, and each of them was chosen aiming on specific purposes. I’ll present and explain them and their peculiarities. The first dataset we want to build is the one we are using for feeding our model with past stats and occurrences on football matches. The source we’re scraping from is fbref and you can preview our scraping result in the file sample.pkl in our repo. Take a look at fbref home page structure:

FBRef Home Page

Proceeding, we aim to get a time series dataset with information available in this page, in several dates, and also the information included in the matches reports links, which contains more specific stats about the matches. There’s also a match report example in the image below.

Fbref Match Report

Then, by taking a look at the website ans it’s structure, it’s clear that we don’t need to handle JavaScript code, which would make our scraping task a little more complex, so we are using BeautifulSoup from now on. We should now plan our scraping structure based on the info we need, as the scraper works linearly catching the information we desire. The code was embedded in a class ‘scrapper’, and it’s functionalities were implemented in it.

So, let’s go trough the step by step I followed:

Old Matches Scraper

  1. In the matches page, get to the specified date

This process and all the following below are taken every day in the iteration universe defined by the user. The function getMatches() takes a starting and an ending date, which set the boundaries where the scraper will perform.

2. Get each championship table

Following the first step example, leagues variable can be inputed by the user, so he chooses the leagues he wants to scrap. We can also see a try-except clause in the code, which deals with scructural errors, such as fake tables that could appear in the website.

3. From each championship table, get info from the match line

In here, besides appending the information we wanted at first in our lists, I highlight the time sleep, used to control the amount of requests we make in a certain amount of time, avoiding to have our IP banned. And also worth of attention is the storage of each match report link, which is contained in the score variable. By catching the link from the score variable instead of from the ‘Match Report’, we can avoid storing in our memory matches links that were delayed or cancelled. This leads us to the next step:

4. Get to each match report and retrieve info

As you can see, this process is a little trickier, so let’s make a brief explanation. The yellow and red cards are taken by summing the amount of cards objects in yellow or in red categories. The other stats were taken by:

  1. Checking if stat in the dictionary of expected stats
  2. If true, updates the dictionary with the values linked to that stat, which are the previous and next values in relation to the stat name

The eager reader might have realized that step 2 — getting each championship table — is not mandatory, but it gives us the flexibility to filter only matches from the leagues we want, which is the approach I took.

As an extra step I realized the need to make a checkpoint trigger, as the scraper can face unpredicted errors, or fbref could disallow your IP to make new requests, and such situations would mean a huge amount of time lost. Then, every first day of each month we save our scraper work so far, and in case some unexpected error occurs we have a safe checkpoint to retrieve.

And that’s pretty much it. In the bottom of the code below you can see the day update iteraroe and the actions needed to format the final data frame.

DataFrame Preview

This whole process allow us to scrape some data to build a model to predict football matches, but we still need to scrape data about matches forecoming, so we can make something useful about the data we already collected. The best source I found out for this purpose was SofaScore, an app that also collects and stores information about matches and players, but not only that, they also make available the actual odds in Bet365 for each match.

SofaScore in particular handles JavaScript code, which means the html script is not completely available for us to use with BeautifulSoup. This means we need to use another framework to scrape their information. I chose the widely used Selenium package, which allows us to surf the web trough Python code as if we were a human user. You can actually see the web driver clicking and navigating in the browser you choose — I chose Chrome.

In the image below you can see the SofaScore home page with the ongoing or forecoming matches, and in the right side you can see what happens when you click in a specific match and then “LINEUPS”.

SofaScore interface

With Selenium, as I explained it works like a human user surfing the web, you might expect the process is a little slower, and this is true. Hence, we have to take more care in each step so that we don’t click a button that doesn’t exist, once JavaScript code is only rendered after the user takes some action e.g. when we click in a particular match, the server takes some time to render the side menu we see in the second image, and if the code tries to click the lineup button in this meanwhile, an error would be returned. Now, let’s get to the code.

Forecoming Matches Scraper

  1. Open main page and activate “show odds” button

As I mentioned, after starting the driver and getting to SofaScore’s URL, we need to wait until the odds button is rendered before we click it. We also created lists for us to store the scraped info.

2. Store matches main information

There’s nothing special in here, but it’s nice to consider that in line 8 we are filtering only matches that didn’t start yet. I did it because handling ongoing matches would make the odds scraping much trickier and it’s not clear yet how the future betting simulator will work, and maybe it wouldn’t work properly with live results.

3. Getting lineups

DataFrame Preview

Summarising the chunk above, we wait until the side menu from a match is loaded, click the lineup button and get the players names. We need take some care, because the name of the captain of each team is formatted in the website, so we created a helper function to handle it. Then, we store those players name in a dataframe for each match, and finally after the whole process, we concatenate the matches info with the predicted lineups.

Conclusion

So, that’s it for today. In this article we built two scraper tools that can collect past football matches information, and also forecoming matches. This is only the start of the project, once you can expect new articles about getting a dataset containing players info, the predictor modeling and finally the betting strategy simulator. Hope you enjoyed it!

--

--