Web Scraping Football Matches [EPL] With Python

Obalanatosin
15 min readJun 6, 2022

--

This is my first web scraping project with python. I’m going to scrape match results from the English Premier League for this project. I’ll download the data using a python library called requests, then parse it using beautiful soup to extract what I need, and finally load everything into a pandas data frame so I can clean it up and prepare it for analysis.

The first thing I’m going to do is figure out how to get the html of a page that shows the EPL standings, and I’ll utilize the requests library to do so. I’m going to import it, and then I’m going to establish the url that I’m going to start scraping.

I’m going to go ahead and download this page, and I’ll do it by typing requests.get(standing_url) . This will send a request to the server and download the html for this page. I can look at the html after downloading it by entering data.text, which will give me a very long and difficult to parse string of html. What web browsers do is convert the html string into graphical elements that we can see and comprehend.

I’m going to have a look at the page and see what I want to scrape. The information I need is all in this first table, which is called the league table, and it displays every team in the league along with their standings. I’m not interested in the rest of the table; all I want are the urls for each squad, thus I’d like to be able to go to each squad’s page and obtain the match log from there.

For example, if I click on Manchester United, I can view all of their matches from this season as well as their numbers. I can even click on other specific stats such as shooting statistics from those matches. What I want is to get the urls for each of these squad pages from the Premier League Stats page.

Scores and Fixtures
Shooting Stats

One really nice way to learn how to parse html is to use the inspector. If you right-click on an element and select inspect, it will show you the html of the page and allow you to see which html tags are associated with which graphical elements on the page. So I’m going to open up this td element and the element I want is actually this a tag, which in html means anchor, and what this has is the actual link to the Manchester United page.

I want to get this href property from the atag, so I’ll take all of these a tags for arsenal, liverpool, and so on, and then I’ll get these urls.

Parsing HTML Links with Beautiful Soup

To parse the html, I’ll use a library called beautiful soup, which is an excellent library for parsing html. After I import the library, the next step is to initialize it with html, so I’ll just call the beautiful soup class and pass in data.txt, which is the html that I downloaded.

So now that I’ve initialized the soup object, I need to give it something to select from the web page. If you look at the html, you’ll notice that all of the data we want is contained within this table tag, which has the class stats_table. So the first thing I’ll do is select this table, and then I’ll select all of the a tags, anchor tags, and actual links that I want inside the table.

To select the table what I’m going to do is to type soup.select, this is called a css selector and in a css selector you type the name of the tag which is table and then a dot and then the class name, the class was stats_table so I’m going to select any table elements in the page that has the class stats_table and I only want the first one so I’ll make sure to index it at zero.

I’ve basically reduced a lot of the superfluous html and now we only have the html for the table; there’s still a lot and it’s still a little overwhelming to read through, but I’ve managed to narrow it down somewhat.

The next step is to find all of the a tags within the table, so I can use links = standing table.find all (“a”). I’m using find all rather than select because find all just finds tags, so I’m basically saying find all of the a tags, so it’s just a little bit easier to use find all here.

I want to obtain the href property of each link, therefore I’ll write a list comprehension which is links =[l.get(“href”) for l in links] and what this does is it goes through each of the a elements and then it finds the value of the href property. Back to the web page, if I go inside the table element, I can see all of the individual rows which are simply tr elements what I care about are the a elements, what I’m going to do is extract this href property from these links.

Now that I have the hrefs for each link, the next step is to filter the links so that I only have squad links, which requires another list comprehension. This will simply say is squad in the link and if it isn’t, then get rid of the link links = [l for l in links if ‘/squads/’ in l]

I observed that the links only featured the last part of the link, not the beginning with the domain, so I’ll change the links to full urls using something called a format string in Python [f”https://fbref.coml" for l in links]. This will take each of the links and append the string https://fbref.com" to the beginning of each link.

Extract Match Stats using Pandas and Requests

To retrieve the stats I desire, I’ll use one of these team urls, like Manchester United. I’ll assign that to the team url and then use requests to retrieve the value of the html from that address. Typing data. text shows a very long and difficult-to-understand html string

Looking at the Manchester United stats page, there’s a lot of different information, most of which I don’t actually need for what I want to do. The part I’m interested in is this table called scores and fixtures, which I can inspect and see that it’s a table with the class stats_table and that all of these individual matches are on each row. Therefore each row represents one match, so what I need to do is essentially grab this whole table out of the page and turn it into a pandas data frame

Pandas provides an incredible approach for doing this. I’m going to convert that match table into a pandas data frame using the pandas read html function, so I’ll pass in the html, which is data.txt, then I’ll say match a certain table, which will be the scores and fixtures table. What match does is seek for a specific string within the table, but pandas.read_html reads all of the tables on the page, so effectively I am scanning all of the tables on the page, all of the table tags, and then looking for one that contains this string.

When I look at the scores and fixtures table, looking inside the caption I see
the string scores and fixtures inside the table, so what is essentially happening is that pandas is going to find this string and then say okay this is the scores and fixtures table and this is what we’re going to grab

Looking at matches to see what data is actually there, and this is now a list, so I need to take the first element of this list, and I end up with a pandas data frame that’s very well formatted.

Get Match Shooting Stats with Request and Pandas

Now that I have the basic match stats, I don’t have any more detailed stats regarding what occurred during the game. I can actually get some of those stats on the shooting page, so I can get the number of shots, the number of shots on target, and some other stats like the number of free kicks and penalty kicks, so what I want to do is actually grab the data from this page as well, so the first thing I’ll have to do is find the url of this shooting page from the scores and fixtures page.

In the inspector, I can see that this is just an a tag with a href that has shooting in it, so I’m going to identify all the links and then just keep the one with shooting in it, which I’ll use to scrape our shooting stats, so I’ll do this in a very similar manner.

I’ll get the html for this exact link, which is something I can do with request.
get
, this will download the html of the shooting stats page, and I’m going to use a format string for this because I need to add the beginning part of the url, these are relative urls, I need to add the fbref.com part to actually be able to download the data[data.txt]. I see a very long string of html that I cannot parse manually but can parse with pandas.

Cleaning and Merging Scraped Data With Pandas

Looking at the first five rows using the head technique, the only issue is the multi-level index, which doesn’t do much in this situation because the second level of indices aren’t particularly useful. With most cases, a multi-level index is not required in pandas. I’m going to remove one index level, and you can tell there are two index levels because there are multiple rows in bold, implying multiple header rows, and then I’ll look at shooting. I check my head again and notice that the index level has disappeared.

I have two distinct data frames. I have both the shot data frame and the match data data frame. I can see that the first row in the shooting data frame has this date and time, and if I go back to the matches data frame, they all line up.
I’m going to use the pandas merge method to join these data frames and apply the result to a variable named team data. I don’t want to integrate all of the columns from shooting, however, because many of these columns are simply duplicative, such as right like time comp round venue, and these are duplicated between both data frames. I’ll only be taking a few columns from the shooting data frame. I’ll do the merging on the date column. Looking at team data.head, I can see that I’ve integrated both of our data frames, so I’ve effectively taken the matches dataframe and added a few new columns.

Just looking at the team data to check how many rows there are 49 rows.
It’s useful to do this when you want to ensure that all of the row numbers or the number of rows and columns match up between data frames.
There are some matches that occurred in the matches dataframe but did not exist in the shooting dataframe for whatever reason, which is fine; I simply dropped them and eliminated them from the data if they did not exist in both.

Scrapping Data for Multiple Seasons and Teams with a Loop

I’ll need a for loop, but first I’ll make a list of the years I want to scrape, so list(range(2022,2020, -1)), negative one which means this will start with the current season and then go backwards to scrape past seasons.

I want to initialize a list called all matches, and when the loop is finished, this list will contain several data frames, each of which will contain the match logs for one team in one season, so I’ll end up with a bunch of little data frames that I’ll then combine into one big data frame once the loop is finished.

I need to specify the url where I want to begin, and this url will be the same as the one I used earlier.

I’m going to add another layer to this, and what I’ll do in addition is scrape previous seasons as well, so the way to do this is by hitting previous season and then I’ll end up with the data for a given season and a similarly structured table that I can then look at and scrape in the very similar way that I scraped the initial table, so this will allow me to scrape multiple seasons by simply going to the previous season

The first thing I’m going to do is get the standings_url html, which I’ve already done, and then I’ll use beautiful soup to parse that html. If you recall, what I’m doing is looking for this table_stats table, and that stats table contains all of the individual team links, which give the individual match data for each team, What I’m going to do is find all of the team links and get the href property, so I’ll use find all to get all of the team links, and then I’ll grab the href property, and then I’ll filter the links so that I just have squad links. If you recall, I had player and other links in addition to the squad links. What I’ll do now is convert them from relative links to absolute links, similar to what I did previously, and you’ll see that this is how I ended up with these links to each squad stats page.

I’ll loop through each of the team urls and scrape the match logs for each team individually. I’ll need to make sure that I set the team name and year correctly so that team name = team url.split(“/”)Taking a look at a team url, you can see the name of the team is Manchester United but there’s a bunch of extra stuff in the url, so splitting it on the slash helps us get rid of anything that comes before the slash.

I’ll say [-1] so this will just take that last piece and I end up with Manchester-United-Stats. The next thing I need to do is replace -Stats, which we’ll simply replace with nothing.So we have Manchester-United.

Then, because I don’t want team names with dashes, I’ll replace the dash with a space, giving us Manchester United, and then I’ll have the team name.

Then I’ll do what I did before to get the team url, which will allow me to get the scores and fixtures table, which I’ll then read into matches.

Next, I initialized beautiful soup and pulled out all the links, which I used to get the link to the shooting stats page, which I can do here. So what this is doing is that I’m first parsing the scores and fixtures table, then pulling out the all comps shooting link because that will allow me to get the shooting stats, then I’ll convert that to an absolute url, and then I’ll read in our shooting stats using pandas. There are two levels to the index for our shooting stat so I’m going to drop the top index level

Next is I’ll merge the shooting stats with the match stats dataframes. Sometimes for some teams the shooting stats aren’t available and when I try to actually merge the two together pandas gives a value error because the shooting stats is empty, shooting stats data frame is empty so in these cases what I am going to do is just ignore that team so this says try to merge the team_data together if pandas has an error which in this case is a specific kind of error called a value error, if pandas has a value error then just continue with the loop and don’t do anything else. I am essentially skipping over any teams where they the shooting stats aren’t available.

The next step is to filter this; the matches are divided into different competition levels. I’m only interested in matches that took place in the Premier League, so I’ll just filter out everything else. After that, I’ll include season and team columns.

I’m going to be combining a lot of other tables for other teams, and we need a way to distinguish which team this was for and which season this was for, so that’s why I’m adding some extra columns here that show the team and season columns. This is something to keep in mind when web scraping because you want to make sure you preserve the kind of information that’s available on the page but isn’t necessarily available in the specific table that you’re scraping.

I have a list called all_matches, which will be a collection of data frames. I’ll add this team_data dataframe to that list, and then we’ll sleep for a second, doing nothing for one second. The reason I’m doing this is that many websites, including fbref, allow scraping but don’t want you to scrape too quickly because it can slow down their website and make it difficult for it to run effectively, so by slowing down how quickly I scrape, I’m ensuring we don’t get blocked from scraping the website.

I need to combine all of the individual data frames into a single data frame, so I’ll use the concat function in pandas to do this, which takes a list of data frames as input and returns a single data frame, and then I like to make sure all of the column names are lowercase, and then the final thing I want to do is write this to csv using the pandas to csv method.

I need the previous season’s url; there’s a button that says previous season; if I right-click on it and inspect it, I’ll see an anchor tag and a tag with the class prev.

So what I can do is grab the url from this anchor tag, which will give me the previous season’s url, which I can scrape.

soup.select(“a.prev”)[0]. get(“href”) So this is selecting anchor tags with the class prev, and then I’ll take the first one because this returns a list, and then I’ll get the href property of that one anchor tag, and then use a format string to convert it to an absolute url. Every time the loop runs, I’ll get the prev season’s standings url and scrape data for that season, ensuring that I can scrape data from multiple seasons into a single data frame.

Final Match Results and DataFrame

I got a total of 1520 rows and 27 columns (1520, 27).

Thank you for taking the time to read this.

Github Link

LinkedIn Link

--

--