How to scrape and personalize data from FBref with Python: A guide to unlocking Football Insights

Ricardo André
4 min readApr 6, 2023

--

Football enthusiasts around the world are always on the hunt for new ways to explore and analyze the game. Fortunately, FBref provides a wealth of open-source data that can be scraped and analyzed to unlock valuable insights.

In this article, we will demonstrate how to scrape league and player statistics tables from FBref using Pandas (a Python library). By leveraging HTML tables, we’ll show you how to extract and customize this data to better understand the game and players.

First of all, we need to identify what table of FBREF we want to scrape. Let’s try scraping Top-5 league player data. If you explore FBref you’ll find this data in this link: https://fbref.com/en/comps/Big5/gca/players/Big-5-European-Leagues-Stats

fbref table

Let’s start the code. To extract data from an HTML table using Python, we can use the “Pandas” library. First, we import this library and then provide the URL for the table inside the pd.read_html() function.

# libraries
import pandas as pd

# fbref table link
url_df = 'https://fbref.com/en/comps/Big5/gca/players/Big-5-European-Leagues-Stats#stats_gca'

df = pd.read_html(url_df)
df

As shown below, the format of the scraped table can be difficult to work with and interpret.

df

Let’s now extract the table at index [0] and voilà.

df = pd.read_html(url_df)[0]
df.head()
df and df.info()

Our primary objective has been achieved. We have successfully scraped the data. However, the resulting dataframe has a multi-index, and our task is to remove this multi-index, create new headers, and slightly modify the dataframe.

# creating a data with the same headers but without multi indexing
df.columns = [' '.join(col).strip() for col in df.columns]

df = df.reset_index(drop=True)
df.head()
df and df.info()

Look at df.info() and how the structure of the dataframe is different now.

Let’s now rename some columns, removing the ‘level_0’s.

# creating a list with new names
new_columns = []
for col in df.columns:
if 'level_0' in col:
new_col = col.split()[-1] # takes the last name
else:
new_col = col
new_columns.append(new_col)

# rename columns
df.columns = new_columns
df = df.fillna(0)

df.head()

If you take a look at the DataFrame at this point, you may notice that the ‘Age’ column appears to be in an unusual format.

Additionally, the ‘Pos’ (position) column sometimes contains multiple positions for players who are able to play efficiently in more than one position. To address these issues, I will split the ‘Pos’ column into two separate columns, one for the primary position and another for a secondary position if one exists.

I also think that the format of the ‘Nation’ and ‘Comp’ (competition) columns could be improved. Therefore, I will make some changes to the DataFrame to address these issues and make it more readable.

df['Age'] = df['Age'].str[:2]
df['Position_2'] = df['Pos'].str[3:]
df['Position'] = df['Pos'].str[:2]
df['Nation'] = df['Nation'].str.split(' ').str.get(1)
df['League'] = df['Comp'].str.split(' ').str.get(1)
df['League_'] = df['Comp'].str.split(' ').str.get(2)
df['League'] = df['League'] + ' ' + df['League_']
df = df.drop(columns=['League_', 'Comp', 'Rk', 'Pos','Matches'])

df['Position'] = df['Position'].replace({'MF': 'Midfielder', 'DF': 'Defender', 'FW': 'Forward', 'GK': 'Goalkeeper'})
df['Position_2'] = df['Position_2'].replace({'MF': 'Midfielder', 'DF': 'Defender',
'FW': 'Forward', 'GK': 'Goalkeeper'})
df['League'] = df['League'].fillna('Bundesliga')

Finally, we have arrived at our and improved dataset, which is now ready to be used for creating valuable football insights and analysis.

Did you like the final result? Will it be useful for you? Hope so :))

Really hope you enjoyed your reading and find it useful!!! :)

If you enjoyed my article and feel like giving a little something back, feel free to give it a round of claps 👏 or buy me a virtual coffee :), link here.

Your support is greatly appreciated!

Follow me and my content on:
Linkedin
Tableau
Portfolio page
Football analytics Twitter
Football analytics Instagram

--

--

Ricardo André

Mathematician, Data Scientist with huge passion for football.