Hacking football: build a comprehensive player metrics dataset for 0$

Belkacem Berchiche
4 min readMay 25, 2023

--

The first step to building football software is inevitably to get up-to-date, structured football data. Whether you’re building a football analytics app, an automated betting tipster or an Ultimate Team helper tool, data about team performances, individual player performances, physicality, fitness record, etc. is the foundation you build upon.

With the rising importance of data in the beautiful game, there is thankfully now an abundance of commercial data you can buy and build upon. However, if you’re an indie hacker who can’t afford it or just want to experiment and play around with some data, you still have some free options.

This is a problem I’ve encountered when building my FPL solver HAUL-9000, and I’d like to share my solution with you.

The analytical soup: combining multiple sources

For my Fantasy Premier League solving app, I could thankfully rely on the official FPL API for data like goals scored, assists, availability, expected goals and so on. It is a sound foundation to build upon. However, I still needed physical data (height, weight, age, estimated speed, strength) and player performance data.

For the physical data, I went with the FIFA video game dataset, which you can grab on https://futdb.app/. For player performance, I’ve built a custom scraper for the FotMob website.

Now, all I had to do was to merge the external data with the existing FPL data. Pretty straightforward, right?

What the fuzz?

N’Golo Kanté or N’Golo Comté?

That’s when I ran into my first big problem. Each data source uses a different player identification system. In an ideal world, players should be identified by some universal system like FIFA ID code, but we do not live in such a world. Another option would have been to use the club and the player’s shirt number as identification, but shirt numbers can change during the season, and very few data providers give us that specific data.

My only option therefore was to merge using player names, but we all know how tricky and dangerous that can be. If your player is named “Heung Min Son” in FIFA and just “Son” on FotMob, there is no way to match the names together.

That’s when I found thefuzz, a Python library that does fuzzy merging of strings. I would go through FPL players by name, find the closest matching names in the FIFA and FotMob datasets and tie them all together.

In order to do that, I would first need to implement function that finds the best match for a given name from a list of player names:

from thefuzz import process
from typing import List

def find_best_matching_name(name: str, name_list: List[str]) -> str:
result = process.extractOne(name, name_list, score_cutoff=90)
return result[0] if result else None

For my specific needs, I found that a cutoff of 90 yielded the best results, but depending on your data, you’ll have to run through a trial-error phase to find the threshold that works best for you.

With the matching function implemented, all that was left to do was to create a “dictionary” for mapping player names from the 3 different sources. I like to use the pandas library for data manipulation, so here’s what my implementation looks like:

import pandas as pd

async def main():
# FPL data
fpl_df = get_fpl_players_data()
fpl_df = fpl_df[['player_name']]
fpl_df.columns = ['fpl_name']

# FotMob data
fotmob_df = pd.read_csv('data/fotmob.csv')
fotmob_names = fotmob_df.player.values.tolist()
fpl_df['fotmob_name'] = fpl_df.fpl_name.apply(
lambda x: find_best_matching_name(x, fotmob_names))

# Get FIFA players data
fifa_df = pd.read_csv('data/futdb_players.csv')
fifa_df = fifa_df.drop_duplicates(subset='name', keep= 'last')
fifa_df = fifa_df[['name']]
fifa_df.columns = ['fifa_name']
fifa_names = fifa_df.fifa_name.values.tolist()
fpl_df['fifa_name'] = fpl_df.fpl_name.apply(
lambda x: find_best_matching_name(x, fifa_names))

After running this code, you’ll get a dataframe that looks like:

You can now export it to a .csv or .json file and use it when you want to combine the data from these sources. Depending on the nature of the data, you’ll need to update it frequently. Athletic / physical data doesn’t need frequent updates typically, but performance data has to be updated weekly in order to be relevant. If you want to go deeper, you can maybe create your own database and have a background job that runs periodically to do the necessary scraping, merging and updating of data.

In conclusion

Assembling a player metrics dataset from various free football data sources opens up exciting possibilities whether you’re a student of the game and doing one-shot analysis, or an indie developer who’s looking to build football software. In this article, we’ve seen a technique that allows you to combine player data from different sources. Let me know if you’ve encountered this problem and how you solved it, or what topic do you want me to cover next. Happy hacking :)

--

--