Photo from Unsplash by Michael Lee

2020-21 Transfer Window — Data scraping with Python

Gabriel Meireles
Data Science Soccer Club
3 min readSep 17, 2020

--

Hello there!
I’m really very excited to share some of the things I’ve been learning for the past few days. I’m not an expert in Python, statistics, maths or even in English, but my purpose is share with you some information about soccer and data science with Python. So without further ado, let’s start!

What we’ll do?

Every beginning of the season the market is heated, many transfers are made, some clubs invest heavily aiming not only at local competitions but also the desired international achievements. Our mission is to capture this information, treat it as necessary and present it through graphic representations.
If we put everything on a list, this is what we’ve to do:

  • Create a script to get the latest transfer data
  • Create a script to obtain information (goals, assists, cards, etc.) from the respective players
  • Create graphic representation through the scraped data

Scraping the data

The first step is to obtain information about the transfers and for that we use Transfermarkt

Latest transfers — Transfermarkt

All we need is to read the information from the table and save it in a csv file and for that we’ll use BeautifulSoup, a library responsible for pulling data out of HTML and XML files, in our case HTML. With bs4 it’s possible to transform rows and columns of a table into a Python list of dictionaries.

Let’s start by importing some libraries:

  • Requests to make the request to the web address
  • BeautifulSoup to pull data from HTML
  • Csv to put the data in the csv file
  • Re to handle regex

Now we’re going to create some functions that will help us throughout the application. Starting with data_to_csv which receives a list to save a csv output file:

We also have the format_text function that takes a string and remove some chars like double with spaces or escape sequences:

Now a function to handle with the currency:

And finally our function responsible for accessing the pages (line 10), transform the HTML in a soup object (line 11), look for an element with the responsive-table class (line 13), so iterate all the even and odd classes (line 16) to get the ‘tds’ or cell and then create a dictionary with the information we need (line 28), appending the var player (line 42) to the players_list (created on line 2) and finally return on line 46. Easy, right?

Then we run the script saying that we want to browse the first 10 pages, remembering that each page displays 25 players, we’ll have 250 players in all. So, we send the dictionary to the function data_to_csv responsible for saving the data in a csv file:

And the result is this:

Now that we’ve the necessary data, the time has come to manipulate and display it:

--

--