How I Used Python To Scrape 100+ Tables Containing 5 Years Of F1 Data

Creating an ETL pipeline project scraping Wikitable data for Formula One (F1) driver performance from the past 5 years.

James Patalan
Pipeline: Your Data Engineering Resource
7 min readJan 15, 2024

--

James Patalan is pursuing an M.S. in data science at Bellevue University and is a contributor to Pipeline: Your Data Engineering Resource.

Getting Technical

If you read my previous article on what I learned by attempting to quantify the skill of a Formula One (F1) driver, you will know that I ended up finding a flaw in my analytical approach, and disproved my own theory. While my previous write up was results oriented, this one is process oriented, and by result is going to be a lot more technical.

My data all came from information that I scraped off of Wikipedia, which, despite what your High School English teacher may have told you, is generally a reliable source of information. Opting for Wikipedia as a data source for a project offers advantages; Wikipedia provides a diverse and up-to-date dataset across a multitude of topics and the structured nature of its articles simplifies data extraction, something I would soon find out.

Photo by jacopo marello on Unsplash

Hard Coding & Dynamic Coding

When I began work on my F1 project, I wanted to be as programmatic as I could, and avoid “hard coding.” Hard coding is essentially writing code that cannot be easily modified or reused, the programmer does most of the work by manually writing fixed values into the code. Whereas in not hard (dynamic) coding, the computer does most of the heavy lifting for you.

In order to avoid hard coding, it is important to first think about what data you are trying to collect, and how that could be achieved with programming.

Screenshot of a Wikipedia table showing the results of a race
Screenshot by James Patalan

In my case I wanted to get the table shown above. Every single individual F1 race has its own corresponding Wikipedia page, and each page has this table which displays the ending position (Pos.) and the starting position (Grid) of every driver.

At this point I could simply copy and paste over 100 URLs into my code, but I wanted to generate a more intelligent solution.

Work smarter not harder.

Examining the URLs for the individual races revealed they all shared the same format.

  • wikipedia.org/wiki/2021_British_Grand_Prix
  • wikipedia.org/wiki/2019_Hungarian_Grand_Prix

See it yet?

  • wikipedia.org/wiki/{YEAR}_{NAME}_Grand_Prix

This insight told me that all 100+ URLs could be generated.

Working Smarter

Not every Grand Prix is held every year. While there are some main staples like Monaco, others go in and out of favor. Thankfully, Wikipedia also has articles on each F1 season, which contain the names of every race within that year.

Screenshot of a Wikipedia table showing all the races in a given season
Screenshot by James Patalan

Using the Pandas method pd.read_htlm(), I learned that each season’s list of races could always be found at index position 2. Now we can iterate through the URLS, pull index position 2 and append our results to a new list.

import pandas as pd

races_by_season_lst = []

for url in season_url_lst:
data = pd.read_html(url)
data = data[2]

races_by_season_lst.append(data)

In a perfect world, this code would work perfectly. But the real world is messy, and to get data to be formatted enough for the computer’s taste, it must be cleaned.

data.replace(' ', '_', regex=True, inplace=True)

Because of this, I had to add several steps to ensure the data was scraped correctly. For example, using regex to remove white spaces.

I ended up generating a list of tuples, where each tuple represented a season, and each item the name of a race within that season. Using this list of tuples, we can now begin to build out our list of Wikipedia URLs.

master_url_lst = []
year = 2018

# Loop through the list of seasons and create proper urls
for season in races_by_season_lst:
for race in season:

# Fixes error caused by special character
if 'ã' in race:
race = race.replace('ã', 'a')

race_name = f'{year}_{race}'
race_url = f"https://en.wikipedia.org/wiki/{year}_{race}"
master_url_lst.append((race_url, race_name))

# Increase the counter variable once the season has been completed
year += 1

First, this code creates a new empty list, then sets the year to 2018, which also serves as a counter variable. It then iterates through the list of tuples, and uses f-strings to create a new tuple, that contains the URL and name of the race. Finally, at the end of the season, the code increases the year by 1, so the year counter goes up all the way to 2022.

Working Harder

All of the needed URLs have been generated, and it is finally time to scrape our data; this is precisely where I hit the first major snag of my project. Remember how all the names of the races were conveniently located in index position 2? No such luck when it comes to the driver table that I was after. For the vast majority of URLs, that table was located at index position 3.

However I found eighteen that were located at index position 4, and for the life of me I could not identify a trend. Even stranger, there were four races with the desired table at index position 1, two in index position 5 and one in position 7. I ended up reaching out to three separate colleagues within the industry, all of whom collectively scratched their heads over how to solve this.

That is until one friend said to me, “Well… you could just hard code it.”

Hard coding is generally considered a bad practice, because it often makes code inflexible and difficult to maintain. This being said, sometimes it is a necessary evil in terms of finishing within a deadline.

Seeing as how I had been working on this project for a month and had yet to do any actual analysis, I decided to bite the bullet.

ones = ['https://en.wikipedia.org/wiki/2021_British_Grand_Prix',
'https://en.wikipedia.org/wiki/2018_Abu_Dhabi_Grand_Prix',
'https://en.wikipedia.org/wiki/2019_French_Grand_Prix',
'https://en.wikipedia.org/wiki/2019_Spanish_Grand_Prix']

This is how I ended up with this lovely table in my code, remember that handful of races that were scraping from index position 1? I have another table named “fours”.

dataframes_lst = []
for url, race in master_url_lst:

# Checks for URLS with odd table placements
if url in fours:
data = pd.read_html(url)
data = data[4]

elif url in ones:
data = pd.read_html(url)
data = data[1]

...

else:
data = pd.read_html(url)
data = data[3]

The above code block should be fairly self explanatory, essentially it loops through our list of tuples, and checks the URL against the list of URLs that were pulling from index position 4, and if it finds a match, grabs the data in the correct position required. It does this for all of the weirdly paced tables before defaulting to index position 3.

dataframes_lst = []
for url, race in master_url_lst:

...

max_rows = 20

# Removes fastest lap row and source row
if len(data) > max_rows:
data = data.iloc[:max_rows]

# Add race name column
data['Race_Name'] = race

dataframes_lst.append(data)

Within the same for loop, we need to clean the data a little bit before storing it. When pandas scrapes the table from Wikipedia, it can accidentally grab things that you do not want included as a row in the data. Since there are 20 F1 drivers in a race, I have set a max_rows variable that the table is checked against, ensuring that only the first 20 rows are scraped.

Checkered Flag

So far I have dynamically generated URLs, and extracted data from those to create more URLs, which I then iterated through and scraped the tables that I wanted while cleaning and transforming them.

Finally it is time to complete the ETL process and load the data into a database, for this task I leveraged SQLite.

import sqlite3

def sql_table(driver_df):
conn = sqlite3.connect('formula1.db')
c = conn.cursor()

c.execute('''CREATE TABLE IF NOT EXISTS drivers (End_Pos REAL, No TEXT, Driver TEXT, Constructor TEXT,
Laps TEXT, TimeRetired TEXT, Start_Pos REAL, Points TEXT, Race_Name TEXT)''')

for row in driver_df.itertuples():
c.execute(
'INSERT INTO drivers (End_Pos, No, Driver, Constructor, Laps, TimeRetired, Start_Pos, Points, Race_Name) VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?)',
(row[1], row[2], row[3], row[4], row[5], row[6], row[7], row[8], row[9])
)

formula1 = pd.read_sql('SELECT * FROM drivers', conn)

conn.close()

return formula1

Now I know what you’re thinking, “What about all those cool visualizations you made?” For that I turned to my friend Matplotlib Pyplot. The visualizations for the individual drivers was fairly straightforward, and all pretty much followed the same procedure.

import matplotlib.pyplot as plt

verstappen_df = formula1[formula1['Driver'] == 'Max Verstappen']

ax = plt.axes()
plt.hist(verstappen_df['End_Pos'], color='black', edgecolor='ivory', bins=20)
plt.xlabel('Position', weight='bold')
plt.ylabel('Frequency', weight='bold')
plt.suptitle('Finishing Positions of Max Verstappen', weight='bold')
plt.title('Data Scraped from Wikipedia')
ax.set_facecolor('silver')
plt.xticks(np.arange(1, 21, 1))
plt.show()

Essentially I isolated all of the data for a particular driver by selecting all rows where the “Driver” column matched the name I was looking for. I then used Matplotlib to create a histogram. In regards to the average gain/loss graphs however, I used the Pandas group by method.

formula1.groupby('Constructor')['Gain-Loss'].mean().plot(kind='bar', color='blue')

This method groups the data by constructor (team), then selects the “Gain-Loss” column and calculates the mean for that group.

While my journey of quantifying the skill of F1 drivers may have ended in a different position analytically than I had expected, I would call my technical portion a success. Despite the initial aspiration to avoid hard coding, practical challenges led to some compromises and the process taught me the importance of adaptability in coding and problem solving while balancing efficiency and pragmatism.

For a complete look at my entire Python script, please visit my GitHub.

Contact James

--

--