F1Archive: A Python Library for Analsying F1 Data
Building on a previous blog post where I began analysing some historical F1 data, the idea behind the code repository I discuss in this post is to make it really simple to import data available on the Formula One website so you can begin doing your own data science projects. The repository consists of several classes that can be used to extract data from the website with only one or two method calls and with minimal arguments. Under the hood, the code is based heavily on beautiful soup and pandas to first pull the data and then to transform it in to an easy-to-use format.
At present, the codebase is formed around three main classes which correspond to three areas of interest:
- The Drivers Championship: Race-by-race position and points for each driver across an entire season.
- Qualifying Performance: Absolute times and comparative performance between team-mates and the rest of the field.
- Constructors Championship: Total points, percentages of available points and trends over several seasons.
Each of the three areas is supplemented with data transformations and visualization functions to assist with data analysis. In the remainder of the article I will go into some more detail of each and demonstrate some of the functionality available.
The library is available on github, and includes separate tutorial notebooks for each of the Drivers, Qualifying and Constructors classes. Downloading the code as a zip and running it locally is probably the best way to get started.
Drivers Championship
For the Drivers Championship, we are interested in the positions and points scored by each driver at each race for an entire season. To make this really easy, the library contains a DataExtractor class to extract data from the Formula One website and format it in a schema suitable for pandas. In addition to single seasons, we can also pass it a list of seasons.
The DataExtractor class is a base class for extracting data and the qualifying and constructors classes inherit from it (see below). The code snippet below shows how simple it is to get the results for an entire season. We first call the class. Then we call the get_race_urls method while passing the required year or list of years. The year can be passed as either an integer or a string. This method performs a search for all the urls pointing to webpages for the race-by-race results for that season. From there, we simply call the seasons_results method, which extracts the data from the page and formats it in a DataFrame. Results from each year are stored together in a dictionary with the year as a key and the DataFrame as value.
A sample dataframe from the 2005 season shows how results are stored. For each driver, we store their position at each race that season, as well as the points scored.
With the position and points data loaded, we have the option of dropping the position data to consider only the points (generally, points data is more informative that anything we can do with race-by-race positions). This allows us to add a new column to the dataframe for total points in the season for each driver and reorder the data by final championship positions.
If you prefer a visual on how the championship score breaks down for any particular season, there is basic functionality for plotting a bar chart of results showing the champion all the way through the non-scorers. Here, the reordered DataFrame and subsequent bar chart makes it immediately clear that 2005 was a two-horse race between Fernando Alonso and Kimi Raikkonen.
Another interesting way to look at the Drivers Championship is to visualize the progression through the season. It is possible to see how a championship battle pitched and rolled throughout the year. In 2005, although it probably didn’t seem like it (watch Nurburgring), the fight between Alonso and Raikkonen stayed fairly static for most of the season. The cumulative chart how little the gap changed after those first few races.
Further functionality around drivers’ race and championship results can be viewed in the associated Jupyter Notebook.
Qualifying Results
Qualifying performance is the gold-standard speed test in formula one for both cars and drivers. I don’t happen to think it is the only metric by which we might judge a driver’s speed but it is certainly the most clear. For this reason, there are several methods for extracting qualifying results from each race, as well as for facilitating comparison between team-mates and relative to the entire grid across a season. In addition to a brief overview here, I have written a sister post on the code developed for qualifying results, due to it being a little more complex than the other classes.
The QualyExtractor class inherits from DataExtractor and effectively performs the same function for qualifying results, with some differences under the hood. Pass it a year and it will find the corresponding webpage urls, extract the information and format it accordingly in a DataFrame. Below, the code snippet shows how simple this is.
You can see below that the code returns positions and qualifying times for each race in the 1998 season. You may have noticed that there is a set of empty columns for each race named ‘Team-mate’. These columns can be filled with a relative time to each driver’s team-mate, enabling comparative performance analysis to indicate driver speed.
Functions for computing transformations of the data for different types of analysis are in the data_transforms.transformations.py file and several perform operations on qualifying times. There is a little bit of processing required to compute qualifying deltas, a lot to do with datetime conversions from strings, and I go into more detail on this in the other post.
To get time relative to team-mate, the qualy_differences function takes the above DataFrame and a list of race names, filters the DataFrame by ‘Car’ (manufacturer) and computes the differences between a pair of qualifying times. This gives us the revised DataFrame below, with the ‘Team-mate’ column now filled.
There are several things we can do with this information and the library provides a suite of functions for different vizualisations (see below). One of these, plot_teammate_comparison allows us to display the relative delta between team-mates at each race for a given season. First, we have to stack the DataFrame to get the correct format for plotting. For this, we use the stack_qualy_results of the QualyExtractor class.
The resulting plot gives us a reasonable indicator of the difference in speed between two drivers. Here, the minus numbers indicates a faster lap…Schumacher dominant of course — faster in all but one race and by more that 0.5s in 11 out of 16!
There are a lot of other things we can envisage doing with qualifying data, and some of what the library facilitates can be walked through in the Jupyter Notebook here.
Constructors Championship
As with the qualifying data, there is a subclass of DataExtractor for extracting constructors championship data, called TeamsExtractor. The basic syntax for calling this method is the same as before and initially returns a dictionary of DataFrames containing one or more championship tables.
You can see from the DataFrame above that the champ_standings method returns total points for each constructor, the percentage of total points available and the percentage of the maximum any team could actually score i.e., points for finishing first and second in each race. There are plotting functions available for each of the numeric columns in the visualizations file and these are demonstrated here.
For this post, we will look at an interesting trend in the performance of teams over multiple seasons. To help with this, the TeamsExtractor class can also be used to extract results for multiple seasons. The code snippet below demonstrates how to go about plotting several seasons worth of constructors points data.
First, we pass a list of years to the champ_standings method which returns a dictionary of DataFrames. Then get_seasons_df is used to combine results from several seasons into a single DataFrame. Within this method, another method (generic_team_names) is used to convert seasonal differences in team names into a single entity, e.g. ‘Red Bull Racing Tag Heuer’ to ‘Red Bull’ (Note: at present this method is not fully functional beyond the demonstration). Next, we stack the data into a format for plotting (stack_constructor_trends) and finally, the plot_constructors_trend visualization function is used to generate the graph below.
Unsurprisingly, Mercedes, Ferrari and Red Bull dominated in points percentage over this time-period. Only Ferrari dropped-off massively in 2020. This chart really helps visualize the consistent gulf in performance between the best and the rest and indicates F1 is not as competive in the hybrid-era as we might like to believe.
As with drivers and qualifying results, there is an accompanying notebook for constructors data here.
Visualization Utils
Obviously if people want to use any of this code for data science projects visualization is going to be an important feature. At the moment, the code contains several functions for plotting different aspects of the data. These can be found in the visualization.viz.py file which currently has nine such plotting functions. Among these are functions to plot qualifying field spread and constructors championship points over time.
These plotting functions are all demonstrated in the three notebooks accompanying the race, qualifying and constructor data extraction classes. At present, the visualizations use standard Matplotlib and Seaborn methods and have not been optimally configured to represent team colors. This is something to add to the list of enhancements, of which there are many.
Future Enhancements
I must admit that I realised when undertaking this project that the historical F1 data is much more messy than I had anticipated! In some cases, I probably should have known it would be a little awkward. For instance, teams change their names year on year depending on sponsorship endorsements and things like points for fastest lap have not been integrated. Other issues were less predictable. The 2020 season caused issues due to the fact that several racetracks hosted multiple races and this was not accounted for in my first pass of the code.
There are definitely still some omissions which require further work. For instance, abbreviations in the viz_utils.py are incomplete, color-coding would be nice and generalizing old team names beyond 2016 is required. So, please feel free to make ammendments when you notice something missing.
Finally, if you find this stuff interesting, useful or something worth continuing to develop, please give it a like or a star on github.