Nobody can guarantee you anything in Formula 1 … but are we sure?

Diana Ballesteros
LCC-Unison
Published in
6 min readDec 8, 2021

Predicting the podiums of the formula 1 silverstone race

They say that formula 1 cannot be predicted so why did I choose this topic to do a predictive analysis? As I mentioned, formula 1 cannot be predicted because so many things go wrong that we have no control over, but most of the time there is a pattern, both in the way the drivers drive, weather at a certain track, the ability of each team, etc.

Of all the circuits I have chosen this one in particular, because the first World Championship Grand Prix was held in 1950 at Silverstone, which means that we have a lot of data from that circuit.

What I did in this project was to try to predict the podium for each driver in the race.

Let’s get started!

First of all, for the realization of this project I used a dataset that I found in kaggle, which is in the repository I added at the end.

First, we import the libraries we will need later on

import numpy as np
import pandas as pd
import seaborn as sns
import autokeras as ak
import tensorflow as tf
from matplotlib import pyplot as plt
from sklearn.preprocessing import StandardScaler
from wordcloud import WordCloud, STOPWORDS, ImageColorGenerator

We read our data

data = pd.read_csv('finalSpa.csv')
data = data.drop('Unnamed: 0', axis =1)

Before I started training the model, I wanted to know what the dataset contained, and how the data was being handled.

So through code I was able to know how the dataset was conformed, here I will only put the results, the code can be seen in my repository (
I added link at the end).

The dataset was conformed by 14,566 lines that spanned from the year 1983 to 2020, and had 21 columns that some of them were the name of the circuit, the driver, the team, the starting position of the race, the final position, among other things.
I checked how many null data there were, and the result was zero, which did not complicate the project.
As an extra I implemented a word cloud with the names of the drivers that appeared in the dataset (not necessary but to make it look nice).

wordcloud2 = WordCloud().generate(' '.join(data['driver']))
plt.imshow(wordcloud2)
plt.axis("off")

Out of pure curiosity I wanted to know how was the distribution of nationalities of drivers from 1983 to date, and we can see it below:

sns.set_theme(style='whitegrid')
plt.figure(figsize=(15,10))
df_tmp = data.groupby('nationality',as_index=False)\
.agg(count_drivers = pd.NamedAgg(column='driver',aggfunc='nunique'))\
.sort_values('count_drivers',ascending=False)
plt.figure(figsize=(15,15))graph = sns.barplot(
data=df_tmp,
y ='nationality',
x='count_drivers',
color='#25B5D9'
).set_title(
'Formula 1 Driver Nationalities',
size=14
)
plt.xlabel('# of drivers')
plt.ylabel('Nationalities')

We could answer any more questions we have but that is not the main purpose of this project.

Now it’s time to prepare our data for training!

Before choosing which columns we want to train with, let’s pass all of them to a numerical value, so in the end we have freedom to choose between all the columns.

One of the columns is “nationality”, so we will give a numerical value to each unique value in this column.

data.nationality.unique()data["nationality"].replace({"Finnish": 1, "French": 2, "Brazilian": 3, "British": 4, "Italian": 5, "American": 6,"Austrian": 7, "Colombian": 8, "Venezuelan": 9, "Swiss": 10, "German": 11, "Chilean": 12,"Australian": 13, "Belgian": 14, "Swedish": 15, "Dutch": 16, "Canadian": 17,"Japanese": 18, "Spanish": 19, "Argentine": 20, "Portuguese": 21, "Monegasque": 22, "Danish": 23, "Czech": 24, "Malaysian": 25, "Irish": 26, "Hungarian": 27, "Indian": 28, "Polish": 29,"Russian": 30, "Mexican": 31, "Indonesian": 32, "New Zealander": 33, "Thai": 34}, inplace=True)

We will do the same as we did with the nationality column, with the constructor column and circuit_id

After having all our values in numerics, as our goal is to predict the silverstone race, we will filter only our data from previous years (from 1983 to 2019 because at the end we will predict the 2020 data) that are from the silverstone track.

dataSilverstone=data['circuit_id']==9 
filtered_df = data[dataSilverstone]

We give the number of 9, because when converting our columns to numerical, in the circuit_id column the silverstone circuit took the value of 9.

Preparing the data

train = filtered_df

As a first test, we will train our x without the columns of ‘driver’, ‘podium’, ‘round’, I will also remove “circuit_id” because as all the data are from the same circuit, it does not contribute anything to put it, and we will train our y with podium values.

X_train = train.drop(['driver', 'podium', 'round', 'circuit_id'], axis = 1)y_train = train.podiumscaler = StandardScaler()
X_train = pd.DataFrame(scaler.fit_transform(X_train), columns = X_train.columns)

Fitting a classification model using auto-keras

tf.keras.backend.set_floatx('float64')
clf = ak.StructuredDataClassifier(overwrite=True, max_trials=2) #max_trials is the number of models to test
clf.fit(X_train, y_train, epochs=400)

The result gave us an accuracy: 0.8333, which is quite good.
But let’s see what happens if we remove more columns, now, let’s train our x without the ‘driver’, ‘podium’, ‘round’, “circuit_id” columns and now let’s see how it gives us the result if we also remove the constructor column.

X_train = train.drop(['driver', 'podium', 'round', 'circuit_id', 'constructor'], axis = 1)

This time the accuracy it gave us is 0.9832, quite high.
Let’s try another column combination and see what happens before choosing one to predict.

Now let’s try removing the columns with the driver’s information, such as the number of races won, his points, and age and see what happens.

X_train = train.drop(['driver', 'podium', 'round', 'circuit_id', 'constructor', 'driver_points', 'driver_wins', 'driver_standings_pos', 'driver_age'], axis = 1)

As a result we get an accuracy of 0.6176, which is quite low, so we know that it is not a good combination for removing.

There are a huge number of combinations, but for the purposes of this project let’s stick with the one that gave us an accuracy of 0.9832.

Testing

For prediction we will use the data from the silverston 2020 race (not in the training data).

X_test = pd.read_csv('testSilverstone.csv')
X_test = X_test.drop('Unnamed: 0', axis =1)
drivers = X_test.driver
X_test = X_test.drop(['driver', 'podium', 'round', 'circuit_id', 'constructor'], axis = 1)
X_test = pd.DataFrame(scaler.transform(X_test), columns = X_test.columns)
z = clf2.predict(X_test)res = pd.DataFrame({'Driver': drivers})
res['pos'] = z.astype(int)

It predicts the following podium order

If we compare with the exact actual results of the podium of the race:

with our prediction, we can see that we only got right the first place, which tells us that verstapen or hamilton will win it, which is very low what we got right, but if we see it in general terms, the actual results the first 6 of the podium were Hamilton, Verstapen, Leclerc, Ricciardo, Norris and Ocon; and we can see that our prediction was that the first 6 places were going to be Hamilton, Verstapen, Ricciardo and Ocon.

Maybe by choosing other columns, different epochs, different max_trial values we can get closer to the true value of the podium, but even though it is little predicted, I really enjoyed doing this project and “experiment” of trying to predict formula 1:)!

Link to repository

--

--