Oscar Piastri: Star in the Making?

Anthony Dalke
4 min readAug 6, 2024

--

III. Data Architecture

Welcome to the third installment in this series assessing the historical strength of Oscar Piastri’s rookie season! Parts I and II respectively focused on an overview and data requirements gathering. Today, we’ll get more granular (and wet our coding beaks), identifying where to find required data points in the FastF1 API’s ecosystem and how to translate that into data architecture.

Recall that in the previous part of this project, I concluded the Results and EventsSchedule topics from the API satisfied my data requirements. While this helped get a “lay of the land”, I wanted to dig deeper, pinpointing the specific fields to retrieve and where to find them. To explore the structure of the data in the in the API’sSession and EventSchedule objects, I took the quick-and-dirty approach of entering a few basic commands in a Jupyter notebook.

import fastf1 as ff1

schedule = ff1.get_event_schedule(1994)
print(f"Round numbers in 1994 season: {schedule.RoundNumber.to_list()}")

session_quali = ff1.get_session(2020, 1, "Q")

session_quali.load()
session_event = session_quali.event
print(f"Preview of event data in session object: {session_event}")

session_race = ff1.get_session(2020, 1, 'R').results()

session_race.load()
session_results = session_race.results
print(f"Preview of results data in session object: {session_results}")
Retrieving round numbers in a given season
Exploring the contents of session event data
Exploring the contents of session result data

I continued playing around to gain a sense of the fields in each object. To summarize my findings, I assembled tables connecting the fields I ultimately wanted for the analysis to the API object exposing them and their assigned column names in the API. That resulted in the following.

During this exploration, I also noted a few observations about working with the API.

— The get_session() function call takes roughly 15 to 20 seconds for the initial load of a single session. After that, the API leverages caching to reduce subsequent calls of the same session’s data to a single second.

— For qualifying data, only the Position field contains values, as opposed to race data, which both Position and ClassifiedPosition.

— Also for qualifying data, values can differ by era. For example, historical years featured different qualifying formats, so no values appear for Q3 sessions.

— The Status field indicates whether a driver could not finish a race. Alternatively, records with a value of “R” for ClassifiedPosition correspond to a DNF, as well.

— Rate limiting could affect larger data pulls, such as for multiple seasons. Incorporating the sleep function from Python’s time module should reduce the impact.

Next, I started thinking about how to structure the pipeline and project infrastructure. My mind naturally gravitates toward bulletpoints, so that resulted in the following.

— Since the API natively incorporates Pandas dataframes to represent data, it made sense to embrace a Pandas-centric approach to data transformations. Pandas reaches limitations at greater data volumes, since it stores data in memory, but that wouldn’t come into play with this project.

— I wanted the option to store this data for future analyses or other projects, so I decided to create a Postgres database and write to it via a library like sqlalchemy.

— For the actual analysis, a SQL-, rather than Python-centric, approach held greater appeal, since it opened the door to experimenting with tools like dbt at some point.

— I made a point of containerizing the project to practice my Docker skills and aid reproducibility.

To offer a basic visual depiction of the data architecture, I envisioned it looking something like this:

Simplified data architecture

At that point, I felt fairly comfortable with the broad strokes of the project and set my sights on practicing another data engineering skill I’ll explain next time: data modeling.

Thank you for reading, and I hope you come back again!

I. Introduction

II. Data Requirements Gathering

III. Data Architecture (today)

IV. Data Modeling (coming next week)

V. Development

VI. Analysis

VII. Next Steps & Enhancements

--

--

Anthony Dalke

Thank you for visiting! I've worked in data for over 10 years, first in analytics, then in data science, now in data engineering. I'll share learnings here.