Can aircraft tail number be a good predictor of flight delays?

7 min readJul 27, 2018

As an aviation geek, I usually search which exact aircraft that I will get for my long-haul flights. Start making questions and get answers: Is it a brand new aircraft?; Do I make a good seat selection?; Does it have a prior scheduled flight to fly into my airport?; If so, does it operate on time?; If not, how long do I need to wait for my airplane arrive?; and so on. When you make these search, you may end up yourself finding with the 4–6 digits number, which we call it as the tail number or the registration number, one of the unique identifier of the aircraft. If you ever had taken airplane photo when you get excited getting onboard, please zoom in the tail part of it. You will find this number printed on both port and starboard side of the aircraft. Like this.

after 14 hours long flight and captured its tail number

This unique number inspired me to gear up my data science project. I believed that this tail number can be a good predictor of the flight delays. And I stretched this inference to — what about aircraft makers or models, can it be found by this number — and can it be variables that I can use them for my prediction models? That’s how I started the project by having this ambitious assumption: if there is a certain relationship between aircraft and delays, it can be linked to the fleet operation strategies to the aviation companies — whether to diversify or simplify their fleet inventories.

On-Time Performance open-source data in 29 millions data points

The dataset, the on-time performance, can be obtained from the Bureau of Transportation’s Static website. In my case, I dived into 5 recent years of the data from 2013 to 2017, and it ended up with 29 millions data points of the entire domestic passenger flights. The CSV file contains from basic information of each flights(e.g. date, scheduled time, actual time, etc) to specific detailed log, such as minutes of delays, reason of delays, flight status(cancelled, delayed or diverted), etc.

# Since BTS webpage allows to get dataset by month, I downloaded first and merged it.import glob
import pandas as pd
import os, os.path# Set the file path
path = './assets/rawdata/'# Make reproducible, whether the combined dataset existed or not
if os.path.isfile('./assets/newdf_2013.csv') == True:
    newdf_2013 = pd.read_csv('./assets/newdf_2013.csv', low_memory=False)
    print("File existed!", newdf_2013.shape)
else:
    # pull 2013 datasets, segmented down by months, and merge
    files = glob.glob(path + str(2013) + "/*.csv")
    df_2013 = pd.concat((pd.read_csv(f, low_memory=False) for f in files))
    print("Shape of 2013 raw datasets: ", df_2013.shape)

Once you fetch the dataset and complete the merge, make sure to have a basic analysis on the dataset which columns and features you want to mainly take a look. Here I selected around 30 features out of 110 columns. Dumped columns were flight diverted information or somewhat duplicated geographical airport/market data.

# Select only relevant columns
if os.path.isfile('./assets/newdf_2013.csv') == True:
    print("File existed and already preprocessed")
else:
    cols = ['Year', 'Month', 'DayofMonth', 'DayOfWeek', 'Carrier', 
            'FlightNum', 'TailNum', 'Origin', 'Dest', 'CRSDepTime',
            'DepTime', 'DepDelay', 'TaxiOut', 'WheelsOff',
            'CRSElapsedTime', 'ActualElapsedTime', 'AirTime', 
            'Distance', 'WheelsOn', 'TaxiIn', 'CRSArrTime', 
            'ArrTime', 'ArrDelay', 'Diverted', 'Cancelled', 
            'CancellationCode', 'NASDelay', 'SecurityDelay', 
            'CarrierDelay', 'LateAircraftDelay', 'WeatherDelay']

    newdf_2013 = df_2013[[col for col in df_2013.columns 
                          if col in cols]]
    print("Shape of 2013 modified datasets: ", newdf_2013.shape)

For the computational reasons, I subset it to local market data in both outbound and inbound flights. In result, it came down to 2.2 million data points.

# only select Washington Metro airport
if os.path.isfile('./assets/washington.csv') == True:
    print("File existed and already preprocessed")
    was_air = pd.read_csv('./assets/washington.csv', low_memory=False)
    print("Washington airports dataset dimension: ", was_air.shape)
else:
    was_airports = ['DCA', 'IAD', 'BWI']

    a = newdf_2013[newdf_2013.Origin.isin(was_airports)]
    b = newdf_2014[newdf_2014.Origin.isin(was_airports)]
    c = newdf_2015[newdf_2015.Origin.isin(was_airports)]
    d = newdf_2016[newdf_2016.Origin.isin(was_airports)]
    e = newdf_2017[newdf_2017.Origin.isin(was_airports)]

    a_= newdf_2013[newdf_2013.Dest.isin(was_airports)]
    b_= newdf_2014[newdf_2014.Dest.isin(was_airports)]
    c_= newdf_2015[newdf_2015.Dest.isin(was_airports)]
    d_= newdf_2016[newdf_2016.Dest.isin(was_airports)]
    e_= newdf_2017[newdf_2017.Dest.isin(was_airports)]

    was_air = pd.concat([a, b, c, d, e, a_, b_, c_, d_, e_], ignore_index=True)
    print("Washington airports dataset dimension: ", was_air.shape)

    # save dataset for the future usage
    was_air.to_csv('./assets/washington.csv', index=False)

As I previously mentioned, I was going to plan capture aircrafts’ identification with model, maker and its age by provided tail no in the dataset. In the 2.2 million datasets, there were around 6,000 unique tail numbers found. Therefore I decided to use Selenium to web-scrap from flightradar24.com which renowned as its live flight tracker service. This site contains detailed aircrafts code, serial number, age and delivered date to paid subscribe users. As I followed the site, after some tests with Selenium I successfully scrapped thousands of aircrafts information. You can find it more on my GitHub.

Once all the dataset is prepared, let’s have exploratory data analysis and feature engineering. I have several golden nuggets I found from the data.

The national capital area has three international/national airports: Baltimore/Washington International Thurgood Marshall Airport(IATA Code name — BWI: Baltimore, MD), Ronald Reagan Washington National Airport(DCA: Arlington, VA), Washington Dulles International Airport(IAD: Dulles, VA) Out of these three airports, BWI has the largest flights counts, followed by DCA and IAD. Interestingly, all the airports showed some dipped in February and spiked up in March. My guess is as same as yours, the Spring break!

When you flight counts in both in- and out-bounds, Southwest ranked the first and followed by United, American, and Delta. (As you can see, subsidiary airlines were not re-categorized by which they were operated for, and as it also separately showed US Airways in the dataset)

Before we are saying which airlines performed the best and the worst — guess the main reason to read this far to quench your thirst and curiosity — let’s have one final technical point. I wanted to use departure and arrival time logged in the dataset as a predictor to my model. With the help of Ben Shaver, I found Mr. Ian London’s blog post about cyclical time conversion. The code and thought process of the part in my project inspired from the blog and credit goes to Mr.London and Ben. Basically, we are converting time a sine transform and a cosine transform. Once calculate the sin_time and cos_time, you can scatter plot it into 24-hour clock shape.

As you can see from the different colored clock above, the flight delays made extra unscheduled work in the airport.

Here is the analysis of the departure and arrival flights which made more than 15 minutes delayed from its scheduled time. In the first page with the red line showed the mean value of delayed rate. With the second page, as the code and plot credit to Mr. Fabien Daniel, his tutorial in Kaggle, represented how each airlines’ mean delayed minutes in both departure and arrival sequences.

Challenges: too many unknown values in the aircrafts info

Okay, let’s get back to the original inference and assumption. Can we predict flight delays with aircrafts models and makers? My short answer is Yes it can be, but my longer answer is I cannot get the meaningful result at this moment. The reason behind this is, I got too much unknown aircrafts information from the flightradar24.com. Out of 6,000 unique tail number, after I successfully finished my selenium bot work, I got around 20% of unknown or null data. Therefore my challenge and new journey here is, I need to find the better source to collect aircraft information based on its tail number. If anyone has thoughts or suggestions on my ambitious projects, I will be more than happy to chat with and learn from you. As a new data scientist, I am always up to listen to all the feedback and ready to tune my work. All of my current work and process can be found on my GitHub. Thank you very much for reading it.

Can aircraft tail number be a good predictor of flight delays?

On-Time Performance open-source data in 29 millions data points

Challenges: too many unknown values in the aircrafts info

Written by Kihoon Sohn