There are three main sets of data that we are using. The first one is collected from the from the Bureau of Transportation Statistics (BTS) called “On-Time : Reporting Carrier On-Time Performance.” The dataset has been available since 1987. The second one is aircraft tail data, giving us information of the aircraft age, series, and engine type. This data is provided on the Federal Aviation Administration (FAA). The third data set is hourly weather reports at all of the commercial airports of the United States. This data was taken from the archives of the National Centers for Environmental Information (NOAA).
Data from BTS contains information on the flight date, the day of the week, operating airline, tail number, origin and destination airport, departure and arrival time, departure and arrival taxi time, cancellation, distance, airtime, cause of delay (due to carrier, weather, national air system, security, or late aircraft). ShortVIX has chosen to focus on flight data between January 1st, 2016 to January 31st, 2017. Therefore, there are approximately 12 million observations (flights), and the total file size is 5 gigabytes. The data is easily found from the following link: https://www.transtats.bts.gov/Tables.asp?DB_ID=120. Data provided from BTS is already clean, so there will be no large data wrangling needed.
However, data from the FAA website in regards to finding aircraft age, series, and engine type based on tail numbers is more complex. Since there are thousands of aircraft in the United States and data for each tail number can be collected one by one, ShortVIX has decided to conduct web scraping of the FAA website to automatically pull the necessary data. But sometimes, the tail numbers in the BTS data don’t correspond to the tail numbers in the FAA database, e.g. American Airlines reports nose numbers to BTS and some planes retired from 2016 to 2018. So matching from the tail numbers in BTS data to the tail numbers in FAA database is necessary. There has been some trouble in cleaning up the data because there are some missing values in FAA data and sometimes FAA data is incorrect like the wrong engine manufacturer name and model. So these data need to be filled and corrected manually.
Data from NOAA contains archived weather data of all airports in the United States. ShortVIX is focusing on downloading hourly weather data containing data on sky condition, visibility, dry and wet temperature, dew point, humidity, wind speed and direction, pressure, precipitation, and daily sunset and sunrise time. There is a total of 324 airports, providing about 5.5 million rows of data. However, the NOAA imposes a 5 airport limit per download. Therefore, data has to be downloaded 65 times. The data could be found in the following link: https://www.ncdc.noaa.gov/cdo-web/datatools/lcd. After downloading all the CSV files, the tables need to be appended together in one file. The column with the ‘Hourly Present Weather Type’ will be explored to see the weather type in order to incorporate those into the model.