Attempting to Predict the Aircraft of Plane Crashes
The Question
Since the 1960s and 70s, the plane has been a critical player in long-distance passenger travel. In the modern-day, aircraft tend to not crash often. However, sending hundreds of tons into the air does not come without risk and when planes do crash, it is deadly. This begs the question:
Can we predict which civilian aircraft are more likely to crash with significant accuracy based on historic data?
There are several significant stakeholders that could ask this question. Government organizations that focus on transportation such as the Federal Aviation Administration and the National Transportation Safety Board may use this data to know which planes need engineering review. This could be the same for manufacturers such as Boeing and Airbus for their own internal quality assurance.
It is important to remember that these organizations would all have extremely experienced teams working on a model such as this. Their answer to this question would be different. I will be answering this question to the best of my undergraduate ability and time.
The Data
The ideal data for this analysis would be a dataset that contains all passenger aircraft crashes. This dataset would include the manufacturer, aircraft type, number of fatalities, and the date the crash occurred. For this model, the ground-truth label will be the aircraft type. This is the only feature from this potential dataset (and the dataset that will be used) that can be used to answer the question.
The data being used for this project comes from www.planecrashinfo.com. This is close to the ideal data and focus on civilian accident. However, it does include military transport accidents. For a separate project, I had already cleaned and scraped this information from the website. The rough code for this is below.
df = pd.DataFrame()
for year in range(1920, 2024):
response = requests.get(f"https://www.planecrashinfo.com/{year}/{year}.htm")
soup = BeautifulSoup(response.text, 'html.parser') #parse the html
table = soup.find('table') #find the table
single_df = pd.read_html(str(table))[0] #convert the table into a pandas dataframe
single_df = single_df.drop(0) #remove first row
df = pd.concat([df, single_df], ignore_index=True)
print(f"Collected year: {year}")
sleep(0.25)
df = df.rename(columns={0: "Date", 1: "Location / Operator", 2: "Aircraft Type / Registration", 3: "Fatalities"})
df.head(20)
df.to_csv('output.csv', index=False) # Setting index=False avoids writing row indices to the CSV file
This program exported the data as a csv file. It was cleaned in a separate program (which will be discussed later) and then imported into a Jupyter Notebook file (ipynb).
The Model and it’s Features
A classification model will be used for this analysis, specifically sklearn’s Decision Tree Classifier. This is because the feature that is being predicted, the type of aircraft, is categorical in nature. The features that will be used to predict this are fatalities and the year of the crash. Although these are numeric in nature, they will be placed into a matrix in a binary fashion.
Answering the Question
The complete extent of the blocks in this notebook are too numerous to discuss in this single post (the full ipynb file is available on my github). However, I will share the program that created this model from the matrix. Below is the matrix that contains each crash as a row, with the aircraft, year (c.24 means 1924, c.25 means 1925, etc.), and number of fatalities as columns. What follows is the Python code that transforms this data into x and y variables that is accepted by sklearn’s Decision Tree Classifier. Keep note that the code is modified from one of Professor Cody Buntain’s in-class examples.
crash_indexes = matrix.index.tolist() #create a list of crash indexes
split_index = int(0.7 * len(crash_indexes)) #get a number that represents and 80% split
random.shuffle(crash_indexes) #shuffle them!
train_crash_indexes = crash_indexes[:split_index]
test_crash_indexes = crash_indexes[split_index:]
print("train data:", len(train_crash_indexes))
print("test data:", len(test_crash_indexes))
x = matrix.drop(columns=["aircraft"]) #drop columns that don't have aircraft
x.columns = x.columns.astype(str) #ensure the columns are strings
x
x_train = x.loc[train_crash_indexes] #locate training data for x
x_test = x.loc[test_crash_indexes] #locate test data for x
y = matrix["aircraft"] #use the aircraft column
y
y_train = y.loc[train_crash_indexes] #locate training data for y
y_test = y.loc[test_crash_indexes] #locate test data for y
#create the tree and fit the model
model = DecisionTreeClassifier(max_depth=12)
model.fit(x_train, y_train)
y_predict = model.predict(x_test) #create predictions
Using a max depth of 3 and 2, both returned 1116 wrong predictions and 125 correct predictions making for a 11.2% accuracy rate. 1,240 of these predicted the aircraft would be a Douglas DC-3 and 1 predicted the aircraft would be a Douglas C-47a.
Using a max depth of 1 returned 1115 wrong predictions and 126 correct predictions making for a 11.3% accuracy rate. All were predicted to be a Douglas DC-3.
The depth was then increased to 12, which returned 1127 wrong predictions and 114 correct predictions for a 10.1% accuracy rate. The predictions included 749 aircraft, with 112 being the DC-3.
Not much is discussible regarding the specifics of the other tests. The following are 5 samples that the last model (depth of 12) got wrong and information about their real-life crashes. Note that the crash numbers are clickable, while leads to a site containing more information about the crash. Crash numbers are not based on any value, purly the index of the initial DataFrame.
1. Crash 4225 occurred in 1999 and had 4 fatalities. The model predicted it to be a Douglas DC-3, while it was listed as a Boeing 747–2B5F.
2. Crash 3380 occurred in 1986 and had 13 fatalities. The model predicted it to be a Douglas DC-3, while it was listed as a Short SC-7 Skyvan.
3. Crash 1448 occurred in 1954 and had 26 fatalities. The model predicted it to be a Douglas DC-6b, while it was listed as a Convair RB-36h (conflicting reports).
4. Crash 1600 occurred in 1957 and had 2 fatalities. The model predicted it to be a de Havilland Canada DHC-3 Otter, while it was listed as a Douglas DC-4.
5. Crash 446 occurred in 1937 and had 7 fatalities (conflicting reports). The model predicted it to be a Douglas DC-3, while it was factually a Dewoitine D-333.
One of the reasons the Douglas DC-3 was so prevalent in this model is because of how many were made. According to the Smithsonian, more than 13,000 DC-3s were produced since 1935 and many still fly. In this dataset, the DC-3 is the most prevalent aircraft. This is not because the aircraft is inherently dangerous (the Smithsonian also considers them to be safe aircraft for their time), it is because of the shear amount of DC-3s that were in the air. Below is a graph showing how many times a Douglas DC-3 is involved in a crash in the dataset.
Can we predict which civilian aircraft are more likely to crash with significant accuracy based on historic data?
No. It may be possible for others. However, I cannot predict which civilian aircraft can more likely to crash with significant accuracy.
Data Cleaning
The initial data was originally extremely dirty. Columns had multiple fields of information in them, contained significant spelling errors, and had significant inconsistencies. The original table contained the Aircraft Type and registration in the same column. OpenRefine was used to both remove the registration number from each cell in this column and set each character to lowercase. The Fatalities column was a similar problem. The original table listed the number of fatalities out of the total passengers with the number of ground fatalities in parathesis. OpenRefine was used to split this data into multiple columns. If data is ever retrieved from this website, similar processes should occur.
Conclusion
Future models like this should consider additional features beyond year and fatalities. These additional features could increase the accuracy of the model. For example, the location of the accident could be included. Additionally, future analysis could include a regression model. Instead of attempting to predict the aircraft, a model could attempt to predict the number of fatalities based on the aircraft.
This analysis and its model does not explore each person in depth. It is important to remember for each number that represents a fatality, represents a real person that had a family and friends. Each crash was a tragic incident. However, a working prediction model like this could potentially save lives by learning from previous mistakes.
Resources
planecrashinfo.com: https://www.planecrashinfo.com/
GitHub Repository: https://github.com/smiller1551/PlaneCrashPredictions
Professor Cody Buntain’s Github: https://github.com/cbuntain
This Medium post was created by Simon Miller at the University of Maryland — College Park for INST414: Data Science Techniques under Professor Cody Buntain.