HDSC August ‘21 Premiere Project Presentation: Limiting Plane Crashes

A Project by Team Vispy

HamoyeHQ

Published in

Hamoye Blog

5 min readOct 5, 2021

Data Analysis on causes of Airplane crashes and solutions to limit such occurrences

Introduction

An airplane is a fixed-wing aircraft that is propelled forward by thrust from a jet engine propeller, or rocket engine. Industries in the air transportation subsector provide air transportation of passengers and/or cargo using aircraft, such as airplanes and helicopters. Aside from this, the industry is the safest, fastest, and most comfortable means of transportation. An aviation accident is defined by the Convention on International Civil Aviation Annex 13 as an occurrence associated with the operation of an aircraft, which affects the safety of operation. (AirSafe.com, 2009). However, air crashes can occur either for one reason or the other which can have a severe consequence on the lives of people and their valuable assets. The most common reasons for airplane crashes as reported by an article from natlawreview which include: pilot error, mechanical failure, and poor weather.

The project is set out to perform analysis on the airplane crashes data, and agree with or refute the claims made by natlawreview while proposing solutions to limit such occurrences.

Objectives

The project aims to investigate the number of plane crashes per year, the number of people on board, and the number of survivors. The project also aims to determine the highest number of crashes that occur over a particular period of time and to observe data patterns in order to make inferences on the cause(s) of crashes.

Our Approach

Data profiling and cleaning, exploratory data analysis, and K-Means clustering techniques for inferences were all carried out.

Dataset Used for Analysis

The dataset contains information about airplane crashes collected from Kaggle website data collections, including fatalities, aboard, data, time, and operators.

Collaborators

This project is an open-source project for the Hamoye Data Science Internship. We are a team of data scientists, data storytellers, and data engineers; each team member was assigned a specific role.

Dataset Exploration

As with all real-life data, the dataset contained some missing values and imperfections which resulted in the cleaning of the data. The missing values in columns of interest are ‘Aboard’, ‘fatalities’, and ‘Ground’ which were filled with zeros (0) due to the unavailability of exact information. The data column is rearranged into a more familiar form. A column for a year was created for future analysis and a column for survivors from the ‘Aboard’ and ‘Fatalities’ columns were created.

However, the diagram below shows the relationship between the three properties (Aboard, Fatalities, and Ground).

Useful deductions were made from the data by exploring the number of crashes per year, and the death toll per year.

We see that after the 1940s, there is a significant increase in airplane crashes. The highest peaks are between 1960 and 2000

Here we can see the same pattern; the years that had the most accidents are also the ones with the most fatalities.

From the 1940s, the number of people aboard airplanes started to increase. From 1960 to 2000 is where we have most people aboard, the same years with most plane crashes and fatalities.

We see a spike in ground casualties. After further investigation, we find it to be the dreaded 9/11 disaster when two planes were hijacked and crashed into the Twin Towers.

The plots below show that Aeroflot had the most crashes, and expectedly, fatalities. One would watch out for them when ordering the next flight ticket.

Text Analysis using K Means Clustering

The summary column gives us an idea of the causes of the crashes, hence, analyses were performed on it to find out the most common causes so as to agree with, or refute the claims of the natlawreview article. This is going to be done using K Means Clustering.

Steps to K Means Clustering approach:

First, we drop the empty summary rows as they are not useful to us.
K Means normally works with numbers only, so we need to have numbers.
To get numbers, we do feature extraction. The feature we’ll use is TF-IDF, a numerical statistic. This statistic uses term frequency and inverse document frequency.
The method TfidfVectorizer() implements the TF-IDF algorithm.
We want to get a summary of the top 5 causes of airplane crashes so we’ll use 5 clusters.

Results

From the cluster analysis, the following inferences were made:

Cluster 1 contains words relating to crashes during and shortly after takeoff
Cluster 2 contains words relating to crashes while attempting to land the plane
Cluster 3 contains words relating to crashes due to engine failure
Cluster 4 contains words relating to poor weather condition
Cluster 5 also contains words relating to takeoff

Hence, we can completely agree with two of the causes highlighted by the natlawreview article:

Mechanical failure
Poor Weather

There were indications of pilot error, but we couldn’t come to a suitable conclusion as no cluster explicitly contained words relating to pilot error.

Conclusion

Based on the data analysis, years with most crashes and fatalities were analyzed, and investigations were made on the operator with the most number of crashes, to give an idea of the peculiarities of the data. Missing values in the dataset and unavailability of more information for better inferences are the challenges faced. Meanwhile, a suggested improvement is to make some changes to the number of clusters and compare it to find the optimal number of clusters.

Inferences were made on the possible causes of airplane crashes. From this, the following solutions were proposed:

Reduced flights in poor weather conditions
Adequate checks on the plane to prevent mechanical failure
Adequate care should be taken on the part of flight operators and pilots for efficient landing and takeoff to prevent unwanted occurrences.

The team members that worked on this :

1. Eseose Okiti

2.Name: Anosike Prosper Udochukwu

3. Ogwu Augustine Ugbedeojo

4.Magret Tolulope Akinwande

5 God’spower Ebole ( Query Analyst)

6. Sayak Mallick ( Project Team Lead)

7. Isaac Omolayo

8 Patrick Jeremiah o.

9.Hafsah Anibaba

10. Adeboje Olusola (Group Article Author)