The severity of Airplane Accidents: Predictions using ML techniques

Dr. Mohit Sharma
9 min readJun 15, 2022

--

https://globalnews.ca/news/702034/analysis-early-evidence-suggests-pilot-error-in-the-crash-of-asiana-214/

Table of Contents:

  1. Introduction
  2. Exploratory Data Analysis (EDA)
  3. Split the data
  4. Logistic Regression
  5. Support Vector Machines
  6. Gradient Boosting Decision Tree
  7. Conclusion
  8. LinkedIn account
  9. Future Work
  10. References

Introduction:

Aircraft accidents are among the most dangerous and catastrophic types of vehicle accidents. Since the first aviation disaster, which was only a balloon, on June 15, 1785, in France, there has been a never-ending sequence of tragic incidents in the aviation business, including a powered aircraft Wright Model A constructed by the Wright Brothers.

Since 1980, the frequency of aviation accidents has been falling year by year, insinuating a greater understanding of the factors that cause an airplane crash. This article discusses one such case study, which is based on data from airline accidents https://www.kaggle.com/datasets/kaushal2896/airplane-accidents-severity-dataset.

Exploratory Data Analysis (EDA):

In this part we will study the data using the data visualization to see if that can help us to make a decision on the severity of accidents.

Safety score and column ‘Days_Since_Inspection’ together, upto some extent, can predict the type of risk.
Top-Left: Almost no information can be extracted regarding the severity of accidents by plotting the max elevation and Cabin temperature; Top-Right: With the safety complaints and Control metric, one finds that the probability of high damage and serious injury is also more when safety complaints>50, Bottom-Right: The type of accidents can not be distinguishable by using Total safety complaints only; Bottom-Left: The safety score can slightly distinguish between High fatal and damaging and significant damage and fatalities
Code Snippet for PCA
Visualization of PCA in 2-D
Code Snippet for TSNE
TSNE plots

Since accidents in the data are classified into four groups based on their severity: (1) Minor Damage And Injuries, (2) Highly Fatal And Damaging, (3) Significant Damage And Serious Injuries, and (4) Significant Damage And Fatalitiesn because not all causes are equally responsible for an accident, it is critical to recognize which are the key ones so that accidents may be prevented by taking care of them. Machine Learning (ML) approaches, as well as Data Analysis (to a limited extent), can assist us in obtaining the necessary information regarding the incidents.

One may ask how ML algorithms, which are merely mathematical in nature, can provide us with hidden knowledge regarding accidents and facilitate us in avoiding them? Let us look at the code below and see what we can find:

Split the data:

We begin by importing the ‘train.csv’ file, which has the shape (10000,12). Then, after removing the class labels given by ‘y,’ we are left with the remaining data given by ‘X,’ which has the shape (10000,11). Since we want some data points to validate our prediction accuracy, we randomly divide ‘X’ and ‘y’ in an 80:20 ratio, with 80 percent of the data used to train and the remainder can be used as the test data. We utilize stratified sampling to partition the data, which means that the ratio of class labels in both train and test data will be almost identical.

Since we have several measures of severity like ‘Safety Score’, ‘Turbulence In gforces’, ‘Adverse Weather Metric’, etc., some (or all) of them can be of a different order of magnitudes. Training a model with factors of different scales will lead to wrong predictions because the distances between two points of a given factor (let’s say Safety Score) will be different than the rest of the features. Hence, the outliers which are at large distances from the population of data points, cannot be distinguished from the inliers. It is, therefore, recommended to scale each and every data feature. Here, we use StandardScaler() of the Sklearn library to scale both train and test data between 0 to 1.

Data splitting and Preprocessing

Logistic Regression:

Logistic Regression (LR), unlike its name suggests, is a classification technique. It is slightly more advanced than the Linear Regression in the sense that it uses a sigmoid function which gives a probabilistic interpretation of whether a point belongs to the positive or negative class. In addition to that, the sigmoid function tries to reduce the impact of outliers. In particular, if an outlier gets added to the data, which is at a large distance, the distance metric i.e. the sigmoid function does not change much. In simpler terms, even though the outliers can be at a very large distance, the sigmoid never goes beyond the value of unity. It is always bounded inside the range [0,1]. It should also be noted that LR works best when the classes are substantially separable.

To implement the LR classifier and train the model, we make use of the GridSearchCV to find the best-fit value of the hyperparameter ‘C’. In this procedure, out of the values that we give as a set for ‘C’, the model trains itself for each value and then returns a result that fits the data the best. Since the model requires some ‘unseen data’ for the cross-validation, it then divides the given train data into several parts, and that is controlled by the argument ‘cv’. In the code below, we take cv=3, which means that for each value of ‘C’ in ‘params’, it splits the train data into three equal parts, the two-third of which is used for training, and the remaining one-third will be used for cross-validation.

Code Snippet for Logistic Regression
Confusion Matrix for Logistic Regression

To assess prediction accuracy, and consider how many points are properly identified and how many are not, we make use of the Confusion Matrix plot for visualization purposes. In the above figure, one can see that the training data is fitted better than the test data, but the values of non-diagonal elements are very high in both cases. It means that the LR does not work well for this data. This might be because the data is not linearly separable.

Code Snippet to get LR Accuracy

The total accuracy of the training data is 0.644375, whereas the overall accuracy of the test data is 0.6235.

Code Snippet for F1 Score

Similarly, the F1 scores for train and test data are 0.618 and 0.600, respectively. Let us now implement the method that can assist us with non-linear separable data.

Support Vector Machines:

The Support Vector Machine (SVM) is one of the most popular algorithms that work with both classifications as well regression problems. It creates a decision boundary between points of different classes, by using the extreme points, known as the support vectors. One of the best things about the SVM is that once it finds the support vectors, it only uses them for predicting the test data, and not all the data points. This helps in reducing the time complexity.

In the code below, we again use the GridSearchCV method to find the best-fit value of the hyper-parameter ‘C’, as we did for the LR case. However, in order to take into account the non-linearity in the data, we choose the Radial Basis Function (rbf) kernel. This kernel checks the similarity between two points by using a ‘Gaussian-like’ function.

Code Snippet for SVM
Confusion Matrix for Support Vector Machines

After training the SVM with rbf kernel, we find that the values of non-diagonal elements in both train and test data are less than the LR. This implies that taking into account ‘rbf’ kernel has helped us to get a better classification of the data points.

Code Snippet to get SVM accuracy

The SVM’s accuracy for train data is 0.96525, while it is 0.89 for test data. It is worth noting that by accounting for point similarity, the accuracy has increased significantly over the previous case.

Code Snippet for SVM F1 score

The corresponding F1 scores for train and test data are 0.966, and 0.892, respectively. Let us now progress to a more complicated ML method.

Gradient Boosting Decision Tree:

It is one of those techniques that performs really well with non-linear separable data. It is an ensemble technique in which decision trees are used as the base learners. In this technique, the model gets fitted in a cascading form. In other words, the model first gets trained on the given data, and then the difference between the model prediction and the actual class label then serves as the input for the next decision tree. This technique is repeated until adequate precision is obtained without overfitting the model. The difference between the model prediction and the actual class label is known as the ‘pseudo error’. One of the major importance of the Gradient Boosting Decision Tree (GBDT) is that it can incorporate any kind of loss function provided it should be differentiable.

We apply the GBDT approach in the code below by taking two of the most relevant hyperparameters., i.e. ‘max_depth’ which determines the depth of a decision tree, and ‘the ‘min_samples_split’ which determines number of splittings at any given depth. Following the similar procedure as above, we fit our model with the train data to find the best-fit hyperparameters.

Code Snippet for GBDT

The best-fitted parameters are the follows:

Best-fit parameters
Confusion Matrix for Gradient Boosting Decision Trees

In the above confusion metric plot, it is easily noticeable that all diagonal elements have substantially greater values than non-diagonal elements. The plot also clearly indicates that GBDT has performed the best among all the three models we have used.

Code Snippet for GBDT accuracy

Remarkably, one can see that the training accuracy has reached more than 99% and the test accuracy reaches to 95.5%, which are way improved than that of the LR and SVM models.

GBDT F1 score

The F1 scores for both train and test data have also improved to 0.991, and 0.955, respectively.

Feature Importance:

Since we have, in total, eleven features that determine the severity of airplane accidents. However, not all of them contribute to the same extent. From the figure given below, we can check which feature(s) is(are) dominantly responsible to determine the severity.

Feature importances

From this figure, we find that following are the top four features that contribute majorly to the severity of airplane accidents.

Important features

Conclusion:

In this case study, we figure out that the data of airplane accidents cannot be easily interpretable, and therefore it requires a highly complex computational technique, such as GBDT. We find that the latter can reach up to an accuracy of 95.5% on test data. However, it also slightly gets over-fitted as the training accuracy is over 99%. Nevertheless, it turns out to be the best as compared to the Logistic Regression and SVM technique. Finally, we find that out of all the eleven features, there are only four features that contribute more than 10% in determining the severity of accidents.

LinkedIn account:

Future Work:

By using the above data of the airplane accidents, one can extend the analysis using various Deep learning techniques which have potentials to give better results than this. Moreover, based on the availability of the data that either specifies the company name of the accidental airplane or the region in which the accident happened, one can get lots of important information which can be further helpful to identify the cause of accidents.

References:

  1. https://www.kaggle.com/code/kaushal2896/airplane-accidents-severity-starter-eda-rf
  2. https://www.kaggle.com/code/snide713/flight-crash-severity-prediction-top-1-approach

--

--