The severity of Airplane Accidents: Predictions using ML techniques
Table of Contents:
- Introduction
- Exploratory Data Analysis (EDA)
- Split the data
- Logistic Regression
- Support Vector Machines
- Gradient Boosting Decision Tree
- Conclusion
- LinkedIn account
- Future Work
- References
Introduction:
Aircraft accidents are among the most dangerous and catastrophic types of vehicle accidents. Since the first aviation disaster, which was only a balloon, on June 15, 1785, in France, there has been a never-ending sequence of tragic incidents in the aviation business, including a powered aircraft Wright Model A constructed by the Wright Brothers.
Since 1980, the frequency of aviation accidents has been falling year by year, insinuating a greater understanding of the factors that cause an airplane crash. This article discusses one such case study, which is based on data from airline accidents https://www.kaggle.com/datasets/kaushal2896/airplane-accidents-severity-dataset.
Exploratory Data Analysis (EDA):
In this part we will study the data using the data visualization to see if that can help us to make a decision on the severity of accidents.
Since accidents in the data are classified into four groups based on their severity: (1) Minor Damage And Injuries, (2) Highly Fatal And Damaging, (3) Significant Damage And Serious Injuries, and (4) Significant Damage And Fatalitiesn because not all causes are equally responsible for an accident, it is critical to recognize which are the key ones so that accidents may be prevented by taking care of them. Machine Learning (ML) approaches, as well as Data Analysis (to a limited extent), can assist us in obtaining the necessary information regarding the incidents.
One may ask how ML algorithms, which are merely mathematical in nature, can provide us with hidden knowledge regarding accidents and facilitate us in avoiding them? Let us look at the code below and see what we can find:
Split the data:
We begin by importing the ‘train.csv’ file, which has the shape (10000,12). Then, after removing the class labels given by ‘y,’ we are left with the remaining data given by ‘X,’ which has the shape (10000,11). Since we want some data points to validate our prediction accuracy, we randomly divide ‘X’ and ‘y’ in an 80:20 ratio, with 80 percent of the data used to train and the remainder can be used as the test data. We utilize stratified sampling to partition the data, which means that the ratio of class labels in both train and test data will be almost identical.
Since we have several measures of severity like ‘Safety Score’, ‘Turbulence In gforces’, ‘Adverse Weather Metric’, etc., some (or all) of them can be of a different order of magnitudes. Training a model with factors of different scales will lead to wrong predictions because the distances between two points of a given factor (let’s say Safety Score) will be different than the rest of the features. Hence, the outliers which are at large distances from the population of data points, cannot be distinguished from the inliers. It is, therefore, recommended to scale each and every data feature. Here, we use StandardScaler() of the Sklearn library to scale both train and test data between 0 to 1.
Logistic Regression:
Logistic Regression (LR), unlike its name suggests, is a classification technique. It is slightly more advanced than the Linear Regression in the sense that it uses a sigmoid function which gives a probabilistic interpretation of whether a point belongs to the positive or negative class. In addition to that, the sigmoid function tries to reduce the impact of outliers. In particular, if an outlier gets added to the data, which is at a large distance, the distance metric i.e. the sigmoid function does not change much. In simpler terms, even though the outliers can be at a very large distance, the sigmoid never goes beyond the value of unity. It is always bounded inside the range [0,1]. It should also be noted that LR works best when the classes are substantially separable.
To implement the LR classifier and train the model, we make use of the GridSearchCV to find the best-fit value of the hyperparameter ‘C’. In this procedure, out of the values that we give as a set for ‘C’, the model trains itself for each value and then returns a result that fits the data the best. Since the model requires some ‘unseen data’ for the cross-validation, it then divides the given train data into several parts, and that is controlled by the argument ‘cv’. In the code below, we take cv=3, which means that for each value of ‘C’ in ‘params’, it splits the train data into three equal parts, the two-third of which is used for training, and the remaining one-third will be used for cross-validation.
To assess prediction accuracy, and consider how many points are properly identified and how many are not, we make use of the Confusion Matrix plot for visualization purposes. In the above figure, one can see that the training data is fitted better than the test data, but the values of non-diagonal elements are very high in both cases. It means that the LR does not work well for this data. This might be because the data is not linearly separable.
The total accuracy of the training data is 0.644375, whereas the overall accuracy of the test data is 0.6235.
Similarly, the F1 scores for train and test data are 0.618 and 0.600, respectively. Let us now implement the method that can assist us with non-linear separable data.
Support Vector Machines:
The Support Vector Machine (SVM) is one of the most popular algorithms that work with both classifications as well regression problems. It creates a decision boundary between points of different classes, by using the extreme points, known as the support vectors. One of the best things about the SVM is that once it finds the support vectors, it only uses them for predicting the test data, and not all the data points. This helps in reducing the time complexity.
In the code below, we again use the GridSearchCV method to find the best-fit value of the hyper-parameter ‘C’, as we did for the LR case. However, in order to take into account the non-linearity in the data, we choose the Radial Basis Function (rbf) kernel. This kernel checks the similarity between two points by using a ‘Gaussian-like’ function.
After training the SVM with rbf kernel, we find that the values of non-diagonal elements in both train and test data are less than the LR. This implies that taking into account ‘rbf’ kernel has helped us to get a better classification of the data points.
The SVM’s accuracy for train data is 0.96525, while it is 0.89 for test data. It is worth noting that by accounting for point similarity, the accuracy has increased significantly over the previous case.
The corresponding F1 scores for train and test data are 0.966, and 0.892, respectively. Let us now progress to a more complicated ML method.
Gradient Boosting Decision Tree:
It is one of those techniques that performs really well with non-linear separable data. It is an ensemble technique in which decision trees are used as the base learners. In this technique, the model gets fitted in a cascading form. In other words, the model first gets trained on the given data, and then the difference between the model prediction and the actual class label then serves as the input for the next decision tree. This technique is repeated until adequate precision is obtained without overfitting the model. The difference between the model prediction and the actual class label is known as the ‘pseudo error’. One of the major importance of the Gradient Boosting Decision Tree (GBDT) is that it can incorporate any kind of loss function provided it should be differentiable.
We apply the GBDT approach in the code below by taking two of the most relevant hyperparameters., i.e. ‘max_depth’ which determines the depth of a decision tree, and ‘the ‘min_samples_split’ which determines number of splittings at any given depth. Following the similar procedure as above, we fit our model with the train data to find the best-fit hyperparameters.
The best-fitted parameters are the follows:
In the above confusion metric plot, it is easily noticeable that all diagonal elements have substantially greater values than non-diagonal elements. The plot also clearly indicates that GBDT has performed the best among all the three models we have used.
Remarkably, one can see that the training accuracy has reached more than 99% and the test accuracy reaches to 95.5%, which are way improved than that of the LR and SVM models.
The F1 scores for both train and test data have also improved to 0.991, and 0.955, respectively.
Feature Importance:
Since we have, in total, eleven features that determine the severity of airplane accidents. However, not all of them contribute to the same extent. From the figure given below, we can check which feature(s) is(are) dominantly responsible to determine the severity.
From this figure, we find that following are the top four features that contribute majorly to the severity of airplane accidents.
Conclusion:
In this case study, we figure out that the data of airplane accidents cannot be easily interpretable, and therefore it requires a highly complex computational technique, such as GBDT. We find that the latter can reach up to an accuracy of 95.5% on test data. However, it also slightly gets over-fitted as the training accuracy is over 99%. Nevertheless, it turns out to be the best as compared to the Logistic Regression and SVM technique. Finally, we find that out of all the eleven features, there are only four features that contribute more than 10% in determining the severity of accidents.
LinkedIn account:
Future Work:
By using the above data of the airplane accidents, one can extend the analysis using various Deep learning techniques which have potentials to give better results than this. Moreover, based on the availability of the data that either specifies the company name of the accidental airplane or the region in which the accident happened, one can get lots of important information which can be further helpful to identify the cause of accidents.