Predicting the severity of Airplane accidents with Feature Engineering

9 min readAug 2, 2022

Business Problem and Overview :

Nowadays transport of airways has been playing a key role in every industry. airlines and passengers’ safety is the first concern we should take care of. Also, usage of airways for passengers and cargo is increasing rapidly, Various Safety checks are done continuously 24/7 manually, and every safety measure and precaution are been taken care of by the airline team but still some case accidents due to various reasons like pilot error, air traffic controller error, design and manufacturer defects, maintenance failures, sabotage, or inclement weather etc. They are some ways to avoid such failures.

Our motto is to reduce airline accidents as low as possible. So, we can find a way to avoid such accidents with the previous airline crash or failure data. we use this data to prevent such same cases in future. this can save a lot of lives and money. automatic the severity of any aeroplane accident help gets an idea of what is happening. so that, things can be taken care of. Data such as Safety Score, Days Since Inspection, Total Safety Complaints, Control Metric, Turbulence In g-forces, Cabin Temperature, Accidents Type Code, Max Elevation, Violations, Adverse Weather Metric with these data, predict the severity of the accident and accidents are classified into four types they are

1. Minor Damage and Injuries

2. Significant Damage and Fatalities

3. Significant Damage and Serious Injuries

4. Highly Fatal and Damaging

Use of Machine Learning :

For this problem, a predictive model adds more value and saves lives and money, models like Trees, boosting, SVM even deep learning models help us to achieve a solution for our problem faster and more accurate. how should our model behave in live time here are some constraints

medium latency requirement
Interpretability is important.
Errors can be very costly.

Source of your data and problem :

Data and problem is from competition in Hacker Earth Link

metric that is used for evaluation is the F1 Score

Existing approaches :

Existing solution1: link

The method they went is a simple straight method, finding top features and selecting only the top 5 features as input data, They mentioned that all the features as input are causing overfitting, They directly trained the GBTD with hyper tunning this result 97% accuracy, well it is an imbalance data, they have not done any pre-processing techniques like data normalization etc.

Existing solution2: link

This blog is similar to the above one but their data is standard scaler() and I don’t find any feature engineering methods. the final model is GBDT with an f1 score of 0.95 validation data and 0.99 train data there is no sign of test data score and they mentioned GBDT is overfitting. These are the top features they found among the 11

[‘Safety_Score’,’Days_Since_Inspection’,’Control_Metric’,‘Accident_Type_Code’]

My improvements to the existing solution:

Balancing Data
Data pre-processing Techniques
Add Feature Engineering Techniques
Feature Importance and Model Interoperability

Exploratory Data Analysis (EDA)

No Null values in the Data set

Class Feature :

The data size of train data is 10k
We found that Significant_Damage_And_Fatalities is a minor class among all the other class
Highly_Fatal_And_Damaging is the max class have more data points

Bar Charts of Categorical with Class :

1. we can see clearly that accident_type_code 4 is lead to high fatal and damage, similarly, type 2 is minor damages, and type 3 is Significant Damage And Serious Injuries, look like accident type can help us

2. whereas no violation looks the same for each class and there is very less data which have 5 violations this result in Significant Damage And Fatalities

Scatter plots -2D and PDF:

Pdf’s :

The safety score looks good when compared to the others pdf’s, safety score might be an essential feature
we can observe that the 10–15 Days_Since_Inspection has more incidents
all other pdf is completely overlapping each other, now let's try with scatter plot with two features

Scatter Plots :

In safety score vs days_since_inspection looking good and easy to separate, we can try some feature engineering methods on these features.
In accident_type_code vs adverser_weather_metric, we can point yellow and green points are grouped and orange and blue are grouped.
IN safety score vs rest, we can separate at least two classes yellow and blue roughly
we also find an interesting plot accident_type vs violations
seem to be there only 4–5 feature looking good through the features
let’s try in dim reduction plot, let’s see if there is any luck

Finding Outlier: using Zscore

Safety_Score = 29
Days_Since_Inspection=13
Total_Safety_Complaints=129
Control_Metric=20
Turbulence_In_gforces=19
Cabin_Temperature=93
Max_Elevation=27
Adverse_Weather_Metric=254

Safety_Score: has only a few outliers
Days_Since_Inspection: has only a few outliers
Total_Safety_Complaints: has more outliers
Control_Metric: has outliers a few outliers
Turbulence_In_gforce: has no low and no high outliers
Cabin_Temperature: has no low and no high outliers
Max_Elevation: had few outliers in all classes
Adverse_Weather_Metric: have more outliers
In Adverse_Weather_Metric and Total_Safety_Complaints classes are positively skewed so these features have high outliers

T-SNE :

t-sne don’t look too bad, we can find a few cluster groups in the plots

Correlation features :

most pairs have less correlation of less than 0.01
(accident_type — saftey_score) has the highest positive correlation which is 0.17
(Adverse_Weather_Metric — Accident_Type_Code) , (Turbulence_In_gforces — Control_Metric) and (Safety_Score — Days_Since_Inspection) have the highest negative correlation

EDA Conclusion:

we found Safety_Score, Days_Since_Inspection, Accident_Type_Code, and Violations this feature will help us a lots
Feel like Adverse_Weather_Metric and Total_Safety_Complaints don’t look like good features and these features have more outliers as right skewed
Highly_Fatal_And_Damaging, Minor_Damage_And_Injuries these classes can be separated easily were the other two classes are overlapping a lot.

My Approach:

Feature Engineering (10 Featuring)

I come up with these by looking at pdf and scatter plots and correlation, for example, if a pdf looks like positive skew then doing log transform etc.,

These are my final feature and I have experimented with a few more feature which doesn’t add much value so I have removed those such

Binning Safety_score
changing numerical feature to categorical feature then CountVectorizer

Data pre-processing:

Applying StandardScaler for Numerical Data
Converting numerical feature to categorical feature CountVectorizer
Stacking All the features together (total of 26 features)

Important features by `RandomForest`

[('FE1', 0.26530467171052063),
 ('Control_Metric', 0.15652313268107496),
 ('Days_Since_Inspection', 0.07326915157366469),
 ('Safety_Score', 0.05569825602490337),
 ('FE8', 0.05093572335387651),
 ('FE10', 0.049667016691478195),
 ('FE9', 0.0470837934097891),
 ('Turbulence_In_gforces', 0.03483850050070113),
 ('FE5', 0.031038350858261524),
 ('FE2', 0.030505454521039983),
 ('FE7', 0.02485302368170017),
 ('Adverse_Weather_Metric', 0.020498685385019644),
 ('FE3', 0.01986682945437444),
 ('FE6', 0.017958831123633105),
 ('Accident_Type_Code', 0.0142465304698986),
 ('Cabin_Temperature', 0.013514462462969376),
 ('Accident_Type_Code_3', 0.011186724714475023),
 ('Max_Elevation', 0.01086308943567474)]

These are the top 15 features among 26 features

we found that our new features are working well, and some of our feature engineering techniques worked well
The result from EDA is matching with important features
ALL our feature engineering features look good, Now let’s start modelling

Modelling :

KNN :

knn hyper tuning gives 5 neighbours and the distance metric is manhattan. Knn is highly overfitting and the test score is 0.77, by seeing the confusion matrix there is class1 is classified as class 3 a lot and lots of confusion between 0 and 1 class

Logistic Regression:

we found c=100 and regulations as L2 in hyperparameter tuning seem to be Logistic regression performing worst than knn and From confusion matrix, there is lots of confusion over all classes

SVM:

After Hyperparameter tuning we found kernel RBF is good for our data. The training time of SVM is high and SVM is not overfitting and performance is also quite good, from confusion still there some confusion btw classes 1–0 and 1–3

Conclusion Models :

after hyperparameter tuning, max depth is 5 and min sample in leaf is 5 and Dt is also not overfitting and the performance is also very high and the confusion is btw class reduced when compare to SVM

Random Forrest:

after hyperparameter tuning, no of the trees are 1000 and this model is highly overfitting we can see the f1 score is 1 but the test score is pretty high and overlap btw classes also reduced

XgBoost:

after hyperparameter tuning, no of the estimators are 1000 and this model is highly overfitting we can see the f1 score is 1 but the test score is pretty high and the overlap btw classes is also reduced, as for now, the best model is xgbt even it is overfitting

LR: performance is bad
SVM, RT, and XGBT are overfitting but the performance is good
DT is not overfitting with a 0.94 f1 score
DT and XGBT are top performers

Model interpretability (by using shap) :

these five top features help to predict class as Highly_Fatal_And_Damaging

[‘Total_Safety_Complaints’, ‘Cabin_Temperature’, ‘FE5’, ‘Turbulence_In_gforces’, ‘FE2’]

Upsampling Data :

after upsampling :

Training and testing is the xgboost model with upsampled data. There is only a slight change in the performance. upsampling data or not doesn’t add much value to the performance

Hacker Rank competition result :

My test F1 score is 85.32502

the top score is: 87.42502

Model Deployment :

I have created a streamlit application and model integrated with streamlit and deployed in the AWS ec2 instance

how this application work:

user have to fill the data features in the form
click on predict severity
this result in the predicted classified class and also gives the important feature that causes that particular class

Video of the project :

Future work :

Try more new feature engineering methods with the help of domain knowledge person

2. Trying Deep learning model MLPs as feature engineering.

3. As the data is from USA airlines, adding up all other nationalities and airline's data

4. Decrease the latency as low as possible by using software engineering techniques

5. converting this problem into binary classification with the help of domain experts and trying some cascading classifiers model where there can be death or not

References :

My URLs :

Github