Airplane accident severity - Analysis and Prediction

11 min readFeb 17, 2022

The IAF helicopter was on its way from an Air Force base when it crashed near Coonoor in Tamil Nadu. (Photo: TNN)

Introduction:

Flying has been the go-to mode of travel for years now; it is time-saving, affordable, and extremely convenient. According to the FAA, 2,781,971 passengers fly every day in the US, as in June 2019. Passengers reckon that flying is very safe, considering strict inspections are conducted and security measures are taken to avoid and/or mitigate any mishappening. However, there remain a few chances of unfortunate incidents, like recently Chief of Defence Staff General Bipin Rawat, his wife, 11 others killed in IAF helicopter crash in Coonoor, Tamil Nadu which is shown by above figure.

Business problem:

Description:

How severe can an airplane accident be? Flying has been the go-to mode travel for years now; it is timesaving, affordable, and extremely convenient. According to the FAA, 2,781,971 passengers fly every day in the US, as in June 2019. Passengers reckon that flying is very safe, considering strict inspections are conducted and security measures are taken to avoid and/or mitigate any mishappenings. However, there remain a few chances of unfortunate incidents.

Imagine we got a project from leading airline. They give us task to predict the severity by building the Best Machine Learning model based on the past data. With this, all airlines, even the entire aviation industry, can predict the severity of airplane accidents caused due to various factors and, correspondingly, have a plan of action to minimize the risk associated with them.

So, the purpose of this self-case study was to determine the model that best predicts the target variable and determine the variables that are most important within the model, and classify the severity of the accident like (Minor_Damage_And_Injuries, Significant_Damage_And_Fatalities, etc.)

Source of Dataset : https://www.kaggle.com/abilashcheruvathur/airplane-accident-severity-analysis-prediction.

The Dataset have contained two file: Train.csv, Test.csv.
Train.csv file have 12 columns like Accident ID, severity, safety score, Violation…. etc. And Train.csv have 10000 rows.
Test.csv file also contain 12 columns same as train.csv.

Column analysis:

Accident ID: unique id assigned to each row
Accident_Type_Code: the type of accident (factor, not numeric)
Cabin_Temperature: the last recorded temperature before the incident, measured in degrees Fahrenheit
Max_Elevation: The maximum altitude during the event plane reached.
Adverse_Weather_Metric: measure weather metric in the basis of the adverse event occurred.
Turbulence_In_gforces: the recorded/estimated turbulence experienced during the accident
Control_Metric: an estimation of how much control the pilot had during the incident given the factors at play
Total_Safety_Complaints: number of complaints from mechanics prior to the accident
Days_Since_Inspection: how long the plane went without inspection before the incident
Safety_Score: a measure of how safe the plane was deemed to be
Violations: number of violations that the aircraft received during inspections
Severity: a description (4 level factor) on the severity of the crash [Target].

As we all know we directly can’t conclude that which model is good or which is bad it’s all depend on the dataset. So for getting the dataset behaviour or information we have to use EDA on it.

Real-world/Business objectives and constraint:

Interpretability is important.
Minimize multi-class error.
No strict latency concerns.
Incorrect classification impact on the analysis.

Performance Metric :

In these Ml model i consider two matrix for the performance measurement.

1- Multi Class Log loss

2- Confusion Matrix

Data distribution:

1- Target Variable :

Number of Class thar are Highly_Fatal_And_Damaging 3049 , ( 30.490000000000002 %)
Number of Class thar are Significant_Damage_And_Serious_Injuries 2729 , ( 27.29 %)
Number of Class thar are Minor_Damage_And_Injuries 2527 , ( 25.27 %)
Number of Class thar are Significant_Damage_And_Fatalities 1695 , ( 16.950000000000003 %)

As we seen above pie plot we get that Target variable are not a bais one means they are not going towrads only on one class , so that means our dataset is balanced dataset , Thier is no problem of imbalanced dataset in our case study.

1- Accident Type code :

As we seen above plot we got that Accident Type Code is a categorical Feature , Where 4 Type is come more number of time the other type code.type-5 comes less number of time the other type code.

2- Days-since- Inspections:

As we seen above plot we got that Days since inspections is a categorical Feature , Where 13 Days since inspections is come more number of time the other type days .Days since inspection -1 comes less number of time the other type days.

3- Violations:

As we seen above plot we got that violations is also a categorical feature.

Let’s see how numerical Feature affected the accident ?

1-Safety Score:

As we seen we get the All the class have the differnt mean.
The ‘Significants damage and fatalities’ have the different shape as the other having the gaussian Distribution.
The data for the classes range between Safety_Score 0 to 120.

2- Total_safety_complaints:

When the number of Total Safety Complaint is roughly between 0–1 , then Highly_Fatal_And_Damaging is maximum.
Then the Accident rate is gradually declining when the Total cafety complaint is increase.

3- Control_Metric:

As we seen we get that no one class having the same mean.
All the classes don’t have the gaussian distribution.
All classes are overlapping each other.
The data for the classes range between Control Metric 0 to 100.

4- Turbulence_In_gforces:

As we seen we get that Significant_Damage_And_Serious_Injuries and Significant_Damage_And_Fatalities and minor damage injuries have the same mean.
Highly_Fatal_And_Damaging have not the same mean..
All classes are maximum overlapping each other.
The data for the classes range between Turbulance In Gforce 0 to 1.0

5- Cabin_Temperature

As we seen we get that Significant_Damage_And_Serious_Injuries and Significant_Damage_And_Fatalities and minor damage injuries have the same mean.
All classes are maximum overlapping each other.
The data for the classes range between Turbulance In Gforce 75to 100

6- Max_Elevation:

As we seen we get that Significant_Damage_And_Serious_Injuries and Significant_Damage_And_Fatalities have nearly the same mean.
Highly_Fatal_And_Damaging have different mean.
All classes are overlapping each other.
The data for the classes range between Max_Elevation 0 to 70000.

7- Adverse_Weather_Metric:

As we seen we get that When Adverse Weather is 0.1 then the Minor_Damage And injuries occur chances are more.
Highly_Fatal_And_Damaging have not the gaussian Distribution.
All classes are overlapping each other.
The data for the classes range between Adverse Weather Metric 0.0 to 2.5.

Now Let’s See How one feature corelated with other feature.

1-turbulance_in_gforce vs the control_mteric:

By looking above we get that there is a inter relation between the turbulance_in_gforce and the control_mteric.
when turbulance_in_gforce increase then the controol_metric is decraeses.
by looking above graph we get that more point are overlapping so we can’t get idea by just seeing it.

2- Adverse_weather_metric Vs Maximum elevation:

In the left side we can see that when adverse_weather_metric is 0.0 then maximum elevation occurs.

3- Days_since inspection vs Safety score:

By looking above we can get that there is inter corelation between the Safety score and days_since_inspection.
when the days_since _inspection is incraese then the safety socre is decraeses.

Let’s plot Corelation Matrice of the Dataset:

The Value range of Corelation matrix is between -1 to 1.
By looking above heatmap we get the DarkRed color indicate that the variable are highly not corelated with each other,example Safety_Score and Days_Since_inspection they are not corelated with other that why it’s matching square comes under the Dark red.
Dark Green Color indicate that they are highly corelated with each other and their values is 1.
Adverse weather metric have the highly negative related with the Accident data type that’s why it’s shows in the dark red one box and other also like this which is highly negative corelated with like days since inpection and the safety score.

When i applys EDA on this dataset i conclude that dataset is not linearly seperable that means Machine learning model like Logistic ,linear which gives good perfromance on the linearly seperable dataset can’t be use here beacuse dataset is not linearly seperable. So we have to go for the other model like Xgboost and Random forest which give good perfromance on this type of dataset. So for initail approch i select the Xgboost beacuse Xgboost is an ensemble technique where new models are added to correct the errors made by existing models. Models are added sequentially until no further improvements can be made.

Firstly i apply some feature engineering on my dataset with the approches like statistical and domain knowledge.like,Mean of all numerical feature and median of all numerical feature and other statistical opertion.

When i Proceed further then i reliease that the column Accident_Type_code , Days_Since_Inspection and Violation are nothing but categorial feature beacuse they have number of unique value is less so when the feature don’t have more numbe rof unique value then we can consider it as categorical not numerical. So,as we know there are number of categorical feature present inside the dataset then we can apply the number of encoding technique on it like one hot encoding, ordinal encoding and other. Here we use one hot encoding and respone encoding on the categorical feature and create number of Ml models on it like on (One hot encoding Feature + Numercial feature) we create ml models and on (Response encoding feature + Numerical Feature ) we create number of Ml models , All of these idea and concept i got from the my applied ai mentor who helps me a lot in these Case study to complete.

Now Question Occur what is response Coding ?

What is Response Coding? It is a technique to represent the categorical data while solving a machine learning classification problem. As part of this technique, we represent the probability of the data point belonging to a particular class given a category.

One Hot Encoding :

One hot encoding can be defined as the essential process of converting the categorical data variables to be provided to machine and deep learning algorithms which in turn improve predictions as well as classification accuracy of a model.

Assume we have a sequence of labels with the values ‘red’ and ‘green’. We can assign ‘red’ an integer value of 0 and ‘green’ the integer value of 1. As long as we always assign these numbers to these labels, this is called an integer encoding.

Model Building & Implementation:

1- Respone Encoding on Categorical Feature + Standard Scaling on Numerical Feature :

Response Encoding on Categorical Feature:

response_values(‘Accident_Type_Code’, X_train)

By the above code we can apply the respone coding on the categorical feature (accident_type_code , days since inspection , violations).

Standard scaling on numeircal feature:

by above code of snippet we can apply the Standard scaling on numerical feature.in the above code we took the cabin temprature similarly we can apply the scaling on the other numerical feature also.

Merging the all Categorical feature and Numerical Feature:

by using the hstack we can merge the all categorical and numericl feature and with the help of these we can create a machine learning model.

Here we create the Total 4 model

1- KNN model

2-Logistic Model

3-Random Forest Model

4- Xgboost Model

Comparision Of Models:

2 - One Hot Encoding on Categorical Feature + Standard Scaling on Numerical Feature :

By use of above code of snippet we can apply the one hot encoding on the categorical feature.above we apply only on the accident type code similarly we can apply the one hot encoding on the other categorical feature and merge it with the numerical feature like we do above, and we can create the machine learning model by these merging data.

Here we create 5 model

1- KNN model

2- Logistic Model

3- Random Forest Model

4- Xgboost Model

5- Xgboost With Hyperparamter.

Comparion of Models:

As we seen above conclusion we see that Xgboost performing very well on the Cv and test dataset and it misclassification point is also less as comapre to other it beacuse data set is might be not linearly sepeprable.
In term of train dataset we see random forest perform very good as compare to other.
Logistic model is performing worst in comparing of all of model.

Selection Of model:

So in term of selection of model we Can select Xgboost with One Hot Encoding + Standard Scaling beacuse it give us best results and perfromance.

Future work:

Instead of the Machine learning model we can use Deep learning model which can give us the better performance here.For example we can use LSTM here for the Deep learning model by which th accuracy can be improve

Thank You For reading My Blog !!!

Specail Thank’s:

Specail Thank’s to my Applied Ai Team (appliedaicourse.com) Who Helps me to a lot to solve these Case study.

Reference :

Some articles and reference blogs and paper about the problem statement.

https://medium.com/analytics-vidhya/the-severity-of-airplane-accidents-305136e495b8
Predicting General Aviation Accidents Using Machine Learning Algorithms Bradley S. Baugh
Aviation Accident Analysis: A Case Study M. Shahriari (✉) and M.E. Aydin
This dataset provided by Microsoft contains about airplane accident.
Source: https://www.kaggle.com/abilashcheruvathur/airplane-accident-severity-analysis-prediction
https://www.youtube.com/watch?v=XI0619WGa_o

Contact :

My github notebook :: https://github.com/itzaamer/Airplane-Accident-Severity---Case-study--1
LinkedIn Profile :: https://www.linkedin.com/in/mohammad-amer-khan-a69901191/