Let’s Save People From Airplane Crashes - Pilot Errors :)

11 min readSep 11, 2021

Problem Statement and Why Solve This?
Explaining the Data and Features.
Connecting To Machine Learning Problem
Performance Metric and Why This Only?
Exploratory Data Analysis
Feature Engineering
Feature Selection
Modelling
Deployment
Further Improvements
References

1. Problem Statement and Why Solve This?

We all know that Aviation is the Key Industry to Globalization. It constitutes nearly 30% of Share in Transportation only in India every year. Besides this, a lot of Aviation Fatalities are taking place every year. Nearly, total fatalities due to aviation accidents since 1970 is 83,772 across the world. There are many reasons like Pilot error, Mechanical failure, Design defect, Air traffic failure, Defective runways. But here our main focus is on aviation accidents caused by pilot error, and solutions to avoid them so that accidents in aviation industry can be reduced results in saving of lives of people and money to airline.

Most flight-related fatalities Occurred Due to a loss of “airplane state awareness.” That is, ineffective attention management on the part of pilots who may be distracted, sleepy or in other dangerous cognitive states.

So Our Main Goal is build a model to detect troubling events from aircrew’s physiological data and alert the pilot to take necessary preventive Steps.

2. Explaining the Data and Features

refer: https://www.kaggle.com/c/reducing-commercial-aviation-fatalities

This competition was conducted by kaggle under the consulting firnm Booz Allen Hamilton expertise in analytics and Cyber has been solving for business, government, and military leaders for over 100.

Features of the Dataset:

ID — This is unique identifier present only in test and sample submission csv, which is used to identify each pilot.
Crew- This is unique identifiers for pair of pilots. There are 9 pilots in the data.
Experiment- Experiment the pilot has undergone. It may be CA,DA,SS or LOFT(Line of Flight not present in train.csv, present only in test data).
Time- Seconds into the experiment
Seat- Left(0) and Right(1)
EEG- Electroencephalography (EEG) measures the electrical activity in the brain. An electrode is placed on the scalp which measures the electrical voltage resulting from ionic current. So all the 20 EEG features which are present in the dataset are the voltages generated at different sensors placed on different parts of the scalp.
ECG — Electrocardiogram measures the electrical activity in the heart. It is measured in micro voltage. The sensor for this experiment had a resolution/bit of .012215 μV and a range of -100 mV to +100 mV. The data are provided in microvolts.
Respiration — This measures the rise and fall in the level of the chest. The sensor had a resolution/bit of .2384186 μV and a range of -2.0V to +2.0V.
Galvanic Skin Response — This measures the electrodermal activity which reflects sweat gland activity that are reflective of intensity of emotional stress. The sensor had a resolution/bit of .2384186 μV and a range of -2.0V to +2.0V.The data are provided in microvolts.
Event — This indicates the state of pilot at a particular time.

Channelized Attention(CA) : This State Will Occur When Pilot is not able to Do Multitasking and Only Focus on One Thing without Caring about what happening in the surrounding environment

Diverted Attention(DA) : In Simple Terms, It like noticing or thinking about something instead of Focusing on the real task. It can thought as multitasking but attention is given to various unnecessary Tasks

Startle/Surprise(SS): This state can Defined as the Surprise or Sudden Change in scenario where the Attention gets Disrupted Abruptly..

3. Connecting To Machine Learning Problem

We want to Predict the probability of pilot being under each one of the 4 events, so that we can alert the pilot to take necessary preventive Steps and this is a Multi-Class Classification problem. There are 4 events which we are want to predict Correctly.

4. Performance Metric and Why This Only?

refer: Multi-Class Loss

Here, we are predicting the state of mind of the pilots and we are predicting what is the probability that the pilot is Falling into any of the states. In the process of predicting our model should not falsely predict that the pilot in Normal state while pilot is actually in some Dangerous State. Here we need to take care of these misclassifications and we need to penalize even every small error in probability that we make. So for this Situation Multi-class Log-loss as a metric Makes Sense.

5. Exploratory Data Analysis

Let’s Look at the data type of each Features

Looking for the datatype of each Feature

This Say that most of them are numbers except seat, event and experiment, as Machine Learning Model mostly understand only numbers It’s good to encode them while passing through models

2. Let’s Look for any Null values present in the dataset

Let’s Check whether any null values present in the dataset

From the above data, we can say that there are no missing values so we can exclude mean/median/mode imputations methods here

3. Let’s Look at the Distribution of class labels

Distributions of Class Labels — **Distribution of Class Labels**

From the Above plot we can clearly see that event A[Baseline] has 58.5% of total events[100%] and occurred most of The times. This depicts that Most of Time Pilot is Under the Control. The Second most event is CA occurred 34% of times followed by DA with 4.8% and here SS is the rare event which can be justified because this occurs only when Sudden or Surprise things happens maybe like heart attacks. This is Absolutely imbalanced data but imbalanced is justified here.

4. Let’s Compare Experiment and Seat Features

From the Above plots we can Clearly See that in SS experiment, only small number of times only SS State occurred, So SS is rare in overall event also. When CA experiment is performed, most of the time CA state only Occurred and DA State occurred less number of time but more when compared with SS state..

Distribution of Seat Between Events

From the Above plot It is clearly visible that there is no effect of position of Seat in the State of Pilot. The Distribution is present at 1 and 0 only and other than this there are no distributions present. This probably has nothing to do with the outcome of the experiment though, we can remove at time of Feature Selection.

5. Exploring Galvanic Cell Response (refer: GSR)

The Above Plot Depicts the pdf and cdf of Galvanic Cell Response. TheThe range of gsr is in between 0 to 1999.85 microvolts. We can see that nearly 50% of data falls behind nearly 770 microvolts. The Distribution is not normal, It is Somewhat skewed to to the right.

Let’s see how GSR is behaving with Pilots State

There is more overlapping between the events, The distributions are almost the Same here, This implies that we cannot simply place any threshold on gsr to Separate. So, we can Say that This gsr feature alone add low value to classifying the events

6. Exploring Respiration (refer: respiration)

All values of respiration lies between 482.05 to 840.18, Nearly 50% of Data lies below 743.437988, we can see the Same Thing here, that the data is Skewed towards the right.

Let’s see how Respiration is behaving with Pilots State

The More Overlapping between class Labels, It’s difficult to Separate.Same happening with respiration data, We cannot simply place any threshold on respiration to Separate the class Labels

7. Exploring ECG

Here we can See that there is Somewhat difference between the events in ECG Distribution, DA is the one having more highest values and CA containing lowest values, There Seems to be some outliers present here, but we have to really check whether they are outliers or rare cases.

8. Exploring Electroencephalography (refer: wikipedia)

Note: Electroencephalography is an electrophysiological monitoring method to record electrical activity of the brain. It is typically noninvasive, with the electrodes placed along the scalp, although invasive electrodes are sometimes used, as in electrocorticography, sometimes called intracranial EEG.

Take a Look at all EEG Features

we can see that most of eeg features are around and having mean as zero and looking like a normally distributed. Most of the feature variance are Fluctuating between 500.we can See that Most of the eeg-features are overlapping a lot, indicating that we cannot just put a threshold to separate class labels. We can come up with some interesting features like potential difference of EEG and Power Features in Feature Engineering.

9. Now Let’s examine the how ECG, Respiration and Gsr Features are Behaving with time.

**Distribution of Features with experiments events**

We can see that time in range between 0 to nearly 350 seconds. The Event B[SS] having high and Lower range of time when compared with A, C, However, we cannot use time feature as an important feature because it is independent of the experiment and Time in Test is independent of time in train.

Here, in Case of ECG, For instance, There should be at least 20 beats in 15 seconds and in respiration there should be at least 3 – 4 breaths in nice sinusoidal Form. Here, we notice that all these features are highly noise and we should remove noise so that our models perform better. We should use Biosppy Package to remove noise in these Features.

10. Scatter Plots Between Bio Features

**Bivariate Analysis between Bio Features**

We can see that all these features are we weakly correlated and cannot able to get how much they are correlating.

6. Feature Engineering

(refer: https://www.kaggle.com/stuartbman/introduction-to-physiological-data)

1. Removing Noise in the Features:

Biological sensors are quite susceptible to noise from outside sources. This can include lights (flickering at 50/60Hz depending on your AC frequency), and other electrical equipment. Hopefully this would be consistent between recordings, but it does make analysis more challenging, since removing any noise will usually remove a bit of signal too.

**Before and After Filtering of Respiration**

After Filtering, we can See that noise is removed in the ECG feature and Respiration using moving average with the help of butter and flit flit methods in SciPy Package, Now It make complete sense that in 10 seconds we can expect that many heart beats, we should filter our data to get much more useful insights into it.

2. Adding Potential Difference Features

T-sne plot with all potential Differences

Here we can see that this EEG Features are somehow helping to distinguish class events. And also there is Grouping Present for events ‘A’ and ‘C

3. Power Features

Notes: Electroencephalography (EEG) power features represents amount of activity in certain frequency bands of the signal while coherence between different electrodes reflects the degree to which connections are present across brain regions

Description:

Delta (<4Hz) : Slow wave sleep, continuous attention tasks
Theta (4-7Hz) : Drowsiness, repression of elicited responses
Alpha (8-15Hz) : Relaxed, eyes closed
Beta (16-31Hz) : Active thinking, focus, alert
Gamma (>32Hz) : Short term memory, cross sensory perception

PowerFeatures

4. Using Autoencoders for Feature Extractions

Auto Encoders on top of Bio Features

Here, we will trained our models with Both Actual Features and Encoded Features and Record all the Performances, and check whether Encoded Features are helping to achieve even more good performance than derived Features.

5. Applying MinMax() for All Features

MinMax() for All Features

6. Feature Selection

Feature Selection using Decision Based Models

From Above we get to know that Encoded Features are having More Importance and Certainly Helps in classifying our events and bio Features like gsr and r having more Importance in predicting Class events.

8. Modelling

Random Model

Evaluation Random Model

Here we can See that Random Model Gives loss of 1.6449 and Keeping this in Mind we should Select the Models that having loss still Lot less than This and It should Have High Recall rates for Class events

Logistic Regression (refer: blog)

**Hyperparameter Tuning Logistic Regression**

**Kaggle Score for Logistic Regression**

Here, We can conclude that Logistic Regression is Performing better than Random Model. Here we got Log-loss as nearly 1.3. Let’s Try Other Models and Check their Performances as well.

Stacking Of Models (refer: blog)

Here Stacking Model Outsources The Random Model and Logistic Model. We got a Better Log-loss of nearly 0.57. Finally Let’s Try LightGBM Model as the Dimensionality of Features are less, Tree Based Models Works Better here.

LightGBM Model (refer: Blog)

Now, we can Clearly See that The Score is 0.53 and is better than Other Models like Logistic Regression and LightGBM Models. Let’s make this as final Model and Do our Deployment.

9. Deployment

Please Check this Sample Demo Video: https://youtu.be/XHk6RP8Ad3E

Here, we have used Simple Flask Framework and basic HTML/CSS to deploy our model and make predictions for our test data point. By this we have covered our Project from Complete end to end workflow.

10. Further Improvements

Doing more Hyperparameter Tuning with different values may help if have more computing resources.
We can Try Deep Learning Techniques to get better predictions
As we have time Features, So may be doing models with LSTM, Bidirectional GRU, and also CNN-LSTM may helps in better predictions

If you have time please check my GitHub and contact me through LinkedIn if you have any doubts further. By reading this you will surely understand how an End to End Machine Learning project workflow occurs. This is my First Blog, if you feel like you learnt something, please give it a clap.

11. References

Introduction to physiological data: https://www.kaggle.com/stuartbman/introduction-to-physiological-data
Biosppy: https://biosppy.readthedocs.io/en/stable/biosppy.html
https://www.kaggle.com/shahaffind/reducing-commercial-aviation-fatalities-11th
https://www.kaggle.com/c/reducing-commercial-aviation-fatalities