Reducing Commercial Aviation Fatalities
An end-to-end Data Science, Machine Learning, Deep Learning project
Table of contents
- Business Problem
- A statistical approach to solving the problem
- Use of Machine Learning to solve the problem
- Dataset Overview
- Features Overview
- Performance metric
- Exploratory data analysis
- First cut approach
- Comparison of models
- Kaggle result
- Future work
Reducing Commercial Aviation Fatalities is a Kaggle competition held in 2019. Aviation Fatalities/accidents cause deaths of passengers, aircrew, and complete vehicle damage, etc. Most flight-related fatalities stem from a loss of “airplane state awareness.” That is ineffective attention management on the part of pilots who may be distracted, sleepy, or in other dangerous cognitive states.
A statistical approach to solving the problem
This problem can be solved by a statistical modeling approach using Linear Discriminant Analysis(LDA) considering the classification problem.
- LDA assumes data is normally distributed and each feature has the same variance also values for each feature differ around the mean.
- with these assumptions, the LDA model will estimate the mean and variance for each class in training data.
- The mean(mu) of each feature(x) for each class can be defined as “Dividing sum of features by the total number of features”.
Use of Machine Learning to solve the problem
For this problem statement to be solved, data is acquired from actual pilots in test situations, and the task is that ML models should be able to run calculations in real-time to monitor the cognitive states of pilots and based on observations predict the cognitive state of the pilot. pilots could then be alerted when they enter a troubling state, preventing accidents and saving lives.
Dataset consists of real physiological data from 18 Pilots who were subjected to various distracting events. There are 3 files provided by Kaggle, viz. Train, test, sample submission all in CSV types.
Training set consists of a set of controlled experiments collected in a non-flight environment, outside of a flight simulator.
Test set (abbreviated LOFT = Line Oriented Flight Training) consists of a full flight (take off, flight, and landing) in a flight simulator.
The sample Submission file is the final file to be uploaded.
Pilots experienced distractions intended to produce one of the following 3 cognitive states:
Channelized attention:- this is an engaged state of mind where the pilot excludes other tasks and only remains focused on one task.
Diverted attention:- it can be understood as multitasking when the pilot is involved in handling multiple things at a time and cannot properly focus which leads to a lack of decision making.
Startle/Surprise:- this can be understood when sudden unexpected events happen during flight, events can be anything and dangerous by which the pilot gets surprised and starts focusing on handling troubling events.
- ID — (test set and sample submission file only) A unique identifier for a crew + time combination.
- crew — a unique id for a pair of pilots. There are 9 crews/pilots in the data.
- experiment — One of CA, DA, SS, or LOFT(only present in the test set).
- time — seconds into the experiment
- seat — is the pilot in the left (0) or right (1) seat
- EEG — Electroencephalography(EEG) is an electrophysiological monitoring method to record the electrical activity of the brain.
- ECG — An electrocardiogram is a test that measures the heart’s electrical activity. It’s also known as an ECG or EKG. Every heartbeat is triggered by an electrical signal that starts at the top of the heart and travels to the bottom. The sensor had a resolution/bit of .012215 µV and a range of -100mV to +100mV. The data are provided in microvolts.
- respiration(r) — Respiration, a measure of the rise and fall of the chest. The sensor had a resolution/bit of .2384186 µV and a range of -2.0V to +2.0V. The data are provided in microvolts.
- GSR — Galvanic Skin Response, a measure of electrodermal activity. The sensor had a resolution/bit of .2384186 µV and a range of -2.0V to +2.0V. The data are provided in microvolts.
- event — The state of the pilot at the given time. This a class label.
This Kaggle problem statement is a 4 class classification problem and evaluation to be done using Multi-Class Log Loss between the predicted probabilities and the observed target.
Following are 4 Class Labels to be classified:
- A = baseline or no event
- B = Startle/Surprised state
- C = Channelized attention
- D = Diverted attention
Why Multiclass Log-Loss as evaluation metrics?
- In our case, we are solving a classification problem in which the Machine Learning model has to predict each class among 4 classes.
- In this problem getting misclassified prediction cannot be tolerated as per Business perspective, using Multi-Class log loss will penalize even small deviations from the actual output which is very helpful.
- when data is highly imbalanced ROC curve is not a good choice to go with.
Exploratory Data Analysis
Distribution of class labels
It is clearly visible that the training dataset is highly imbalanced. Event ‘A’ occurs in the highest numbers which mean pilots are not distracted and there is no such troubling event most of the time. event ‘C’ is the most common occurring distracted state of mind followed by event ‘D’. event ‘B’ has the least numbers which means few times pilots go in shocking/startle cognitive state.
Physiological features analysis
- This is raw ECG signal data which is very unclear and noisy. this signal data needs to be filtered
- for event-based ECG, there exist potential outliers for events A, C, and B. event D has no outliers.
- it is assumed these outliers could be noise in the data which needs to be carefully handled as ECG data is an important feature.
- No event share similar distribution nor their quartiles. for the event, A and B medians are quite close.
- there is clearly overlapping of distributions. None of the events is normally distributed.
- normalization of values is required during feature engineering. there is no symmetry shape of data points, this also means the ECG feature is not normally distributed.
- it's clearly visible that GSR data is very noisy and unclear. normally GSR data is not like this.
- there are no outliers in the GSR feature but from the violin plot, there are fewer values of A, B, and C within the range 100 to 300. for event D there are fewer values in the range of 125 to 250.
- None of the event similarities in terms of distributions and no event has normally distributed this means during feature engineering its a good idea to normalize GSRvalues.
it's clear that data is not symmetric so it's not normally distributed.
- there are no outliers as well according to the box plot.
according to the violin plot there seem to be no values within the range of 100 to 350.
- In general respiration rate increases when there is some stressful event. Also as per general observation respiration is highly related to heartbeats which in this case affects ECG. respiration is a measure of the rise and fall of the chest and abdomen area. In the dataset, if the respiration rate is higher it can be inferred that the current event could be stressful.
- from the above plot, it's clearly visible that Respiration data is very noisy and unclear. normally Respiration data is not like this instead it's quite similar to sine waves.
- there are no outliers for any event. event C and D share quite similar distribution with nearly similar medians and interquartile range.
- There is no symmetry in data which means this feature is not normally distributed. also, there are no potential outliers according to the box plot. Normal respiration rates for an adult person at rest range from 12 to 16 breaths per minute but the above figures don’t provide clear insights.
4. Correlation matrix of physiological features
- from the correlation map, it can be seen that all EEG features are highly correlated but ECG, GSR, Respiration are not. but generally, all physiological features have some dependencies on each other.
- For example: during flight, if there is any threatening and stressful event starts happening then definitely the pilot’s respiration rate will change and so the same change will occur on Heartbeats this means ECG data will be automatically shown higher peaks. This will also lead to a change in the rhythm of brain waves which will be recorded at EEG data, this can lead to a change in the pilot’s state of mind from alert to mixed alert and fear.
- Also, all these will directly affect the dermal activity and sweat glands of the body, typically when there is a moment of fear then there is more secretion of sweat from the body cells leading to a change in an emotional state.
First Cut Approach
- Given the training dataset, it has a large number of training data points with fewer features.
- There is not an issue of null values in both train and test sets.
- To handle imbalance problems in data will experiment with both SMOTE and cost-sensitive methods, based on the final model results will finalize which method to go with.
- SMOTE will generate synthetic examples for minority classes.
- The cost-sensitive method will balance the dataset using the parameter ‘class_weight=balance’ for tree-based algorithms.
- Converting data types to experiment with space and memory reduction. In data, there are total 3 data types associated with columns: int64, float64, object
Convert int64 to int32
Convert float64 to float32
Convert an object type to categorical
- Given the variation in the training dataset, column standardization needs to be performed.
- New EEG features to be derived from existing EEG features by subtracting according to electrodes connection and general practice in the medical field. From the given 6 montages two of them suit best for this problem statement and 20 EEG features. “Longitudinal-traverse Bipolar” and “Circumferential Bipolar” fit well for feature engineering experiments.
- Removing noise from physiological data.
Model Selection approach
- Deciding which model to use should be based on computational complexities and the nature of the dataset.
- Based on the imbalance nature of the dataset it’s good to go for tree-based algorithms like ensemble Decision trees — Random forest, lightGBM.
- With RandomSearch cross-validation get tuned hyperparameters, here I am not thinking of GridSearchCv as it will take a longer time to get tuned hyperparameters.
- Training LightGBM classifier model with best-found hyperparameters.
Comparison of Models
I have trained 5 models out of which LightGBM stands out as best.
Kaggle Result and leaderboard rank
- Kaggle screenshot of LighGBM classifier
- After manually checking(as the competition is closed) for rank on the leaderboard I would be in 43rd place
- Deriving features from Respiration and ECG and use derived features instead.
- Out of two suitable montage types Longitudinal-Bipolar method connection is used the next type is the Circumferential method can also be used.
- Introduction to physiological data
Connect with me
Github repo link:
Reducing Commercial Aviation Fatalities is a Kaggle competition held in 2019. Aviation Fatalities/accidents cause…
Deployed project testing video on Youtube: