Reducing Commercial Aviation Fatalities

Published in

The Startup

12 min readSep 26, 2020

What is the problem we are trying to solve?

Reducing commercial aviation fatalities is just one of the complex problems that have been solving for business, government, and military leaders for many years. Most flight-related fatalities stem from a loss of “airplane state awareness”.That is ineffective attention management on the part of pilots who may be distracted, sleepy, or in other dangerous cognitive states. This work focuses on predicting one of these cognitive states and helps the pilot to manage flights effectively.

In the present situation, the relative stability of the industry can be considered as stressed by observing the increasing level of congestion on airports. Trends show that the rate of accidents is directly proportional to the number of fleets flying or the increase in activity.

Further information about the problem:

Kaggle Competition

Research Paper

About the organizer of this challenge

Booz Allen Hamilton Holding Corporation (informally Booz Allen) is an American management and information technology consulting firm that provides consulting analysis, and engineering services to public and private sector organizations and nonprofits.

Introduction

In this challenge, we are trying to build a model to detect troubling events from aircrew’s physiological data.

We will try to predict that the pilots experienced distractions intended to induce one of the following three cognitive states:

Channelized Attention (CA) is, roughly speaking, the state of being focused on one task to the exclusion of all others. This is induced in benchmarking by having the subjects play an engaging puzzle-based video game.
Diverted Attention (DA) is the state of having one’s attention diverted by actions or thought processes associated with a decision. This is induced by having the subjects perform a display monitoring task. Periodically, a math problem showed up which had to be solved before returning to the monitoring task.
Startle/Surprise (SS) is induced by having the subjects watch movie clips with jump scares.

“It is a multi-class classification problem.”

The goal of this competition is to predict the probability of each state for each time in the test set.

About the Dataset

In this dataset, we are provided with real physiological data from eighteen pilots who were subjected to various distracting events. The benchmark training set is comprised of a set of controlled experiments collected in a non-flight environment, outside of a flight simulator. The test set (abbreviated LOFT = Line Oriented Flight Training) consists of a full flight (take off, flight, and landing) in a flight simulator.

For each experiment, a pair of pilots (each with its own crew id) was recorded over time and subjected to the CA, DA, or SS cognitive states. The training set contains three experiments (one for each state) in which the pilots experienced just one of the states. For example, in the experiment = CA, the pilots were either in a baseline state (no event) or the CA state. The test set contains a full flight simulation during which the pilots could experience any of the states (but never more than one at a time).

The data source contains 3 files:

train.csv: Contains all the training data for the model.
test.csv: Contains the data on which model will predict the output
sample_submission.csv: It provides a format in which the final CSV will be submitted.

Features in the dataset

id - (test.csv and sample_submission.csv only) A unique identifier for a crew + time combination.
crew - unique id for a pair of pilots. There are 9 crews in the data.
experiment - One of CA, DA, SS or LOFT. The first 3 comprise the training set. The latter the test set.
time - seconds into the experiment
seat - is the pilot in the left (0) or right (1) seat
eeg -An electroencephalogram (EEG) is a test used to evaluate the electrical activity in the brain. Brain cells communicate with each other through electrical impulses. An EEG can be used to help detect potential problems associated with this activity. An EEG tracks and records brain wave patterns.
ecg - An electrocardiogram (ECG or EKG) records the electrical signal from your heart to check for different heart conditions. 3-point Electrocardiogram signal. The sensor had a resolution/bit of .012215 µV and a range of -100mV to +100mV. The data are provided in microvolts.
r - Respiration, a measure of the rise and fall of the chest. The sensor had a resolution/bit of .2384186 µV and a range of -2.0V to +2.0V. The data are provided in microvolts.
gsr - Galvanic Skin Response, a measure of electrodermal activity. The sensor had a resolution/bit of .2384186 µV and a range of -2.0V to +2.0V. The data are provided in microvolts.
event - The state of the pilot at the given time: one of A = baseline, B = SS, C = CA, D = DA

Each sensor operated at a sample rate of 256 Hz.

Exploratory Data Analysis

We can see that training data is imbalanced as event A occurs far more time than any other event. Further, we can observe that for the experiment CA pilots are in a distracted state as compared to the baseline state. In the other 2 experiments in the plot, the pilots are more in the baseline state as compared to distracted DA or SS state.
Experiment CA is responsible for the most distracted states of the pilot in the dataset.
So from this, we can say that experiment CA is most likely to put the pilot in a distracted state as compared to other experiments.

We can see some different patterns emerging from time distribution which will be studied further in time series analysis of data and they can be useful in predicting results.
Event B is occurring at the low and high range in all the distributions and event D has an alternating distribution.

We can see from the distribution that there are areas where there are almost very few or none values of ECG for the event. Many of the parts between pair of events are overlapping but the range of values vary vastly. So, it could be a good feature
The values in extremely low and higher ranges cannot be classified as outliers as the density of the values is quite good.
This feature could be useful as for a range of values it will help to narrow down the state.
For event A and B the maximum distribution concentration is between 20k uV to -10k uV. For event D it is bit more evenly distributed and for event B concentration is higher in 10k uV to the lower ranges of -20k uV.

We can clearly see is the presence of some outliers in event A as the lower part of the range is reaching 450 while for other events it is very similar around 50
For rest the data overlap is quite visible, even density distribution is quite similar.

The overlapping of the range is a little less as compared to other features.
This could be a very useful feature to separate events in some range of values.
Apart from that, the distribution is quite similar.

A positive correlation means that value of feature 1 increases as the value of feature 2 increases if they are highly correlated.
From the above plot, we can see that there are few values i.e > 0.7 which are considered highly correlated features. In such case, we can remove any of the correlated features from the pair. Here seat and respiration are very highly correlated.
Apart from that (ECG,GSR) and (respiration, GSR) also have a positive correlation but the value is low.
For all the EEG features we can observe high correlation in many cases as it is understandable because they al are measured together in one setup and they just measure different aspects of the electrical activity of the brain.

The calculation of correlation coefficients between paired data variables is a standard tool of analysis for every data analyst. Pearson’s correlation coefficient is a de facto standard in most fields, but construction only works for interval variables (sometimes called continuous variables). Pearson is unsuitable for data sets with mixed variable types, e.g. where some variables are ordinal or categorical.
While many correlation coefficients exist, each with different features, we have not been able to find a correlation coefficient with Pearson-like characteristics and a sound statistical interpretation that works for an interval, ordinal and categorical variable types alike.
Phi_k correlation matrix gives us a better understanding and more accurate values for correlation because we have a combination of numerical and categorical features.
Here we can observe that there are many pairs of features with very high as well as very low correlation.

Due to memory and computing constraints, we can only visualize a part of the data.
From this, we can easily see the dominant baseline state.
Rest we cannot say anything conclusively as this is only a projection of a little part of data.

Time-series data analysis of features

Studying data for a single pilot(crew 1 and seat 0)

Event

Respiration

respiration for experiment 0(CA) with time

respiration for experiment 1(DA) with time

respiration for experiment 2(SS) with time

ECG

GSR

Further very in-depth analysis with more plots and observations can be found in my notebook here.

Note- This time series analysis is for one pilot only.

Metric used

Multi-class log loss is used for performance evaluation of the model.

Building the model

All the approaches mentioned below used these models: Decision Tree, Logistic regression, Random forest, LightGBM

Approach 1 - First cut approach:

Encode all the categorical features.
Since the range of many features varies so we have to normalize the data for which I used MinMaxScaler. So that all the values lie in the range of 0 and 1.
Trained all the above-mentioned models without hyperparameter tuning and selected the best performing among them as the baseline model.

2. Approach 2 -Model with feature engineering

Encode all the categorical features.
Add a new feature called pilot:

df_train['pilot'] = 100 * df_train['seat'] + df_train['crew']df_test['pilot'] = 100 * df_test['seat'] + df_test['crew']

Normalize data
Add new features with EEG data:

The data is prepared in a fairly typical arrangement of 20 electrodes across the scalp. The letter in each lead signifies the part of the brain that that lead is nearest to (Temporal, Frontal, Parietal, etc), with odd numbers on the left, evens on the right. Usually in the clinic, the electrical potentials at each electrode are not observed but the potential difference between pairs of electrodes. This gives us an idea of the electrical field in the brain region between these two points as a way to infer what the brain is doing in that region. We can choose any two electrodes and produce 20! different potential differences, but not all of those are going to be useful.

We talk about the layout of choosing the pairs of electrodes to compare potential differences as Montages. There are lots of different montage systems, but commonly there’s the 10–20 system.

df_train['fp1_f7'] = df_train['eeg_fp1'] - df_train['eeg_f7']df_train['f7_t3'] = df_train['eeg_f7'] - df_train['eeg_t3']df_train['t3_t5'] = df_train['eeg_t3'] - df_train['eeg_t5']df_train['t5_o1'] = df_train['eeg_t5'] - df_train['eeg_o1']df_train['fp1_f3'] = df_train['eeg_fp1'] - df_train['eeg_f7']df_train['f3_c3'] = df_train['eeg_f3'] - df_train['eeg_c3']df_train['c3_p3'] = df_train['eeg_c3'] - df_train['eeg_p3']df_train['p3_o1'] = df_train['eeg_p3'] - df_train['eeg_o1']df_train['fz_cz'] = df_train['eeg_fz'] - df_train['eeg_cz']df_train['cz_pz'] = df_train['eeg_cz'] - df_train['eeg_pz']df_train['pz_poz'] = df_train['eeg_pz'] - df_train['eeg_poz']df_train['fp2_f8'] = df_train['eeg_fp2'] - df_train['eeg_f8']df_train['f8_t4'] = df_train['eeg_f8'] - df_train['eeg_t4']df_train['t4_t6'] = df_train['eeg_t4'] - df_train['eeg_t6']df_train['t6_o2'] = df_train['eeg_t6'] - df_train['eeg_o2']df_train['fp2_f4'] = df_train['eeg_fp2'] - df_train['eeg_f4']df_train['f4_c4'] = df_train['eeg_f4'] - df_train['eeg_c4']df_train['c4_p4'] = df_train['eeg_c4'] - df_train['eeg_p4']df_train['p4_o2'] = df_train['eeg_p4'] - df_train['eeg_o2']

Train all the models with this data and compare results.

3. Approach 3 -Using SMOTE

Since in EDA we observed that there is some data imbalance and to overcome that we can use a technique called SMOTE.

SMOTE(Synthetic Minority Over-sampling Technique)-One approach to addressing imbalanced datasets is to oversample the minority class. The simplest approach involves duplicating examples in the minority class, although these examples don’t add any new information to the model. Instead, new examples can be synthesized from the existing examples. This is a type of data augmentation for the minority class and is referred to as the Synthetic Minority Oversampling Technique, or SMOTE

This approach follows similar steps as in the above 2 approaches just with an additional step of oversampling the data.

Selecting the best approach

After all the analysis I found that approach 1 without SMOTE on the lightGBM model works best.

One of the reasons for SMOTE not working well maybe because the frequency of occurrence of an event also plays a role. So if a similar distribution for an event is present in the train and test dataset then it is best to not use SMOTE.

Best Approach with hyperparameter tuning

I proceeded with approach 1 and lightGBM as it performed best.LightGBM is a gradient boosting framework that uses tree-based learning algorithms.

I tried various permutation and combination of hyperparameters until I found the best ones which gave me the best public and private score on Kaggle.

params = {'objective' : 'multiclass','metric' : 'multi_error','boosting' :'gbdt','num_class':4,'num_leaves' : 30,'learning_rate' : 0.06,'bagging_fraction' : 0.9,'bagging_seed' : 0,'num_threads' : 4,'colsample_bytree' : 0.4,'min_data_in_leaf':100,'min_split_gain':0.00015}model_lgb2 = lgb.train(params, lgbtrain, 2000, valid_sets [lgbtrain,lgbtest], early_stopping_rounds=200, verbose_eval=100)

About hyperparameters:

num_leaves: number of leaves in full tree, default: 31
learning_rate: This determines the impact of each tree on the final outcome. GBM works by starting with an initial estimate which is updated using the output of each tree. The learning parameter controls the magnitude of this change in the estimates.
bagging_fraction: Randomly select part of data without resampling. It can be used to speed up training and to deal with over-fitting.
bagging_seed: Random seed for bagging
colsample_bytree: LightGBM will randomly select the part of features on each iteration (tree) if feature_fraction smaller than 1.0. For example, if you set it to 0.8, LightGBM will select 80% of features before training each tree. It can be used to speed up training and to deal with over-fitting.

Results

Future Works

With a better understanding of domain knowledge, we can introduce some new features from the given ones which can be helpful in improving the performance.
We can research more about every feature individually as well and combine some of them to make new ones which can be helpful.
Due to computation limitation, I was not able to train SVM model as it computation-intensive algorithm. We can try this model as well.

References

I have uploaded all my notebooks in my Github Repository you can find much in-depth analysis in these notebooks with all the approaches.

This is my first medium article, I have tried to include as much information as possible. Please share your valuable feedback.

You can also reach me on LinkedIn.

Thanks for reading!!!