Wait, Ever Examined Your Pilot Before Boarding?

Neha Sikerwar

Published in

The Startup

14 min readAug 14, 2020

It can prevent your flight heading for trouble.

Find the GitHub link of the project. LinkedIn profile.

Introduction:

Here the business problem is Reducing aircraft fatalities. It’s a Kaggle competition, you can find the link here. Here we try to build a model, with a pilot’s physiological data, to detect if a pilot is in a dangerous cognitive state. So pilots can be alerted to prevent accidents. And we can save lives, yes.

Here we conduct experiments with real physiological data of pilots to find out below cognitive states:

Channelized Attention (CA) is, roughly speaking, the state of being focused on one task to the exclusion of all others. This is induced in benchmarking by having the subjects play an engaging puzzle-based video game.
Diverted Attention (DA) is the state of having one’s attention diverted by actions or thought processes associated with a decision. This is induced by having the subjects perform a display monitoring task. Periodically, a math problem showed up which had to be solved before returning to the monitoring task.
Startle/Surprise (SS) is induced by having the subjects watch movie clips with jump scares.

Pilots can experience any of the states but not more than one at a time, and if data is normal then there will be “no event”/baseline. So it’s a classification problem with four classes. And, we need to predict the probability of each state for the pilots. We can use log loss as performance metrics. I have used machine learning as well as deep learning models for this case study.

Also, each sensor operated at a sample rate of 256 Hz. Now let’s move towards data.

Data description:

Variables with the eeg prefix are electroencephalogram recordings: eeg_fp1, eeg_f7, eeg_f8, eeg_t4, eeg_t6, eeg_t5, eeg_t3, eeg_fp2, eeg_o1, eeg_p3, eeg_pz, eeg_f3, eeg_fz, eeg_f4, eeg_c4, eeg_p4, eeg_poz, eeg_c3, eeg_cz, eeg_o2.

So there are a total of 27 independent features and ‘event’ as a dependent feature with four classes.

Steps:

There are 6 steps in which I’m going to explain my project end to end:

Fetching the data.
Reading the data.
EDA
Feature Engineering
Modeling
Conclusion

Fetching the data:

I directly fetched the zipped data from Kaggle using CurlWget (CurlWget builds a command line for ‘curl/wget’ tools to enable the download of data on a console only session). Then unzipped the folder and got two CSV files: train.csv (1.15 GB) and test.csv (4.46 GB).

Reading the data:

As data is huge, I needed some memory optimization code which I found here. It checks the data type of all columns one by one. If the column is object type then make it category type. If a column is int or float type, then according to its min and max values make it int8, int16, int32, int64 or float32, float64 respectively. In this way, it returns the optimized data and we can save our memory.

We have “4867421 rows × 28 columns” in the train dataset. And, “17965143 rows × 28 columns” in the test dataset.

Then in this section, I did encoding of ‘experiment’ and ‘event’ features, in this way:

dict_1 = {‘A’: 0, ‘B’: 1, ‘C’: 2, ‘D’: 3}dict_2 = {‘CA’:0, ‘DA’:1, ‘SS’:3, ‘LOFT’:4}train[‘event’] = train[‘event’].apply(lambda x: dict_1[x])train[‘experiment’] = train[‘experiment’].apply(lambda x: dict_2[x])

I am not using float16 dtype as it has some issues as mentioned in the link. It gives nan while calculating the mean, so I’m passing the float32 and float64 as dtype in the memory optimization function.

EDA (Exploratory Data Analysis):

Skewness checking:

According to Wikipedia, skewness is a measure of the asymmetry of the probability distribution of a real-valued random variable about its mean. The skewness value can be positive, zero, negative, or undefined.

As we can see in the table, most of the features are highly skewed (skewness either less than -1 or more than 1), some moderately skewed (skewness either in between -1 and -0.5 or in between 1 and 0.5) and some are approximately symmetric (skewness between -0.5 & 0.5).

2. Missing values checking: We don’t have any missing values in the train dataset, so no need to worry here.

3. Univariate Analysis:

First I’ll show the histogram plots of ‘event’ and ‘experiment’ with code and observations.
Then I created a new column ‘pilot’ with ‘crew’ and ‘seat’ columns. Every pilot has a unique crew id, sitting in either seat 0 or 1. We have 9 unique crew ids, so there are 9 pilots who are sitting in seat 0 and 9 pilots who are sitting in seat 1. So we have 18 unique pilots available in our dataset. We’ll see a histogram plot of pilots too.

Now we’ll see pdfs for all eeg features (20) and its observations with code.

Observations: We can observe that data is looking like normally distributed. Distributions of both train and test look like centered at 0. Variance is more on the test dataset, as we see spread is more for all of the test dataset features’ pdf than the train dataset features’ pdf.

Pdfs of ‘ecg’, ‘r’, ‘gsr’ with code and observations:

Distribution plot for ‘time’ feature with code and observation:

Observation: We can see from the below plot, the test time range is much larger than on the train. Means experiment done on a train dataset is in very lesser time. Though we can’t use the time feature, as flight simulator time has nothing to do with the experiment time.

4. Bivariate Analysis:

Count plot with ‘seat’ and ‘event’:

Observation: We can see in the plot, the seat column may not have any impact on the ‘event’ outcome, as count of seat 0 and seat 1 is the same for each event. We know event 0 occurred a max number of times, but we can not tell if it’s because the pilot is sitting in seat 0 or 1 because both counts are equal. We will see if this feature is important or not when we see the feature importance later.

The simplest bivariate plot is the scatter plot. we need to understand how variables interact with one another. Also, scatter plots tell us about the correlation. So we will plot the scatter plots and try to observe it.

Observations:

Originally, we had 28 features in the train dataset and 28 features in the test dataset. We have ‘event’ as the output/dependent variable in the train dataset.
And, an ‘id’ feature in the test dataset which is not present in the train dataset. We’ll use it only for the submission, not in the model prediction.
We have huge train and test datasets, so we used a memory optimization function to change the datatypes of columns accordingly (int8, float32).
From EDA we know most of the features are following Gaussian distribution with skewness. But as decision trees/gradient boosting algorithms don’t get affected by skewness, we are not going to handle this.
We have imbalanced data, so either we can use weights to balance them or the ‘class_weight’ attribute as ‘balanced’ in the models.
‘Experiment’ feature has ‘CA’, ‘DA’, ‘SS’ values in the train dataset, and ‘LOFT’ in the test dataset. So we are not going to consider this feature for our model.
We have 9 unique crews. And each sitting in either seat 0 or seat 1. So we have 18 unique pilots. In the same way, we created a pilot column. And will remove crew and seat features. As we saw in the “count plot for seat w.r.t. event”, seat feature is not helping with events.
For the “time” feature, we can’t use the “time” feature as flight simulator time has nothing to do with the experiment time.

Feature Engineering:

Adding new features:

Each and every row is the collection of sensor readings of the experiment conducted on each pilot. We have not given pilot data directly. But we can find it from the crew and seat columns.
There are 9 unique crews. In each crew, there are two pilots, one on the left seat and one on the right seat. So a total of 18 pilots.
The train data is collected from experiments conducted on pilots in different situations. The test data is collected while a flight simulation.
Therefore we will not use the ‘experiment’, ‘crew’, ‘seat’ and ‘time’ features in the model, as they will not be useful for the prediction.
Now we will use the “biosspy” APIs for creating new features, find the documentation here. And I’ll give credits to this who introduced these APIs. It takes signals of biometric data as input and returns according to the function used, eg, ecg.ecg(), eda.eda(), eeg.eeg(), etc. We’ll see the code.
Then we will use cubic interpolation to give values for timestamps in between.
Some sensors have reading 0, i.e, Some pilots have missing sensors. It may be due to human error or noise. We will set those readings to nan.

In this way, now we have 103 new features and 1 new ‘pilot’ feature. So now we have a total 131 features, from which I’ll drop ‘crew’, ‘seat’, ‘time’ and ‘experiment’. So total features remaining are 127 features and 1 ‘event’ as output. Now we’ll proceed for feature importance and try to remove highly multi correlated features, so we can improve our model performance.

Multicollinearity or Feature Selection:

Here we are using 3 techniques:

Variance inflation factor
Permutation Importance
Recursive feature Importance

Then we can take features from any of the three techniques.

Variance Inflation Factor (VIF):

I referred to this link. And according to this:

VIF starts at 1 and has no upper limit
VIF = 1, no correlation between the independent variable and the other variables
VIF exceeding 5 or 10 indicates high multicollinearity between this independent variable and the others.

You can find the code below:

And I got very high values of VIF and the value of c is 67. I’m attaching some top values of VIF from the output.

So I decided the threshold, above which I’ll remove the features recursively. Let’s take the threshold as 50 and remove features one by one, which have VIF above 50. In this way and with the same code, I removed the feature with the highest VIF (one at a time) till the highest VIF I got is less than 50. I saved those final features in a pickle file, in case I am going to use these for our model.

Permutation Importance:

There are some nice links I found on permutation importance: here and here. Permutation Importance tries to find out if model performance degrades by removing the columns recursively one at a time. Please find the code below and we can save those features, having weight more than zero. Meaning total 120 features, which we can use for our models.

Recursive Feature Elimination (RFE)

You can find the document of RFE here. Recursive feature elimination (RFE) is a feature selection method that fits a model and removes the weakest feature (or features) until the specified number of features is reached. But in RFECV, it uses cross-validation to find out the optimal number of features. RFE tries to remove the collinearity present in the data.

The feature ranking, such that ranking_[i] corresponds to the ranked position of feature i. Selected features are assigned rank 1. Here total 118 features are selected. Please find the code below:

Now we’ll proceed to the modeling Section.

Modeling:

I’ve applied multiple models like logistic regression, SVM, ensemble models (lightGBM, xgboost, random forest, CatBoost, AdaBoost), and deep learning models (CNN1D, dense neural network). Let’s see them one by one.

LightGBM:

LightGBM is a gradient boosting framework. But the thing is, it’s faster and can handle large scale data, also takes low memory. And in our case, we have huge data. I found a nice blog on it. I did hyperparameter tuning first and then final training of the model. Find the code below:

I must say I got pretty good model performance in the first model. Also, the model neither overfitting nor underfitting. Then I made predictions for the test set and checked the Kaggle score, which is 4.67354.

XGBoost:

Xgboost also uses boosting techniques. But it’s developed to give more accurate results using systems optimization. And I really got one of the best results with this model. The most important thing here is to tune the parameters. So go through this document to know more about params. We have 2 forms of XGBoost:

Direct xgboost library (xgb)
sklearn wrapper for XGBoost (XGBClassifier)

First, we will see the code of sklearn XGBClassifier and it’s score.

I’m really impressed with its performance. Loss has decreased significantly. Now it’s time to see the Kaggle score.

CatBoost:

CatBoost uses gradient boosting on decision trees. It takes more time compared to xgboost and lightgbm. Catboost can give comparatively good results in case of very high dimensions of data. But in my case, I don’t have very high dimensions.

I tried with max 1000 iterations and the model kept decreasing the loss. But it nearly took 4 hours. If someone doesn’t have memory and time constraints, can try with more number of iterations. I think it’s possible to improve this model further. As it’s performance isn’t that good, I did not check for the Kaggle score as processing the test dataset takes much memory.

Then I tried Random Forest, Logistic regression, and SVM (SGDClassifier), but all these models are not able to give better performance than xgboost. So I’ll move ahead.

Xgboost 2 (xgb.cv):

Here I tried with xgb.cv, it’s a direct xgboost library. I’ll not use the sklearn xgboost. Here is the nice blog I found on it. To tune the hyperparameters, I used xgb.cv and I’ll suggest first go through the parameters in the document before setting their values.

First I’ll tune ‘max_depth’ and ‘min_child_weight’ in the first step. And then ‘subsample’ and ‘colsample_bytree’. I have created a function for the tuning of params:

I have attached the screenshots to show the outputs. Model performance is not that good with param-tuning with xgb.cv and learning rate 0.08. Maybe if we increase either the learning rate or n_estimators we get better performance.

XgBoost 3 (XGBClassifier):

I thought if I tune more parameters in xgboost, maybe I can get better performance. I found this nice article for param tuning in xgboost. So let’s see the code and results:

Final training of the model:

And this is the best model performance I’ve got yet. So we’ll see the Kaggle score.

Deep neural networks (CNN-1D):

We have sensors’ data as input here so I thought maybe 1D CNN can work better here. As it works better with analysis of any kind of signal data or audio signals. There’s a nice link for the CNN 1D architecture. Find the architecture of the model below:

I’ve not received better performance than xgboost. But architecture can be improved further. Model performance can be improved by changing the learning rate and epochs.

Dense Neural Network:

Out of curiosity, I applied a dense neural network, to check how dense networks will perform. I added five dense hidden layers and one dense output layer. I have not added any dropout layer. Maybe performance can be improved by adding some dropout layers and by changing the learning rate. Find the architecture below:

This model’s performance is worse than CNN-1D.

AdaBoostClassifier:

An excerpt from this blog to explain AdaBoost. Adaptive Boosting is one of the ensemble boosting classifiers. AdaBoost is an iterative ensemble method. AdaBoost classifier builds a strong classifier by combining multiple poorly performing classifiers so that you will get a high accuracy strong classifier. AdaBoost should meet two conditions:

The classifier should be trained interactively on various weighted training examples
In each iteration, it tries to provide an excellent fit for these examples by minimizing training error.

We’ll see the code now:

Model performance is the worst of all the models. It may be improved further with param-tuning and taking a lower learning rate and more number of estimators.

Conclusion:

Results:

Below are all models with their corresponding cross-validation loss I got. In my case, the sklearn xgbClassifier gave the best performance (loss: 0.000755) of all. As data is huge, a model can take time, nearly 3–4 hours. That’s why I preferred to choose a high learning rate, so it’ll converge faster. If anyone wants to check the code, I’ve attached my GitHub link.

Further Improvements:

If you don’t have memory and time constraints:

→ Try with more parameters’ tuning,

→ And, with a lower learning rate and more number of estimators or iterations. It can really improve your model performance.

Try to balance your data before modeling. You can try the ‘SMOTE’ technique. Here is the reference for it. It can really help.
Also, play with deep learning models’ architecture, might be it’ll improve the performance. People have better scores than mine, so yes performance can be improved further.