“Reducing Commercial Aviation Fatalities” Dataset Pipeline

Camila Duarte de Souza
Analytics Vidhya
Published in
3 min readDec 2, 2019

About the Dataset

Our goal in this dataset is to build a model to detect troubling events from aircrew’s physiological data. Most flight-related fatalities stem from a loss of “airplane state awareness.” That is, ineffective attention management on the part of pilots who may be distracted, sleepy or in other dangerous cognitive states.

We used data acquired from current pilots in test situations, and our model performs real time calculations to monitor the cognitive states of pilots. With this, pilots could then be alerted when they enter a troubling state, preventing accidents and saving lives. You can get the dataset from the Kaggle website here.

Pipeline Description

EDA(Exploratory Data Analysis)

First of all, we load the dataset with pandas and start to do the EDA(Exploratory Data Analysis). One important thing to know at first is if the dataset has missing data. In this dataset we don’t have, so we can go forward. Other indispensable thing to check is if the dataset is unbalanced, because if so, our result may be skewed. Data is said to be imbalanced when instances of one class outnumber the others by a large proportion.

Checking Imbalance

We can clearly see that there is an imbalance. I will explain how we solve this ahead. What I did further was to check the influence of some measurements on the final result.

Checking fisical measurements influence on the final result

Feature Engeneering

In this step of the pipeline was created a new column called “pilot” formed by the seat and the crew data.

Solving Unbalance

To solve the unbalance problem we can use a technique called SMOTH. The technique generates synthetic data for the minority class joining the points of the minority class with line segments and then places artificial points on these lines.

Feature Importance

Because this dataset is a very large dataset, if you need to train and do other things, like SMOTH, faster, you can decrease the number of columns. One option is to note the ones that are most important:

Choose the model & Train

We tested some models and what came out best was the LightGBM model. Due to the nature of the problem, it would be almost inevitable to use a tree based model such as LightGBM, RandomForest, Gradient Boost, XGBoost, etc. The LightGBM is a fast, distributed, high-performance gradient boosting framework based on decision tree algorithm, used for ranking, classification and many other machine learning tasks. You can see the training step below:

Results

To compute the efficiency of the model we used the Logarithmic Loss metric, since this is the metric computed in Kaggle competition. The Logarithmic Loss, also called Log Loss, works by penalizing the false classifications. And the log loss achieved was 0.173.

--

--