Can Driver Alertness Be Predicted?

Driver Alertness Analysis and Prediction using Python and Machine Learning Algorithms

Herambh Dakshinamoorthy

Published in

CodeX

6 min readJul 31, 2021

Photo by NoSystem images on Getty Images

To view the complete code click here.

Each year, drowsy driving accounts for about 100,000 crashes, 71,000 injuries, and 1,550 fatalities, according to the National Safety Council (NSC). Drowsy driving contributes to an estimated 9.5% of all crashes, according to AAA.

We will design a classifier that will detect and classify whether the driver is alert or not alert, employing data that has been acquired while driving and look at the important features affecting the alertness of an individual while driving.

A classifier is a machine learning algorithm that categorizes data into one or more sets of “classes.” These “classes” are the labels/categories that we are targeting, which in this case is a simple “yes” or “no” indicating if the driver is alert or not.

Method

This project uses Python ,Pandas, as well as libraries such as sklearn, matplotlib and seaborn to manipulate, visualize the data and make predictions. Three different classifiers have been used to compare and select the best suited algorithm — Decision Tree Classifier, Random Forest Classifier, XGBoost Classifier.

As the name suggests, the Decision Tree Classifier is a tree containing a set of decisions derived from the behaviour of the data that are made step-by-step. At each step (known as node) of a decision tree we try to form a condition on the features to separate all the “classes” contained in the dataset.

Though Decision Tree Classifiers may provide a good classification of labels, a much more effective strategy would be to combine the results of several decision trees trained with slightly different parameters. This algorithm is called a Random Forest Classifier. The idea behind this is that the errors made by each tree, upon averaging, will cancel out.

The XGBoost Classifier is a Gradient Boosting Classifier. The term “gradient” refers to the fact that the algorithm aims to reduce the loss function iteration-by-iteration. “Boosting” is a special type of learning technique that converts a group of several weak learners into a strong learner. So the Gradient Boosting Classifier fits a new predictor to the residual errors made by the previous predictor to improve the accuracy.

Here are the steps involved:

Downloading the data
Data Analysis and Cleaning
Feature Scaling
Splitting into Training, Validation and Test Sets
Training the Decision Tree Classifier
Training the Random Forests Classifier
Training the Gradient Boosting Model

Downloading the Data

The first step would be to download the data that we are intending to use. The dataset is available at https://www.kaggle.com/c/stayalert/data?select=fordTrain.csv. Using the opendatasets library from Jovian, I downloaded the data from Kaggle.

Data Analysis and Cleaning

In order to make the data ready to be used in a Machine Learning model, we must first analyze it and remove any peculiar data.

The first two columns appear to be some information about the driver. The third column denotes whether the driver is alert or not — 1 indicates he/she is alert, 0 indicates that they are not. The next 8 columns with headers P1, P2 , …….., P8 represent physiological data. The next 11 columns with headers E1, E2, …….., E11 represent environmental data. The next 11 columns with headers V1, V2, …….., V11 represent vehicular data.

Feature Scaling

Since the values in each column are varying we need to scale them to a value between 0 and 1, so that we can compare their respective weights and see how these parameters affect the alertness of the driver.

Training, Validation and Test Sets

In order to train our data we will require a training set. To test this data on previously unseen data we will see how it performs on the validation set. Finally, we will look at the test set which represents how well our model will perform in real-life applications.

The Training Set has 75% and the Validation Set has 25% of the provided data. The Test Set has been separately provided.

Decision Tree Classifier

We can use DecisionTreeClassifier from sklearn.tree to train a decision tree.

In a decision tree, for predicting the class of the given dataset, the algorithm starts from the root node of the tree. This algorithm compares the values of root attribute with the record (real dataset) attribute and, based on the comparison, follows the branch and jumps to the next node.

This shows the features that mainly influence the alertness of a driver

It appears to be that the vehicular and environmental factors seem to have a more profound effect on the alertness of the driver than the physical factors as such.

Hyperparameter Training and Tuning

Often, the optimal model architecture is unknown to us , and thus we’d like to be able to explore a range of possibilities. Hyperparameters define the model architecture and the process of searching for the ideal model architecture is referred to as hyperparameter tuning.

In order to prevent overfitting the Training Set and to reduce the error on the Validation Set we can tune various hyperparameters to make our model perform better.

For the Decision Tree Classifier we have an accuracy of 100% on the Training Set, 98.647% on the Validation Set and 67.266% on the Test Set. Thus, we can say that the Decision Tree Classifier is not the best model to predict our results.

Random Forest Classifier

A random forest works by averaging/combining the results of several decision trees. We’ll use the RandomForestClassifier class from sklearn.ensemble.

For the Random Forest Classifier we have an accuracy of 100% on the Training Set, 99.398% on the Validation Set and 80.198% on the Test Set. The Random Forest Classifier has definitely done a better job than the Decision Tree Classifier.

Gradient Boosting Model

The term “gradient” refers to the fact that each decision tree is trained with the purpose of reducing the loss from the previous iteration (similar to gradient descent). The term “boosting” refers the general technique of training new models to improve the results of an existing model. To train a GBM, we can use the XGBRegressor class from the XGBoost library.

For the Random Forest Classifier we have an accuracy of 100% on the Training Set, 99.569% on the Validation Set and 86.954% on the Test Set. The Gradient Boosting Model seems to be the best option as it produces a greater accuracy and is also computationally quick!

Summary and Final Thoughts

A driver’s alertness can be predicted with an accuracy of 87% approximately based on various physical, vehicular, and environmental factors! This signifies that it is, in fact, possible to predict the alertness of the driver using Machine Learning Classifiers.

To see all of these steps in greater detail, check out the full project notebook here.