Data Science: A Road to Safer Roads

Debdarsan Niyogi (PhD)
Rigging Real ‘Artificially’
8 min readSep 11, 2020

A treatise on Seattle’s car collision data

Hindsight is a wonderful thing but foresight is better, especially when it comes to saving life, or some pain!

— William Blake

Photo by Ricardo Gomez Angel on Unsplash

Prerequisite: Basic knowledge of data processing, machine learning using Python

Introduction

According to the statistics by WHO (7th Feb 2020):

Every year the lives of approximately 1.35 million people are cut short as a result of a road traffic crash. Between 20 and 50 million more people suffer non-fatal injuries, with many incurring a disability as a result of their injury.

According to the National Safety Council, traffic collisions cause more than 40,000 deaths and injure thousands of people every year across the United States. These are not traffic accidents, but entirely preventable tragedies.

Objective

Undoubtedly, we appreciate the prediction of severity in car accidents (based on the historic data) is crucial and machine learning is ideally suited for this purpose. In this article, we are going to explore, analyze and model real-world car collision data published by the Govt. of Seattle. In this journey, we will touch upon cleaning data, data imputation, feature selection, tackling extremely skewed data and finally evaluate different AI model formulations.

Interests

The practical utilities of the prediction, besides saving lives:

  • Safe route planning
  • Emergency vehicle allocation
  • Roadway design
  • Reduce property damage
  • Where to place additional signage (e.g. to warn for accident-prone areas)

Data Loading

The car collision data is obtained from Seattle Govt’s website (Time frame: 2004 to Present).

Upon loading the data in Pandas dataframe (df), we immediately see the profile (using pandas-profiling).

The overview of the data profile report:

Data profile report using pandas-profiling

We see that the severity level of car accidents to be predicted (SEVERITYCODE) are listed against 39 independent variables (features). Some important features are summarized below to get a feeling of what we are dealing with (readers are urged to have a quick look at the entire list published on Govt. of Seattle’s website once):

List of important features

Exploratory Data Analysis (EDA)

In this phase, we concentrate on Cleaning up missing data, Value imputation, Value re-grouping and Data visualization.

A) Severity Code: Severity Code (SEVERITYCODE) is the target/dependent variable. Let us scrutinize that first. There are four codes = 0, 1, 2, 2b and 3. Since we are not going to predict an ‘Unknown’ severity (SEVERITYCODE = 0), these observations, along with the rows with missing values can safely be deleted. The categorical values can also be remapped to a scale of 1 to 4, where 3 is assigned to 4 and 2b to 3. The new mapping is as follows.

New severity code mapping

This is to be noted that the highest severity (4 = Fatal) has only 0.18% of the number of observations, in other words, the distribution is extremely skewed.

Count of Severity Codes

B) Date Time:

Count of Accidents by Hour
Count of Accidents by Weekday (Day 0: Monday)
Count of Accidents by Month

C) Fixing missing values: We find that the features WEATHER, JUNCTIONTYPE, LIGHTCOND and ROADCOND have common characteristics — they have both missing values and a pre-existing category named ‘Unknown’. We assign the missing values to ‘Unknown’.

The features ADDRTYPE, COLLISIONTYPE, PEDROWNOTGRNT, SPEEDING and INATTENTIONIND do not have built-in ‘Unknown’ category, but we treat them as above.

D) Visualization: Here are some visualizations of different features that give us a first qualitative impression of how they are distributed.

Count of Accidents by Weather Condition
Count of Accidents by Junction Type
Count of Accidents by Light Condition
Count of Accident by Road Condition Type
Count of Accidents by Address Type where the collision took place
Count of Accidents by Collision Type
Count of Accidents by Collision Code
Count of Accidents by flag describing if a parked car was hit

E) Under Influence Flag: The flag UNDERINFL indicates whether the driver was under the influence of alcohol/drug etc. has ambiguous data (‘Y’, 1, ‘N’, 0 and missing values), but we can assume ‘N’ maps to ‘0’ and ‘Y’ maps to ‘1’.

Count of under-influence flag

F) Inspect Injuries, Serious Injuries and Fatalities Features: There are three variables: Injuries, Serious injuries and Fatalities. Let us see how the numbers are distributed among the four severity codes we have.

The matrix is indicating a very strong correlation with severity.

Correlation of the injury variables

Since, apart from the severity code = 1 (“Property Damage Only Collision”), severity code is assigned based on the injury level and the former is a direct reflection of the latter. If we use injury features as predictors, it is easily understood that those will overwhelm the other features, and the prediction will be based on the after-effects of a collision. Therefore, these three features will be ignored.

G) Grouping junction type and collision type with severity code:

H) Mode values of the features: It is interesting to see the mode (highest frequency) values of each of the features concerning the severity codes (where S indicates severity code).

Mode values of a few feature for different severity codes

Feature Selection

Let us select the features of interest (ADDRTYPE, COLLISIONTYPE, JUNCTIONTYPE, INATTENTIONIND, UNDERINFL, WEATHER, ROADCOND, LIGHTCOND, PEDROWNOTGRNT, SPEEDING, ST_COLCODE, HITPARKEDCAR), and one-hot encode them.

Finally, all the columns having ‘Unknown’ in their headers (that were produced during one-hot encoding) are discarded as they are minorities as well as not informative.

Classification Models for Multi-class, Skewed Distribution

This extremely skewed and multi-class data may not be amenable even to the specialized classification models that deal with unbalanced data, we go ahead and have a taste of their performance, nevertheless. Here we have chosen the following classifiers, capable of handling unbalanced data inherently, for our study:

i) Bagging

ii) Balanced Bagging

iii) Balanced Random Forest and

iv) EasyEnsemble

The code snippet for training and predicting severity code

The confusion matrices generated by the above code:

Confusion matrices along with balanced accuracy for different models

We see that in all the four cases, balanced accuracy did not even cross 50%.

Multi-class to Two-class

Often, the skewed multi-class classification problem is converted to the two-class problem by taking the minority class versus the group of the rest of the classes. In our situation, the accidents with severity level 4 are fatal and others are non-fatal. Therefore, we can focus on level 4 accidents and regroup the levels of severity into level 4 versus other levels. In this process, a new column ‘Severity 4’ is created.

Data Balancing

As seen above, severity 4 is extremely rare, or in other words, the data is highly skewed. The main challenge of dealing with this type of data is that the machine learning algorithms train with almost 100% accuracy and fails to classify the minority class. This is intuitive since when the occurrence of the majority class is 99% per cent, even if the classifier is hard-coded to predict majority class always, the accuracy will still be 99%.

We appreciate that false negative is very costly here, that is actual severity code 4 is not predicted. The situation is just like the detection of fraudulent transactions or diagnosing diseases.

There are many ways to deal with this situation by balancing the data synthetically by exploration method before training. We might (1) under-sample the majority class, (2) over-sample the minority class or (3) have a combination of (1) and (2), i.e. over- and under-sample simultaneously.

The combination of over- and under-sampling will be used since the data is large enough, level 4 will be randomly over-sampled to 10000 and other levels will be randomly under-sampled to 10000.

Correlation

Let us now get an idea of how the variables are correlated. The variable ST_COLCODE is excluded here as this has a long list making the plot exceptionally tall.

We can see that the variables are not highly correlated.

Classification Models (applied to balanced data)

We are going to consider the models:

i) Logistic Regression

ii) k-Nearest Neighbors (kNN)

iii) Decision Tree

iv) Random Forest

The results of the above four model are summarized below.

a) Accuracy: Accuracies achieved by different algorithms are shown here.

b) Confusion matrices:

c) Feature importance: Relative importance of features for

i) Decision Tree

ii) Random Forest

Inference

We can conclude that the Random Forest is the best model in this scenario, with Decision Tree and the other models are almost the same). An interesting point to note here is that the top important features are somewhat different between the Random Forest and the Decision Tree model. Following the Random Forest model, we see that special attention needs to be given to pedestrians (topmost important feature), speeding, collision with a parked car, rear-ended collision, drivers under influence of alcohol/drug. This result is very much expected. The collision codes 50 (Struck Fixed Object), 32 (One Parked — One Moving), 10 (Entering At Angle) and 14 (From Same Direction — Both Going Straight — One Stopped — Rear End) are the major influencers.

The complete source code and the input data file can be found here.

Future Study

  • The relations between the key features and accident severity can be further studied in details
  • Different data balancing techniques can be applied and evaluated
  • Development of a much more complex real-time accident risk prediction model

--

--