Data Science: A Road to Safer Roads
A treatise on Seattle’s car collision data
Hindsight is a wonderful thing but foresight is better, especially when it comes to saving life, or some pain!
— William Blake
Prerequisite: Basic knowledge of data processing, machine learning using Python
Introduction
According to the statistics by WHO (7th Feb 2020):
Every year the lives of approximately 1.35 million people are cut short as a result of a road traffic crash. Between 20 and 50 million more people suffer non-fatal injuries, with many incurring a disability as a result of their injury.
According to the National Safety Council, traffic collisions cause more than 40,000 deaths and injure thousands of people every year across the United States. These are not traffic accidents, but entirely preventable tragedies.
Objective
Undoubtedly, we appreciate the prediction of severity in car accidents (based on the historic data) is crucial and machine learning is ideally suited for this purpose. In this article, we are going to explore, analyze and model real-world car collision data published by the Govt. of Seattle. In this journey, we will touch upon cleaning data, data imputation, feature selection, tackling extremely skewed data and finally evaluate different AI model formulations.
Interests
The practical utilities of the prediction, besides saving lives:
- Safe route planning
- Emergency vehicle allocation
- Roadway design
- Reduce property damage
- Where to place additional signage (e.g. to warn for accident-prone areas)
Data Loading
The car collision data is obtained from Seattle Govt’s website (Time frame: 2004 to Present).
Upon loading the data in Pandas dataframe (df), we immediately see the profile (using pandas-profiling).
The overview of the data profile report:
We see that the severity level of car accidents to be predicted (SEVERITYCODE) are listed against 39 independent variables (features). Some important features are summarized below to get a feeling of what we are dealing with (readers are urged to have a quick look at the entire list published on Govt. of Seattle’s website once):
Exploratory Data Analysis (EDA)
In this phase, we concentrate on Cleaning up missing data, Value imputation, Value re-grouping and Data visualization.
A) Severity Code: Severity Code (SEVERITYCODE) is the target/dependent variable. Let us scrutinize that first. There are four codes = 0, 1, 2, 2b and 3. Since we are not going to predict an ‘Unknown’ severity (SEVERITYCODE = 0), these observations, along with the rows with missing values can safely be deleted. The categorical values can also be remapped to a scale of 1 to 4, where 3 is assigned to 4 and 2b to 3. The new mapping is as follows.
This is to be noted that the highest severity (4 = Fatal) has only 0.18% of the number of observations, in other words, the distribution is extremely skewed.
B) Date Time:
C) Fixing missing values: We find that the features WEATHER, JUNCTIONTYPE, LIGHTCOND and ROADCOND have common characteristics — they have both missing values and a pre-existing category named ‘Unknown’. We assign the missing values to ‘Unknown’.
The features ADDRTYPE, COLLISIONTYPE, PEDROWNOTGRNT, SPEEDING and INATTENTIONIND do not have built-in ‘Unknown’ category, but we treat them as above.
D) Visualization: Here are some visualizations of different features that give us a first qualitative impression of how they are distributed.
E) Under Influence Flag: The flag UNDERINFL indicates whether the driver was under the influence of alcohol/drug etc. has ambiguous data (‘Y’, 1, ‘N’, 0 and missing values), but we can assume ‘N’ maps to ‘0’ and ‘Y’ maps to ‘1’.
F) Inspect Injuries, Serious Injuries and Fatalities Features: There are three variables: Injuries, Serious injuries and Fatalities. Let us see how the numbers are distributed among the four severity codes we have.
The matrix is indicating a very strong correlation with severity.
Since, apart from the severity code = 1 (“Property Damage Only Collision”), severity code is assigned based on the injury level and the former is a direct reflection of the latter. If we use injury features as predictors, it is easily understood that those will overwhelm the other features, and the prediction will be based on the after-effects of a collision. Therefore, these three features will be ignored.
G) Grouping junction type and collision type with severity code:
H) Mode values of the features: It is interesting to see the mode (highest frequency) values of each of the features concerning the severity codes (where S indicates severity code).
Feature Selection
Let us select the features of interest (ADDRTYPE, COLLISIONTYPE, JUNCTIONTYPE, INATTENTIONIND, UNDERINFL, WEATHER, ROADCOND, LIGHTCOND, PEDROWNOTGRNT, SPEEDING, ST_COLCODE, HITPARKEDCAR), and one-hot encode them.
Finally, all the columns having ‘Unknown’ in their headers (that were produced during one-hot encoding) are discarded as they are minorities as well as not informative.
Classification Models for Multi-class, Skewed Distribution
This extremely skewed and multi-class data may not be amenable even to the specialized classification models that deal with unbalanced data, we go ahead and have a taste of their performance, nevertheless. Here we have chosen the following classifiers, capable of handling unbalanced data inherently, for our study:
i) Bagging
ii) Balanced Bagging
iii) Balanced Random Forest and
iv) EasyEnsemble
The confusion matrices generated by the above code:
We see that in all the four cases, balanced accuracy did not even cross 50%.
Multi-class to Two-class
Often, the skewed multi-class classification problem is converted to the two-class problem by taking the minority class versus the group of the rest of the classes. In our situation, the accidents with severity level 4 are fatal and others are non-fatal. Therefore, we can focus on level 4 accidents and regroup the levels of severity into level 4 versus other levels. In this process, a new column ‘Severity 4’ is created.
Data Balancing
As seen above, severity 4 is extremely rare, or in other words, the data is highly skewed. The main challenge of dealing with this type of data is that the machine learning algorithms train with almost 100% accuracy and fails to classify the minority class. This is intuitive since when the occurrence of the majority class is 99% per cent, even if the classifier is hard-coded to predict majority class always, the accuracy will still be 99%.
We appreciate that false negative is very costly here, that is actual severity code 4 is not predicted. The situation is just like the detection of fraudulent transactions or diagnosing diseases.
There are many ways to deal with this situation by balancing the data synthetically by exploration method before training. We might (1) under-sample the majority class, (2) over-sample the minority class or (3) have a combination of (1) and (2), i.e. over- and under-sample simultaneously.
The combination of over- and under-sampling will be used since the data is large enough, level 4 will be randomly over-sampled to 10000 and other levels will be randomly under-sampled to 10000.
Correlation
Let us now get an idea of how the variables are correlated. The variable ST_COLCODE is excluded here as this has a long list making the plot exceptionally tall.
We can see that the variables are not highly correlated.
Classification Models (applied to balanced data)
We are going to consider the models:
i) Logistic Regression
ii) k-Nearest Neighbors (kNN)
iii) Decision Tree
iv) Random Forest
The results of the above four model are summarized below.
a) Accuracy: Accuracies achieved by different algorithms are shown here.
b) Confusion matrices:
c) Feature importance: Relative importance of features for
i) Decision Tree
ii) Random Forest
Inference
We can conclude that the Random Forest is the best model in this scenario, with Decision Tree and the other models are almost the same). An interesting point to note here is that the top important features are somewhat different between the Random Forest and the Decision Tree model. Following the Random Forest model, we see that special attention needs to be given to pedestrians (topmost important feature), speeding, collision with a parked car, rear-ended collision, drivers under influence of alcohol/drug. This result is very much expected. The collision codes 50 (Struck Fixed Object), 32 (One Parked — One Moving), 10 (Entering At Angle) and 14 (From Same Direction — Both Going Straight — One Stopped — Rear End) are the major influencers.
The complete source code and the input data file can be found here.
Future Study
- The relations between the key features and accident severity can be further studied in details
- Different data balancing techniques can be applied and evaluated
- Development of a much more complex real-time accident risk prediction model