Car Accident Severity in Seattle, WA.

Augusto de Nevrezé
The Startup
Published in
10 min readSep 10, 2020

Coursera Data Science Capstone Project

Introduction

In year 2010, there were 32,999 people killed, 3.9 million were injured, and 24 million vehicles were damaged in motor vehicle crashes in the United States. The economic costs of these crashes totaled $242 billion. Included in these losses are lost productivity, medical costs, legal and court costs, emergency service costs (EMS), insurance administration costs, congestion costs, property damage, and workplace losses. This represents a 1.6 percent of the $14.96 trillion real Gross Domestic Product for 2010.

The society as a whole — the accident victims and their families, their employers, insurance firms, emergency and health care personal and many others — is affected by motor vehicle crashes in many ways. It would be great if real-time conditions can be provided to estimate the trip safeness. In this way, it can be decided beforehand if the driver will take the risk, based on reliable information.

Data

The data used in the analysis is provided by the Traffic Records Group in the SDOT Traffic Management Division from Seattle, WA. It includes all collisions provided by the Seattle Police Department and recorded by the Traffic Record, displayed at the intersection or mid-block of a segment from 2004 to the present. The project purpose is to analyze and predict the severity of an accident based on some particular features that will be chosen.

Data cleaning

Many of the observations including the features described above has incomplete information, such as ‘NaN’ (Not a Number) values or bad formatted ones. At the same time, the frequency of the property damage accidents are almost as double as the ones involving injuries. Remember also, that the target variable, the one which will be predicted is SEVERITYCODE.

The data cleaning process must also involve balancing of the data, in this way, the number of entries corresponding to the 2 severities present in the dataset are equal. Severity labeled as 1 correspond to collisions which implies only property damage and the ones labeled as 2, represents personal injuries. This classification is based in the SDOT Traffic Management Division criteria.

Some features have categories such as “Unknown” or “Other” which are not representative and do not add predictive information to the training model. These categories together with the empty fields which do not have a valid entry, will also be dropped.

As mentioned before, the data needs to be balanced between the two categories in order to improve the accuracy of the predictive machine learning models (unless decision tree like models are trained). For this purpose, the imbalanced learn library has been used and particularly the RandomUnderSampler class to perform an under-sampling of the dataframe. This strategy, eliminates randomly the extra entries corresponding to severity grade 1. Up to the point where there is the same amount of entries with both severities.

Feature Selection

One of the most important questions before training the model is, are all the features adding the same information to the model? If not so, what variables have more weight on it? To tackle this question some techniques can be used to help select the important features, the ones adding more information to our model. It has to be taken into mind that categorical inputs and output will be used, hence, for this kind of variables there are two common strategies: Chi-Squared Feature Selection and Mutual Information Feature Selection. For this particular project the later will be used, since tends to outperform when compared with Squared Feature selection.

The most important variables to determine the collision severity are: ADDRTYPE, COLLISIONTYPE and JUNCTIONTYPE. Here, there is a clear winner: COLLISIONTYPE. Also, the first and the third feature are related, hence, is reasonable that the addition of information of these two variables is almost the same. We will cover this aspect deeply in the next section.

Exploratory Analysis

Let’s explore the data to see if we can gather some knowledge from it and get some insights. It is also important to have in mind that some variables chosen can not be used to create a predictive model, since it is based in information collected after the accident had taken place.

Type of Collision

This feature has different characteristics based in the area of impact, such as: angles, parked car, rear end, right turn, sideswipe, head on, left turn, pedestrian and cycles. All those variables, their frequency and the collision severity can be found in the figure below. As can be observed, the entropy of this categorical variable is pretty high (very unbalanced).

Collision place

Looking at the following histogram, we can observe a higher frequency for severe accidents in intersections rather than in the middle of the blocks.

There are some less frequent categories which can be grouped in a meta-category involving intersections and mid-block incidents. Since these categories are relatively balanced, the overall classification does not change. Indeed, the categorical variable ADDRTYPE divides collision in these two categories, as we commented before. Not surprisingly, the amount of information of these two variables is almost similar.

Weather, Light and Road conditions

There are more occurrences of severe collisions during daylight whereas during the night with the lights on, accidents tend to be less risky. The reason for this, may be related to a more cautious driving during the night which predispose car users to an aware state. Dusk and dawn tend to be related to more severe collisions, maybe because of the visibility reduction while facing the sun directly in the vision zone.

Considering weather data, severe accidents are slightly more frequent during rainy weather as well as with wet roads.

However data is pretty balanced between both severity types and for this reason, the lack of entropy do not add too much information to our model.

Clustering collisions in different geographic areas

One relevant question to explore is, where are most of the accidents occurring? Are there some hot zones where is most probable to have collisions?. In the following pictures a map including clusters with incidents is displayed.

It is pretty clear that most of the accidents occurs in Pioneer Square, Yesler Terrace, the Downtown, Belltown and Pike/Pine. A closer picture let us recreate smaller clusters which shows further information about these areas.

The last picture show us that the Downtown and the North-Eastern neighbors are the ones with more events. These neighbors should take more attention and further evaluation from the local government and transportation division to increase infrastructure and to reduce the collision incidences. Clearly this is the hot accident spot in the metropolis where car users have to pay extra attention in their maneuvers.

Predictive Modeling

We are going to use in particular the following classifier algorithms given the type of data and the predictions we would like to do: K-Nearest Neighbor (KNN), Decision Tree (DT), Support Vector Machine (SVM), Logistic Regression(LR) and Random Forest (RF). The model with the best results will be optimized in order to fine tune it and compare the difference with the standard values.

It is important to select variables which can perform the best in the classification process and at the same time can have predictive use. For this reason and for what was commented in the feature selection paragraph, the predictors are WEATHER, ROADCOND and LIGHTCOND.

Another important thing to take into account is how to encode the categories in each predictor, since each variable has several categories. The best approach is to convert everything to one hot encoding to avoid the model getting lost in hierarchy issues present in label encoding methods. In those cases, the model may try to predict values which are in the middle of a category. However, this is not representative of the observable universe, because they do not actually exist. Finally, to avoid biases, it is important to normalize information before entering our observation matrix to the training.

Results

To keep things short, I will only show a table with the performance results per each classifier and also an optimization to the RF algorithm which looked most promising.

Above you can find the confusion matrix for the best model approach.

You can find more in-depth information in the report contained in my github repository, together with the jupyter notebook.

Discussion

Collisions which does not involve personal injuries are twice as frequent as the ones involving damaged people. Much of the useful information for classification is embedded in the post-collision data collected. From this information, it was learned that accidents involving cycles or pedestrians are severe and involves injuries. It is important to take care of them since they are highly vulnerable to traffic incidents. Some efforts are being held to mitigate those risks providing interurban trails, to transit with bikes or simply walking. The downtown, which has the highest collision record, have implemented many protected bike lanes, and multi-use lanes, shared with pedestrians, that extends to nearby neighborhoods such as Queen Anne, Capitol Hill, between others.

The riskier car collisions are the ones that hit the car from the rear end. This characteristic can aid car automakers to improve the vehicles design in order to mitigate the effects of this kind of collisions. At the same time the frequency of this type of collision is quite high, what put them in central debate. Some studies affirm that many of these accidents are caused by distracted drivers, fatigue, aggressive (speeding) and drunk driving. Efforts are being held to mitigate this incidents with the implementation of crash avoidance systems which take the car brakes control if there is a risk of collision with the car in front of the first. It is rather important and highly recommended, that car users choose this as a safety feature.

Safety at unsignalized intersections is a major concern. Intersection collisions are one of the most common types of crash, and in the United States, they account for nearly 2 million accidents and 6,700 fatalities every year. However, a fully signalized intersection can sometimes be hard to justify in rural areas, due to the cost of installation, maintenance, and added delays to traffic on the major through streets. The Intersection Collision Warning System (ICWS) project studied the effectiveness of an innovative and potentially less expensive approach to improving safety in these situations. This approach consists of two types of traffic-actuated warning signs linked to pavement loops and a traffic signal controller. Concerning the particular Seattle situation, using the same database, in this page a list of the most dangerous intersections can be found.

From the collision geographic clustering, the Downtown and the North-Eastern neighborhoods are the ones with more events. These neighborhoods should take more attention and further evaluation from the local government and particularly the transportation division, to increase infrastructure and to reduce the collision incidences. Clearly this is the hot accident spot in the metropolis where car users have to pay extra attention in their maneuvers.

Conclusion

Much of the data analyzed had revealed some important information about car accidents. Concerning the riskier ones which involve personal injuries, the focus has to be made in some important factors: intersections, rear end collisions, pedestrian and cycles. Left turns are also risky maneuvers which should also be avoided if the road users want to be safe.

Extremely dangerous weather and road conditions do not produce a quite significant accident rate, such as snow and ice. However ,caution have to be taken with rainy weather and wet roads, since after clear days and dry roads, these are the following conditions in order of importance.

Finally, the results of the machine learning algorithms using predictors such as the weather, road and light conditions throws mediocre results. Other factors have to be considered to improve the prediction rate of the models being used.

--

--

Augusto de Nevrezé
The Startup

I write about Data Science, AI, ML & DL. I’m electronics engineer. My motto: “Per Aspera Ad Astra”. Follow me in twtr @augusto_dn