Road accidents are a leading cause of death in the United States. Each year, over 38,000 people die in the United States from road accidents. In addition, according to the National Highway Traffic Safety Administration, these accidents cost the United States $871 billion annually. Therefore, it is evident that the impact of road accidents in the United States is substantial. It is worthwhile to get a better understanding of the causes of road accidents so that they can be prevented.
The data set that we will be using in this analysis was imported from Kaggle. It is a countrywide car accident data set that includes data collected from February 2016 to December 2019 using various data providers including MapQuest and Bing. The APIs used to collect data for this data set broadcast traffic events captured by numerous entities including the United States Department of Transportation, individual states’ departments of transportation, law enforcement agencies, traffic cameras, and traffic sensors.
We note that there are 49 features in the data set which are described in detail here. Some features of interest include TMC which is a Traffic Message Channel code, Severity which is a number from 1 to 4 with 1 indicating the least impact on traffic (i.e. a short delay and also less damage / fatalities) and 4 indicating the most impact on traffic (i.e. a long delay and more damage / fatalities), Description which contains a natural language description of the accident, and Weather Condition which contains natural language keywords describing the weather conditions at the time of the accident.
We will drop the Country and Turning_Loop features since the columns for both features only include one value. Additionally, since we already have columns for City, Start_Time, Start_Lat, and Start_Lng, the columns Zipcode, End_Time, End_Lat, Timezone, and End_Lng do not provide much new information for our analysis so we will drop them. We drop Wind_Direction since in general, we want to focus on numerical weather data. Furthermore, since we already have the time at which the accident occurred, the features Sunrise_Sunset, Civil_Twilight, Nautical_Twilight, and Astronomical_Twilight can also be removed because they just add more specific information about the time when the accident occurred. The feature Airport_Code which denotes an airport-based weather station which is the closest one to the location of the accident can be removed since it won’t be too useful in determining what causes more severe accidents. Finally, we remove the feature Weather_Timestamp since this feature details when the weather observation was made which is not as relevant for our analysis of what causes severe accidents as are specific weather observations given by the Weather_Condition and other weather data.
Exploratory Data Analysis
We now summarize the results of exploratory data analysis (EDA) on the data set. The first group of visualizations we show here are visualizations that depict how the number of accidents varies by the month, the time of day, the day of the week, and the year.
Next, we look further into the times when accidents occurred by looking to see when the most severe accidents occurred. For example, we look to answer questions like “At which hour in the day do the most severity 4 accidents occur?” and “On which day do the most severe accidents occur?”
We first plot the number of accidents with a fixed severity (from 1 to 4) by the day of the week where 0 means Monday and 6 means Sunday.
In general, there are less severe accidents during the weekend. In addition, more severe accidents occur earlier in the week than later in the week. The fact that more cars are on the road during the week than on weekends could be a key contributing factor to this disparity.
As the months progress, the number of accidents tends to increase. Interestingly, the winter months of January and February tend to have less accidents than the fall months of September, October, and November. This could be attributed to the fact that not all areas of the US see harsh winters with a significant amount of snow and ice that could impact road conditions.
Finally, we plot the number of accidents with a fixed severity (from 1 to 4) by the hour in the day with 0 meaning 12 am to 1 am and 23 meaning 11 pm to 12 am.
The most severity 1, 2, and 3 accidents occur during morning commutes (7–9 am), but many severity 4 accidents occur in the late afternoon (3 pm to 5pm).
We now look at how the severity of accidents is affected by location.
These plots show that the majority of accidents in the United States from 2016–2019 were moderate to severe accidents (levels 2 to 4). This observation aligns with the reported number of over 38,000 fatalities per year in United States due to road accidents. Additionally, across all levels of severity, more accidents tend to occur along the coasts and near Chicago which is expected as these regions are the most populated regions of the United States.
We now examine in which states and cities the most accidents occur.
California leads all US states by far and Houston leads all US cities. Surprisingly, New York City and Chicago are not even in the top 10, but this could be because the data set does not include enough data points of road accidents from these cities. An alternative explanation could be that due to the public transportation options available in these large cities, there are far fewer road accidents.
We now take a look at the number of accidents by weather condition. We plot the top 12 weather conditions mentioned in the data set versus the number of accidents in each category.
Most accidents seem to happen in clear weather. Let’s take a closer look at how specific weather features impact accident severity.
Across all levels of severity, most accidents happen under clear, cloudy, fair or overcast conditions. It is likely that they are the most frequent conditions overall in the United States as the country’s climate varies drastically from city to city and state to state. Light rain and light snow seem to be the top harsh weather conditions in terms of number of accidents caused. Now, we take a look at weather conditions that we might anticipate to have an impact on severity including those that impact visibility like fog and haze as well as conditions describing precipitation.
The proportion of severity 3 and 4 accidents is greater with precipitation than with fog and haze.
Finally, we look at the impact of road infrastructure (a crossing or a speed bump for example) and points of interest (a railway for instance) on accident severity.
More severe accidents tend to occur near junctions and near no exit signs.
Building a Classifier Based on Severity
Exploratory data analysis has shown us that the time when an accident occurs, the location where it occurs, surrounding infrastructure / POIs, and the weather condition all impact the severity of the accident. Motivated by this, we now aim to build a classifier to predict the severity of an accident. This classifier will be useful because it will allow one to predict, given a location in the US and specific information about the surrounding infrastructure, the weather, and the time, if an accident were to occur there whether the accident would be more severe or less severe. This classifier could easily be extended to identify accident hotspots, predict car accidents, and study the impact of environmental stimuli on accident occurrence.
To build the classifier, we redefine severity as severity 1 and 2 belonging to a low severity class (0) and severity 3 and 4 belonging to a high severity class (1). In our first approach, we will consider all categories of features that could impact the severity of an accident including weather data, the state where the accident occurred, POI and road information, and time information (hour in the day and day of the week). We drop all non-numeric columns as well as the TMC column, and use one hot encoding for the weather condition and day of the week columns.
We begin by building a Random Forest Classifier, an ensemble tree-based learning algorithm which generally performs well on data sets with many features. We use an 80/20 split of training to test data. After tuning the hyperparameters with a grid search, the results of the model are shown below.
The confusion matrix shows that the model tends to choose the label 0 even when the actual label is 1.
Our next goal is to build a logistic regression model for our binary classification problem. There are 562,339 instances of severity 0 accidents and 220,113 instances of severity 1 accidents in the filtered training data set. Clearly, there are many more instances of severity 0 accidents so we downsample the majority class in order to balance the classes. The results of the logistic regression model are shown below.
Though the logistic regression model has a lower accuracy than the random forest model, it performs better in terms of recall and F1 score.
Finally, we build a Decision Tree Classifier for the problem. The results are summarized below.
The Decision Tree model performs the best in terms of recall.
We next attempted another approach to building an ML model which was to examine the importance of different feature categories to predict accident severity. The goal was to examine only one category of features (POIs or weather) and analyze their ability to predict the output class. The results were not great, as shown below, demonstrating that it is necessary to include multiple feature categories in the analysis.
Adding New Natural Language Features
Trying to predict accident severity based solely on POI and infrastructure features as well as based solely using weather features did not produce great results. Thus far, the best results were seen with a Random Forest Classifier on nearly the complete feature set. In building this classifier, we did not consider the TMC feature since it is a numeric code for traffic information messages and the Description feature since it is non-numeric. These features describe the scene of the accident using natural language. We could add features based on key words in the description and in the associated description for each TMC code. In order to do so, we must first understand the descriptions for each TMC code since the data set only includes code numbers but does not include their associated descriptions. The TMC codes in the data set along with their respective descriptions which were extracted from this website via XPath are shown below.
There are only 21 unique TMC codes in the data set. Through visual inspection of these TMC codes and their associated descriptions, we can add new features to the data set with the goal of adding more context about the cause of the accident. For example, TMC codes 200 and 203 indicate that a multi-vehicle accident occurred and TMC codes 206 and 336 indicate a spill occurred. We use these codes to create new features to indicate if a spill caused the accident or if a multi-vehicle accident occurred. We hypothesize that such factors could be indicative of more severe or less severe accidents.
Ultimately, the features added are Heavy_Traffic and Slow_Traffic to indicate traffic conditions at the time of the accident, Spill to indicate if a spill caused the accident, Multi-Vehicle to indicate the accident involved multiple vehicles, and Roadwork to indicate the accident occurred in a road work area.
Next, we take a closer look at the Description feature. Using NLTK to filter out stopwords from the descriptions, tokenize them, and view the most common words, we see that blocked, exit, parkway / highway, and shoulder are all frequent keywords to base new features on. Thus, we now create new features for them by tokenizing the description column and searching it for these keywords. Thereafter, we train a new instance of our best performing model from before, the Random Forest Classifier, with our new natural language features. The results of this classifier are shown below.
Adding the natural language features based on the descriptions of each TMC code as well as from the accident descriptions themselves allowed our model to improve to an 80.44% accuracy! There is still room for improvement with the precision, accuracy, and F1 score, but these metrics are also higher than they were for the original Random Forest Classifier.
This project presented a number of challenges. The data set was not analysis ready, so a good amount of data wrangling and cleaning was necessary before attempting to build classifiers. Additionally, choosing the best model presented its own unique challenges. It was necessary conduct a substantial amount of background research on machine learning algorithms for classification in order to determine that a Random Forest Model, a Logistic Regression model, and a Decision Tree Classifier would be among the best models to try out. Furthermore, deciding which features to visualize further as well as in what ways to view them required close examination of the data set. Lastly, interpreting the plethora of documentation necessary to use many of functions needed for this analysis was a difficult but important obstacle to overcome.
Conclusion and Future Work
All in all, through this project, we have gained deeper insights about road accidents in the United States based on data from 2016 to 2019. Though exploratory data analysis on the data set, many intriguing conclusions were drawn. For example, it is evident that more road accidents occur during morning commutes (7–9 am) and on weekdays compared to weekends. Additionally, road accidents in the United States tend to be moderate to severe in terms of the delays they cause. Moreover, from 2016–2019, California had the most road accidents of all US states while Houston led all US cities. Finally, of all roadway infrastructure features, the highest proportion of high severity accidents occur at junctions between multiple roads.
In terms of building a ML model, we took the approach of formalizing a binary classification problem where accident severity values of 1 and 2 were considered a low severity and accident severity values of 3 and 4 were considered a high severity. We experienced the most success with a Random Forest Model in which we considered all types of features including location data (start latitude and longitude of the accident), time data (hour in the day and day of the week for the accident), POI information, road infrastructure details, and weather data. We were able to further improve the performance of the model by adding features based on natural language processing of keywords in the accident descriptions in the data set as well as keywords in the descriptions of Traffic Message Channel codes.
In the future, there are a number of directions the project can take. For example, we can look into ways to mitigate the lack of precipitation and wind chill data in the overall data set by filling in the average precipitation and wind chill for a particular latitude and longitude at a particular time of year. Doing this would of course require an online database where it is easily possible to search up such information. Filling in the data set in this way would allow us to train a model on a greater number of data points, potentially improving the model’s performance. In addition, we could look into ways of stacking multiple model types to improve performance. Finally, we could explore different algorithms, such as Support-Vector Machines, that work well for classification problems with a large number of features.
- Moosavi, Sobhan, Mohammad Hossein Samavatian, Srinivasan Parthasarathy, and Rajiv Ramnath. “A Countrywide Traffic Accident Dataset.”, 2019.
- Moosavi, Sobhan et al. “Accident Risk Prediction Based on Heterogeneous Sparse Data.” Proceedings of the 27th ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems (2019): n. pag. Crossref. Web.