Reaper’s Roads

Who’s likely to die in a car crash?

Published in

The Startup

10 min readSep 30, 2019

Accidents related to motor vehicles occur everyday. On average, 6 million car accidents occur in the U.S. each year and more than 90 people die in car accidents every day [1]. While we might know that these fatal vehicular accidents are occurring, I think it’s important that we take into consideration who these people (the fatalities) are? What are their demographics? Where are they dying? And what can their deaths tell us about the likelihood that we’ll find ourselves in a fatal motor vehicle accident? These are the questions I set out to answer and here is how I attempted to solve them.

Because of my background in Information Science, I started my search for answers in a place where all good queries should begin — raw data. After souring the web for relevant datasets, I stumbled upon a CDC dataset [2] on motor vehicle occupant death rates from 2012 and 2014, and a FARS repository [3] of accident reports dating as far back as 1975 (for consistency I only analyzed the reports from 2012 and 2014). Unfortunately — as is always the case with raw data — my datasets needed some cleaning. Using good ole’ Jupytr Notebook and Python 3 I proceeded to make my data usable.

You can access my full Jupytr notebook here. [4]

TL;DR? No problem, just jump down to the Conclusion section.

Cleaning

Before doing anything else, I loaded the appropriate Python libraries into my terminal using import library_name as xx. The primary library I used was pandas (particularly for its dataframe objects).

# import libraries for datawork
import pandas as pd

pd.options.display.max_columns = 100

Dataset 1: Motor Vehicle Occupant Death Rate, by Age and Gender, 2012 & 2014, All States [5]

# read in main dataset as dataframe
mvoDF = pd.read_csv("MVO_DRate_12-14.csv")

The primary problems with my first dataset were: (1.) the Location column contained the state name, longitude, and latitude values all in one string text and (2.) several values were missing from my dataset appearing as NaN.

Using a list, a dataframe, and string functions, I grabbed every value in the Location column of my dataset and split it into three columns titled State, Latitude, and Longitude. Which I then added to my dataset.

# split 'Location' column from dataframe into 'State', 'Latitude', and 'Longitude' columns
locList = mvoDF['Location']
locDF = pd.DataFrame(columns=['State','Latitude','Longitude'])
count = 0

for location in locList:
    location = location.replace('\n','').replace(', ',',').replace('(',',').replace(')','')
    location = location.split(',')
    state = location[0]
    latitude = location[1]
    longitude = location[2]
    locDF.loc[count] = [state,latitude,longitude]
    count += 1# add 'State', 'Latitude', and 'Longitude' columns to dataframe
mvoDF.drop(columns=['Location'],inplace=True)
mvoDF.insert(0, 'Longitude',locDF['Longitude'],True)
mvoDF.insert(0, 'Latitude',locDF['Latitude'],True)
mvoDF.insert(0, 'State',locDF['State'],True)

2. Realizing that the NaN values in my dataset were the result of suppressed reports — which according to the CDC: “ Fatality rates based on fewer than 20 deaths are suppressed.” — I concluded that the missing fatality rates in my dataset were negligible and could be replace with zeros. To do this I used the pandas function df.fillna().

# replace every NaN value with 0
mvoDF.fillna(0,inplace=True)

mvoDF.head()

Dataset 2: Fatality Analysis Reporting System Data [6]

Notice that this is marked as dataset 3 if you are following along in my jupytr notebook.

The primary problems with my second dataset(s) were: (1.) there were a ton of irrelevant columns that I had no use for and (2.) locations like states and counties were recorded as the FIPS numbers rather than their names.

To solve this problem I simply created a list of all the column names in my dataset, then created a list of all the columns I wanted to keep, and used list.remove() to remove the names of the columns I wanted to keep from the list of all column names. Following this I used the pandas function df.drop(columns=list) to remove all leftover columns.

# drop unnecessary columns in dataframe
    dropList = list(farsDF.columns)
    keepList = ['STATE','ST_CASE','PER_NO','COUNTY','DAY','MONTH','HOUR','MINUTE','AGE','SEX']
    for i in keepList:
        dropList.remove(i)
    farsDF.drop(columns=dropList,inplace=True)

2. Since I had two sets of reports that I was using (one for the year 2012 and the other for 2014) I created a function to clean both reports.

# create a dictionary of month names and corresponding numbers
MonthDict = {1:'January',2:'February',3:'March',4:'April',5:'May',6:'June',7:'July',8:'August',9:'September',10:'October',
             11:'November',12:'December'}# create a dictionary of sex and its corresponding numbers
sexDict = {1: 'Male', 2: 'Female'}

My Function for cleaning FARS data (particularly ones labeled PERSON):

# create a function that handles cleaning for FARS datasets
def cleanFARS(filename,year):
    # read in dataset as dataframe
    farsDF = pd.read_csv(filename)
    farsDF
    # drop unnecessary columns in dataframe
    dropList = list(farsDF.columns)
    keepList = ['STATE','ST_CASE','PER_NO','COUNTY','DAY','MONTH','HOUR','MINUTE','AGE','SEX']
    for i in keepList:
        dropList.remove(i)
    farsDF.drop(columns=dropList,inplace=True)
    
    # convert state codes to state names
    farsDF['STATE'] = farsDF['STATE'].map(stateDic)
    # https://stackoverflow.com/questions/20250771/remap-values-in-pandas-column-with-a-dict
    
    # drop rows in dataframe if 'STATE' column contains NaN
    farsDF.dropna(axis=0,subset=['STATE'],inplace=True)
    farsDF.reset_index(drop=True,inplace=True)
    
    # convert county codes to county names
    farsDF['COUNTY'] = farsDF['COUNTY'].map(countyDict)
    # https://stackoverflow.com/questions/20250771/remap-values-in-pandas-column-with-a-dict
    
    # drop rows in dataframe if 'COUNTY' column contains NaN
    farsDF.dropna(axis=0,subset=['COUNTY'],inplace=True)
    farsDF.reset_index(drop=True,inplace=True)
    
    # replace month number with month name
    farsDF['MONTH'] = farsDF['MONTH'].map(MonthDict)
    
    # combine 'DAY' and 'MONTH' into 'DATE' column
    dateList = []
    
    for i, row in farsDF.iterrows():
        date = str(row['MONTH'])+' '+str(row['DAY'])+', '+str(year)
        dateList.append(date)
    
    farsDF.drop(columns=['DAY','MONTH'],inplace=True)
    farsDF.insert(4, 'DATE',dateList,True)
    
    # combine 'HOUR' and 'MINUTE' into 'TIME' column
    timeList = []
    
    for i, row in farsDF.iterrows():
        time = str(row['HOUR'])+':'+str(row['MINUTE'])
        timeList.append(time)
    
    farsDF.drop(columns=['HOUR','MINUTE'],inplace=True)
    farsDF.insert(5, 'TIME',timeList,True)
    
    # replace sex number with sex name
    farsDF = farsDF[farsDF['SEX'] != 8]
    farsDF = farsDF[farsDF['SEX'] != 9]
    farsDF.reset_index(drop=True,inplace=True)
    
    farsDF['SEX'] = farsDF['SEX'].map(sexDict)
    
    return farsDF

Afterwards, I ran both of my reports through my function and combined them using concatenation.

# read in 2012 dataset as dataframe
v12DF = cleanFARS('12-PERSON.csv',2012)# read in 2012 dataset as dataframe
v14DF = cleanFARS('14-PERSON.csv',2014)# combine my 2012 and 2014 dataframes
vDF = pd.concat([v12DF,v14DF])
vDF.reset_index(drop=True,inplace=True)

Afterwards I exported my cleaned datasets using pandas’ df.to_excel() function for analysis and visualization in Tableau.

# dataframe 1
mvoDF.to_excel('mvoDF_12-14.xlsx',sheet_name='MVO_DR')

# dataframe 2
vDF.to_excel('vDF.xlsx',sheet_name='V_PbS')

Analysis

What States Have the Most and Least Motor Vehicle Fatalities?

Texas has the highest total number of motor vehicle fatalities with 7,916 in 2012 and 8,219 in 2014; followed by California with 6,828 in 2012 and 7,389 in 2014.

Rhode Island has the lowest total number of motor vehicle fatalities with 126 in 2012 and 95 in 2014; followed by Alaska and Vermont with 144 & 160 in 2012 and 165 & 110 in 2014 respectively.

This seems to suggest that you’re less likely to get into a car crash somewhere like Rhode Island, Alaska, or Vermont than somewhere like Texas or California. The data doesn’t give any hints as to why, but my best guess would be that this is due to population. Texas and California both have very large populations while Rhode Island, Alaska, and Vermont all have small ones. Lower population means less people on the road which means less accidents overall.

In Which Months Do the Most and Least Motor Vehicle Fatalities Occur?

August has the highest amount of motor vehicle fatalities in both 2012 and 2014, while February has the least in both 2012 and 2014.

This seems to suggest that February is the safest month (in terms of the number of fatal accidents) while August is the deadliest month. It also seems that the number of motor vehicle fatalities increases as we move towards the summer months and decreases as we move towards the winter months.

This pattern seems to hold up when we look at the individual state level. Take the state of Colorado for instance. The number of motor vehicle fatalities increases as we get closer to August (summer) and decreases as we get closer to February (winter).

In fact, if we look at this across each day of the year, we see that the days with the lowest amount of motor vehicle fatalities occurred in winter months — with January 10th (115 fatalities) in 2012 and February 18th (91 fatalities) in 2014 — and the days with the highest amount of motor vehicle fatalities occurred in summer months — with June 16th (370 fatalities). Although, there seems to be an outlier in 2014 where October 25th (356), which is an autumn month, had the day with the highest amount of fatalities that year.

At What Hours Do the Most and Least Motor Vehicle Fatalities Occur?

The total number of motor vehicle fatalities is at its lowest at 4 AM with ~1,500 fatalities and at its highest at 6 PM ~4,600 fatalities. Fatalities begin to increase after 4 AM with a slight dip between 8 AM and 9 AM, followed by a steady rise until 6 PM, followed by a gradual decline until 4 AM, with a slight jump around 2 PM.

This seams to suggest that the safest hour of the day (for avoiding fatal motor vehicle accidents) is 4 AM while the deadliest hour of the day is 6 PM.

One can theorize that this is because most people are asleep around 4 AM while others are beginning to make their way home from work around 6 PM.

What is the Typical Age For Someone Who Dies in a Fatal Motor Vehicle Accident?

The highest age density of fatal motor vehicle victims is the early-20s. This could be due to a number of unknown factors like drunk-driving, speeding, inexperienced drivers, texting and driving, etc.

The implication is that if you are at the extremes of the spectrum (very young or very old) you are less likely to be involved in a fatal motor vehicle accident.

This seems to hold when looking at individual states, with the highest motor vehicle occupant death rates between Ages 21-34 and the lowest motor vehicle occupant death rates between Ages 0–20.

What is the Typical Gender For Someone Who Dies in a Fatal Motor Vehicle Accident?

Across various states, there seems to be a consistently higher number of male victims of fatal motor vehicle accidents than women.

The only state in which the death rate of female victims of fatal motor vehicle accidents is higher than male victims is the state of Oregon.

This is the case across ages…

…and time.

The data seems to suggest that women are less likely to be the victims of fatal motor vehicle accidents than men. What the implications of this are, I’m unsure. It could be suggesting that there are more men on the road, or that women are better drivers, or something else entirely. Unfortunately my dataset is unable to give us a clear answer on that.

Conclusion

So, who’s the likest candidate to earn a seat in the reaper’s Cadillac? Well, according to my analysis, the likest victim of a fatal motor vehicle accident is John Doe: a male victim is his early 20s, in Texas during the summertime (probably August), around 6 PM. Conversely, the least likely victim of a fatal motor vehicle accident is Jane Doe: a female child (or at least someone aged less than 20), in Rhode Island during the wintertime (probably February), around 4 AM. Now that we’ve found the likeliest (and least likely) victim of a fatal motor vehicle accident, we need to ask ourselves why. What are the factors that contribute to John Doe being more likely to die in motor vehicle accident than Jane Doe.

These questions unfortunately lie outside the scope of my datasets, but I think this lays a good foundation to do further analysis on. I believe that these are important questions that we should look into — after-all, the reaper is always waiting and watching to take us away in his ghastly hearse.

Afterword

Before closing out this article I would like to point out the flaw in my analysis. Because of my CDC dataset, I was limited to the years 2012 and 2014. While these are good starting points, we cannot extend my conclusions beyond these years as we have no conclusive evidence that these patterns I observed continue past or predate these years (2012 and 2014). That being the case, my hope is to eventually use the entirety of the FARS repository and observe these trends from 1975 to 2019.

Fine