Be Safe

What’s the safest mode of transportation for commuters?

Published in

The Startup

12 min readOct 30, 2019

Commuting to and from work, school, and other places, has always been a fundamental part of many of our daily lives; particularly in urban areas with high population densities and robust transportation infrastructure. As a daily commuter myself, I typically default to the same routine commute everyday and don’t often give much thought to whether my chosen means of transportation is necessarily the best option for me or my safety. As more and more of us find ourselves commuting for longer stretches of time [1], and as the number of fatal car accidents rise [2], I think it is more important now then ever for us to more closely consider our personal safety when commuting.

As I have a background in Information Science, I have chosen to approach this problem from a data-centered perspective. Through the power of Python 3 and Jupytr Notebook, I’ve made use of several sets of publicly available data published by the United States government to base my analysis upon.

You can access my full Jupytr notebook here.

TL;DR? No problem, just jump down to the Conclusion section.

Data Retrieval

For this analysis, I made use of four primary datasets; each from a different governmental agencies. These datasets are as follows:

Because I had the good fortune of having all my data in the form of csv (comma-separated value) files I used the .read_csv() function from a python library called pandas, to import all my datasets into pandas dataframes.

Here’s a bit of sample code:

import pandas as pd# read in a csv file using a pandas function
df = pd.read_csv('file_name.csv')

Data Cleaning

After importing all datasets as dataframes, I set about cleaning my data in preparation of merging, analysis, and visualization.

Bureau of Transportation Statistics — Commuting to Work, 2013

The first dataframe I cleaned was commuting data from the BTS. I began by creating a list of states which I used to filter out rows in my dataframe. Following this, I removed unnecessary columns in my dataframe, renamed the columns that remained, and used my list conditionally drop rows that held data on states not in my list of states.

# Dataset #1: Bureau of Transportation Statistics - Commuting to Work, 2013
btsDF.drop(columns='Unnamed: 9',inplace=True)
btsDF.columns = ['State','Number of Workers','Percent Automobiles (Alone)','Percent Automobiles (Carpooled)',
                 'Percent Public Transortation','Percent Walked','Percent Taxicab, Motorcycle, Bicycle, etc.',
                 'Percent Worked from Home','Average Travel Time to Work (Minutes)']

# remove non-states in 'State' column (ex. Puerto Rico)
for i in range(len(btsDF.index)):
    if btsDF['State'][i] not in stateList:
        btsDF.drop(index=i,inplace=True)

btsDF.reset_index(drop=True,inplace=True

Following this, I used pandas’ .apply() function to convert the values in my rows to integers. Then I used it again to change the format of the values from percents (ex. 12%) to decimals (ex. 0.12).

# convert the string values in Number of Workers to integers
btsDF['Number of Workers'] = btsDF['Number of Workers'].apply(lambda x: int(x.replace(',','')))
# this replaces the ',' in the num and converts it from a string into a int value# convert the percentage from percent format (12%) to decimal format (0.12)
btsDF = btsDF.apply(lambda x : x / 100 if x.name not in ['State','Number of Workers'] else x)

Centers for Disease Control and Prevention — Motor Vehicle Occupant Death Rate, by Age and Gender, 2012 & 2014

The second dataframe I cleaned was motor vehicle occupant death rates data from the CDCP. This dataframe is actually a combination of two separate datasets, from two separate years (2012 and 2014 respectively) which I had already merged for a previous project.

I used the same process to clean this dataframe’s rows and columns as I used for my previous dataframe; though, I also used pandas’ .mean() and .apply() functions to generate average death rates between the two years and convert those rates from percent (ex. 12%) to decimal (ex. 0.12) format.

# take the average death rate (between 2012 and 2014) for each state
cdcDF['Death Rate per 100,000 Population'] = cdcDF.mean(axis=1,numeric_only=True)

cdcDF.drop(columns=['2012','2014'],inplace=True)# convert the percentage from percent format (12%) to decimal format (0.12)
cdcDF = cdcDF.apply(lambda x : x / 100 if x.name not in ['State'] else x)

United States Census Bureau — U.S. Census Annual Estimates of the Resident Population, 2013

The third dataframe I cleaned was population data from the US Census Bureau. In the same vein as dataframe #1 and #2, I used the same methods I did before to clean my third dataframe’s rows and columns. Aside from that, I used pandas .apply() and the .astype() function to convert the population values in my dataframe from strings (which they, for whatever reason, are by default) into integers.

# convert the Population values from strings (which it is by default) to integers
uscbDF = uscbDF.apply(lambda x : x.astype('int32',copy=False) if x.name not in ['State'] else x)

National Highway Traffic Safety Administration — Fatality Analysis Reporting System (FARS), 2013

The final dataframe I cleaned was fatality reports from the NHTSA’s FARS. My process for cleaning this dataframe was very different from the ones I cleaned before. My goal with this last dataframe was actually to generate another dataframe based upon it, that would to give me the number of motor vehicle accidents that occurred in each state in 2013.

I was able to achieve this by generating a dictionary, which I populated with values from my final dataframe. And then by later converting that dictionary into a dataframe of its own using pandas .DataFrame() function.

# Dataset #4: National Highway Traffic Safety Administration - Fatality Analysis Reporting System (FARS), 2013

nhtsaDic = {'FIPS':[],'State':[],'Accidents':[]}

nhtsaDic['FIPS'] = fips['FIPS'].tolist()
nhtsaDic['State'] = fips['NAME'].tolist()

for state in nhtsaDic['FIPS']:
    tempDF = farsDF.loc[farsDF['STATE'] == state]
    size = len(tempDF.index)
    nhtsaDic['Accidents'].append(size)

nhtsaDF = pd.DataFrame(nhtsaDic)# remove the FIPS column
nhtsaDF.drop(columns=['FIPS'],inplace=True)

# remove non-states in 'State' column (ex. Puerto Rico)
for i in range(len(nhtsaDF.index)):
    if nhtsaDF['State'][i] not in stateList:
        nhtsaDF.drop(index=i,inplace=True)

nhtsaDF.reset_index(drop=True,inplace=True)

Merging Data

After having cleaned my dataframes, I combined them into one master dataframe using pandas .merge() function; after which I exported my master dataframe as an excel file using pandas .to_excel() function for later visualization in Tableau.

# Dataset #1 + Dataset #2
mainDF = btsDF.merge(cdcDF,on='State')
...
...
...
# export my dataframe to excel for visualization in Tableau
mainDF.to_excel("CvVOD.xlsx", sheet_name='Data')

Data Analysis

A Baseline: Motor Vehicle Accidents by State

According to the NHTSA data, Texas has the most motor vehicle related accidents (3,047) out of any state whereas Alaska has the least (49), with the average across all states being around 600 accidents a year. This makes sense given that Texas has a much larger population than Alaska, thus having more potential people to get into motor vehicle related accidents.

This theme persists throughout the data with high population states like California and Florida, each having large amounts (2,860 and 2,223) of motor vehicle related accidents; and low population states like Rhode Island and Vermont, each having smaller amounts (62 and 63) of motor vehicle related accidents.

There is an odd case though, in New York. New York has a massive population (third largest in the US), yet has a fairly small amount of motor vehicle related accidents (1,124) proportional to its population. (We’ll be able to see this disparity more clearly in a later visualization.)

Another Baseline: Motor Occupant Deaths by State

According to my CDC data, Texas once again has the most motor vehicle occupant deaths (2,460) out of any state whereas Rhode Island has the least (36), with the average across all states being a little over 400 victims of fatal motor vehicle accidents a year. This makes sense given how our previous data showed that Texas had the most motor vehicle related accidents, thus more chances for one of those accidents to be fatal.

This theme of a possible correlation between population and motor vehicle accidents persists throughout this data, with places like New York once more managing to keep the death toll low (580) despite it’s large population.

Standardizing for Population

Let’s better contextualize these death rates by looking at them in terms of each state’s population.

As the above graph shows, the states with the largest populations are:

California (38,332,521)
Texas (26,448,193)
Florida (19,552,860)
New York (19,651,127)

And the states with the lowest populations are:

Wyoming (582,658)
Vermont (626,630)
North Dakota (723,393)
Alaska (735,132)

Now let’s look at the motor vehicle occupant deaths in relation to these populations.

The above graph shows the motor vehicle occupant death divided the population of each state, shown as a percentage value. Using this visualization we can see that the states with the highest motor vehicle occupant deaths in relation to their populations are:

Wyoming (1.9%)
North Dakota (1.7%)
Mississippi (1.6%)
Montana (1.6%)

And the states with the lowest motor vehicle occupant deaths in relation to their population are:

New York (0.3%)
Massachusetts (0.3%)
Hawaii (0.33%)
Rhode Island (0.34%)

An interesting insight that we can pull from this data is that it seems to tell us that your safer in New York (from getting into a fatal motor vehicle accident that is) than in Wyoming. So, while Wyoming might have less motor vehicle occupant deaths a year than New York, you are almost 2/3 more likelihood to end up as one of these victims in Wyoming than you would be in New York.

Context: Commutes by State

Looking at the BTS dataset, we can see that, unsurprisingly, transportation by automobile (solo) is the dominant mode of transportation for commuters across all states. This if followed by automobile (carpool) and public transportation.

Let us now look to see if there exist any correlation between mode of transportation and fatal motor vehicle accident deaths.

Mode of Commute vs. Motor Vehicle Occupant Deaths

Insight: The number of MVO deaths (per 100,000 people) increases as the number of solo automobile commuters (per 100,000 people) increases.

Strong Correlation: With a p-value of <0.0001, we reject the null hypothesis, thus we can say that the data suggests a strong correlation between the number of solo automobile commuters and MVO deaths.

Insight: The number of MVO deaths (per 100,000 people) increases as the number of carpool automobile commuters (per 100,000 people) increases.

Weak Correlation: With a p-value of 0.58, we accept the null hypothesis, thus we can say that the data suggests a weak correlation between the number of carpool automobile commuters and MVO deaths.

Insight: The number of MVO deaths (per 100,000 people) decreases as the number of public transportation commuters (per 100,000 people) increases.

Strong Correlation: With a p-value of <0.0001, we reject the null hypothesis, thus we can say that the data suggests a strong correlation between the number of public transportation commuters and MVO deaths.

Insight: The number of MVO deaths (per 100,000 people) decreases as the number of alternative (taxi, motorcycle, bicycle, etc.) commuters (per 100,000 people) increases.

No Correlation: With a p-value of 0.1, we accept the null hypothesis, thus we can say that the data suggests no correlation between the number of alternative (taxi, motorcycle, bicycle, etc.) commuters and MVO deaths.

Insight: The number of MVO deaths (per 100,000 people) decreases as the number of walking commuters (per 100,000 people) increases.

Modest Correlation: With a p-value of 0.013, we reject the null hypothesis, thus we can say that the data suggests a modest correlation between the number of walking commuters and MVO deaths.

Insight: The number of MVO deaths (per 100,000 people) decreases as the number of workers who worked from home increases.

A Modest Correlation: With a p-value of 0.015, we reject the null hypothesis, thus we can say that the data suggests a modest correlation between the number of workers who worked from home and MVO deaths.

Insight: The number of MVO deaths increases as the number of motor vehicle accidents increases and as the population increases. The number of accidents increases as the population increases.

Strong Correlation: With a p-value of <0.001 (in all cases), we reject the null hypothesis, thus we can say that the data suggests a strong correlation between the number of motor vehicle accidents and MVO deaths and between the population size and MVO deaths and between the population size and number of accidents.

(These last three graphs are here just to prove that deaths do go up as accidents and population go up, justifying the ‘per 100,000 people’ standardization I used in the graphs preceding it. ‘Per 100,000 people’ allows us to account for these increases caused by higher populations and accidents, thus not skewing the data unfairly towards one state or another.)

Conclusion

So what does all this data tell us?

Well, it tells us that both the number of accidents (and by proxy) the number of fatal motor vehicle occupant deaths increases as a state’s population increases. So, the more people that there are in a given state, the greater chance for accidents and said accidents being fatal. Although, this does not necessarily translate to you yourself being more likely to end up as one of the victims of a fatal motor vehicle accident simply because your state has a higher population. In fact, for most of these states with high populations, victims of fatal motor vehicle accidents actually make up a minuscule part of the population.

The data also tells us that certain modes of transportation like automobile (solo) and public transportation have a meaningful correlation and, by extension, effect on the number of motor vehicle occupant deaths in a given state; whereas other modes of transportation like carpooling automobiles and alternative transportation (taxi, motorcycle, bicycle, etc.) have little to no correlation to, or impact on, motor vehicle occupant deaths.

The most important (in my opinion) insight we can draw from this data is that states with higher amounts of commuters using public transportation experience a substantial statistical decrease in the amount of motor vehicle occupant deaths. Meaning the more people using public transportation, the safer it is for all of us. Unfortunately, my data also shows that the majority of commuters commute solo by automobile. So, according to the data, if we want to keep our commuters safer, then we should embrace and support more public means of transportation and move away from driving to and from work.

So, in conclusion, what’s the safest means of transportation for a commuter? The bus…the train…good ole’ public transportation. According to my research, you are statistically less likely to get into a motor vehicle accident, and by proxy a fatal accident, if you choose public transportation. So, maybe think about giving it a chance. Regardless, I hope I’ve left you with something to think about. Be well, and be safe! Ciao!

Afterword

Before closing out this piece, I would like to point out some limitations in my analysis. Because all of my data originates from around 2013, It is all a bit dated at this point. Unfortunately I was unable to find enough sources of data from a more contemporary time so I acquiesced and decided to base my findings in the year 2013. Another limitation of my data is that motor vehicle occupant death doesn’t give us a holistic view of the actual dangers and safety risks across different modes of transportation. Since it only looks at the deaths of motor vehicle occupants and not other victims that could have arose from a fatal motor vehicle accident, it lacks quite a bit of useful information. Unfortunately, I was unable to find another adequate dataset that included this information, so I stuck with the motor vehicle occupant deaths data. Regardless, I think my analysis gives a good basis for thinking about commuter safety and, if nothing else, raises questions about the modes by which we choose tot commute. Thanks for reading.