The Great Debate: Maryland v.s.

mpothen
INST414: Data Science Techniques
4 min readMay 6, 2022

Virginia Drivers

An Analysis on Motor Vehicle Accident Data in the DMV Area

Marilyn Pothen

American roads are built to prioritize efficiency, not safety. In order to solve this problem, we have to focus on specific regions of the U.S.. My approach was to start at my home: the DMV. I wanted to answer the question of what areas in the state of Maryland and Virginia need increased highway security. The National Highway Traffic Safety Administration (NHTSA) provides an API for developers interested in crash statistics within the U.S.. I used this API to gather data on motor vehicle accidents in Maryland and Virginia from 2019 onwards. This data was relevant because it provided information on the location, date, fatalities, pedestrians, persons and number of vehicles for over 2,500 cases. Analysis on this data could help inform government agencies, such as the NHTSA or the Federal Highway Administrations, to make decisions on which areas of the DMV are “accident prone” and need increased funding and attention in roadway safety.

In order to retrieve this data, I imported the Python Requests library. The NHTSA provided a URL which I changed the states and year parameters of to retrieve my desired output in a JSON format. Then, I inputted this URL in the “.get()” function and stored the result in a JSON object.

r = requests.get("https://crashviewer.nhtsa.dot.gov/CrashAPI/crashes/GetCaseList?states=24,51&fromYear=2020&toYear=2021&minNumOfVehicles=1&maxNumOfVehicles=6&format=json")

The first step in my data analysis was to find which areas had the highest number of reported car accident cases. Prince George’s County also held the highest count of fatalities (221 fatalities) and pedestrians involved crashes (79 pedestrians). Prince George’s County had the highest count of cases in Maryland and in the DMV area. In Virginia, Fairfax County had the highest number of cases.

Prince George’s County (MD) had the highest number of cases from 2019–2020 at 205 cases. The counties with the lowest number of cases, with the exception of Alleghany County, were all Virginia counties: Staunton, Covington, Waynesboro, Galax, Highland, and Colonial Heights.

Prince George’s County also held the highest count of fatalities (221 fatalities) and pedestrians involved crashes (79 pedestrians). Prince George’s County had the highest count of cases in Maryland and in the DMV area. In Virginia, Fairfax County had the highest number of cases.

Next, I created a correlation matrix to understand the correlation between fatalities between persons, vehicles, and pedestrians involved in a crash. All variables had a highly strong correlation; specifically the fatalities and the number of vehicles involved in a crash. This analysis could serve as evidence that roadways need to be built to limit the number of vehicles in a certain area in order to reduce the number of fatalities. I also wanted to examine which times of year see the most accident cases in the DMV area.

I also wanted to examine which times of year see the most accident cases in the DMV area. When I plotted the number of cases to date, I discovered that the number of car crashes actually increased from April to November in both 2019 and 2020. However, during the time of the pandemic from April 2020 to December 2020, there was a much more drastic, faster increase in the number of cases. With the COVID-19 pandemic altering traffic patterns, less people are on the roads and this may embolden people to take more risks such as driving at lethal speeds or not. Thus the number of pedestrians and fatalities rose as well.

The two major challenges I encountered was reformatting the data and interpreting the datetime variable. Originally, each row in the data frame contained information on a single case. Data frames are also not recommended to use in for loops as well. This made it difficult to filter the data by location and iterate through the data. Thus, I had to rearrange the data frame using a double for loop so that the data was grouped by the county (one row per county, instead of one row per case). I also converted the dataframe to a list of dictionaries since this format is easier to iterate through. The CrashDate variable was extracted from the API in an Epoch time format. To change it to an easily interpretable format, I looped through the rows and I extracted the date code using regex. Then I passed the regexp object into the function “.fromtimestamp()” from the datetime library to get a datetime object with the year, month, and day clearly outputted.

A limitation I faced was finding crash data from 2021. This data would have been useful to predict what traffic accidents may look like in the future for the rest of 2022. However, analyzing data on 2019 and 2020 allowed me to gain a unique perspective on traffic accidents during the time of the pandemic of the months leading up to it. In the future, I want to reproach this problem with new skills such as clustering and linear regressions to get a stronger understanding of what exactly makes certain areas such as Prince George’s county or Fairfax county prone to accidents. I would also like to understand if the population of a county has any effect on the number of crashes. Another limitation of the data was that it lacked information on the causes of crashes. Understanding what led to the crash brings another facet to the analysis that can help government agencies prevent specific events that increase the chances of accidents.

Links:

Data Analysis Python File: https://drive.google.com/drive/folders/1rJGuEpPz1vqy6f3nSCS-F-LxvkyKLfy5?usp=sharing

--

--