Montgomery County Crash Reporting — Exploratory Analysis

Kimberly Escobar
INST414: Data Science Techniques
4 min readNov 25, 2023

For this analysis, I extracted and cleaned data providing information about motor vehicle incidents occurring on the local roadways of Montgomery County, Maryland.

The specific insight I planned to extract from this data was finding if a certain vehicle model was more prone to accidents, and if there was a connection to the severity of the accident. Gaining perspective on this will inform the audience of what car models are statistically more dangerous on the road. This will also invite them to look at other factors that may contribute to what makes the car model more “dangerous” like certain vehicle features and locations where the car is distributed.

Data Collection and Cleaning

I collected this data from Data.gov through an API. I used the requests Python library to access the API Endpoint, which returned the data in JSON format. To better analyze and read the data, I utilized the pandas library to first create a DataFrame converting the JSON data into a table.

One difficulty I encountered was choosing the amount of data I wanted to use for this analysis. When first accessing the data without setting a limit, I received the first 1000 rows of the data (The last 1000 reports as of Nov. 24, 2023). This was understandable as I was using a SODA Consumer API. However, I did not think I would gain an effective reading using this amount of data, so I set the limit to 5100, as this returns all reports within the past three months. The API is updated weekly, and in total contains 170,000 rows/crash reports.

resp = requests.get('https://data.montgomerycountymd.gov/resource/mmzv-x632.json?$limit=5100')

While reading the API documentation, I learned there were a total of 43 features in the dataset, but when I converted the data into a DataFrame there were a total of 51 rows. This was because there were additional keys/features named “:@computed_region_xxxx_xxx” that was not relevant to the topic. To remove these, I sorted the columns since the extra features all started with “@” then sliced the sorted DataFrame — excluding the extra columns.

sort_df = df[sorted(df)]
df = sort_df.iloc[:,8:]

Exploration

First, I created a list of the top 30 most frequent vehicle models in the DataFrame. I chose to analyze the top 30 vehicle models rather than a larger number, as it best represents the majority of data. Also, given the scale of this assignment, 30 seemed like the best cut-off number for an exploratory analysis.

['HONDA CIVIC',
'TOYOTA CAMRY',
'TOYOTA COROLLA',
'HONDA ACCORD',
'TOYOTA RAV4',
'HONDA CRV',
'TOYT SD',
'NISSAN ALTIMA',
'TOYT 4S',
'FORD TK',
'NISSAN SENTRA',
'FORD F150',
'HONDA PILOT',
'FORD EXPLORER',
'HYUNDAI ELANTRA',
'HOND SD',
'NISSAN ROGUE',
'TOYT UT',
'DODGE CHARGER',
'TOYOTA SIENNA',
'TOYOTA TACOMA',
'HYUNDAI SONATA',
'TOYOTA PRIUS',
'TOYT TK',
'TOYOTA HIGHLANDER',
'FORD ESCAPE',
'THOMAS BUS',
'HOND 4S',
'HONDA ODYSSEY',
'TOYT COROLLA']

Next, I filtered the original DataFrame to contain only vehicle models from the list above. I then created three histograms for each vehicle model. The three histograms represent each of the following key features I decided to analyze further: vehicle damage extent, injury severity, and driver substance abuse. I chose these specific features, because these help measure the danger/severity level of each car incident. Here are some example charts for reports involving a Honda Civic, Toyota RAV4, and Ford F-150. Specifically, they record the vehicle damage extent from each report.

These visualizations helped me understand the ratio of each vehicle model’s damage extent. This was the same for the other features: injury severity and driver_substance_abuse.

View all Visualizations in the GitHub Repository

Limitations

One way this analysis may be biased is given the timeframe this was completed. As it is the holiday season, car crashes are known to be more frequent during the holidays, so the sample I obtained may not best represent the environment.

Another limitation of this analysis was having a variety of data. It is known that vehicle brands like Toyota and Honda are a very popular choice when it comes to owning a car. This was one difficulty I faced when first trying to choose the specific insight I wanted to curate. This is why I chose to further analyze the state of each vehicle and driver since it helped me obtain a variety of findings. While the dataset did provide information about the driver like whether or not they were distracted driving, I feel with more information about the driver’s demographics would also improve this topic of analysis.

--

--