Finding Important Characteristics From Montgomery County Vehicle Incidents

Kimberly Escobar
INST414: Data Science Techniques
3 min readDec 9, 2023

After collecting data from an API providing information about Montgomery County vehicle incidents, I planned to expose patterns that are not immediately available to analyze. Most specifically, I clustered the data by the location where the incident took place (road name), and proceeded to characterize the clusters that I was given. Obtaining this insight will inform those who are interested like public service officials to characterize certain Montgomery County, Maryland roads and find how they may be similar in other ways. This will also allow stakeholders to further analyze certain incident reports to understand why they are grouped given this unsupervised learning technique.

Data Collection and Cleaning

I obtained this data from Data.gov. I was provided an API endpoint to collect the information from. I utilized the python requests library to access the endpoint, and convert the collected data into a pandas DataFrame for better readability. After further analyzing the data, I recognized extra columns that are not originally mentioned in the API documentation. These were all similarly named, “:@computed_region_xxxx_xxxx.” Assuming these were automated due to the fact that this data was originally collected via the Automated Crash Reporting System (ACRS) of the Maryland State Police, I removed these before I completed my analysis.

Another fix I had to make was accounting for the NA values in the data. For example, some reports had NA for the road the vehicle incident occurred on, so this automatically reduced the actual scale of data I continued to work with. I extracted 5100 rows of reports (to collect reports in the last 3 months), but only clustered over 2800+ of reports.

resp = requests.get('https://data.montgomerycountymd.gov/resource/mmzv-x632.json?$limit=5100')

resp_obj = resp.json() #Convert output to JSON

df = pd.DataFrame(resp_obj) #Convert JSON response to DataFrame

#Sort to pull additional "@" columns to front of df
sort_df = df[sorted(df)]

#Slice off "@" columns
df = sort_df.iloc[:,8:]
df = df.fillna('')

Clustering the Data

The feature I used to cluster the data is the road name recorded for each report. I felt with this feature, since it implies a geographic location, this allows for better exploration and guidance for other features to analyze in the future.

#Initialize dictionary to hold road counts for each report
report_dict = {}

#Iterate through each row
for index, row in df.iterrows():
road_count = {}

#Get road name and set count to 1
road_count[row['road_name']] = 1

#Add report number to dictionary and set value to the road count
report_dict[row['report_number']] = road_count

#Create dataframe with counts for each road
index = report_dict.keys()
rows = [report_dict[k] for k in index]
df1 = pd.DataFrame(rows, index=index)
df1 = df1.fillna(0)

k=11 #Set k
cluster_model = KMeans(n_clusters=k)
cluster_model.fit(df1)
cluster_labels = cluster_model.predict(df1)
report_cluster_df = pd.DataFrame(cluster_labels, index=df1.index, columns=["cluster"])

I decided to use KMeans to cluster the data. To choose the k value, I used the “Elbow Method” to determine the best “k.” Due to this method, I chose the k value to be 7. After clustering the data, I believe that each cluster represents which roads are similar to each other based on how many incidents are reported to take place at such. For example, in one of the clusters, Rockville Pike is appears in the list, so the next step I would take is reading the car crashes reported at this location, and see if other roads contain similar characteristics/features.

#Sample 5 reports from cluster
Cluster: 2 Size: 56
MCP3243001R ROCKVILLE PIKE
MCP3350001M ROCKVILLE PIKE
DD559500FC ROCKVILLE PIKE
DD5628005M ROCKVILLE PIKE
MCP31430050 ROCKVILLE PIKE

Limitations

One limitation of this analysis is the missing depth of data. I only decided to use the road name feature from each report to cluster the data. This led to the counts for each report to be very small and not as informative as if I were to group the report by vehicle make, and then cluster them by road name. This may be a next step to improve this type of analysis technique. There may be a possibility of bias in this analysis due to the scale of data used. Only 2800+ reports were clustered, and this is a little more than half of the original reports collected from the API. Since I had originally collected 5100 reports (which accounts for all reports in the past three months) less than 2 months worth of reports were clustered. This may not accurately reflect the average environment on the local Montgomery County roads.

GitHub Repository

--

--