Modeling Crime Rates in Maryland Counties

Published in

INST414: Data Science Techniques

9 min readDec 17, 2023

After finding a dataset containing information about violent and property crime rates in Maryland, I wanted to extract insights on latent trends concerning this topic. The data presents a date/time structure, so I wanted to use this to my advantage and highlight changes in certain type of crimes being committed throughout the state of Maryland. I also wanted to find what types of crimes were more prevalent in a certain Maryland county (as the data is organized by such).

Creating these insights will help me solve the information problem of understanding crime occurrences in Maryland. This will be solved by my visualizations with the data presenting crime occurrences from 1975 to 2020. By understanding the likelihood of a type of crime that may occur in a certain area, a person will be able to make decisions tailored for the prevention of that sort of crime. Those most concerned with this type of action are those in law enforcement, and policymakers who form budgets for this field.

Public service officials will always have stay informed about criminal activity as long as we have a running government. Safety precautions have to be made to prevent crime. To keep a streamlined policing operation, analysis must be done to appropriately address criminal activity in the community. With more emerging data models to assist public service work, it is important to ensure these models are useful and insightful.

Data Collection

The source of my data is from Data.gov. From there, I was able to navigate to Maryland’s Open Data Portal where I had access to the API endpoint for the dataset. This webpage also supplied documentation for the API, describing its background and dataset characteristics. As the source described, the data was compiled from Maryland State Police’s annual Uniform Crime Reports. Technically, it has 38 columns/features and every row out of 1104 represent one year for each Maryland county. The features combine both statistical data and total numbers of crimes committed. The violent and property crimes reported in the data are as follows: murder, rape, robbery, aggravated assault, breaking and entering, larceny theft, and motor vehicle theft. The data also supplies total population for each county, every year.

Some other subject-matter expertise I needed to support my analysis was familiarizing myself with certain terminology. For example, the data describes statistics for different types of crimes like larceny theft and breaking and entering. I wanted to make sure what classifies a crime to be such to help measure them amongst each other. I did this by reviewing the contents of a Uniform Crime Report completed in 2020. In this document, i was able to find a definition for each crime. This helped me understand what specifically had to be done for a crime to be classified as robbery, aggravated assault, etc.

After collecting the data, made sure to clear and clean it of any technical errors. The main error I found was that there was no crime rate percent changes reported for every county in 1975. This is because since the data starts from 1975, it cannot compare the percent changes from 1974. I filled these null values with 0 to prevent future complications when I completed calculations. I also continued with normalizing the crime totals for every county, ensuring to exclude un-related features.

Key Ideas

The methods I applied for this analysis are KMeans clustering, using cosine similarity to compare counties to each other, and predictive modeling. Each method supports the insight I tried to form in a different way.

Here are all the libraries I utilized for my analysis:

%matplotlib inline

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.cluster import KMeans
from sklearn.decomposition import TruncatedSVD

KMeans Clustering

I decided to cluster the data to see how each county (by year) are grouped. I first approached the KMeans method by using the “Elbow Method” to determine a k value. After using this method it seemed best to choose a k value that was under 10.

I wanted to cluster the Maryland counties by percent change for every type of crime. I did this by combining all the statistics in one column for better pre-processing. I also utilized the cosine similarity metric to compare each county. After this, I chose a k value of 9 to continue the clustering process. Since I wanted to create a visualization representing all of the clusters, I had to perform dimensionality reduction using the TruncatedSVD method from the sklearn decomposition module. The original count matrix had too many dimensions. After doing so, I was able to plot and form clear clusters using KMeans clustering.

Here is the code supporting this process:

# create combined features column with all percent change rates
def combined_features(row):
    return str(row['murder_rate_percent_change_per_100_000_people'])+" "+str(row['rape_rate_percent_change_per_100_000_people'])+", "+str(row['robbery_rate_percent_change_per_100_000_people'])+", "+str(row['agg_assault_rate_percent_change_per_100_000_people'])+", "+str(row['b_e_rate_percent_change_per_100_000_people'])+", "+str(row['larceny_theft_rate_percent_change_per_100_000_people'])+", "+str(row['m_v_theft_rate_percent_change_per_100_000_people'])

df['all_change_rates'] = df.apply(combined_features, axis=1)
# count number of features present in each title
cv = CountVectorizer()
count_matrix = cv.fit_transform(df['all_change_rates'])
df_matrix = pd.DataFrame(count_matrix.toarray(), columns=cv.get_feature_names_out())
print(df_matrix)

# set similarity metric
cosine_sim = cosine_similarity(count_matrix)

k=9

# apply KMeans clustering
cluster_model = KMeans(n_clusters=k)
cluster_model.fit(df_matrix)

cluster_labels = cluster_model.predict(df_matrix)
county_cluster_df = pd.DataFrame(cluster_labels, index=df_matrix.index, columns=["cluster"])

county_cluster_df["cluster"].value_counts()

df2 = df.reset_index()
for cluster,county in county_cluster_df.groupby("cluster"):
    print("Cluster:", cluster, "Size:", county.shape[0])
    
    # sample 5 items from each cluster
    for c_id in county.sample(5).index:
        print("\t", c_id, df2['jurisdiction'].iloc[c_id], df2['year'].iloc[c_id])

# reduce matrix and plot points
svd = TruncatedSVD(n_components=2)
svd.fit(df_matrix)
matrix_reduced = svd.transform(df_matrix)

plt.scatter(matrix_reduced[:,0], matrix_reduced[:,1], c=cluster_labels)

Given the graph, I was able to notice the clear groupings that the data had formed. This helped direct my next analysis which was comparing and finding similar counties and years.

Ranking Maryland Counties

To obtain insightful tables, I first generated the distances between each county, year element. I then chose three different queries/target counties: Baltimore City (1982), Montgomery County (2005), and St. Mary’s County (2011). After doing so, I was given the three following tables, each displaying the top 10 most similar counties and years.


- Baltimore City 1982
Baltimore City 1982 1.0000000000000002
Baltimore County 1999 0.8728715609439696
Baltimore County 1982 0.8660254037844388
Baltimore City 1977 0.816496580927726
Baltimore County 1996 0.816496580927726
Montgomery County 1990 0.816496580927726
Charles County 1996 0.7071067811865477
Anne Arundel County 1987 0.6666666666666669
Anne Arundel County 2001 0.6666666666666669
Anne Arundel County 2019 0.6666666666666669
Baltimore City 1980 0.6666666666666669

- Montgomery County 2005
Montgomery County 2005 1.0000000000000002
Montgomery County 1990 0.816496580927726
Anne Arundel County 2019 0.6666666666666669
Baltimore City 1982 0.6666666666666669
Baltimore County 2018 0.6666666666666669
Prince George's County 2013 0.6666666666666669
Wicomico County 1987 0.6666666666666669
St. Mary's County 1982 0.6546536707079772
Harford County 2011 0.6405126152203486
Baltimore City 1993 0.5773502691896258
Baltimore City 1998 0.5773502691896258

- St. Mary's County 2011
St. Mary's County 2011 0.9999999999999997
Cecil County 2004 0.6546536707079772
Talbot County 1976 0.6546536707079772
Howard County 1995 0.5669467095138407
Cecil County 1978 0.50709255283711
Caroline County 1985 0.4629100498862757
Anne Arundel County 2015 0.4364357804719848
Carroll County 1990 0.4364357804719848
Montgomery County 1996 0.4364357804719848
Montgomery County 1980 0.4285714285714285
Talbot County 2005 0.4285714285714285

In these three tables, some patterns are found. For example, Baltimore City in 1982 had similar precent changes in crime to Baltimore County’s percent changes in 1999 and 1996. While both jurisdictions are very similar in location, it is important ask why there are not more “Baltimore County/City” elements in this chart. Instead, Anne Arundel County accounts for three of Baltimore City, 1982’s top ten similar counties/elements. Knowing the specific years for each “similar” county is important as it helps lead further investigation towards understanding why this data is grouped. A possible next step could be researching events and/or legislation that may have led to certain percent changes increasing/decreasing. These tables also help compare each county and form insight on possible demographic relations.

Here is the code supporting this process:

df1 = df.reset_index()
similarity_dict = {}

for index, row in df1.iterrows():
    # generate distances from target county to all other counties
    similar_counties = list(enumerate(cosine_sim[index]))
    
    # sort to get most similar counties first
    sorted_similar_counties = sorted(similar_counties, key = lambda x:x[1], reverse = True)

    similarity_dict[index] = sorted_similar_counties

# sample similarities for target county and year

# Baltimore City, 1982
target_similar_counties = similarity_dict[99]
print("\n-",df1['jurisdiction'].loc[99], df1['year'].loc[99])

# print top ten most similar counties
i=0
for county in target_similar_counties:
    print(df1['jurisdiction'].loc[county[0]], df1['year'].loc[county[0]], county[1])
    i = i+1;
    if i>10:
        break

# Montgomery County, 2005
target_similar_counties = similarity_dict[720]
print("\n-",df1['jurisdiction'].loc[720], df1['year'].loc[720])

# print top ten most similar counties
i=0
for county in target_similar_counties:
    print(df1['jurisdiction'].loc[county[0]], df1['year'].loc[county[0]], county[1])
    i = i+1;
    if i>10:
        break

# St. Mary's County, 2011
target_similar_counties = similarity_dict[910]
print("\n-",df1['jurisdiction'].loc[910], df1['year'].loc[910])

# print top ten most similar countiess
i=0
for county in target_similar_counties:
    print(df1['jurisdiction'].loc[county[0]], df1['year'].loc[county[0]], county[1])
    i = i+1;
    if i>10:
        break

I also calculated and plotted normalized crime totals for each Maryland count in 2020. I chose a horizontal bar chart to plot this data as it helped me compare and highlight crime prevalence for each county.

Table:

Horizontal Bar Chart:

Population Predictive Modeling

To support an unsupervised machine learning technique, I structured and tested a model that predicted total population for each county, using the following features:

'grand_total', 'violent_crime_total', 'violent_crime_percent', 
'property_crime_totals', 'property_crime_percent', 
'overall_crime_rate_per_100_000_people', 
'violent_crime_rate_per_100_000_people', 
'property_crime_rate_per_100_000_people', 
'murder_per_100_000_people', 'rape_per_100_000_people', 
'robbery_per_100_000_people', 'agg_assault_per_100_000_people', 
'b_e_per_100_000_people', 'larceny_theft_per_100_000_people', 
'm_v_theft_per_100_000_people', 'percent_change', 
'violent_crime_percent_change', 'property_crime_percent_change', 
'overall_percent_change_per_100_000_people', 
'violent_crime_rate_percent_change_per_100_000_people', 
'property_crime_rate_percent_change_per_100_000_people', 
'murder_rate_percent_change_per_100_000_people', 
'rape_rate_percent_change_per_100_000_people', 
'robbery_rate_percent_change_per_100_000_people', 
'agg_assault_rate_percent_change_per_100_000_people', 
'b_e_rate_percent_change_per_100_000_people', 
'larceny_theft_rate_percent_change_per_100_000_people', 
'm_v_theft_rate_percent_change_per_100_000_people', 'population'.

I decided not to include the normalized crime totals, as I wanted to test creating the model using only the statistics provided in the data. After splitting the training and testing data by 80/20, continued to use Jaccard similarity to form the predictions. My accuracy score only came out to be: 0.1583710407239819.

Here is the code supporting this process:

k_nn = 10
prediction_rows = []

# find k most similar counties in the training set
for test_id in testing_ids:

    target_obj = set(df_population_rates.loc[test_id[0], test_id[1]].values)
    
    test_county_similarities = []

    for this_id in training_ids:
        this_obj = set(df_population_rates.loc[this_id[0], this_id[1]].values)

        this_intersect = this_obj.intersection(target_obj)
        this_union = this_obj.union(target_obj)

        jaccard = len(this_intersect) / len(this_union)

        test_county_similarities.append({
            "county": this_id,
            "jaccard": jaccard,
        })

    similarity_df = pd.DataFrame(test_county_similarities, columns=["county", "jaccard"])
    top_counties = similarity_df.sort_values(by="jaccard", ascending=False).head(k_nn)

    relevant_populations = []
    for this_id in top_counties["county"]:
        
        # add relevant populations
        relevant_populations.append(df['population'].loc[this_id[0], this_id[1]])
        
    # take average total population
    predicted_populations = np.mean(relevant_populations)

    # track predictions
    prediction_rows.append({
            "county": test_id,
            "population": predicted_populations,
        })
    
predicted_df = pd.DataFrame(prediction_rows).set_index("county")

correct_populations = []

# count number of accurate predictions
for county,row in predicted_df.iterrows():

    actual_population = df['population'].loc[county[0], county[1]]

    # mark correct if prediction is less/more than 40000 of the original total population
    correct_populations.append(1 if row['population'] >= actual_population-40000 and row['population'] <= actual_population+40000 else 0)
    
accuracy = sum(correct_populations) / len(correct_populations)

Some notes with this modeling:

I marked predictions correct is the prediction was within 40000 of the original total population. This was because I did not think such a model would be able to form such accurate numbers, which is why I limited it to a range.
After reflection, this model’s accuracy would be greatly improved perhaps with using different features to compare each county/year. I feel that including demographic data/features specifically would improve this model.

Project Outcome

After utilizing different methods to form meaningful and actionable insights I was able to complete the following:

Exposing latent patterns like trends in crime rate changes
Clustering similar elements, relating counties by year to help direct future analysis
Present crime prevalence for each Maryland county in the most recent year

For the latter, this was only completed for the most recent year in the data which was 2020, but the code could be easily updated to create tables/visualizations for every year in the data. This method of analysis (finding similarities) paired with KMeans clustering helped support my goal of solving the information problem of understanding crime occurrences. Studying this data, especially on a county scale, provides levels of insight which support different decisions compared to analyzing city-level data. The data that I collected helps support law enforcement officials and county/state legislators who have a responsibility to enforce and implement policies to better the community. Analyzing both property and crime rates helps the decision to create certain preventative programs that eventually improve public safety.

Limitations

While the data I collected provides important features like crime percentages and total population for each Maryland county, I do feel the accuracy and depth of my project would be greatly improved by including demographic data. At the start of this project when I was first brainstorming ideas, there were some ethical implications I had to consider. This was a possible reason for why I was not able to find suitable demographic data that also related to criminal activity for each Maryland county. It also caused me to struggle creating a predictive model using the information I had. While I was able to have some successful predictions based on the accuracy score, the feature values being numerical instead of categorical may also be a reason for the inaccuracy of my predictive model. My goal was to use historical crime rate data to be able to predict population outcomes, but this proved to not be as sufficient for such topic.

Modeling Crime Rates in Maryland Counties

Data Collection

Key Ideas

KMeans Clustering

Ranking Maryland Counties

Population Predictive Modeling

Project Outcome

Limitations

GitHub Repository

Written by Kimberly Escobar