Classifying Types of Crime

Brandon Fung
INST414: Data Science Techniques
5 min readApr 28, 2024

Introduction

In the realm of urban safety and crime prevention, leveraging machine learning to predict crime types presents a transformative opportunity for law enforcement agencies. The core question guiding this project is: “Can we predict the type of crime in a given area based on location data and demographics?” This question is particularly pertinent for local police departments seeking to allocate resources more effectively and enhance proactive measures.

Dataset

The most ideal dataset is comprehensive, encompassing fields such as time of crime, location, type of crime, demographic data, and economic indicators of the area. Ground-truth labels or the type of crime should be generated through verified police reports and categorized by crime analysts to ensure reliability. By accurately predicting crime types, this model empowers law enforcement officials to tailor their strategies, optimize patrolling routes, and ultimately, make informed decisions to curb crime rates effectively.

The actual data was collected from Socrata’s Soda API. Specifically, I collected over 3 million rows of crime data in NYC. For my analysis, however, I used only 1 million rows so that my machine could handle the load in a relatively short time. Some features included in this data set includes location information, responding officer details, and the suspect and victim demographics. Ground-truth labels came directly from the API, which is a compilation of crime reports from the New York Police Department. I am not entirely sure how crimes are classified or classifies them, but I imagine the department classifies them all based on a set of laws that were broken.

Model

For my project, I will be looking to classify property and violent crimes due to their drastically different nature. Since these types of crimes are so distinct, it should help the model be more accurate and reliable in correctly predicting them. I am choosing to use a classification model since the ground-truth label that I am hoping to predict is categorical data, not continuous. The ground-truth label is the type of crime committed as specified earlier. Finally, I have chosen to use a random forest classifier since it is pretty simple to use, robust, and is usually better at dealing with imbalanced data due to it using ensemble techniques.

Here is a list of the specific features that I will be using to train my model: the type of crime committed, the jurisdiction where the crime occurred, the x and y coordinates of the crime, the victim’s and suspect’s race, sex, and age group, and finally the parsed date and time of the crime.

After training my model, here are five instances that my model did not predict correctly:

  1. Crime 172573 — Property Crime, but predicted as Violent Crime
  2. Crime 105368 — Property Crime, but predicted as Violent Crime
  3. Crime 149225 — Property Crime, but predicted as Violent Crime
  4. Crime 138990 — Violent Crime, but predicted as Property Crime
  5. Crime 264229 —Violent Crime, but predicted as Property Crime

After looking at the value count for crimes that were falsely predicted as violent crimes, it is clear that the model performs worse than predicting property crimes (8083 false violent crimes versus 4080 false property crimes). My theory is that property crimes rely more heavily on location details than violent crimes do. Perhaps property crimes occur in higher populated and wealthier parts of the city than violent crimes, which may have more contributing factors not captured in the data set.

Conclusion

My model could be considered reliable from strictly a programming standpoint, boasting an f-1 score of 0.77 for violent crimes and 0.83 for property crimes. Because of this, I would say that my model is able to answer my initial question, and can predict the type of crime based on location and demographic data. However, a better analysis of the strength of my model could be given by someone who has more domain knowledge.

Data Cleaning

I cleaned the data by dropping any irrelevant features, rows containing empty data, and any rows that were deemed as outliers. Outliers were determined by if a column contained a value that did not occur more than 1% of the time. Then, I manually classified crimes as violent if it was described as an assault or robbery and then a property crime if it was described as grand larceny, theft, and mischief. I later dropped the offense description column so the model would make predictions strictly based on location and demographics. Finally, I encoded all categorical variables using one hot encoding.

Common Bugs

A common bug that I encountered was not being able to parse certain date and time variables from the date feature like quarter, month, day, etc. This was because the date variable was initially stored as a string so I had to convert it into a datetime variable and then I was able to parse it.

Limitations

The biggest limitation of my analysis is that it is restricted to only 1 million rows, which covers a little over a year of crime in NYC. To be more robust, it would be advantageous to use more data from more years, but my machine could only handle so much. As a result, my model is biased to the year 2021 since that is where most of the data is coming from. Another limitation is that I intentionally tried to predict crime types with only location and demographic data. The fact of the matter is that crime is much more complex than that. There are various other factors that matter such as socioeconomic variables. If I included this data in my model, it may have been better at predicting these types of crimes. Additionally, I used the x and y coordinates as location variables, but that is definitely not the best practice. With more time, I could transform the data so that the model can recognize them as coordinates that have geographic significance and not just integers.

You can find the code for my analysis here.

--

--