Exploring NYC - Analysis of Crime Data in New York City

Using NYPD Complaint Data to analyze crime events in different boroughs.

Jeff, Lu Chia-Ching
Nov 7 · 9 min read
Photo by Mark Asthoff on Unsplash

Introduction

When I moved to New York City, I kept hearing different people saying the same thing: I should watch out for the city in terms of personal safety. Even though, historically, NYC has become a much safer place than before (The troll tourist guide about New York in the 1970s: ‘Welcome to Fear City’ — the inside story of New York’s civil war, 40 years on), New York City leaves a full-of-fun but somewhat-terrifying image to the locals and people all around the world.
E.g. Here is the total crime by year from 1985 to 2014, showing the decrease in NYC crime rate since the 80s:

data from UCR Statistics

Despite hearing all of these talks, I don’t think I ever heard anyone (even native New Yorkers), bring up any kind of data sources or statistics to prove their overall idea of borough safety. Being an analytics student, I had an itch to dive into this topic and actually figure out what the true reality of this situation is.

What is the distribution of Crime in each borough?

Since this is a broad topic that could be analyzed in various aspects, I decided to focus on my most pressing question: Do the different boroughs have significant differences in terms of crime level and crime type?

People in NYC always say that some boroughs are safer than others. However, there’s a disproportionate amount of people who live in Manhattan and Brooklyn with fewer living in Staten Island and Queens, so it is difficult to assume that any of these boroughs are safer in terms of different populations. Having this in mind, I decided to dive into this question using various analytical techniques.

New York City has five boroughs: Manhattan, Bronx, Brooklyn, Queens & Staten Island. Photo by nycmap360.com

Dataset Used

For this project, I used the NYPD Complaint Data from NYCOpenData. It contains all the NYC history crime data reported by NYPD through 2019, and I extracted the 2018 data from it to avoid any missing data in 2019. This exploratory analysis is conducted with Python and Tableau.

Data Pre-processing

First Look of the Data

  • number of observations: 452,997

Re-categorize Crime Description to Crime Type

After taking a good look at this data and removing NaN data, I realized that this dataset is not ideal for analysis, even though it contains all kinds of variables such as crime happening time and even the suspect’s data for each event. The main reason is that crime type data is really confusing.

E.g. At one event, OFNS_DESC column could be “HARRASSMENT 2” but PD_DESC column could be “HARASSMENT,SUBD 1,CIVILIAN”, which makes it really hard to be analyzed)

The U.S. Department of Justice administers two statistical programs: Uniform Crime Reporting (UCR) Program & National Crime Victimization Survey (NCVS). However, NYPD seems to have its own approach in recording the crime event, so I can’t interpret the data with an established method. Additionally, there are multiple columns that explain the same crime event with some confusing description.

Therefore, I decided to re-categorize the type of crime event based on these two systems and my understanding. After this, I managed to limit the number of unique types of crime to 21 (original dataset had at least 59 types) and make them more understandable.

new_dic = {a: b for old_cat, str_name in zip(new_type_cat, new_type_name) for a, b in zip(old_cat, [str_name]*len(old_cat))}
df['new_category'] = df['OFNS_DESC'].map(new_dic)
df['new_category'].nunique() # 21
# The amount of unique category has been narrowed down to 21
# The detailed methodology and approach could be found at my GitHub

Exploratory Analysis

Total Number of Crime: Not really what I expected

According to this visualization, Brooklyn has the overall highest number of crime events, and Bronx, often considered a little messier area, has a lower crime number than Manhattan. This was a surprising finding for me.

With the data of the projected population in each borough (sourced from the Department of City Planning), I found out that though Brooklyn has the most events, Manhattan and Bronx both have a higher percentage of crime events per resident. (Bronx has the highest number of 6,887 per 100k residents). Another interesting thing is that Queens has the lowest average number of crime events (3,849 per 100k residents), even lower than the Staten Island (4,269 per 100k residents)

But this isn’t the end. I would still like to do some more exploratory visualizations to get more insight from the data.

1. Level of Crime: Understanding the Distribution

Changing back to the total count of events, I used the same visualization to go into more detail within the level of crime in each borough (There are three levels of crime in New York State: Violation, Misdemeanor and Felony). From the graph below, I can tell that Misdemeanor, an offense of which a sentence in excess of 15 days but not greater than one year may be imposed, is the most popular level of crime in each borough, and it consists a similar percentage in each group (about 52% to 57%; Manhattan has the highest percentage of 57.2%). The second popular one is Felony, a serious offense for which a sentence would be more than one year, and the third one is Violation, a lesser offense for which a sentence only be no more than 15 days.

This gives us the following interesting information:

  1. Staten Island has a substantially lower percentage of Felony than the other 4 boroughs (which have about 31%). This could mean that Staten Island is a much more peaceful area, with not only a lower total crime number but also less serious ones. (This also means that the Violation level of crime in Staten Island has a much higher percentage than all the other areas)

Statistics: Are boroughs’ frequency of crime at each level different from each other?

To compare the distribution of level type in each borough, Chi-squared test can be used (test of independence), with the null hypothesis as “each crimes’ happening borough is independent of the boroughs’ level-of-crime classification.”

The test result (p=0.00) shows there is a significant relationship between the variables, meaning different boroughs have a different distribution of level of crime. With the table of standardized residuals, the result showed that most of the real data is distinctly different from the expected value.

import scipy.stats as stats
import statsmodels.api as sm
chi2, p, dof, expected = stats.chi2_contingency(observed = two_way_table2)
print('chi-square statistic :', chi2)
print('p-value :', p)
print('degrees of freedom :', dof)
table2 = sm.stats.Table(two_way_table2)
table2.standardized_resids

Therefore, I could further confirm a few things:

  1. Bronx does have a significant low felony-level crime event than Manhattan, Brooklyn and Queens. Most of its crime data is contributed from lesser crime.

These scenarios would be further broken down in the crime event analysis.

2. Crime type: Recognizing prevalent Crime Type

I further research into crime type for each borough level. I listed the top 5 crime types in each borough, and it seems that there is no major difference among each borough. In each borough, major types of crime are all about Larceny_Theft, Harassment, Assault, Criminal_Mischief_Property and Offenses_against_Public_Order_Administration. The only difference happens in Staten Island, which has significantly lower-proportion of Larceny_Theft and Criminal_Mischief_Property.

Statistics: Are boroughs’ frequency of each crime type different from each other?

The test for crime category result (p=0.00) shows there is a significant relationship between the variables, meaning different boroughs have different distributions of each crime type.

I summarized the result with a heatmap and list only the significant ones below:
p.s. The threshold was over 2.0 or below -2.0, and I also put the most significant high or low category in each area into a bold font

Brooklyn

  • Significantly high-frequent crime: Forgery, Burglary, Weapon Problem, Gambling, Robbery, Sex Crime & Social and Commercial-related Crime (Social_Commercial_related_Crime)

Manhattan

  • Significantly high-frequent crime: High-Value Theft, Forgery & Sex Crime, Frauds

Bronx

  • Significantly high-frequent crime: Serious Assault (Aggravated_Assault), Assault, Drug Problem, Weapon Problem, Harassment, Offenses against Public (Offenses_against_Public_Order_Administration) & Robbery

Queens

  • Significantly high-frequent crime: Traffic Laws Violations, Assault, Burglary, Property Mischief, Harassment & Motor Vehicle Theft

Staten Island

  • Significantly high-frequent crime: Driving under the Influence, Property Mischief, Frauds, Harassment, Offenses against Public

Through this analysis, I have found some interesting insights:

  • Being the most popular place in terms of crime events, Brooklyn has a lower frequency of Fraud and High-Value theft. This is also true in Bronx, Queens and Staten Island.
Photo by Brandon Jacoby on Unsplash

Conclusion and Next Steps

From the exploratory analysis and statistical approaches, I can say that different boroughs do have significant differences in terms of crime level and crime type with each other.

After this, I am intrigued to look into more data related to crime happening time, victim and suspect data, etc. I will combine the above findings with the new data to look into more detailed topics. (e.g. the correlation of the crime happening time with different crime types or the persons who got involved in)

Congrats and Thanks for your reading! Feel free to check up my Github for the full codes and drop a message at cl3883@columbia.edu

If it helps, press clap as many as you like: )

Jeff Lu

Jeff, Lu Chia-Ching

Written by

MS in Columbia & Pursuing Data Practitioner @NYC! | Linkedin: https://www.linkedin.com/in/jeff-chia-ching-lu/

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade