Deciding Where to Move In NYC by Crime Density

Published in

Eric Andresen

10 min readOct 25, 2017

Visualizing 220,000+ incidents using Pandas and Gmaps.

After working at the UN in New York for a few months in early 2017 I fell in love with the city and decided it’s the perfect place to transition to a career in Data Science. But when I asked people where I should move, a lot of what I heard about which neighborhoods are good and bad seemed either vague or plain contradictory. Luckily NYPD recently started putting all of its incident reports online, so to figure out which neighborhoods are really safe I created a heatmap of crime density for the whole city.

Step 1 | Getting and Cleaning the Data

NYC has an absolutely gorgeous data set in its OpenData repository for all the police incidents over the past several years. New York and San Francisco are both putting a lot of effort into opening up their data, so if you haven’t started playing around with yet check it out!

Looks like there are almost 223,000 records and 24 features. From the first few rows of the data its clear that some of these aren’t going to be relevant for our analysis, and there are a few NaN values we need to worry about (a little bit later). Let’s use the codebook to narrow the features down. We’ll keep:

CMPLNT_FR_DT, and CMPLNT_FR_TM: The date and time of the event. The dataset has both the date/time when the incident started and when it ended - there are more start values and they are more relevant so we'll stick with those.
OFNS_DESC: The description of the incidents classification code, so we can filter by type.
LAW_CAT_CD: Level of offense (felony, misdemeanor, violation)
BORO_NM and ADDR_PCT_CD: Borough and Precinct Numbers where the incident took place to help narrow down the geography.
LOC_OF_OCCUR_DESC, PREM_TYP_DESC: Location of incidenct and the premise type (resident, street, grocery store, etc.)
PARKS_NM: Name of Park incidence occured in
Latitude and Longitude so we can map them.

Now that the dates are in datetime format we can explore the data a little more and make sure its evenly distributed over time and geography.

2017 has the most data by far, which makes sense given how recent the NY Open Data initiative started hosting this data. Let’s filter out everything that is before 2016 as well as anything that was coerced to Nan in the conversion to datetime, or that doesn’t have location data.

It’s interesting that Brooklyn has more incidents than Manhattan, but given the difference in size I’m surprised that the difference isn’t higher. Then again Manhattan is much denser, so who knows how this actually shakes out. As a side project it would be interesting to normalize this by population to see what has the higher incident rate. The most importan things is there is plenty of data across Boroughs for meaningful analysis.

There are so many fewer incidents in 2016 it’s clear that only a small number of incidents are accounted for in the data. It’s also important to note that 2017 isn’t over, so this difference will actually be even greater by the end of the year.

This makes me wonder if there is a bias in the 2016 results. For example, maybe only more serious crimes were recorded in 2016, or maybe the data ramped up towards the end of the year. Let’s make a few charts to explore.

It looks like felony’s might be over represented in 2016, but more importantly it’s fairly obvious that more comprehensive reporting was phased in during the last few months of the year. In other words December 2016 has way more data than the rest of the year, which means December will totally ovewhelm the other months. We’ll drop 2016 and just focus on the first 6 months that we have for this year.

This only leaves us with data for the first six months of 2017, which isn’t ideal, but the upside is it looks like the number of incidents has stayed fairly consistent across that time. We can plot some of the numbers from 2015 as a spot check to make sure crime rates are fairly stable across the year.

#plot 2015 crime by month normalized to percent to make scale of difference easier to judge
crime_2015 = df[cols][df["DATE_TIME"].dt.year == 2015]
by_month_percent_2015 = crime_2015["DATE_TIME"].dt.month.value_counts().sort_index() / crime_2015.shape[0]
by_month_percent_2015.plot.bar()
print(by_month_percent_2015.std())0.020300981574879005

There are spikes at the beginning and end of the year, which don’t seem to be replicated in the 2017 data. But overall this is consistent enough to make me feel comfortable extrapolating out the data we have from 2017 so far.

#filter to only include 2017
crime_data = crime_data[crime_data["DATE_TIME"].dt.year == 2017]
crime_data.shape(215484, 11)

Now it’s time to look at offenses. Not all offenses represent risk factors. Lower level offenses like traffic citations aren’t predictors of danger and will cloud the map from representing what we want it to. On the other hand just using the level of offense isn’t enough. Harassment, for example, is considered a low level offense, but there’s a lot that can be called harassment that is still a risk factor.

#list unique values in offense description
crime_data["OFNS_DESC"].sort_values().unique()

Most of these look like crimes that represent real risk. But there are two that might cloud the data:

‘VEHICLE AND TRAFFIC LAWS’ — Traffic citations will make crime seem higher where traffic citations are disproportionately high (e.g. Manhattan?), and intuitively they aren’t a strong correlate to crime, so we’ll drop.
‘ALCOHOLIC BEVERAGE CONTROL LAW’ — Dropped because selling alcohol without a license isn’t a risk factor.

The big debate for me was Marijuana which is nested in the Dangerous Drugs category. Marijuana is scientifically not dangerous and shouldn’t be illegal to start with, but for now I’ll to keep it. I don’t have enough hard data on how these types of charges correlate to crime beyond my personal bias against criminality.

#filter out the above
bool_list = crime_data["OFNS_DESC"].isin(['VEHICLE AND TRAFFIC LAWS', 'ALCOHOLIC BEVERAGE CONTROL LAW']) == False
crime_data = crime_data[bool_list]
#verify
#crime_data["OFNS_DESC"].sort_values().unique()

Step 2 | Map it!

The goal is to see how crime is distrubuted throughout the city, so I’m using the Jupyter Gmaps module to plot these incidents as a heat map on top of an interactive map of New York. Green is low incident density and red is higher. The first map I made was a total mush of red at a high altitude and totally green when closer in (keep in mind visual density changes in proportion to perspective), so I also changed the intensity and the radius of each reported crime until the map was more comprehensible.

(Unfortunately github won’t host these maps so I’ve substituted them with pictures. If you fork this code and put in your own api key you can see these maps interactively. The analysis below is based on zooming into the map)

import os
import gmaps
#API key hidden with environmental variable
gmaps.configure(api_key=os.environ["GOOGLE_API_KEY"])#set initial frame
new_york_coordinates = (40.75, -74.00)fig = gmaps.figure(center=new_york_coordinates, zoom_level=12)
locations = crime_data[["Latitude","Longitude"]]
heatmap_layer = gmaps.heatmap_layer(locations)#adjust red to make more understandable
heatmap_layer.max_intensity = 35
heatmap_layer.point_radius = 5fig.add_layer(heatmap_layer)fig

It’s important to remember that this map equally weights each incident, so red means crime is more dense, not more serious. It’s also important to remember that Manhattan is more dense, so this map doesn’t necessarily mean that crime is higher per person in Manhattan. But, in this case we don’t care about crime per person we care about geographical density of crime because regardless of how many people live around you, you want to know how dangerous it is to walk down the street.

If you zoom into midtown you can see that there are hotspots around the touristy areas of town — specifically around Times Square and the Empire State Building. If you move further west into hells kitchen, crime drops. Where as if you go the other direction there’s mid level density around second street and a cluster around Trump tower.

Given the last year it makes me wonder if this is a lot of low level offenses from protests around that area. Moving further north its clear that the Upper East Side is fairly safe, but once you get into east harlem, things start looking a lot more red. No real surprises there. Interestingly the west side of 14th street has considerably higher density than the west side, which goes into the over all quite safe east village. Any ideas why that might be (message me). There is an interesting hotspot formed just south of there in the box formed between E. Houston, Allen, and Essex Streets, which jives with my experience from when I was living south of there, but it still makes me quite curious as to why its so concentrated there. If memory serves right that area and east of there are lower income areas (something to look at in a future project).

There are similarly predictable patterns in Brooklyn with a few surprises mixed in. Williamsburg is fairly safe. It was surprising to me how safe Bushwick and Ridgewood are, but there is a definite cluster at the border between Williamsburg and Bushwick at Broadway Triangle. I don’t know this area too well, but it’s very interesting that there is a cluster of crime between two relatively peacful areas rather than in the slightly more risky area directly to the south.

The area I’m really interested is in the Prospect Park region. My girlfriend lives there, and I’m looking at apartments in the area with my sister. The original inspiration for this visualization was all of the rumors I had heard about the area. Some of it was certaintly true, but the rest of it seemed to be a mix of out dated information, anecdotal evidence, and racial stereotypes, which virginia tech’s (really cool)racial dot map confirms are true, atleast in the the sense that Crown Heights is overwhelmingly black (compare this with Park Slope on the other side). In this respect, this map is fascinating. You can see Park Slope is quite peaceful, where as you can follow Flatbush Ave. down the other side and see an almost perfect gradient to the stereotypically dangerous neighborhood at Prospect Park’s south east corner. What’s especially interesting is that the north side of the Prospect Lefferts Garden neighborhood is a relatively low risk island between Brownsville to the north east and East Flatbush to its south west. This whole area has a rough reputation, but in reality crime is only clustering into specific areas.

Equally as interesting is the north side of the park, which is just starting to gentrify. It’s almost as if Park Slope is creeping across the north side of the park, boxed in by Fulton to the north and slowly chipping away at New York Ave to the east. This is really where I wish I had better historical data so that we could see if incident density over the past few years supports this.

The major problem with these conclusions (besides the fact that I’m wildly extrapolating data from a few months of crime), is that this map treats all crime the same. In other words disorderly conduct will have the same effect on our perception of density as a double homicide. So our next step is to make sure hotspots aren’t being drowned out by lower level crime. I have a hunch we’ll see the same general pattern. There are several ways to test this. I could weight more serious crimes so that they have a higher intensity on the map (i.e. will make areas redder faster). But it would still be difficult to tell exactly what the differences are if they are small, especially with such a large area to cover. The opposite extreme is seperating out the data completely so we can get a clean read, but this has the opposite problem of decontextualizing felonies from the overall pattern of crime. A good data visualization is all about making things as easy to grasp as possible, so I’m going to add another layer on top of the existing map to emphasize more serious crimes in a different color — black.

#seperate data into felony and non-felony offenses
felonies = crime_data[crime_data["LAW_CAT_CD"] == "FELONY"]
crime_data_wo_felonies = crime_data[crime_data["LAW_CAT_CD"] != "FELONY"]gmaps.configure(api_key=os.environ["GOOGLE_API_KEY"])new_york_coordinates = (40.75, -74.00)fig = gmaps.figure(center=new_york_coordinates, zoom_level=12)#map non-felonies
locations = crime_data_wo_felonies[["Latitude","Longitude"]]
heatmap_layer = gmaps.heatmap_layer(locations)
heatmap_layer.max_intensity = 25
heatmap_layer.point_radius = 5
fig.add_layer(heatmap_layer)#map felonies
felony_locations = felonies[["Latitude","Longitude"]]
felony_layer = gmaps.heatmap_layer(felony_locations)
felony_layer.gradient = [
    (200, 200, 200, 0.0),
    (0, 0, 0, 1.0),
    (0, 0, 0, 1.0)
]
felony_layer.max_intensity = 30 #compare 20 and 25
felony_layer.point_radius = 5
fig.add_layer(felony_layer)fig

It’s not too surprising that felonies occur in the same areas that crime in general does. But if you look closely a trend emerges. Felonies are most dense, that is to say most likely, in areas where overall crime is the densest. This raises interesting questions about whether lower level crimes are predictive of more serious, violent crimes. If this is the case then that means there are “gateway” crimes or multiplier effects of low level crime that lead to more serious offenses. Intuitively this makes sense. For the most part when we think of a “bad” part of town we think of it as holistically bad, not just bad in one respect. What we don’t think about is that in itself is an odd thing to assume and it represents a ghettoization both literally and figuritively of dangerous areas. I’m sure there are studies on this, so if you know any I’d love for you send them my way! (UPDATE: A friend of mine told me that this is known as the broken windows theory)

Finally its interesting to notice that some of my earlier assumptions about crime in touristy areas being unserious doesn’t seem to be the case. There are still high density of felonies around Times Square and the Empire State Building, which makes me want to drill down further into the data to find out what they are. For now though, the visualization has served its purpose in getting context about where are good and bad areas to look for a place to live. Now all I have to do is add a few lines of code to add prospective places as markers on the map. Thanks for following along!

UPDATE: Moving In

Since I first made this map, I’ve used it to find a great little place in Crown Heights. I’ve also started talks with my small home town to use similar techniques on their internal data to help them better understand geographical patterns of gang violence. In the meantime I’m looking for a data driven projects in NYC, let me know if you know of anything!

Deciding Where to Move In NYC by Crime Density

Visualizing 220,000+ incidents using Pandas and Gmaps.

Step 1 | Getting and Cleaning the Data

Step 2 | Map it!

UPDATE: Moving In

Written by Eric Andresen