Predicting Crime in Portland Oregon

Part 1: a novel approach to crime prediction - spatial and time data

Predicting future crime poses a particularly interesting data challenge because it has both geospatial and temporal dimensions and may be affected by many different types of features like weather, city infrastructure, population demographics, public events, government policy, etc.

In September 2016, the National Institute of Justice launched a Real-Time Crime Forecasting Challenge to predict crime hotspots in the city of Portland, Oregon. Our team (Maxime and I) made a submission to the challenge. Our goal was to use both geospatial and temporal data to understand underlying factors of crime and predict future hotspots. All of the data are open source, making the project fully reproducible. And in the end, we are very excited to have been announced as one of the winners of the challenge!

How did we do it? In a series of two blog posts, I will walk through our approach to the challenge, which was ultimately a combination of machine learning, time-series modeling, and geostatistics (a combination that was more effective at predicting future crime hotspots than any of these techniques by themselves). This first post will focus on the data we used, and the next post (coming soon) will delve into the analysis of that data.

In a nutshell, we started with the data released by the National Institute of Justice (NIJ) and enriched this data with a variety of public, open-source data sets, including police reports, the US census, data from Foursquare, Open Street Maps, and the weather. Let’s take a deeper look.

What are we predicting?

Crime Data

Format: geopoint x time x call features

The first and maybe most important dataset is our target variable: the crime data. Released by the NIJ and freely available for download, the data contains 911 calls about street crimes, burglary, and motor vehicle theft that resulted in police intervention. We have the category of the crime, the case description, the date stamp, the location (latitude, longitude), and the census tract that contains the location.

There are approximately ~1 million unique calls ranging from March 2012 — March 2017.

It’s worth noting that calls to the police are only a proxy of actual crimes. However, evidence suggests that the number of calls in a neighborhood tracks relatively well with the number of actual crimes, and is a useful source of public police data. That said, it’s important to keep in mind any biases in this measure when interpreting our results — marginalized groups may feel less comfortable calling the police or communities with “insider” crime, such as a local mafia, may try to resolve criminal activities though non-police channels. Moreover, calls for service won’t track crimes that come to police attention through other methods, such as during a patrol, and we can’t take into account any error on the part of the person who reported the crime or the 911 operator who recorded the call.

​The NIJ asked us to predict crime hotspots for one week, two weeks, one month, two months, and three months from March 1st. Additionally, we were asked to make predictions for crime overall, plus specific predictions for each category of crime (car theft, street crimes, and burglary).

Here are what some of the raw data look like, from the Feb 2017 release, as a pandas dataframe:

And as a geomap, using leaflet and openstreetmaps:

Thinking about space

Following the initial NIJ dataset, we started pulling in a variety of open source data to consider geographical features (what makes one neighborhood different from another?) and temporal features (what makes one point in time different from another?).

Points of Interest

Format: geopoint x points of interest features

The first geographic feature we looked at was Points of Interest from OpenStreetMaps (OSM), an open-source, collaborative mapping project that marks points of interest (POIs) that occupy a particular point. POIs are tagged with user-generated category tags such as “Amenity,” “Hotel,” “Shop,” “Restaurant,” “Mailbox,” etc. We found 11,000 labeled points of interest in Portland in the 2016 dataset.

POIs are tagged with the name of the business, the type of business (e.g., entertainment), and the latitude and longitude.

Since we want to learn about the geographical points, we treat the geopoint as the main identifier, or index, of the data. This lets us treat the data as as a sparse map — specific geographic points with categorical labels.

FourSquare Check-Ins

Format: geopoint x check-in features

Beyond POIs, we also wanted to look into how people use different neighborhoods in Portland. For this we pulled data from the FourSquare API — a social media platform where people can “check-in” to different venues and activities, and got a dataset of 35,000 check-ins in 6,623 unique locations.

The FourSquare API lets you query a radius around a geopoint. For efficiency, we dropped a geohash grid over our Portland map, and used those grids to query to the API. We did this in Python with postgis and got 17k hashes — which we used to find our check-ins.

This generated a dataset with the name of the businesses, the category of businesses (7 total, ‘Travel & Transport’, ‘Professional & Other Places’, ‘Outdoors & Recreation’, ‘Food’, ‘Shop & Service’, ‘Arts & Entertainment’, ‘Residence’, ‘Nightlife Spot’, ‘Event’, ‘College & University’), the latitude and longitude, the distances from the city center, the number of check-ins, the number of unique users, and the amount they tipped, in the following format:

As with the POI data, we re-indexed this data by geopoint to get a map.

Police Precincts

Format: geo-tiles with categorical labels.

Next, we grabbed Police Precincts from the NIJ website, as geo-tiles with categorical labels. Greater Portland has 60 different police precincts. Every part of Portland belongs to exactly one of these precincts, so we can think of these as spatial tiles. One of the easiest ways to represent tiles in space, which cover a lot of area (in contrast to a specific geo-point, which can be represented by just latitude and longitude) is by using multipolygons.

Image from

US Census

Format: geo-tile x year x census features

The US census collects more than 20,000 variables on the US population. That data is collected by neighborhood or “census block,” which each cover 600 to 3,000 people. The kind of information captured by the census can include powerful predictors of many different crime-related outcomes, but choosing which of those 20,000 features can be a challenge.

We selected only those features that were, on their own, reliable predictors of our target variable — calls for service. Independently for 2013, 2014, and 2015, we selected all of the features that were reliably correlated across neighborhood with calls to service for that year (p<0.0001). For example, we could ask whether the average rental price of a one bedroom apartment, or the average education level of men over 30, or the average time it takes to get to work, correlates with the amount of crime in a neighborhood. We used sklearn’s feature_selection, with the f_regression scoring function, which iterates through the set of potential predictors (in this case, census features), computes the correlation between each predictor and the target (here, number of crimes), generating an F score, and then a p-value.

After getting all of the reliable features for each year independently, we then selected the features that were reliable indicators for all three years (intersection of 2013:2015). All together, this generated a dataset with ~2600 census blocks, and ~1400 features.

These features include information about the demographics (age, race, ethnicity, gender, primary language), economics (income, cost of rent, poverty status), employment (employment status, time to travel to work, means of transportation to work, school attendance), and living conditions (number of bedrooms, access to plumbing, heat), as well as many of the interactions of these features (e.g. number of women in the house with a high-school diploma who make less than $20,000/year).

Thinking about time

We found a lot of data that might explain variation in crime across space — what about time?


Format: date x weather features

The first type of data we used was weather data from the NOAA API. We had daily samples starting at 2012 from 14 weather stations around Portland. The weather features included variables like min and max daily temperature, max wind speed amount of precipitation, categorical labels (cloudy, rainy, snowy), etc.

For example, here is the max temperature for each day, averaged across station (green = mean; blue = std):

Holidays and Events

Format: date x days x holiday x event features

What else might make some days different than others? We labeled each of our days with day of the week and week of the year.

Moreover, we pulled in a list of major holidays and political events that could have influenced the crime rates in the city (How does Christmas impact crime rates? Saint Patrick’s Day? Election Day?)

So that’s how and where we got our data for the challenge. Stay tuned for part two to discover the details behind our analysis!