Forecasting NYC Arrests During COVID-19 With Long Short-Term Memory Networks

Sam Black
The Startup
Published in
5 min readOct 20, 2020
Image source: The New York Times

Forecasting.

Everyone does it. No one does it the same way.

I’m hoping to solve this problem with a deep learning approach by borrowing some of the best-in-class methods across the industry and showing you how to do the same.

The data

I’m using data from NYC OpenData — specifically, all arrests made during 2020. This has been a strange and difficult time for many, including our brave officers at the NYPD. I thought it would be interesting to see the effect the pandemic had on overall arrest activity, in addition to solving a challenging forecasting problem.

To avoid any sensitivity, I’m ignoring the other features of the data, such as type of crime and location, which would be of interest for potentially stopping crime

Data source details here

The process

1. Data prep and visualization

2. ARIMA

3. Deep Learning

It’s important to note that most forecasting methods (ARIMA, SARIMA, STL) are univariate. To add additional information (variables) to your modeling, you’ll need to go directly to a neural network based approach

I’ve included a link to the github gist, so you can run it yourself.

Unfortunately, I can’t line-by-line this, so you’ll have to follow along in the article

Data prep and visualization

I like to begin a forecast by looking at the data.

These data are squeaky clean. No dupes, no missing values. Nice.

When I visualize the data, I can see periodicity, seasonality, trend and noise. This allows me to determine the best approach, sometimes, the data are stable enough that a moving average would solve the problem. Otherwise, we need to dive down into the details

Here, we can infer a few patterns based off the visualization alone

1. There is a definite periodicity, this means that we may be able to tease out a stationary signal every 7 days, with the peak occurring on Wednesdays. I’m not entirely sure why arrests peak on Wednesdays.

2. Covid-19 quarantines reduced activity from 03/15, thus reducing arrests

3. We see an increase as quarantine restrictions are lifted slightly

4. A spike occurs as a result of the BLM protests and George Floyd’s death

5. Arrest activity bottoms out — due to likely modifications in NYPD’s arrest policy as a result of the protests

These data provide a good example of how macroeconomic factors and societal shocks can be modeled. This also represents a common issue with forecasts, which is a lack of data.

I also want to check autocorrelation, for periodicity.

Yes, I see the 7 day pattern occurring in our autocorrelation chart.

ARIMA

ARIMA (Auto-regressive integrated moving average) should be the starting point of any forecasting project. 8 times out of 10, it will get you the results that you need. Let’s use this as our “benchmark” performance.

I first split the data into three periods

  1. The normal period, which represents the pre-Covid lockdown
  2. The quarantine period, post 3/15
  3. The “somber” period, which immediately followed the George Floyd protests

For each of these periods, I created a small test set that I used to evaluate each ARIMA model.

I used the out-of-the box ARIMA method from statsmodels. However, I did create a custom method that retrains the arima with new observations and creates a new prediction [see code].

There are a few methods to determine the order of the ARIMA models, but the size of the data allows me to brute force search and test out which lag/order combination works best. Overall I found that an ARIMA of order 5 with 1 lag worked best.

Summary of the ARIMA model

We also need to inspect the errors of the model, to ensure that they are somewhat normally distributed. I also want to ensure that I’m not missing a trend that can be captured. The errors are normal-ish, however, not perfect. However, they are not skewed in a way that leads me to reject the analysis

I’ve included the longest period as an example, however, you can run the code yourself to see all the periods, which have a similar accuracy.

Root mean squared error: ~4703

Deep Learning

Here, I am going to apply a single layer and multi-layer (stacked) LSTM to the learning problem. I will not cover the basics of LSTMs, however, will refer you to this article which is quite useful.

I begin with data prep. The data need to be structured in a way that produces a 3 dimensional matrix that can be fed into the input layer of our LSTM.

The matrix is batch_size * window_size * n_features.

The associated Y value, the value that is being predicted is our window_size + 1. In my example, I’m using a window_size of 3, so essentially I’m creating a sliding window that contains 3 observations, and then trying to predict the 4th observation. Additionally, we need to minmax scale the data, so I created a function that reshapes our univariate array into a set of scaled Xs and Ys and iterates over any set [see code]

Once I have the data assembled, I’ve trained two networks

  1. A single layer “Vanilla” LSTM that has a Dense output (16 neurons)
An example of the network. Its so tiny!
  1. A stacked LSTM with Dropout and a Dense output (32 neurons in each)
Bigger is better, apparently

In training, I found better performance with the stacked LSTM.

I’ve created a few methods that collate the predicted values from the trained model to use on “out-of-sample” data and then plotted the data in a similar graph for comparison.

Here is an example of the LSTM forecast. Our best performing network produced a RMSE of 7193.92

LSTM forecast

Summary

Overall, the ARIMA produced better results. This is typical when forecasting with univariate data. However, when you incorporate additional information to the forecast, or forecast with big data, Deep Neural Networks tend to outperform.

This dataset posed a number of challenges, given the state of the world, but it serves as a nice, real-world example that highlights the challenges in forecasting

Hope you enjoyed!

--

--