Forecasting Uber Demand in NYC

Ankur Vishwakarma
Mar 23, 2018 · 7 min read
Image for post
Image for post
A screenshot of the final dashboard created to forecast Uber demand in NYC neighborhoods. For more information on this dashboard, please scroll to the end of this post.

For my final project at Metis, I wanted to work on something that spanned across the following interests of mine:

  • Urban transportation
  • Geographic visualizations
  • Timeseries forecasting

Therefore, I decided to see if I could forecast hourly Uber demand across NYC neighborhoods. In addition to time-lagged features (such as previous week’s demand), I added information specific to each neighborhood to improve my predictions. As a final result, I obtained relatively accurate unique forecasts for all neighborhoods in NYC.

All the code for the steps outlined below is on my Github.

Overview

Image for post
Image for post
  • Uber, Lyft, and other ridesharing systems have become an important part of urban transportation.

Better forecasting can:

  • Decrease surge pricing events
  • Alert drivers of areas with upcoming demand
  • Improve overall customer satisfaction with service
  • Help with city and traffic planning
  • Allow forecasts to take into account changes in availability of other forms of transit, new development, etc

Project Pipeline

Image for post
Image for post
The overall process for this project. We’ll go into more detail about each step in the following paragraphs of this post.

Obtain Data

  • NYC’s Citibike data is provided on their website.
  • The NYC subway station locations can be downloaded from data.ny.gov.
  • The neighborhood geoJSON file was the same one used by Adrian Meyers’ excellent NYC taxi trips analysis.

Using Google Cloud & PySpark for Analysis

Image for post
Image for post

This was a lot of data and to process it quickly, I relied upon using PySpark on a Google Cloud Dataproc cluster. Fortunately, Google’s tools make it easy to get up and running quickly.

I used a mix of this tutorial from Google and this writeup by Charles Bochet to get Jupyter notebook running on a cluster with 1 master node and 3 worker nodes.

Finally, I uploaded all my data to a Google Cloud Storage bucket which made it easy to access from multiple instances as well as avoided the headache of manually setting up a distributed filesystem (HDFS).

Uber Pickups Data

Image for post
Image for post
Number of total Uber pickups plotted against time. Note the big gap in data between September 2014 and January 2015.

Because of the large gap in information, all further analysis was done with only 2015 data (January to June).

Trends in the Data

Image for post
Image for post
1-month of Uber pickups showing clear weekly trend

To zoom in further, we can pick a random day and look at how the Uber demand varies over each hour.

Image for post
Image for post

Split Train/Test

Image for post
Image for post

Looking only at the 2015 data, we still have a total of 14,264,110 pickups. Plotting that against time shows a general positive trend (demand is growing) and high variability (data is hourly). Of the six months, the first five are chosen for training with the last month of June designated as the test set.

Modeling

  1. Linear Regression
  2. Seasonal AutoRegressive Integrated Moving Average with eXogenous regressors (SARIMAX)
  3. Facebook Prophet

Models were evaluated using root-mean squared error (RMSE). This way, the error metric would be easily understandable as “number of pickups” the forecast was off by.

Linear Regression

  • day of the week
  • weekend or not
  • hour of day
  • number of days since 1/1/2015

The resulting forecast (plotted against test values from June) is shown below. The RMSE was 1,324.13. The forecast is certainly very good for such a simple model. However, it does not account for a lot of hour-by-hour variation and does not even try to address certain spikes.

Image for post
Image for post

SARIMAX

As with any ARIMA model, we have to stationarize the target variable with respect to time, and then use autocorrelation plots to determine the important number of lags and whether its a moving average (MA) or autoregressive (AR) model. The 4 plots below show that for the Uber pickups data.

Image for post
Image for post
  1. This chart shows the Uber pickups, a rolling mean, and a rolling standard deviation measure. While that data is not far from stationary, it can be better.
  2. Differencing the data with a lag of 168 (1 week in hours) really stationarizes the data.
  3. The autocorrelation plot shows important lags at 1 and 2, but also at 24 and at 48.
  4. The partial autocorrelation plots show important lags at 1 and 24.

Putting those parameters in, training the model, and creating a forecast leads to the following result.

Image for post
Image for post

The RMSE of 1,053.65 is quite a bit lower than the previous result with linear regression. However, there isn’t a lot of variance day-to-day in this forecast. That’s because the best predictor of demand on a Saturday, for example, is the previous Saturday. However, SARIMAX is extremely computationally expensive for so many lags, crashing my (relatively large) Google instance.

Facebook Prophet

Image for post
Image for post

Not bad at all. It picked up on the weekly trend and hourly variance in each day. With an RMSE of 1,002.15, it’s already the best performing model. Since there is a lot of added customizability available on top of this, the following work was performed using Prophet.

Accuracy of Baseline Models

Accuracy = 1 - (RMSE ÷ Mean Hourly Demand),

this is how our baseline models compare.

Image for post
Image for post

The choice to proceed with Prophet is clear.

Adding Geographic Features

  • Subway train availability, defined as number of subway stops multiplied by the estimated number of passengers using the transit system at that location for that given hour.
  • Bicycling activity, defined as number of citibike docking stations multiplied by number of rides originating from those stations.
Image for post
Image for post
On the left, the darker colors show higher subway train availability. On the right, the dots show the docking stations plotted on a map. In both slides, there is a line chart showing the general distribution of number of trains/bikes throughout the day.

Increased Accuracy

Image for post
Image for post

The increase in accuracy above indicates that there is signal in adding neighborhood-specific information to the timeseries forecasting model. We can zoom in to the neighborhoods themselves to see if all of them benefited with the new features.

Neighborhood-level Accuracy

Image for post
Image for post

Interactive Dashboard

  • A map that can be used to select a neighborhood. It’s color coded to show the neighborhoods with the highest demands as a dark shade of blue.
  • Total volume of rides between January to May 2015 and the number of rides forecasted for the first week of June 2015.
  • Bar chart showing the historical weekday breakdown of rides.
  • Finally, an hourly forecast (in red) of 1-week of demand overlaid against the test data held-out from June.

Below is a short video of the dashboard that shows it in action switching through various neighborhoods. You can access the full interactive dashboard yourself at this Tableau Public link.

Conclusions

  • Long Short Term Memory (LSTM) Neural Networks perform rather well with timeseries forecasting and would be another tool to use in a future extension of this project.
  • Bring ing in additional data — weather, real-time transit, sports events, population demographics, etc, would also be very interesting as I think there’s a potential for a lot of signal in those datasets.

I really enjoyed working on this dataset and visualization. If you have any questions, thoughts, or suggestions, please feel free to reach out to me on Twitter or LinkedIn. Thanks!

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store