Forecasting Uber Demand in NYC
For my final project at Metis, I wanted to work on something that spanned across the following interests of mine:
- Urban transportation
- Geographic visualizations
- Timeseries forecasting
Therefore, I decided to see if I could forecast hourly Uber demand across NYC neighborhoods. In addition to time-lagged features (such as previous week’s demand), I added information specific to each neighborhood to improve my predictions. As a final result, I obtained relatively accurate unique forecasts for all neighborhoods in NYC.
All the code for the steps outlined below is on my Github.
- Uber, Lyft, and other ridesharing systems have become an important part of urban transportation.
Better forecasting can:
- Decrease surge pricing events
- Alert drivers of areas with upcoming demand
- Improve overall customer satisfaction with service
- Help with city and traffic planning
- Allow forecasts to take into account changes in availability of other forms of transit, new development, etc
- The Uber data for this project came from FiveThirtyEight, who obtained the data from the NYC Taxi & Limousine Commission (TLC) by submitting a Freedom of Information Law request on July 20, 2015.
- NYC’s Citibike data is provided on their website.
- The NYC subway station locations can be downloaded from data.ny.gov.
- The neighborhood geoJSON file was the same one used by Adrian Meyers’ excellent NYC taxi trips analysis.
Using Google Cloud & PySpark for Analysis
This was a lot of data and to process it quickly, I relied upon using PySpark on a Google Cloud Dataproc cluster. Fortunately, Google’s tools make it easy to get up and running quickly.
Finally, I uploaded all my data to a Google Cloud Storage bucket which made it easy to access from multiple instances as well as avoided the headache of manually setting up a distributed filesystem (HDFS).
Uber Pickups Data
The Uber data downloaded contained each “pickup” as a timestamped and geolocated row (not complete start/finish trip data). Below, all pickups are plotted against time.
Because of the large gap in information, all further analysis was done with only 2015 data (January to June).
Trends in the Data
We can look at just one month’s data (image below) and see a very cyclical effect. Each peak and valley is 1 day and the trend in pickups seems to repeat weekly.
To zoom in further, we can pick a random day and look at how the Uber demand varies over each hour.
Looking only at the 2015 data, we still have a total of 14,264,110 pickups. Plotting that against time shows a general positive trend (demand is growing) and high variability (data is hourly). Of the six months, the first five are chosen for training with the last month of June designated as the test set.
For baseline modeling, I created forecasts using only time-lagged features. I used the following three techniques, with the goal of progressing further with the most promising forecast.
- Linear Regression
- Seasonal AutoRegressive Integrated Moving Average with eXogenous regressors (SARIMAX)
- Facebook Prophet
Models were evaluated using root-mean squared error (RMSE). This way, the error metric would be easily understandable as “number of pickups” the forecast was off by.
It’s easy to make a simple linear regression model with features such as:
- day of the week
- weekend or not
- hour of day
- number of days since 1/1/2015
The resulting forecast (plotted against test values from June) is shown below. The RMSE was 1,324.13. The forecast is certainly very good for such a simple model. However, it does not account for a lot of hour-by-hour variation and does not even try to address certain spikes.
If you are unfamiliar with ARIMA models for timeseries forecasting, here is a very good intro by Analytics Vidhya.
As with any ARIMA model, we have to stationarize the target variable with respect to time, and then use autocorrelation plots to determine the important number of lags and whether its a moving average (MA) or autoregressive (AR) model. The 4 plots below show that for the Uber pickups data.
- This chart shows the Uber pickups, a rolling mean, and a rolling standard deviation measure. While that data is not far from stationary, it can be better.
- Differencing the data with a lag of 168 (1 week in hours) really stationarizes the data.
- The autocorrelation plot shows important lags at 1 and 2, but also at 24 and at 48.
- The partial autocorrelation plots show important lags at 1 and 24.
Putting those parameters in, training the model, and creating a forecast leads to the following result.
The RMSE of 1,053.65 is quite a bit lower than the previous result with linear regression. However, there isn’t a lot of variance day-to-day in this forecast. That’s because the best predictor of demand on a Saturday, for example, is the previous Saturday. However, SARIMAX is extremely computationally expensive for so many lags, crashing my (relatively large) Google instance.
Prophet is a relatively new open-source forecasting package developed by Facebook. It’s easy to use and very customizable. Generating a forecast with Prophet is also quite easy — it’s mostly a matter of ensuring the data is in a specific format. So let’s see how it performed.
Not bad at all. It picked up on the weekly trend and hourly variance in each day. With an RMSE of 1,002.15, it’s already the best performing model. Since there is a lot of added customizability available on top of this, the following work was performed using Prophet.
Accuracy of Baseline Models
Using the following definition of accuracy,
Accuracy = 1 - (RMSE ÷ Mean Hourly Demand),
this is how our baseline models compare.
The choice to proceed with Prophet is clear.
Adding Geographic Features
Prophet allows adding additional regressors during the modeling process. I figured that there is signal in adding transit availability as another variable. Specifically:
- Subway train availability, defined as number of subway stops multiplied by the estimated number of passengers using the transit system at that location for that given hour.
- Bicycling activity, defined as number of citibike docking stations multiplied by number of rides originating from those stations.
With the 2 geofeatures above, the Prophet model’s accuracy did increase to almost 78%.
The increase in accuracy above indicates that there is signal in adding neighborhood-specific information to the timeseries forecasting model. We can zoom in to the neighborhoods themselves to see if all of them benefited with the new features.
Each bar below represents one neighborhood of the 100 or so in this analysis. It’s clear that most neighborhoods see an increase in the forecast’s accuracy. However, there are some neighborhoods that see a decrease. Further analysis shows that these neighborhoods do not have a lot of Uber pickups within them so the added features increase the noise.
With the outputs from the model for each neighborhood, I created an interactive dashboard using Tableau that can be used to get historical insights into a neighborhood, and see its upcoming forecasted demand. This dashboard has:
- A map that can be used to select a neighborhood. It’s color coded to show the neighborhoods with the highest demands as a dark shade of blue.
- Total volume of rides between January to May 2015 and the number of rides forecasted for the first week of June 2015.
- Bar chart showing the historical weekday breakdown of rides.
- Finally, an hourly forecast (in red) of 1-week of demand overlaid against the test data held-out from June.
Below is a short video of the dashboard that shows it in action switching through various neighborhoods. You can access the full interactive dashboard yourself at this Tableau Public link.
- Neighborhood-level data does help with time series forecasting, but it can also introduce additional noise.
- Long Short Term Memory (LSTM) Neural Networks perform rather well with timeseries forecasting and would be another tool to use in a future extension of this project.
- Bring ing in additional data — weather, real-time transit, sports events, population demographics, etc, would also be very interesting as I think there’s a potential for a lot of signal in those datasets.