Predicting San Francisco Bikeshare availability with TensorFlow and LSTMs

Guy Needham
Apr 13, 2018 · 12 min read
Sydney harbour in autumn.

Riding over Williamsburg Bridge, August 2017

Exploratory Data Analysis

It’s a good idea with any new dataset to have a look around. The data I’m using comes from two publicly available BigQuery datasets, bigquery-public-data.san_francisco.bikeshare_stations and bigquery-public-data.san_francisco.bikeshare_status. We can quickly explore these tables in the notebook:

Using Datalab to make a bar chart from BigQuery data is simple

Initial time series analysis

There is a time series table, bikeshare_status, logging the number of bikes available at one point in time. Hopefully, that data will have a similar time window for each of the stations in the scheme.

The status updates for station 91 only cover around a week of data
The number of status updates per day for the whole dataset show that it covers a longer time frame
The status updates for station 6 have a similar distribution to the global dataset
Most stations have updates from a lot of distinct dates

Data Seasonality

With most time series, the data is seasonal, by which I mean

Weekends have fewer low availability events
9 AM and 6 PM have more low availability events
%chart chart-type --data data-set more-options
{
"legend":{"position":"none"},
... more JSON options ...
}

Exploring geospatial data

These data also have a geospatial component, so I took the opportunity to plot it with folium, a fairly neat package for visualising geospatial data. I wanted to see if the location of a bikeshare station was important in determining its availability. I heat mapped the number of bikes available at each station between 18:00 and 19:00.

Making an interactive heatmap is pretty painless
It’s hard to interpret heatmaps when zoomed out
The heatmap rescales as we zoom in
And gets clearer when there are even fewer stations in shot

Docking station metadata and availability

Another property which should impact the availability of the bikescheme bikes is the size of the docking station. And yes — the availability and the size of the docking station are correlated, but not linearly. The medium size docking stations have a reasonable chance of having a lack of availability.


Data Pipeline from BigQuery into the model

Now I’m fairly comfortable with the data, it’s time to have a pop at training a model with the estimator API. I had a look at the number of records in the status table, and it’s fairly large, 107.5 million records. That’s more than will fit in Excel, which was one definition of BIG DATA I’ve come across.

  1. query BigQuery to select that number of each class, randomly selecting rows
SELECT
...
FROM table
WHERE RAND() < 10/42
;

Scaling up the Datalab instance

When pulling data from BigQuery into the Datalab instance, the Datalab VM really started to struggle with more than a few thousand records in memory. When I checked up on the VM in the compute engine section of the Google Cloud console, the CPU was at 100% for a sustained period and the UI was really struggling. I had to resize the Datalab instance, pronto.

  1. Start up a new Datalab instance, specifying the machine type you require. This is achieved with the --machine-type parameter. I went for n1-highmem-4.
  2. Open up the notebook you were working on, it should appear in the UI.

TensorFlow model

I’m using the TensorFlow estimator API in this example to quickly get a model done. The key steps are to create input functions for the testing and training data, and to define the feature columns from the dataset. The estimator API abstracts away all of the more complicated linear algebra code.


Time Series prediction with LSTM RNN

While the first model is interesting enough, it’s not the most fascinating of models. What would be more useful would be to predict if a specific station is going to have capacity in n time steps. This model could power a service which would tell me the chances of a pair of bikes being available at the station when I get there, solving a commonly frustrating issue.

SELECT avg(sd) as mean_td FROM (
SELECT
timestamp_diff(
lead(time) --window function to access the next record
over(partition by station_id
order by time asc
), time, SECOND) as sd
FROM `bigquery-public-data.san_francisco.bikeshare_status`
) a
SELECT
bikes_available,
IF(next_ba <= 2,
1,
0) AS low_availability,
IF(next_ba > 2,
1,
0) AS high_availability
FROM (
SELECT
time,
bikes_available,
LEAD(bikes_available, 30)
OVER(PARTITION BY station_id ORDER BY time ASC) AS next_ba
FROM
`bigquery-public-data.san_francisco.bikeshare_status`
WHERE
station_id = STATION_OF_INTEREST
ORDER BY
time ASC ) a
WHERE
next_ba IS NOT NULL
;

Online prediction with the LSTMs

The next challenge was to simulate an online lookup for a station. This would represent a ping to our API to request a predication for a particular station. In reality, we would select the n most recent status updates from a station, but in this setting I instead opted for selecting a random 15 minutes within the time range the station reported status updates for. This data was transformed into a numpy array, before loading up the model for that station and generating a prediction.

Conclusion

The work covered here represents a three day effort on my part. I did this work while studying for Google Cloud Engineering certification to get some practice with working on the platform. The key learnings for me were that having a Python notebook running on Datalab is powerful. It integrates well with many Google products, and makes exploring large datasets easy. However, some parts of the process are not well documented such as the charting API.

ValueError setting an array element with a sequence

Google Cloud Platform - Community

A collection of technical articles published or curated by Google Cloud Platform Developer Advocates. The views expressed are those of the authors and don't necessarily reflect those of Google.

Guy Needham

Written by

Coding, traveling, cycling.

Google Cloud Platform - Community

A collection of technical articles published or curated by Google Cloud Platform Developer Advocates. The views expressed are those of the authors and don't necessarily reflect those of Google.