Those of you who know me in real life will be aware that I’ve moved to Sydney, Australia. Here I’m working for a consultancy called Servian, who I’m writing this post on behalf of. You can learn more about Servian on the website.
This post covers using Google Datalab, BigQuery, and TensorFlow to perform some machine learning.
I visited New York for the first time last year, and had heaps of fun there riding around on the city hire bikes. There’s a photo of my partner and I riding over the Williamsburg bridge which captures how much fun we had. One frustrating aspect of the experience was finding a pair of bikes. Oftentimes, I’d look at the map available in the app and see two bikes close by. Usually, when we got there, the bikes were gone or maybe just one bike was gone, and it was pretty frustrating sometimes.
At Servian, I’ve been investing some time in learning the basics of the Google Cloud Platform. In this post, I’ll share how I’ve combined Cloud Datalab, which is essentially a hosted Python notebook environment, BigQuery for fast SQL access to large datasets, and TensorFlow for machine learning. All of this will be to build a set of models which can be used to predict if a bike hire station in San Francisco will have at least two bikes available in a period of time.
There are two reasons why I’m working on data from San Francisco, firstly and most importantly there’s a very large dataset publicly available on BigQuery which covers the state of the bike hire scheme over a period of years, and secondly because I hope to visit San Francisco before long and maybe this can make my trip a little less frustrating.
Importantly, all code I wrote for this blog post is available publicly on GitHub, and the Python notebook runs on Google Cloud Datalab.
Exploratory Data Analysis
It’s a good idea with any new dataset to have a look around. The data I’m using comes from two publicly available BigQuery datasets, bigquery-public-data.san_francisco.bikeshare_stations and bigquery-public-data.san_francisco.bikeshare_status. We can quickly explore these tables in the notebook:
Initial time series analysis
There is a time series table, bikeshare_status, logging the number of bikes available at one point in time. Hopefully, that data will have a similar time window for each of the stations in the scheme.
From here on, I won’t always show how the charts were created as it’s very similar to the two examples I have shared. All the code can be found on GitHub.
That’s not a very long time. How odd, is this station an anomaly or are there other stations with more data?
Yes — this distribution for all stations is pretty different than for this station, and a different station has a distribution which matches the global distribution:
We can also see that almost all stations have a great many days represented:
With most time series, the data is seasonal, by which I mean
a pattern, variation, or fluctuation that is correlated with a season, day of the week, or other period of time
Both days of the week and hours of the day display a seasonal component:
A reasonably frustrating moment came when I was making these plots and I wanted to drop the legend from the plot. The Datalab chart API is really poorly documented and it took me a while to work out how to pass the JSON options to remove the legend. Examples of this are visible in the notebook screengrabs above, and this is the pattern I followed:
%chart chart-type --data data-set more-options
... more JSON options ...
Exploring geospatial data
These data also have a geospatial component, so I took the opportunity to plot it with folium, a fairly neat package for visualising geospatial data. I wanted to see if the location of a bikeshare station was important in determining its availability. I heat mapped the number of bikes available at each station between 18:00 and 19:00.
From plotting out the data on a map I remembered that heatmaps are really fun, and also saw that certain streets in the centre of San Francisco seem to have higher availability at peak times than others. Market Street looks to have fairly high availability, whereas 2nd Street until King Street has more limited availability.
Docking station metadata and availability
Another property which should impact the availability of the bikescheme bikes is the size of the docking station. And yes — the availability and the size of the docking station are correlated, but not linearly. The medium size docking stations have a reasonable chance of having a lack of availability.
It’s also worth noting that the more commonly available sizes of docking station have a higher chance of running out of bikes. This indicates that the size of the docking station itself is a potentially misleading feature.
Similarly, the locations with the most docking stations run out of bikes the most often.
Data Pipeline from BigQuery into the model
Now I’m fairly comfortable with the data, it’s time to have a pop at training a model with the estimator API. I had a look at the number of records in the status table, and it’s fairly large, 107.5 million records. That’s more than will fit in Excel, which was one definition of BIG DATA I’ve come across.
I’ve chose to do a binary classification problem: if there are few than 2 bikes at the station, that signifies low availability to me as my partner and I wouldn’t both be able to jump on a bike. Having a quick look at the balance of these classes, there are 2.5 million low availability events, and nearly 105 million high availability events. Dealing with a class imbalance like this can be annoying, so I’m going to downsample the ‘0’ class.
I chose to sample the data as follows:
- set the number of each class required
- query BigQuery to select that number of each class, randomly selecting rows
Using the RAND() function in BigQuery can be used to partition a data set. The strategy is to generate a number between 0 and 1, and select the row if that number is less than the fraction of the data we wish to select. In practice, this is fast and reduces the network traffic. Say we want 10 rows, and we have 42 records. The query would look like this:
WHERE RAND() < 10/42
We use the BigQuery Datalab connector to pull data from BigQuery into a Pandas DataFrame. I’m using Pandas as it’s straightforward to connect a DataFrame to a TensorFlow model using the estimator API.
Scaling up the Datalab instance
When pulling data from BigQuery into the Datalab instance, the Datalab VM really started to struggle with more than a few thousand records in memory. When I checked up on the VM in the compute engine section of the Google Cloud console, the CPU was at 100% for a sustained period and the UI was really struggling. I had to resize the Datalab instance, pronto.
The pipeline for doing this was as follows.
- Save your notebook. Datalab integrates with git and has a UI tool called ungit, which will by default back up your code to one repo for all your datalab work in that project. Datalab also allows you to save your notebook down to Google Cloud Storage, one bucket for all your Datalab work. Once your code is committed and pushed, as well as saved on Storage, you can safely exit the notebook, kill the Cloud Shell process and stop or delete the VM. The VM will be named after your Datalab session.
- Start up a new Datalab instance, specifying the machine type you require. This is achieved with the
--machine-typeparameter. I went for n1-highmem-4.
- Open up the notebook you were working on, it should appear in the UI.
And there you have it — one of the major benefits of cloud computing is to be able to easily move to a different hardware configuration to meet your emerging requirements. You are also able to turn off the compute instance when you’re finished working for the time being, and all your work is saved and backed up in Cloud Storage and in git. This means you only pay for the VM instance while it is running, and nothing else.
I’m using the TensorFlow estimator API in this example to quickly get a model done. The key steps are to create input functions for the testing and training data, and to define the feature columns from the dataset. The estimator API abstracts away all of the more complicated linear algebra code.
As you can see, we easily separate the DataFrame into test and train components, define a model, train it, and evaluate it. The model was quite good in the end, with 97% accuracy and an AUC of 0.6 to 0.8 depending on the sample pulled from BigQuery.
Time Series prediction with LSTM RNN
While the first model is interesting enough, it’s not the most fascinating of models. What would be more useful would be to predict if a specific station is going to have capacity in n time steps. This model could power a service which would tell me the chances of a pair of bikes being available at the station when I get there, solving a commonly frustrating issue.
Firstly, we need to explore how often the stations update their status on average:
SELECT avg(sd) as mean_td FROM (
lead(time) --window function to access the next record
over(partition by station_id
order by time asc
), time, SECOND) as sd
It turns out that stations usually send a new status every minute, but it does vary a little.
What I want to do with the time series of bikeshare status updates is to predict, given a 15 minute window of status updates, if there will be availability in 15 minutes. Extracting the data from BigQuery isn’t too onerous:
IF(next_ba <= 2,
0) AS low_availability,
IF(next_ba > 2,
0) AS high_availability
OVER(PARTITION BY station_id ORDER BY time ASC) AS next_ba
station_id = STATION_OF_INTEREST
time ASC ) a
next_ba IS NOT NULL
Note a few aspects of this query. The
next_ba column is generated from a record 30 steps in the future. Two binary flags are generated here,
high_availability as the lower level TensorFlow libraries require each class to have its own column.
The data pulled from BigQuery needs to be “strided” which means taking a window of rows from the DataFrame and casting them into a list of records. This generates a list of records covering 15 time steps, and a pair of flags indicating the state in another 15 time steps. The DataFrame can now safely be shuffled as part of model training.
We no longer have an out of the box iterator to rely on, but it’s not so hard to roll our own:
Next up, I need a TensorFlow graph defining the movement of data around a LSTM network:
What happens next is that we pull data for each station, then create a TensorFlow Session for that station’s data, and learn the weights of the model. The model is then saved to disk, evaluated, and the session is closed. This means we can reuse the same graph for multiple entities in our data set. Once the models are saved to disk, when we wish to query a model for a specific station, we load up the state, and initialise a new Session with that state.
This was not trivial to get working, however it was rewarding, as the models for each station are very high accuracy, between 96% and 100%.
Online prediction with the LSTMs
The next challenge was to simulate an online lookup for a station. This would represent a ping to our API to request a predication for a particular station. In reality, we would select the n most recent status updates from a station, but in this setting I instead opted for selecting a random 15 minutes within the time range the station reported status updates for. This data was transformed into a numpy array, before loading up the model for that station and generating a prediction.
Training each model was parameterised by the station ID. When performing an online prediction in the code below, we load up the model for the query station ID and query it. This is a little slow, and in a real world setting I’d keep the sessions for each model loaded in a dictionary.
The work covered here represents a three day effort on my part. I did this work while studying for Google Cloud Engineering certification to get some practice with working on the platform. The key learnings for me were that having a Python notebook running on Datalab is powerful. It integrates well with many Google products, and makes exploring large datasets easy. However, some parts of the process are not well documented such as the charting API.
The estimator API makes it straightforward to generate new insights from a dataset, but is somewhat restrictive. All ease of use comes at a cost. The lower level API in TensorFlow is very frustrating to work with. I spent several hours hunting down the reason for an exception raise, stating:
ValueError setting an array element with a sequence
After quite some time, I worked out that this meant my input data contained a null. This exception could be made more explanatory.
Overall, I feel that TensorFlow is quite hard to get your head around and be productive with, unless you stick to the estimator API. Once TensorFlow is working it is quite fast, and works well with other Google Cloud products such as CloudML. TensorFlow serving and TensorBoard are also good components of the Google ML ecosystem. If you are tightly integrated with the Google Cloud ecosystem, TensorFlow is a good choice for a machine learning library.
In terms of costs, all of the work involved with this post cost me US$10.63 of the free credit I was given for signing up to Google Cloud Platform. That’s very affordable. Most of the costs came from running a high memory instance for several days, so it could be even cheaper if you were happy to wait around for computations.
The techniques demonstrated in this post could be applied to many other real world settings, for example in predicting events based on streams of sensor data or in predicting the number of customers arriving at a location. Time series analysis is a hot topic at present, and as I’ve demonstrated here there is a lot that can be achieved with good data in a relatively short timeframe.
The code for this post can be found here: https://github.com/guyneedham/sanfran-bikeshare.