Lab Notes: Machine Learning and Capital Bikeshare

Joe Paolicelli
Mission Data Journal
4 min readAug 16, 2016

We are exploring uses for machine learning. We decided to start by expanding on our work with Capital Bikeshare data.

Earlier this year, we decided to take advantage of the large amount of trip data DC’s Capital Bikeshare has made available to the public. We began by creating a couple of interesting data visualizations which you can see in more detail here. Soon after we concluded that work, we started thinking about using the same data for a machine learning Labs project.

We set out with the simple goal of predicting the likelihood that a bikeshare station has a completely full or empty rack of bikes. Predicting that a station is full or empty would allow us to answer two important questions a rider might have: “Will there be any bikes at the rack I’m going to?” and “Will I be able to return this bike to the rack I’m going to?” This same type of machine learning system could be interesting for businesses that want answer a question such as, “Can I service the demand at this time of day?” By applying machine learning and statistical methods to this historical bikeshare data, we were successful in answering those important questions.

How we did it

As with the previous data visualization project, we found that the raw data released by Capital Bikeshare wasn’t exactly what we needed. For that project we had to do some cleanup and transformation of the raw data. For this project we found that the source data was missing some critical information we needed, specifically if a station was full or empty at a given time of day. Luckily, we were able to find data that had been collected by a third-party researcher, Capital Bikeshare Tracker, that captured this information historically.

To process the data we used the Python library Pandas again. We used the Scikit-learn library to provide implementations of the statistical techniques. Some of the statistical techniques tested include: logistic regression, stochastic gradient descent learning, random forest classification, and Bernoulli Restricted Boltzmann Machines (which are a type of neural network). These techniques were also tested in conjunction with one another to get better results.

The statistical techniques are used to form a prediction that a discrete set of results will happen. For example, there’s a 59% chance of a particular bike station being empty. We tested some techniques even if they didn’t seem like a good fit, so that we could compare the results to other approaches. To test the success of each technique, we used the holdout method with a 70/30 split. We trained the model on 70% of the data, and then tested for accuracy on the remaining 30%. We could then compare the accuracy of the various algorithms against each other. For example, it was useful to compare the results of using logistic regression by itself with the results of using logistic regression combined with Bernoulli Restricted Boltzmann Machines.

At this point, the test results are showing fairly good empty/full predictions based only on the time of day. Before we detail the final algorithm and set up, we want to add weather data and potentially event data as another input into the learning system. We believe doing this will increase the accuracy of the predictions and may also change the best algorithm to use.

Challenges

A few of the challenges we ran into included dealing with the format and size of the data (3 years of data totaling 130 million records), and limited familiarity with the languages, libraries, and statistical methods needed to properly deal with the data and return results. To deal with the format and large size of the data, we resampled it into consistent 15-minute intervals so that we had a decent amount of time to make comparisons. We also used the data in small bits to ensure that it didn’t use too much memory at once. So far we have been successful as we are able to create models that are more accurate than just using historical averages.

An important lesson to take from the data challenge we ran into is that you may not know today how you will use your data in the future. If you aren’t storing everything you collect, you may not have what you need later. There is always a trade-off between storing everything and predicting future needs.

Next Steps

Phase two of this project will include adding weather and potentially event data. We will then re-evaluate each of the methods we used in this phase and determine whether those extra data points give better results.

Have an idea for a machine learning project? Drop us a line.

--

--