Determining which bikeshare docking stations will be empty or full

4 min readMar 12, 2018

Urban bikeshare systems are a wonderful way to get around a city. They are a great transit option for tourists and commuters alike. However, it can be a frustrating experience to arrive at the destination and discover that there are no empty docks left. It can be equally frustrating to find a totally empty bikeshare station right at the start of a trip.

We should be able to utilize historical occupancy, weather, and population data to determine when stations would be more full or more empty.

The ability to forecast this can help with:

More efficiently rebalancing bikes
Designing self-rebalancing incentives for customers

Data

Fortunately, there is a lot of data on the previous SF Bay Area Bike Share system available on Kaggle in a nice SQLITE format.

I loaded the data into an Amazon EC2 instance since some of the analysis here would tax my laptop a bit too much.

Location of all the bikeshare stations in the SF Bay Area Bikeshare system.

Just for curiosity’s sake, we can look at the number of rides across the system and see when they’re happening. As expected, most of the rides happen during weekday commute hours:

Prediction Evaluation Metric

This is what the confusion matrix looks like for this particular classification:

                      +---------------------+----------------------+
                      | Predicted More Full | Predicted More Empty |
+---------------------+---------------------+----------------------+
| Actually More Full  | True Positive  (TP) |  False Negative (FN) |
| Actually More Empty | False Positive (FP) |  True Negative  (TN) |
+---------------------+---------------------+----------------------+

Both true positives (TPs) and true negatives (TNs) are equally important to get right. A FP however, would be bad as a rebalancing van would arrive at a station to pick up bicycles and it’ll be emptier than anticipated. A FN on the other hand, would be even worse since the rebalancing van would arrive at this station to drop off bikes and find that there are not enough spaces, causing the van to spend extra time, money, and fuel driving to other bikeshare stations nearby. Therefore, we’ll come up with an Expected Value (EV) metric that we can use to rate our models.

Recalling that,

our EV metric will be:

It simply assigns $1 for correct predictions, subtracts $1 for FPs and subtracts $2 from FNs. The resulting number is then divided by all predictions to get a normalized unitless metric that can range between -2 and 1.

Target Distribution

Each station’s status update included how many bikes and docks were available at that time. With that info:

Calculate occupancy: bikes ÷ number of all docks.
Transform occupancy ≤ 0.5 to “more empty” and > 0.5 to “more full”

Baseline Logistic Regression

To establish a starting point, I ran a simple logistic regression using only location and time as features.

The Receiver Operating Characteristic (ROC) curve on the left shows how this model performed: not much better than random.

It’s EV metric (one that was defined earlier) is -0.238.

Feature Engineering

To add some more signal, 24 total features were added in the following categories:

Typical dock availability during “training” years
Weather ☀️⛅️🌧🌪
Population (2010 census)

Expanding the initial logistic regression model with many of these features does improve the EV metric, but plateaus relatively quickly, as shown to the left.

Trees, Random Forest, and XGBoost

At this point, it made sense to try more complicated structures to improve the classification. I tried the following models in all:

Logistic Regression
Single Tree
Random Forest
XGBoost
Light GBM

With XGBoost, it’s easy to generate feature importances to see what features carry more signal than others. In this case, the top features are shown in the chart below:

Comparing Various Models

The EV metric for the various models is shown on the left. According to that metric, a random forest model actually would be the best pick for this application.

Here’s the ROC curve for the final Random Forest model. The EV metric here was 0.656, a significant improvement over the initial negative value.

Conclusions

It is possible to make a good model for this binary classification problem. Models can be further improved by:

Further tuning XGBoost & LightGBM models
More feature engineering, such as holidays, commute hours, demographics, # of subscribers vs occasional users, etc.

Thanks for reading!