Lab Notes: Improving Our Machine Learning Results

Published in

Mission Data Journal

6 min readAug 30, 2016

We improved the accuracy of our machine learning results with Capital Bikeshare by adding weather and event data. Read on as we break down our process.

In the previous machine learning Labs project we used trip data from DC’s Capital Bikeshare system to make a simple prediction for whether a specific bike share station (e.g., rack) would be completely full or empty given a time. We started the initial project assuming that temporal data alone might not be enough to make the best predictions. However, we couldn’t know until we actually experimented with using the temporal data alone, augmented it and tested the results.

As our initial project progressed we decided the goal for the next project would be to simply improve accuracy by adding more data points. We thought through the types of situations that might impact the bike share system and landed on weather and major events. Of those two, the weather data was easier to obtain, so we decided to concentrate on that.

We needed two types of weather data: historical data for training and current or future forecasts to make predictions for given time. The historical data came from the National Oceanic and Atmospheric Administration (NOAA) and the forecasts are provided by the Weather Underground API. From these sources we used hourly precipitation amount and hourly temperature as the two core data points.

How we did it

The first key step was to reformat the NOAA and Weather Underground data into a common structure. A consistent structure was essential to to constructing our machine learning model.

Training the model was accomplished with the same tools as the initial project. To determine which algorithm was best at making the predictions, we tested the same algorithms used for the initial project and looked for variations between them using two methods: a scikit-learn Classification Report and Brier Score.

The scikit-learn Classification Report gave us an understanding of how close a set of predictions were in terms of accuracy. It consists of a number of metrics (precision, recall, etc.) which all look at the predictions as a binary number: 1 or 0. Because we were making a prediction between having an empty station and a not empty station, the Classification Report provided a number of metrics that told us our accuracy over 30,000 predictions.

The Classification Report provides accuracy in percentages that are easier for humans to understand. This is a major reason why we decided to use it. The report also helped us better understand results as we became more familiar with machine learning.

The Brier Score tells you how far your predictions are from being accurate. The Brier Score Loss is a number between 0 and 1, and because it measures how far off your prediction was, you want to have a number as close to zero as possible. The Brier Score is more nuanced than the Classification Report because it takes into account a confidence level. Clearly a prediction with a 50% confidence level is less meaningful to users who must decide if they will take an action than a prediction with a 95% confidence level. Thus the inclusion of a Brier Score is significant because it complements the calculation reflected in the Classification Report.

We created an application that combined the output of the Classification Report and Brier Score together into two reports comparing the algorithms to each other. One report predicted a single station at a time, with and without the weather data, for being either not completely empty or completely empty. The second report was identical other than we predicted whether the station was completely full or not completely full. This application used the same 70/30 split as described in the last post, training on 70% of the data and testing the remaining 30% against the model. We used 100,000 data points total resulting in a training set of 70,000 data points and tests against 30,000 data points.

The abbreviated portion of the report presented below shows the results of the rack completely empty prediction model using the random forest algorithm and only time data as input (no weather):

Brier Score loss: 0.018248834757Classification Report:precision    recall  f1-score   support0       0.99      0.99      0.99     295441       0.12      0.09      0.11       456avg / total       0.97      0.98      0.97     30000

You can see that the Brier Score doesn’t look too bad at 1.8%. In the Classification Report, the first row represents the state of the rack when it was not empty (0). In this case the precision at which we predicted that was 99%. The second row, which starts with a 1, represents the state of the rack when it was completely empty. In this case we predicted correctly 12% of the time.

Clearly we’re better at predicting not completely empty rather than predicting completely empty because the likelihood of a rack being completely empty (i.e., zero bikes) is very rare. The last row represents the two states in combination. Because the not completely empty state dominates this calculation the 97% precision doesn’t look all that bad but we really want that completely empty state at a higher number.

Now let’s consider the same type of abbreviated report for the same rack using the random forest algorithm and including weather data:

Brier Score loss: 0.00388922158247Classification Report:precision    recall  f1-score   support0       1.00      1.00      1.00     295431       0.85      0.80      0.83       457avg / total       0.99      0.99      0.99     30000

Now the Brier Score is 0.3% compared to 1.8%, a definite improvement. The two rows represent the same states as above. In the first row we now have 100% precision on not completely empty. In the second row, the completely empty state, we are now at 85% precision compared to 12% before. That is the type of improvement we were hoping to achieve when adding additional data points. Simply by adding weather as input we increased our precision by 73%.

We used the same type of comparisons for the prediction model to determine if a rack is completely full. The results with respect to the addition of weather data was very similar to the example above regarding completely empty.

One last note: we used a 50% or greater probability to indicate a rack is in a given state or not (i.e., if we are testing that a rack is completely empty and the probability is 51% then it is said to be completely empty). This could lead to near misses where the model predicts a 51% chance of a rack being completely empty and it actually isn’t. This is where the Brier Score is important because it takes that potential into consideration. In both cases we were able to substantially decrease our Brier Score. In a real world application it would also be appropriate to expose the probability to the end users so that they could make their own final decision on whether to head to the rack to grab a bike.

Overall the addition of weather data was enough to push the models to a point where adding any other data point is likely unnecessary. We believe that we could further increase prediction accuracy by incorporating data such as weather events that run for a longer period of time (e.g., long periods of rain). We could also refine the weather data to include the type of precipitation.

Challenges

One of our biggest challenges putting the historical weather data from NOAA into a format that was similar to the Weather Underground API data. We were diligent in this work and took considerable time to format and scrub the NOAA data. After all, the of the data used by models must not contain errors or the predictions won’t be accurate.

Another major challenge was testing and comparing the six different algorithms we evaluated. We took time to define and build the testing system so that we could to determine the best algorithms to use.

Next Steps

Our future plan is to take what we have learned in building the prediction model and turn that into an API that can be used by others who are interested in predicting bike availability. Once the API is created, we will likely leverage it ourselves to create both a web interface and potentially a Slack or even Alexa interface.

SXSW 2017 PanelPicker voting is live, and we would love for you to support our submission, “Innovation Labs: Our Journey Into Emerging Tech”. To vote, please click here — you can quickly create an account using your Facebook profile or by clicking ‘Create Account’ and entering your name, email address, and password. Voting ends September 2nd. We appreciate your time and support!

Lab Notes: Improving Our Machine Learning Results

How we did it

Challenges

Next Steps

Written by Joe Paolicelli