The MOL Bubi Public Bike-sharing System Analytics Challenge

Published in

Balabit Unsupervised

6 min readJan 28, 2016

One of the best things about working at Balabit is that I can come to work by bike. We have a separate bicycle storage with shower, so when the weather is warm enough for biking (oops, this was a spoiler) I start the day really fresh and healthy. Perhaps it is understandable, why I became excited when I have heard about the MOL Bubi public bike-sharing system Analytics Challenge, where the participants got 5 months of bike-sharing data to analyze.

The competition was organized by the “Big Data — Momentum” research group of the Hungarian Academy of Sciences (MTA SZTAKI) and the Centre for Budapest Transport (BKK) and there were 3 predictive modeling and analytics tasks, which I copy here from the official website of the competition.

Task 1: Busiest route prediction
Predict the busiest routes (pairs of source, destination docking stations) for the given evaluation days. You should submit the top 100 routes with the highest predicted frequencies as a ranked order of (source, destination) pairs . The toplist will be evaluated by NDCG.
Task 2: Docking Station Demand prediction
We calculate the demand of a docking station at a given time by subtracting the number of bikes docked to the station from the number of bikes taken away. For each day, we reset the demand to 0 at midnight. For each day in the evaluation set, submit a predicted value of the daily maximum demand for each docking station. Note that the maximum demand is always non-negative. Predictions will be evaluated by root-mean-squared error.
Task 3: Open research task
In this task, you may participate by submitting a very brief research plan. Goals may include, but not limited to, visualization or use of additional external data. You are expected to keep in regular contact with the organizers to share and discuss your findings.

I have a couple of friends among recent and ex-colleagues, and we sometimes work together on data mining competitions. We gathered the team to the BUBI challenge and started to work. We met every week once in our favorite pub, collected ideas and made analysis right there and at home as well. Next time we will do some analysis also on the amount of beer consumed and on our results on the leaderboard to see the relationship.

We worked on the first and second tasks, but due to a misunderstandig we wrongly defined the target variable in the second one. This way we let others to win the 2. task 😉 but we got the 10. place on the final leaderboard even with a not exactly proper target variable. Nevertheless we won the first task, and I present our solution roughly in this blogpost.

Python, R, Matlab and IBM SPSS Modeler (thanks to Clementine Consulting) were used for the analysis, we always picked which seemed to be more comfortable. Our first approach was a simple model building to use it as a baseline solution. We calculated the average traffic on the training data for every route in April and May. The top 100 routes with highest average traffic were selected and they were sent in to every test day as the forecast. We jumped to the third place on the leaderboard with this solution on 21 October. The two better solutions at that time had almost the same NDCG as ours, they used similar methods. (Btw the simple average would be good enough to get the 8. place on the final leaderboard.) This initial result was really inspiring so we started to work harder and build predictive models.

The following chart shows the top 20 most frequently used routes in April — May 2015.

First of all we created some derived variables from the given weather data and from the date. We used the date to identify the day of the week and whether it was a weekday or a holiday. We also used the weather data and created aggregated features for every day, like the maximum, minimum and average temperature, windspeed, etc, the number of rainy half hours, the change in the air pressure, etc. Some aggregated features were used about the previous day and about intraday periods (early morning: 0–6, morning: 6–12, afternoon: 12–18, evening 18–24) as well.

Of course there were some dead end attempts. For example we built time series analysis (we used IBM SPSS Modeler’s transfer function models which form a very large class of models including univariate ARIMA as a special case) for every test day one-by-one and by route. This means that we selected one specific route and predicted the first test day (02 April) traffic by a time series model using the traffic on the training days till this first test day and a few selected weather features and a flag describing the date was working day or holiday. Then filled the traffic data for this first test day with the predicted value and used this to predict the second test day’s (04 April) traffic, and so on. We repeated this method for every possible routes. Unfortunately this method was a slow one (the script run for a couple of days) and the result was worse than the April-May average solution described above.
We also tried to predict the order of the routes instead of the traffic, because the NDCG doesn’t take the predicted values into account, it uses only the rank. But the performace was lower than expected.

The Bubilyze! team. (L-R) Márton Bíró (I-insight), Csilla Balogi (Clementine Consulting), Milán Badics (Corvinus University), Eszter Windhager-Pokol (Balabit), Árpád Fülöp (Balabit)

The winner solution was an ensemble model. The target in the models was the deviation percentage from the April-May average traffic and the inputs were all the features got from the date and the weather data. Only the data after middle of March was used, as the conditions (weather, social events, outdoor sporting events, etc) in April-May are very different than in January-February.

The first model was a CRT decision tree, and the result was slightly better than the simple average solution. We tried to build random forest in R, but it didn’t succeeded because of the lack of memory (8GB), but we managed to build a random forest in IBM SPSS Analytic Server. Finally we created an ensemble model from the CRT decision tree and from the random forest result, and jumped to the first place on the leaderboard. The most important features were the daily maximum temperature (remember the spoiler?), average air pressure between 18–24 hours, average humidity between 12–18 hours and the number of rainy half hours between 12–18 hours.

The BUBI was launched in 2014 September. As time goes on, more and more data comes in so better and more sophisticated predictive models could be built. After a few years of operation of the bike-sharing system also the seasonal effects would be quantifiable and the other characteristics of the usage will be more clear. The larger the data, the more robust models could be created. Hopefully the Centre for Budapest Transport will use the best solutions from this competition to improve the bike-sharing and also other services further.

We enjoyed this Analytic Challenge, especially because the data was about Budapest bike-sharing system on places where we go every day. We look forward to the next data analysis competition.

Till then we ride further.

Originally published at www.balabit.com on January 28, 2016 by Eszter Windhager-Pokol.

The MOL Bubi Public Bike-sharing System Analytics Challenge

Written by Unsupervised Blog