Vélib in Paris — Part II — Predicting availability with Python and Flask

This is the second article of a series of blog posts about a project I started several weeks ago. Long story short, I want to predict the availability of the Vélib’ stations in Paris and make a cool website. You can read the first part on this link, and find the “menu” below.

🚲 🚲 🚲 🚲 🚲 🚲 🚲 🚲 🚲 🚲 🚲 🚲 🚲 🚲 🚲 🚲 🚲 🚲 🚲 🚲 🚲 🚲 🚲 🚲 🚲 🚲

  • Part I: Data retrieval and storage in AWS
  • Part II: Web App with Flask that use a simple model in Python to predict availability of the stations
  • Part III: Improving the model with additional features and a better algorithm
  • Part IV: Setup of a chatbot in Slack or Messenger ? (not sure about that part yet)

🚲 🚲 🚲 🚲 🚲 🚲 🚲 🚲 🚲 🚲 🚲 🚲 🚲 🚲 🚲 🚲 🚲 🚲 🚲 🚲 🚲 🚲 🚲 🚲 🚲 🚲

This second part deals with the actual machine learning model that predicts the number of spots available for a given Vélib’ station, one or 2 hours later. We will build a very simple model that works, and we will improve it, incrementally, in the next part !

Then we will build a web app with Flask to use the predictive model, and we will deploy it on Heroku.

Screenshot of the simple Vélib prediction App

  1. Create a clean dataframe with the previous update

In part one, we created an history of the station updates and we stored it in a PSQL database. It is the raw data, with datetime as the moment the line was written in the database, and response_api the dictionary that the Vélib’ API is returning.

update_stations table — screenshot of DataGrip

Then, we wrote a script to convert this table into a cleaner one containing the different variables we need. For each station, we want the number, the address, the latitude and longitude, the number of available bikes at the stations, as well as the moment of this update.

Moreover, we want to know the number of bikes at the stations 1 hour-ish before. Indeed, the value we want to predict is the number of available bikes an hour later, given the number of bikes at the stations and other variables.

Let’s create a function find_previous_update that adds to each row the number of available bikes one hour before, and apply this function to the whole dataframe.

class AddPreviousVariables:
def __init__(self, stations_df):
self.stations_df = stations_df

def find_previous_update(self, df_row):
stations_df = self.stations_df
number_station = df_row.number
date_time = df_row.last_update
previous_date_time = date_time - timedelta(hours=1)
dt_high = previous_date_time + timedelta(minutes=10)
dt_low = previous_date_time - timedelta(minutes=10)
previous_update_array =
stations_df[(stations_df.number == number_station) \
& (stations_df.last_update < dt_high) \
& (stations_df.last_update > dt_low)]
        if (len(previous_update_array) != 0):
last_update_previous = np.nan
available_bikes_previous = np.nan

previous_update = pd.Series(
{'last_update_previous': last_update_previous,
'available_bikes_previous': available_bikes_previous})
        return previous_update

Note that this is very inefficient and time-consuming, but this is the best solution I found in the short term. For instance, it took 40 minutes to process less than 140,000 rows. It’s a bit annoying considering we’ll have to process over 15 million eventually. I will try to improve this part in the next post or find an alternative — the Window Functions of PSQL for instance.

Running time in seconds

We can now retrieve directly the information as a dataframe — a kind of table with columns of potentially different types (text, number, date). In order to limit the number of data to process, I added a filter on the districts.

postal_code_list = ['75001', '75002', '75003', '75004', '75005', '75006', '75010', '75011', '75012']

The list is totally arbitrary. Those are the ‘arrondissements’ where I spent most of my time in Paris. Here is the new table, almost ready for modeling !

update_stations_with_previous_variables table

2. Creating the model & testing the predictions

We set the column we are trying to predict — available bikes — and the columns that will be used to feed the model. Then, we load the data, using the handy package db.py created by Yhat.

#train_model.py - extract

target_column = 'available_bikes'
columns_model_list = [‘number’, ‘weekday’, ‘hour’, ‘minute’, ‘latitude’, ‘longitude’, ‘available_bikes_previous’, ‘weekday_previous’, ‘hour_previous’, ‘minute_previous’, ‘temperature’, ‘humidity’, ‘wind’, ‘precipitation’]

You can see that there are additional columns, like the ‘weekday_previous’, the ‘hour_previous’ and the number of “available_bikes_previous”. Indeed, knowing the number of bikes available at the stations one hour before is a precious information in order to determine the number of bikes available now.

Moreover, my intuition is that the use rate of the bikes is pretty different depending on the weather, and that is why we add variables such as the temperature, the wind, or the amount of precipitation. I found these additional data on weather underground, which provides an API.

We add these new columns by calling the enrich_stations function, which also filter out rows with weather data or previous variables.

#station_enricher.py - extract
def enrich_stations(df):
stations_df = df.copy()
stations_df = add_date_variables(stations_df)

# Load and add weather data
path_weather_data = 'files/input/paris_temperature.csv'
weather_data = load_weather_data(path_weather_data)
stations_df = add_weather_data(stations_df, weather_data)
    # Filter out rows without weather data
stations_df = FilterWeatherData(stations_df)

stations_df = add_previous_date_variables(stations_df)
stations_df = FilterPreviousVariables(stations_df)
stations_df = cast_df(stations_df)
return stations_df_enriched

We divide stations_df_enriched in two: the train set (80% of the data) and the test set. The train set will be used to train the model, and we will evaluate the performance of the model on the test set, by comparing the value predicted to the actual value. It will give us an idea of how well it can be generalised to new, unseen data.

#data_loader.py - extract
# Get features and target, divided by train & test
logger.info("Split target and features")
features, target = SplitFeaturesTarget(df_enriched, target_column)
logger.info("Train/test split")
features_train, features_test, target_train, target_test = \
train_test_split(features, target, test_size=0.2, random_state=42)

Now, we are ready to launch the training ! For this purpose, I will use a vanilla Random Forest Regressor, using the scikit-learn library. Recall that we are only trying to build a predictor that kind of works, so we will not go any further at the moment.

Training a Random Forest is very time consuming as well — over 77minutes to train 107,000 rows ! We will try other algorithms and hyperparameters in order to speed up iterations.

Training the model — Screenshot of the logs

We evaluate the model on different metrics. Within_two is the percentage of time the error — the difference between our prediction and the real number of bikes — is less than two. It’s 67%, quite good, but not enough for a reliable predictor.

Both MAE and RMSE express average model prediction error. Since large errors are particularly undesirable, we prefer the RMSE to the MAE — because it gives a relatively high weight to large errors. On average, we tend to predict +/- 4 available bikes than the true value. The performance is somewhat good, but we didn’t train the model on enough data, so we shouldn’t take these numbers too seriously right now.

We also computed the features importance. As you can guess, the higher the value, the more important the feature. According to this table, ‘minute’ is the most important variable in the tree by a wide margin.

However, it shouldn’t be so, and I expect this behaviour to disappear once we will use more data. If not, we will have to dig deeper ⛏— in the next post ! The following variables are closer from the truth. The number of the station, the longitude as well as the hour of the day should be really important features in order to predict the number of bikes in a station.

ps: the MAPE is fucked up because sometimes the denominator is really close to zero — 1e-6 to be accurate. Ignore it !

3. Creating the web app with Flask

Ok, so we have an awesome and very efficient artificial intelligence that is able to tell the future, and we want to show it to the world !

Enter Flask. It is a “microframework for Python”, meaning it’s a very simple tool to build websites with python.

Indeed, on the future website, you will select a specific station and press Predict to know the number of available bikes in one or two hours at the station. Flask will take care of your request, load the model, ask it to predict the number of bikes, and return that number to the website.

# Load model
model = load_pickle("files/app_model/model.pkl")

# Load list of stations
list_stations = pd.read_csv('files/input/list_stations.csv', encoding='utf-8')

@app.route('/prediction', methods=['POST'])
def ask_prediction():
number_station = request.form['number_station']
time_prediction = request.form['time_prediction']
prediction = predict_available_bikes(model, number_station, time_prediction)
return jsonify({'prediction': prediction})

def index():
return render_template('prediction.html', list_stations=list_stations.values.tolist(), number_station=4006)

I won’t go into too many details here, because it’s not the goal of this post, but when you load the page for the first time, you generate the file ‘prediction.html’ with the list of stations. Then, when you press predict, you trigger the function ‘ask_prediction’ with a given ‘number_station’ and ‘time_prediction’, and it returns the number of available bikes.

Behind the curtains, Javascript is doing several things:

  • Loading the Google Map and display every stations
  • Triggering the POST request when you click on the button ‘Predict’ and returning the result
  • Handling errors from the user (even though I kept user input to the minimum in order to avoid problems)

Moreover, each time you request a prediction, we call the Vélib’ API in order to find the number of bikes available at the station right now. We also call the wunderground api to determine the current weather.

Test of predict_available_bikes function

Finally, there is a little bit of HTML and CSS in order to make something visually decent, but the most important third-party tool that I use is bootstrap. It is “the most popular HTML, CSS, and JS framework for developing responsive, mobile first projects on the web”. It allows us to organise the different elements really easily without the pain of CSS. I highly recommend you use it if you want to build a website, especially for its awesome grid system.

4. Deploy app with Heroku

This last part always takes more time than expected, because new issues always arise. Instead of trying to explain how to do it with my broken english, I will redirect you to this great tutorial below that fits perfectly our situation: how to get a Flask App on Heroku, a cloud platform as a service.

Indeed, we have to push every piece of code and files we have locally on our computer to the cloud ⛅️ in order to create a website accessible by everyone. Note that HTTP uses port 80 by default, so we’re using this one. Finally, when it’s deployment time, you also need to push your config files to Heroku. The easiest solution, according to me, is to create another folder on your computer with every files you need for you app, and push everything on Git and Heroku.


That’s it ! You can click on the link below to play with the app. As I will continue to improve the model and the layout of the website, it may not look like the picture at the beginning at the article by the time you click !

ps: if you don’t see the red stations on the app, keep refreshing the webpage, or use another browser…

You can find the code of the project on github, if you’re keen to contribute or simply have a look at it. If you liked this article/tutorial, please don’t be shy, and press the little heart button 🙂

The next post will be about getting this little application to the next level by improving the model (by several order of magnitude ?!) as well as the design and the UX experience. Thanks for reading folks !