Vélib in Paris— Part I — Discover AWS and PostgreSQL

Pierre Delarroqua
Python Pandemonium
Published in
7 min readFeb 3, 2017

A few weeks ago, I was looking for a personal project in order to improve my expertise along the “value chain” of computer science. I wanted to code a project encompassing different skills and programming languages, from end to end.

Since I use the Vélib’ everyday in Paris, I thought I could do something fun and useful about it. My idea was to forecast the number of available places in each station. That way, I can be sure that I will find a spot when I go back home after my workday. I found out there’s already an app for that, and they do a pretty good job at it.

My Vélib in front of Notre-Dame de Paris

But hey, it’s the not the destination, it’s the journey ! I still can learn a lot in the process. Moreover, they only offer predictions for up to one hour, and I want to be more ambitious.

I will divide this post in a series of articles, with the following plan (subject to change along the road):

Part I — Store data with EC2 & RDS in AWS

The first step is to gather enough data to train the model. Amazon is offering a free tier account for one year, and it’s a great way to learn about virtual machines and databases, for free. Moreover, there are a lot of tutorials to help us set up everything we need. Just create an account, or use your existing amazon account, and you’re good to go.

AWS Free Tier Details

Note that most of the services under the free tier are indeed free, but you still need to give out some cents when you transfer a fairly big amount of data. For instance, I paid €0.15 in Data Transfer Fees to store over 10 million rows, not really a deal breaker.

My Spend Summary in AWS the 27/12/2016
  • Set up the database

With Amazon RDS (Relational Database Service) under the free tier, we can create an instance running for up to 750 hours per month. As there are maximum 744 hours in a month, it means that we can create no more than one database at the same time. Or no more than 2 databases at the same time, for 15 days. Or no more than… well you get the idea :)

Moreover, we can store up to 20 GB of data, which is plenty for a personal project. To give you an idea, 3 month history of stations, updated every 10 minutes, takes about 10 GB of space.

In order to add new rows to the database, we need to communicate with it. There are different database management systems which interact with the user, other applications, and the database itself to capture and analyze data.

For my project, I choose to use PostgreSQL, and you can create a database in Amazon following this tutorial. This is where we will store updates of every Vélib bike stations in Paris (there are over 1220 of them).

  • Retrieve stations updates from the API and store them in the database
# api_utils.py
import
json
import requests
config_velib = load_json("config/config_velib.json")
url = "https://api.jcdecaux.com/vls/v1/stations"

def get_stations_list():
response = requests.get(url, params=config_velib["params_api_velib"])
return json.loads(response.content.decode("utf8"))

We create a function get_stations_list that return every velib stations with the following informations

Source: Vélib API

After that, we create the connexion to our PostgreSQL database, we get the last stations update, and we insert them in the database. You should insert the credentials of the database in a separate JSON file.

# scrape_stations.py
import
json
import psycopg2
from datetime import datetime

config_db = load_json("config/config_db.json")

conn = None
conn = psycopg2.connect(database=config_db['db'], user= config_db['user'], host=config_db['host'], password=config_db['password'])
cur = conn.cursor()
def load_json(path):
with open(path) as f:
return json.load(f)
def convert_timestamp(timestamp_to_convert):
s = timestamp_to_convert / 1000.0
return datetime.fromtimestamp(s).strftime('%Y-%m-%d %H:%M:%S')

def scrape_stations():
stations_list = get_stations_list()

for result in stations_list:
result["last_update_clean"] = convert_timestamp(result["last_update"])
query = "INSERT INTO api.update_stations (datetime, response_api) VALUES (%s, %s)"
cur.execute(query, (datetime.now(), json.dumps(result)))
conn.commit()

Finally, we write a script to launch the scrapping and deal with possible errors.

# scrape_main.py
import
time
import sys
import traceback

if __name__ == "__main__":
print("{}: Starting scrape cycle".format(time.ctime()))
try:
scrape_stations()
except KeyboardInterrupt:
print("Exiting....")
sys.exit(1)
except Exception as exc:
print("Error with the scraping:", sys.exc_info()[0])
traceback.print_exc()
else:
print("{}: Successfully finished scraping".format(time.ctime()))
  • Create an S3 bucket and store the python scripts

Now, we are going to create a bucket in S3 (Simple Storage Service), an amazon web service interface to store and retrieve any file. Under the free tier, we are limited to 5 GB of storage, but it is more than enough for our project.

Create a bucket by following this tutorial, and then transfer the python scripts to the bucket through the terminal, following this tutorial.

It you have installed AWS Command Line Interface (CLI) correctly, you can check the content of your bucket with this command, directly in the shell.

aws s3 ls s3://name-bucket
  • Set up a Virtual Machine with EC2

We want to get an update of the station, every 10 minutes, non stop, in order to create a full historic. At the beginning, I was launching the script on my computer when I was at work. However, each time I was not on my desk or sleeping or doing something else, the script was not running.

Thus, I needed to set up a Virtual Machine (VM), which is an emulation of a computer, and run the script continuously on it. Again, we have the possibility to set up a VM in Amazon for free, for up to 750 hours per month.

Amazon has a pretty clear tutorial on how to set this up, so I suggest you follow those steps.

  • Transfer the python scripts from S3 to EC2

Previously, we created an S3 bucket with our python scripts. This makes the transfer of our code from S3 to EC2 — from our bucket to the virtual machine — super easy.

If you created your EC2 instance correctly, and if you stored your KeyPair in the .ssh folder at the root of your computer, you can use the following command to connect to your VM.

ssh -i ~/.ssh/MyKeyPair.pem ec2-user@{IP_Address}

Then you can copy the python scripts directly in the virtual machine.

aws s3 cp s3://name-bucket/scrape_main.py ./
aws s3 cp s3://name-bucket/scrape_stations.py ./
aws s3 cp s3://name-bucket/api_utils.py ./
ls
# check that every file are in the VM

Finally, run the main script in the VM to check that everything work.

python scrape_main.py
  • Automate the python script with Cron

The scrapping works, let’s automate it so we don’t have to think about it any longer. Our EC2 VM will do all the work for us !

There are many different ways to do this, but a very simple yet very efficient one is to use Cron. It allows us to set up a job that will be executed automatically at a specified time.

In the VM, open the Cron file with the following command.

crontab -e

It will open an empty crontab for us, where we will set the parameters of our job. You can see in commentary how a crontab file must be filled. In our case, we want to run scrape_main.py every 10 minutes, so the first parameter is */10. The last piece of code, “2>&1”, means that we redirect the output (every logs generated by our script) of our program to velib_log.

# Example of job definition:
# .---------------- minute (0 - 59)
# | .------------- hour (0 - 23)
# | | .---------- day of month (1 - 31)
# | | | .------- month (1 - 12) OR jan,feb,mar,apr ...
# | | | | .---- day of week (0 - 6) (Sunday=0 or 7) OR sun,mon,tue,wed,thu,fri,sat
# | | | | |
# * * * * * command to be executed
*/10 * * * * python scrape_main.py > velib_log 2>&1
  • Check that the database is filling up properly

We are almost done, but we need to make sure that our script is working , and that we have a new update every 10 minutes.

First, install homebrew, if you don’t already have it on your computer. Homebrew is a package manager, making it easy to install other stuff, like PostgreSQL (psql). In the terminal, you can directly install psql (if you’re using a Mac).

brew install postgres

Then, still in the terminal, connect directly to your database with your credentials.

psql -h {host} -U {user} -d {database}
Password for user {user}: {password}
{database}=>

Finally, select the last 50 records (without forgetting “; to launch the sql query)

SELECT * FROM api.update_stations ORDER BY ID DESC LIMIT 50;

If everything worked, we now have a database that is filled continuously. We will use it to build our model to predict the number of free spots. The tutorial will be available in my next post — the Part II !

Thank you very much for reading, and do not hesitate to post anything in order to improve this article.

--

--