Seq2Seq in Keras for Petrol Price Prediction using Italian Open Data

Claudio Stamile
isiway-tech
Published in
7 min readNov 22, 2018

As you may know in Italy the price per liter of petrol is really high compared to the rest of Europe. In the following image you can see Italy is in the Top 4.

Image from https://autotraveler.ru/en/spravka/fuel-price-in-europe.html#.W-xADuhKhPY 14/11/2018

Unfortunatly, Italy is not in the Top 4 in the list of European countries by average wage sorted by purchasing power parity (PPP). So it is good to find a method to help people to save the money they burn in their cars every day.

In this article we show how to use Machine Learning algorithms, in particular Seq2Seq Neural Networks, to predict petrol price at self service stations in Italy.

This article is divided in two main parts. In the first, we will perform some nice plots using Qlik Sense. In the second, we will show how to use Seq2Seq neural network to perform petrol price prediction.

Italian Open Data

In the last years, italian public administration put a lot of effort in the Open Data initiative. A nice overview of the project is available on the official website while, a collection of links with the most interesting datasets is available here.

In order to write this post, we decided to use the dataset containing daily information of the petrol price for all the italian gas stations created by the ministry of economic development and available here. In this dataset, all the quarters (trimesti) from 2015 to 2018 are available. For each quarter it is possibile to download its corresponding compressed directory. Each directory contains one .csv per day in the quarter. Each .csv file contains the price of the petrol and gasoline for all the gas stations available in that period. In the same link, the registry of each gas station, containing its description, is also available. From this file we can extract two important information: 1) municipality identifier and GPS coordinates and 2) the gasoline marketer brand for a specific gas station.

In order to merge all the price data, we created a big single file using the following bash command.

awk 'NR >1 {print $0 ";" substr(FILENAME,15,8)}' prezzo*.csv >> all.csv

Now that we have all the data we need, we can start to perfom our analysis.

Visualization using Qlik Sense Desktop

In order to obtain a nice geographical display of our data we used Qlik Sense Desktop. In the dataset, for each gas station geographical information about its municipality are available. Those information can be used to create a map that can help us to understand the gas price in specific municipalities in Italy.

Unfortunately, Qlik Sense does not provide, at least for Italy, geographical areas at the “municipality” level but only at “region” level. In order to solve this problem we need to import KML data containing the municipality borders. Luckily for us, Rocco Giove created a set of KML files with the municipality border divided by region.
In this part of the article we focus or analysis just for the Lazio region, so, in order to perform our plot, we imported the KML file just for that region.

Municipalities in Lazio

Since we want a global overview of the last prices, we are going to plot just the last prices of the gas stations in all the municipalities in Lazio region. We imported the price and the description just for the last available day in the dataset: 2018–06–30.

In this analysis we are not looking at the price for the single gas station, we just aggregate at the municipality level. In another post we will perform a specific analysis at the gas station level of granularity.

All the three datasets (municipalities borders, price per gas station and gas station description) need to be joined. We used the following join keys:

KML File (Lazio.Name) — Anagrafica Impianti (Comune)
Anagrafica Impianti (idImpianto) — Prezzo Alle (idImpiato)

Link between the different data sources

Here we show some plots (with the associated query) we did in order to show different information about the petrol price for each municipality. Let’s start by looking at the average petrol price only at the Self-Service using the following query

=Avg(if(isSelf > 0 and descCarburante='Benzina', prezzo))

and as result we get

We can also count the number of distinct gasoline marketers for each municipality

=Count(distinct(if(isSelf > 0 and descCarburante='Benzina', Bandiera))

as expected we see that in Rome we have 20 distinct gasoline marketers.

Number of distinct gasoline marketers

Now we check if the are differences in price between branded (B) and not branded (NB) gasoline marketers.

=Avg(if(isSelf > 0 and not Bandiera = 'Pompe Bianche' and descCarburante='Benzina', prezzo))
=Avg(if(isSelf > 0 and Bandiera = 'Pompe Bianche' and descCarburante='Benzina', prezzo))
==Avg(if(isSelf > 0 and not Bandiera = 'Pompe Bianche' and descCarburante='Benzina', prezzo)) - Avg(if(isSelf > 0 and Bandiera = 'Pompe Bianche' and descCarburante='Benzina', prezzo))

We get the following map

Average price of Branded (B) and Not Branded (NB) gasoline marketers and their differences

In this section we just showed a small number of examples to use Qlik with geographical data to exploit the dataset. Of course we can add more cool analysis but this post will be a TL;DR post.

Seq2Seq Neural Network in Keras

In order to predict the price of petrol for each gas station Seq2Seq neural nerworks were used.

But why Seq2Seq neural networks and not classical ARIMA models for time series forecasting ?

In the subset of data we used for this analysis, 27042 gas stations were available. In this case, application of ARIMA models is not trivial. Indeed, in order to predict the price for each station, we need to fit and perform parameter optimization of different ARIMA models, one for each gas station. This can be difficult to perform in real scenario.

Seq2Seq neural networks instead, can learn how to reproduce different type of sequences. Using this approach we can use just one single Seq2Seq neural network to fit all the time series together.

Before going deep with the neural network we need to load the dataset.

Subset of dataset used for the analysis

We then create a table containing, for each gas station the petrol price for each day in the dataset. Since for certain gas station we have NaN values (for instance a gas station was not present in that period) we decided to fill those values with the mean price of all the gas stations for that day.

Shape of the dataset after pivoting

The dataset is now ready and can be used in a Seq2Seq neural network.

In this article we used the Seq2Seq neural network described by Joseph Eddy and available here. We are not going to give specific implementation details since all the information are well described in the previous link. Just to give you and intuition, the convolutional filters take different points of the temporal sequence to predict the next point in time.

Starting with this assumption the network is trained to perdict the value t+1 using the previous values.

We used a 14 days prediction period. We split, for each gas station, the dataset in two sequence, the “Encoding Interval” used to “learn” the sequence and the “Prediction Interval” used to train the network to predict the next steps.

Since we need to know a previous time step to predict the next step, the Encoding and Precition interval have an entry in common. This entry is used as starting point to perform the prediction.

Encoding and Prediction Interval used to train the neural network

After the split, the dataset is used as input in the Seq2Seq neural network.

Once the training is completed, we performed some plots for different gas stations to show the real and the predicted price. Some examples are available here.

Real and predicted results with Seq2Seq neural network in 4 different gas station

As you can see from the results the neural network is quite good to predict the petrol price for single gas stations.

Conclusions

In this article we just show a simple example describing how to combine Qlik Sense and neural network to exploit one of the many Italian Open Datasets. We performed some high level steps just to show the potential of this approach.

In the next article we will provide a deep description of the analysis pipeline from the visualization to the prediction.

P.S We’re hiring, if you like us, we also like you. If you are interested, check our job opportunities here.

--

--

Claudio Stamile
isiway-tech

Machine Learning Scientist | Double PhD | Software Engineer