A methodology to perform time series analysis — Part 2

Mouhamadou-Lamine Diop
Axionable
Published in
10 min readJun 18, 2018

Authors: José Sanchez and Lamine Diop (same contribution)

Following the Part 1, we will guide you now through a real world use case which the energy consumption in France using our methodology to deal with data science problems.

For this purpose we collected data from RTE, the French’s electricity transmission operator. They launched a challenge on www.datascience.net in order to forecast the daily power consumption in France. In the following paragraphs we will present the different steps which are required in order to implement a well suited model. Let us note that we will also take advantage of this opportunity to highlight the many challenges we may face face throughout time series analysis.

1. Description of the data

The data is composed of a set of columns including informations from the date and the time when the data was collected to the individual repartition of the consumption across different energy types. Additionally we can note that:

  • We downloaded the historical data from 2012 to 2016
  • Time slot between valid observations is 30 minutes

The national and regional French’s power consumption data is available from the RTE eco2mix portal http://www.rte-france.com/fr/eco2mix/eco2mix-telechargement.

1.1. Performance score

From a forecasted day, we will use the Mean Absolute Percentage Error (MAPE) to determine the exactitude of our predictions. It is defined by:

Being Ci the actual consumption, Ci* the forecast, and n the number of forecasted points.

1.2. Goal definition

As we intend to give a detailed approach and understanding of the problem, we fixed ourselves the following objective: forecast one day of power consumption, namely 48 predictions ahead of current data.

1.3. Technical information

1.3.1. Development platform

Axionable’s team always works on a jupyter notebook with Python version 3 for developing and test purposes, a production code can be made afterwards according to the client technology or architecture.

We can see our working directory in the following image.

1.3.2. Hardware specifications

We initially started development part of this project according to our methodology by performing the data exploration and data cleaning in a personal computer. However when in the modeling and hyperparameter selection phase, we were quickly limited by the important delays of computing due to the high granularity with more than 87000 valid observations. In fact as we wanted to use the whole dataset to train our model, there was an enormous quantity of calculations which had to be done on memory which tends to slow up all processed.

Later after getting more accustomed to the context, we arrived at a critical point where we started questioning the necessity to use the whole dataset. Is it really relevant using this huge amount of information in order to forecast only a day of power consumption ? That question will be answered later.

In this context, let us list the to hardware specifications that we used:

Personal computer:

  • Model: Macbook Pro
  • RAM: 8GB
  • Processing: 2 core at 2,6 GHz

Cloud Computing:

  • VM: AWS m4.large
  • RAM: 64GB
  • Processing: 16 cores at 2.4 GHz

2. Data importing and data cleaning

The data source is stored in the “./data” folder and it’s composed of many Excel files, each file containing the information of a whole year. Here is a glimpse of the operation that we are going to perform:

  • Load data into a Pandas DataFrame (dataframe from now)
  • Format the dataframe column names
  • Filter null data
  • Create a datetime index

2.1. Load data into dataframe

As we aim to have our code to be production-ready, we implemented a loop in order to read all the excel files contained in the “./data” folder. In that way we are not worried about reading individual files nor appending them in a single dataframe.

2.2. Format dataframe column names

It is recommended to format the column names of a Excel table to remove the spaces and special characters, as it is the case for a French document. Moreover, having no space nor special characters allows us to utilize the jupyter autocompletion.

We prefer the regex library instead of the standard Python string functions for the data cleaning step. We can create very powerful regular expressions in order to format strings, even if this is sometimes slower than the standard functions.

2.3. Filter null data

As explained in the introduction, we will only focus on univariate time-series forecasting. Consequently we are not considering the other variables such as fioul, charbon, gaz, etc. Besides, we notice that a 15 minutes time slot frame is observed in the data frame, having “NaN” values every 30 minutes (or every two rows/observations).

In this context, we will remove the null informations and take only the France perimeter:

A nice trick to check whether there are the NaN values in a Dataframe is the use of the equality operator. Indeed in a pandas’ Series auto comparison, we obtain False only when we are comparing null values.

2.4. Create a datetime index

In order to have a valid datetime at a minute granularity, we need to combine in an appropriate way the “date” and “heures” columns. An index sort is also applied in order to avoid plotting problems.

3. Data exploration

Once we consider our data as cleaned, we should proceed to the data exploration phase. In this phase, we need to answer at least the following questions:

  • Is there any trend and/or seasonality on the TS?
  • Do we need to do further data cleaning?
  • Sometimes, plotting the data helps to detect other problems in data like outliers, human errors in the data input, etc.
  • Do we need to do data aggregation?
  • Is the data coherent to reality?
  • This is a very crucial step in a data science problem. Our analysis and results must conform with the reality.

As you can imagine, we will plot the data at different time granularities in order to detect some behaviours like trend or seasonality. At Axionable, we prefer the use of the plotly library to make our plots interactive. In addition let us mention that we are accustomed to creating functions for every automatisable action to improve readability as well as to be able to reuse those resources for other upcoming projects.

Down here is the first plotting function which aims at plotting time series data.

3.1. Plot interactive time series using Plot-ly

Here, we will plot the whole historical data and hopefully determine at first glance key indicator for the next modeling phase. In the notebook related to this project, we also plotted the TS at a day and week granularity, you can read the shared notebook for further information.

Considering this plot, we can infer that:

  • There is a clear yearly seasonality summed with a trend in the data. To add up if we zoom in, we may also see that this tendency is coupled with a weekly and daily seasonality. Those remarks make up the main problem that we faced: the SARIMA model that we will use for modeling takes into account only one parameter of seasonality which calls out a major question: Which seasonality is the most suited for our objctive ?
  • It would be interesting to see if there are some commun consumption behaviours at a week/year/day level and to corroborate it to the reality. For instance, we should find less energy consumption in the holidays season.

3.2. Plot seasonality and trend profiles at different granularities

How can we determine the evolution profile of a selected season in a time series ?

The Box-Plots are really useful in giving a fast insight of data trend at different levels of granularity.

We are defining below a function that directly plots the box-plots from a dataframe by indicating the target and labels of a set of box-plots.

First of all, we will plot weekly consumption and analyse it. Data aggregation at day granularity is naturally required which is the reason why we are grouping the data using the pandas’ groupby function.

This plot can be read from 2 different perspectives. As a matter of fact if we consider the whole figure, we can assume that the the overall energy consumption is greater on working days in comparison to the weekends. On the other hand the boxplots taken independently show that daily consumptions don’t vary a lot from a week to another (less than 1% difference between the first and the third third quartiles).

In the same way, we proceed to plot the yearly and daily profile…

The plots above play a significant role in our understanding of the problem and helped us jump into the following conclusions:

  • The power consumption lowens during summer holidays, and that tendency is confirmed over the 5 years of historical data.
  • The peaks in energy consumption of the daily profile is highly correlated with the lunch and dinner French times.
  • There seems to be weekly and daily profiles which could have been very useful if we had missing values in our dataset.

From now, this dataset being inline with the reality, we may assert that no more data cleaning is needed. Also considering the objective we define in combination with the insights that we got from the exploration part, we may assert that a daily seasonality is more appropriate (explicitly every 48 observations).

4. Modeling — Choosing our prediction model

Reassured by all the work done in the previous stage, particularly the box-plots which highlighted the seasonal effects as well as a clear trend, we decided to opt for a SARIMA model. However, we can only apply this model to a TS which is stationary.

The next paragraphs will discuss about how with deal with non stationarity in practice as well as the core of modeling.

To fix up things, let’s keep in mind that the SARIMA model has 6 optimization parameters (p, d, q), (P, D, Q, S). We refer you back to our previous article (link) to see the basic explanation of each of them and get a fresh recall of how to choose the optimal ones in some cases.

4.1. Pre selection of the SARIMA parameters

In line with our explanations, we will first infer the differentiation parameters d and D before starting the modelization.

In that regard, we can implement a function that, for given d and D values, will plot the new TS and its corresponding ACF and PACF plots.

Followingly, we will choose an order of the values which make the new TS stationary.

4.2. Hyper parameters

Now we can focus on the tuning of the other parameters (p, q, P, Q). Naturally it implies creating grids for of possible values of those parameters, implementing the corresponding models and evaluating them in order to choose the best one according to our metric (MAPE).

We are doing this by using GridSearchCV from the Scikit Learn module. It presents the advantage of allowing us to perform operations parallelly as well as validate the results through cross validation techniques.

Important note: SARIMA not being part of the scikit learn module, we had to create a custom class calling our model in order for it to fit the standards of the GridSearchCV. This problematic appears also in the automation part of projects where Scikit Learn pipelines are often used for this regard.

This marks the end of our modeling part, we now can turn to the evaluation of our model on real data. Inevitably it involves comparing predictions of our model to values observed on a yearly basis. Having trained our model feeding it with informations of the year 2015, we will now perform a forecast for the year 2016. Finally we are plotting the predictions vs the real values.

Below are the results observed.

With the best model being …

5. Conclusions

5.1. Data Science on a production environment

One of the key suggestions we can make, other than the modeling, is to always adapt our solutions to the production environment. In many cases, the best performing model cannot be implemented because of processing/memory limitations or the delay it would take to fit it.

In our experience at Axionable, in order to choose the most appropriate model one should answer these questions:

Scheduling

  • At which frequency should we run our codes ?

Production environment

  • Do we need to do parallel processing ?
  • What are the memory and computing limitations ?

Scalability

  • Will the input data increase over time ?
  • Can we extend to a parallel processing our code, if any ?
  • Do we need our code implementation to be implemented on a distributed way ?

We then need to do a tradeoff between these variables and choose the best suitable model for the target project.

5.2. Next steps

Part of modeling is the research of alternatives to solve a particular problem. In that regard we are listing above some steps that we could have done, but for simplicity purposes we chose not to tackle them for now.

  • Compare SARIMA with other forecasting methods, in the univariate domain:
  • Exponential Smoothing
  • Recurrent Neural Networks (LSTM)
  • Extend SARIMA forecast by adding exogenous variables. That is to represent it like a Generalised Linear Model.
  • As a long-term goal, make an article on the multivariate time series predictions.

References

P.S. 1: This post was adapted from the one in Axionable Foodtruck.

P.S. 2: If you want to know more about Axionable, our projects and careers please visit us or follow us on Twitter.

--

--