A methodology to perform time series analysis — Part 1

Published in

Axionable

8 min readJun 11, 2018

Authors: José Sanchez and Lamine Diop (same contribution)

How to take advantage of chronological data to forecast, prevent or describe patterns that occur in our daily operations ? This serie of 2 articles will answer that question by presenting various concepts as well as an end-to-end methodology for time series analysis.

In the Part 1, we will introduce the building blocks as well as Axionable’s methodology for data science projects. In the following article, we will then go through a real world use case of Time Series prediction which is the prediction of energy consumption in France.

1. Data science project management

Requiring the implication of different fields, Data science projects may be decline in many parts. In that regard we have presenting below our end-to-end methodology towards the resolution of such problems.

Typically our methodology aims at dealing with all challenges and changes that may occur in the lifecycle of a project. Thereby in most cases 7 steps are involved:

Goals’ definition: in close collaboration with the clients we define clearly the goals and the scope of the project in order to iterate on it
Data Acquisition and Cleaning: in this step, given the objective defined previously, we identify every possible data source that may be relevant and design extraction pipelines in order to retrieve the data needed. Then due to the anomalies that are inevitably present in dataset, we perform a cleaning for our data to fit analysis that will be done later
Data Exploration: the objective of this stage is to have more advanced insights of the dataset and have an idea of operations that we may perform to improve the relevance.
Modeling and Evaluation: this step may be considered as the core of data science in the sense that it consists of the application of diverse machine learning techniques and the evaluation of all the contending models.
Documentation: part of our job is to make our work the most understandable possible. In that regard we provide a detailed notebook with comments of the major steps that we followed.
Automation: machine learning models may need to be recoded in order to make them more efficient using languages such as Spark for example.
Operations and optimization: this final step involves monitoring the performances through well chosen KPIs and retraining the models whenever changes occur.

Let us note that for this series we will follow those steps until the Modelisation and Evaluation. If you need more details about any other part, you may contact us at datascience@axionable.com.

2. Time series analysis

2.1. Introduction to time series

Time series (TS from now) can be seen as sequences of data points measured over successive intervals of time. Their main specificities in opposition to most common fields of Machine Learning are the dependence in time and the seasonal behaviors that may appear in their evolution. Indeed the observations are no more independent and some sets of variations along with an increasing or decreasing trend occur according to a particular time frame.

In practice we distinguish 2 kinds of TS: we talk about univariate TS when only occurrences of a single variable are observed. When more than one variable is considered, they are qualified as multivariate. Please note also that we will focus on univariate time series for simplicity. Most of the approaches presented may then be applied with some adjustments to the process of multivariate TS.

2.2. Business Impact

Over the last few years, the field has benefited a lot of attention due to the multiplicity of its applications. In fact in a context where business decisions are more and more guided by insights we get from data analysis, TS enable companies to project themselves and perform 3 main tasks:

Forecast: They allow the prediction of future based on past events
Prevention: They permit the control of the processes producing the series
Description: They help understand the inherent structure and the mechanism generating the series (overall trend, cyclic patterns, etc).

2.3. Use Cases

In accordance with its main applications, TS analysis intervenes in different sectors, giving it several use cases. Among those we may cite:

Meteorology: prediction of weather variables such as temperature, precipitation, wind, etc.
Economy & Finance: explanation and prediction of the economic factors, financial indexes, exchange rates, etc.
Marketing: keeping track the key performance indicators of businesses such as sales, incomes/expenses, etc.
Telecommunications: prevision of call data records, management of call center workforces, etc.
Industry: control of energetic variables, efficient logs, sentiment and behavior analysis etc.
Web: web traffic sources, clicks and logs, sentiment and behavior analysis etc.

In practice although most of the data is collected continuously we usually work with discrete TS where consecutive observations are equally spaced in a time interval. The process is done by either keeping values measured according to a given time frame or merging continuous variables together over a specified period.

2.4. Description of the patterns

In order to make the definitions as representative as possible, we will present present the 3 main characteristics of TS using a dataset that is freely available on internet: the trend, the seasonality and cyclical patterns.

The data in question is downloaded from RTE, the french system operator, and consists of records of daily power consumption in France from 2012 to 2016. The time interval between to measures is 30 minutes which gives us a very consequent dataset with more than 150000 records.

Below is an interactive plot to let you visualize the data. Note that we generally use a python package named plotly to generate such visualisations. The main advantage lies in the fact that we may zoom in out zoom out to have either a granular view or a synthetical one.

Trend & seasonality:

If we consider the yearly observations, we may see a typical tendency repeating itself. It is consisted of a decreasing pattern in the first half followed by an increasing one in the second half of the year. That is what we call a trend which can be informally defined as the long term increases or decreases present in a dataset. In addition, we remarked that the trend seems to be dictated by each half the year. That kind of fixed influence corresponding to a defined timeline is what is call seasonality.

Cyclical patterns:

Moreover cyclical patterns like seasonality is a set of variations like trends with the only difference that patterns are not of a fixed length.

2.5. Technical notions

Before deep diving in the analysis, it is important to fix up the core notions in order to have an understanding of the underlying mechanisms of time series. The reason is that modeling requires first gaining as much insights as possible on the situation and typically, from one data science problem to another, models need to be optimized according to a certain set of values called hyperparameters. In fact hyperparameter tuning is among key differentiators between good models and state of the art models.

Thus we will define : stationarity, differentiation and SARIMA to give you a package of tools to build a well-suited model. Let us note that we do not aim at providing a theoretical course, so we focus on providing a practical understanding in order to be able to deal with those parameters while modeling.

Stationarity & differentiation:

Basically stationarity is the fact that the dependence between values is not through the time but rather on the rule of their realizations. That means that the immediate correlation between 2 variables doesn’t depend on their values but rather on the lag between them. Its importance lies in the fact that the parameters of stationary models are stable in time. Naturally that assumption implies that the the mean and the variance should be constant regardless the chosen period of time. Consequently TS with trend or seasonality are not stationary. As a matter of fact those factors immediately affect the overall mean and variance of the series.

As a result, some transformations are to be done in order for it to fit theoretical requirements. We will do so by applying a common technique named differencing.

It is no more than the computation of the differences between consecutive observations (or equally separated by a seasonal factor in the seasonal case). This helps in most cases to eliminate the trend and seasonality of a time series.

In that regard, we need to define as a first step the type of notation to represent differencing. In the table below, two notations are shown being the backshift and the linear notation. For the backshift notation we define B as Byt=yt-1.

For its compactness and to improve readability we recommend using the Backshift notation. To illustrate it, below is the representation of a second order differencing combined with a first order seasonal differencing of m steps:

To infer which degree of differentiation is the most appropriate, we plot the autocorrelation function or the partial autocorrelation function. In fact they respectively give us hints about the immediate correlation between two variables with regard or regardless to the intermediary influences.

In practice, we look for the following patterns :

The ACF and the PACF plot should both be decreasing rapidly towards 0 (exponentially or in a sinusoidal manner)
A significant spike at a certain lag of the ACF without any other repetition after that phenomenon (it gives us a hint over what could be the value of the parameter q)
A significant spike at a certain lag of the PACF without any other repetition after that phenomenon (it indicates an ideal value of the parameter p)

SARIMA Models:

Time series analysis can be seen as the search of the closest characterization of an observed set of values. To perform such a task we rely on various mathematical models among which we have SARIMA models which stands for Seasonal AutoRegressive Integrated Moving Average. In this section also, we limit ourselves to a basic description of the model parameters.

More precisely, here is a description of the main components of the SARIMA which can be broke up into two parts:

The ARIMA with the following parameters (p, d, q):
The autoregressive part will consider past occurrences of the time series of the model, p being the AR order of steps back in the past.
The integrated part is the degree needed to make a time series stationary, being d its order of differencing.
The moving average part will take into consideration the past errors instead of its values, q being its MA order.

The seasonal parameters (P, D, Q, S) which take into consideration the parameters to deal with seasonal behaviors, each parameter having the same basic definitions as the ones introduced above, and S being the seasonality.

Conclusion

Throughout this article we introduced you to the basic concepts which are essential to time series modeling. In the next part, we will tackle with the practical aspect of TS analysis while giving you a full access to our code in order for you to have a support to start of for further analysis.

References

“Forecasting: principles and practice | OTexts.” https://www.otexts.org/fpp

P.S. 1: This post was adapted from the one in Axionable Foodtruck.

P.S. 2: If you want to know more about Axionable, our projects and careers please visit us or follow us on Twitter.