Multi seasonal time series analysis: decomposition and forecasting with Python

Daniel J. TOTH
Analytics Vidhya
Published in
10 min readJul 21, 2021

A practical example for analyzing a complex seasonal time series with 100,000+ data points by the Unobserved Components Model

Photo by Richard Horvath on Unsplash

Scope

Forecasting is a common statistical task in business. It is of great importance as it affects management decisions and corporate activities of managing resources. The first statistical methods are first developed 100 years ago, and ARIMA models are still frequently used among other modern machine learning and deep learning techniques. Despite its popularity, ARIMA models have some serious drawbacks:

  1. the coefficients of the model are not easy to interpret or need detailed explanation
  2. efficient for small data sets, it is computationally expensive
  3. assumes stationarity of data or else the inputs should be transformed. Consequently, forecasts refer to the transformed data and not to the original time series. Apart from interpretability, this property increases confidence intervals relative to stationary series without transformation

For a better interpretation of seasonal time series, other methods have been developed such as the Unobserved Components Model (UCM). Being a so called ‘state space’ model, UCM decompose the original time series to its individual level, trend, cyclic, seasonal components and predict future values by modeling and taking the sum of these components.

This article is for practical purposes, for more information on state space models and specifically UCM, please follow the underlined links. Statsmodels user guide is also available here, but I applied the method to a more illustrative and larger data set.

The goal of the analysis is to forecast energy demand one year ahead using the available past values. Optionally, the model can be refined using exogenous variables. Some such variables were selected, collected and fed to the model.

Note: code is embedded step by step, full code is available here. For presentation purposes, the outputs are occasionally embedded first, followed by inputs. The outputs are intended to be comprehensible for lay audience. Do you think it is? Is something missing? Opinions are welcome.

About the data set

The chosen data set is the ‘Hourly energy consumption’ from Kaggle. The source contains hourly reported energy demand values in MegaWatts from several service areas across the United States. From demonstration purposes, I have chosen the American Electric Power (AEP) time series from 10.01.2004 01:00:00 to 03.08.2018. 00:00:00.

Some dates are missing, some are duplicated. These issues have been addressed and presented below.

Methods

Data set is split to train and test subsets (last 8766 hours as test). Brief summary of applied objects and functions:

  1. Loading and cleaning data frame using pandas and Python’s datetime class
  2. Exploratory data analysis by plotting time series with matplotlib.pyplot customized by seaborn settings
  3. Decomposition of individual components manually by seasonal_decompose() function of statsmodels.tsa class
  4. Approximation of individual components by scipy.optimize() and numpy
  5. Decomposition of individual components automatically and performing residual diagnostics by UnobservedComponents class of statsmodels.tsa
  6. Model evaluation using sklearn.metrics (MAE — mean absolute error, RMSE — root mean squared error), and comparison of observed and forecast value integrals (yearly absolute energy demand in MWh)

Results

Loading and cleaning data frame

Libraries mentioned above are first loaded:

The data frame is loaded from csv format as provided by Kaggle. It turns out the indices are not in order:

A quick visualization reveals the complexity of data set. A narrower time frame is also plotted for inspecting subtle patterns:

The indices are not in datetime format. It is obligatory for model classes inputs, so they are to be converted. Test is obtained by checking the equality of a generated list of dates with the same start and end date as the data frame index:

Output is False. Further investigation reveals, that duplicates are present and some values are missing, since:

No. of elements in set of indices < No. of indices < No. of elements in full list

After converting to datetime format, an algorithm reveals the indices that are duplicates or missing:

Remedies for the anomalies are as follows:

  1. converting data frame indices by changing the string format of the extracted indices to datetime format
  2. filling index gaps with mean values
  3. dropping duplicates, keeping mean of duplicate values

Sorting the indices and testing equality to generated list again is finally successful:

The frequency is set, although modelling classes could infer the frequency if indices are consistent.

Exploratory data analysis

Five random time windows are selected and plotted at different scales for inspecting any seasonal patterns present:

The patterns hint for at least a daily, weekly and yearly seasonal components. For a better resolution, the suspected seasonal components are plotted individually. This time, vertical dashed lines aid the eye:

Decomposition of individual components manually

The time series is split to train and test data. Last year (365.25 days or 8766 hours) is reserved for testing. Decomposition is performed by seasonal_decompose function using moving averages. The function accepts one period argument, so it should be applied multiple times. First the daily component is extracted (period=24 for hourly data), weekly component in the next step (period=168) and finally yearly component (period=8766 hours for 365.25 days taking into account leap years).

Weekly decomposition should be applied after daily component is subtracted and yearly decomposition should be applied after all other seasonal components are subtracted from train data. Please see the flowchart figure for a brief summary of the process.

Swim lane flowchart of time series manual decomposition (source: Author)

The code for the process:

The outputs are visualized with annotations added (code found below image). As stated, decomposition was made with moving average method. One can find better methods for the process, which are available in the statsmodels library, so at this point the results depend on the considerations of the programmer. For demonstration purposes, I stick to this method.

In-sample (train data time interval) predictions are readily available by summing the individual components without the residual. However, out-of-sample predictions (forecasting) are not possible at the moment: the trend component inevitably needs modeling for predicting future values. As for the seasonal components, they have leaked calendar effects and the phase also varies in case of the weekly and yearly components for the same day of the month, because one year consists of 52.14 to 52.29 weeks (52.179 on average) or 8760 to 8784 hours (8766 on average) depending on leap year.

Approximation of individual components

Forecasting is available after modeling the individual components. Trend component is modeled linearly across the decreasing trend region and also modeled by a 3rd degree polynomial across the whole train data set using numpy polynomial modeling functions.

The seasonal components are approximated by trigonometric functions. Finding the optimal approximation is automatic process with scipy.optimize , but the functions are needed to be constructed manually. The manual approximation is also inevitable prior calling the scipy method for finding the true minimum of the regularization function. The individual approximations are presented below (outputs first):

Eventually, a forecast series can be constructed from the modeled components. The visualization of models truly shows the complexity of the data set and the power of decomposition, providing interpretable components:

Model is evaluated by mean absolute error (MAE) and root mean squared error (RMSE). The former provides the average deviation from the observed values and is more intuitive, while the latter (if signiciantly greater than MAE) indicates the presence of outliers and may hint that distribution of residuals does not follow a normal distribution. A slight difference is observable:

As you will see, the UnobservedComponents class will have diagnostics plot function, which will provide additional insights.

The third evaluation method of the model is remarkably accurate. That means the modeled trend component is valid and the residuals vary across the series with a mean close to zero. As you will see, all the models accurately (<|2.4%|) describe the total yearly demand (TYD) and differ in approximating the seasonal variations.

Decomposition of individual components automatically and performing residual diagnostics

All the above process can be achieved via the UnobservedComponents class, with a results class of readily available diagnostic function and calculated metrics. Considering the insights gained from exploratory analysis, a single cell with relatively few lines of code describe the necessary process and yield the following summary table and metrics:

Considering MAE and RMSE metrics, this model is essentially equivalent to the above presented manual method. Note, that metrics for the in-sample predictions are also available. This gives the flexibility for the programmer to utilize k-fold cross-validation and refine the model parameters for better results. All you have to do is separate the original time series not only to two subsets (train and test), but further separate the train set into k number of intervals (folds). The model initialize by setting the hyperparameters at start to default values and then perform optimization. In case of k-fold cross-validation, the initially trained model is refined after each fold by reinitalizing with parameters yielded from previous folds, then metrics are calculated (using the UnobservedComponentsResults class append() method). Evaluation of the model is based on the average chosen metric value of each fold.

Results are plotted in style of the result class plot_components function format (some key information were missing, hence it is bypassed):

Finally, plot_diagnostics is called:

The deviance of the distribution of residuals from the normal distribution is tolerable for a meaningful forecast confidence interval. Frankly, the confidence intervals skyrocket after a few months (I visualized only point forecasts), but the viability of the model is shown by:

  1. the accuracy of total yearly demand and
  2. the RMSE — which is basically the standard deviation of the residuals — is in the league of approx. 10% of the average demand level
  3. the point forecast one year ahead is remarkably accurate, being 2.08% higher than the observed value.

Subsequent UC models are different from the previous one in utilizing exogenous variables for establishing the model, practically regressing the residuals and eliminating unexplained variance by the seasonal model. The observed variance (see EDA) of the daily patterns may originate from temperature variance. Air conditioning is commonly used in the summer, the daily pattern shows one peak and the summer demand is relatively higher than e.g. spring demand. The temperature data has been downloaded from the National Centers for Environmental Information (NCEI, see sources) and contained missing values. Filling the gaps would take another blog post to present, so please take it as granted. The gaps were filled with ARIMA modeling. The data itself originate from San Angelo, which is only a part of AEP operations. Although Texas is assumed to be the most populated and most industrialized service areas of AEP, limited explanatory power is expected, but I still use it for demonstration purposes.

The model, which utilize temperature data as additional input, with metrics calculated:

Looking at the metrics, it is obviously a minor improvement. However, this demonstrates the model’s potential. The observed seesaw pattern compared to first UCM shows some variance has been associated to temperature variation:

Residual diagnostics:

Hardly observable, but the autocorrelation plot shifted toward zero. A carefully selected exogenous variable would significantly reduce the autocorrelation of the residuals, as it becomes only random noise.

To address the issue, more variables are added to the analysis: net solar flux and solar zenith angle for modeling illumination (supposed to model extra energy demand for artificial lighting). Data is downloaded from the CERES database (see sources). In this particular case, no data cleaning had to be performed.

I prepared a pure regression analysis as a benchmark, using the same class:

As expected, modeling the patterns yield better results than pure regression (using the given exogenous variables). Visualization of the in-sample predictions for the model:

Plot diagnostics show increased autocorrelation and less similarity between normal distribution and distribution of residuals:

Final model utilizes all the given variables and the trigonometric modeling:

And the residual analysis:

Model evaluation

As stated before, evaluation is based on three metrics:

  1. MAE (mean absolute error)
  2. RMSE (root mean squared error)
  3. total yearly energy demand (TYD), custom metric

All the models were accurate about forecasting the trend. As a result, the error of prediction of energy demand in MWh is relatively close to zero (<|2.4%|). However, better models were more capable of reproducing seasonal variation yielding lower MAE. RMSE also decreased with the same rate, which means the outliers remained present in all the models (otherwise it would decrease more rapidly than MAE). A data frame and a figure were set for final comparison:

Important note: Hyndman & Athanasopoulos propose the use of AIC or AICc (corrected AIC) for evaluation of models.

Although the metrics were only slightly affected by tuning the UCM models, the presented results show that complex multi seasonal time series can be addressed with UCM, even for long term forecasting. To achieve precision and accuracy, one must carefully:

  1. select the decomposition method
  2. model individual components
  3. select the proper exogenous variables

Or following the manual decomposition method, one can model the individual components by an arbitrarily complex function if one has access to industrial knowledge.

Please see the full code at my Jovian profile here.

Sources

Original data set
https://www.kaggle.com/robikscube/hourly-energy-consumption

Hyndman, R.J., & Athanasopoulos, G. (2021) Forecasting: principles and practice, 3rd edition, OTexts: Melbourne, Australia. OTexts.com/fpp3. Accessed on 07.21.2021.

An Animated Guide©: Proc UCM (Unobserved Components Model)
Russ Lavery, Contractor for ASG, Inc.

A more detailed guide to UCM
Time Series Modeling with Unobserved Components, Rajesh Selukar, SAS Institute Inc., Cary, NC
https://forecasters.org/wp-content/uploads/gravity_forms/7-621289a708af3e7af65a7cd487aee6eb/2016/07/Selukar_Rajesh_ISF2016.pdf

Solar net flux and solar zenith angle data
Wielicki, B. A., B. R. Barkstrom, E. F. Harrison, R. B. Lee III, G. L. Smith, and J. E. Cooper, 1996: Clouds and the Earth’s Radiant Energy System (CERES): An Earth Observing System Experiment, Bull. Amer. Meteor. Soc., 77, 853–868. doi: 10.1175/1520–0477(1996)077<0853:CATERE>2.0.CO;2

Temperature data for San Angelo
https://www.ncei.noaa.gov/

--

--

Daniel J. TOTH
Analytics Vidhya

Biochemical engineer, former environmental lab analyst with quality focus. I do my best to bring high quality content to data science learners/enthusiasts.