DATA SCIENCE THEORY | TIME SERIES ANALYSIS | KNIME ANALYTICS PLATFORM

Building a Time Series Analysis Application

Describing the process of accessing, pre-processing, and forecasting time series

Maarit Widmann
Low Code for Data Science

--

Co-authors: Daniele Tonini, and Corey Weisinger

Photo by Agê Barros on Unsplash.

A time series analysis application starts from data access and pre-processing, proceeds with model training and evaluation, and finishes with deployment — like any other data science application. Yet for time series data the techniques applied in these steps are different in comparison to cross-sectional data, because time series data are collected by observing the same object multiple times, and the pre-processing and model training steps consider this temporal structure of the data.

In this article, we introduce different sources of time series data, explain common pre-processing techniques, and compare different approaches to forecasting. Finally, we show how to execute a time series application in KNIME Analytics Platform.

Accessing Time Series

Time series data are collected by observing the same object over time. The sources of time series data and thus applications are many, for example, daily sales data from a supermarket for demand prediction, yearly macroeconomic data from a country for long-term political planning, and sensor data from a smartwatch for health tracking. The different sources produce time series data at different levels of detail and constancy. As an example, we could have a GDP value for a country per year and a sensor value from a smartwatch per second. On the other hand, the GDP values fluctuate less and all always available, while the smartwatch signals might be anomalous or completely missing with a low battery for example. Especially, if we are observing random events, such as disease infections or spontaneous customer visits, the aperiodicity of the raw data is expected by the nature of the data collection.

Building a Time Series Application
Figure 1: A time series is collected by observing the same object over time. The observable objects vary from tiny single objects such as muscles in a human body to larger entities, such as countries.

Pre-processing Time Series

After accessing time series data, the next step is to clean and transform it as required by, firstly, the original shape of the data, and secondly, the analytics purpose.

Below we introduce a few common pre-processing steps for time series data.

Sorting, partitioning, and aggregation

The sorting of the time series ensures that the time series is in chronological order. This sorting should also be preserved when partitioning the data into training and test sets. Therefore, the appropriate partitioning strategy is called “take from the top”, which takes the first x rows for training and uses the remaining rows for testing. Furthermore, if the data contain more than one record per timestamp, for example when you have multiple orders per day, a required pre-processing step is to aggregate the recorded values by the timestamp. This aggregation step produces a time series with only one value per timestamp. Another purpose of aggregating time series data is to reduce its granularity. For example, to obtain monthly data from originally daily data, you can aggregate the daily values into, for example, the monthly total values. The aggregation by date and time fields is called time aggregation.

Missing value handling

Missing value handling refers to firstly, introducing possible missing timestamps into the time series, and secondly, imputing the corresponding missing values. The result, a time series with a value for each timestamp at regular intervals, is called equally spaced.

Popular missing value handling strategies for time series data are based on a fixed value or a statistical measure, such as the moving average, previous/next value, and linear interpolation. The moving average is a good approximation of randomly missing values in irregular data. The replacement strategies based on a previous/next value and linear interpolation perform well for time series data with autocorrelation. If the missing values are not missing at random, due to holidays, for example, you can replace them with 0 or another known value. On the other hand, if the missing values occur far enough in the past, you can also ignore the older data.

Smoothing and outlier detection

Smoothing reduces redundant details and noise in time series data, while outlier detection treats exceptional values that might otherwise bias the analysis. For smoothing, popular techniques are called moving average and exponential smoothing. For outlier detection, a common approach is the box plot, which is based on quantile statistics. This approach removes or truncates the values that lie outside the whiskers of a box plot, that is values that lie exceptionally far beyond the first or the third quantile. However, keep in mind that a part of the exceptional fluctuation can be due to seasonality. Therefore, for time series data with seasonality, it’s recommended to inspect the variation with a conditional box plot because it displays the outliers at each point of the seasonal cycle separately.

Besides via smoothing and outlier detection, you can obtain a more regular time series by extracting a subset of it, for example, by only considering the sales of one product instead of the sales of the whole supermarket, or by clustering the data.

Building a Time Series Application
Figure 2: Time aggregation, missing value handling, and outlier detection are examples of techniques for pre-processing time series data.

Exploring and Decomposing Time Series

After pre-processing, the next step is visual exploration and time series decomposition. Visual exploration reveals long and short-term patterns and temporal relationships that we can use to better comprehend the dynamics of the time series. Decomposition separates the dynamics of the time series into regular and irregular components.

Below we introduce techniques for visual exploration and time series decomposition.

Visual exploration of time series

A widely used plot for exploring time series is the line plot that shows all values in the time series and displays a possible direction (trend), regular fluctuation (seasonality), outliers, gaps, or turning points in it.

For time series with a seasonal pattern, a seasonal plot shows the dynamics of subsequent seasonal cycles in parallel. For example, for sales data with yearly seasonality, the seasonal plot can show if a specific month, let’s say July, was a stronger sales month this year than last year.

A conditional box plot shows the variability of the data at different points of the seasonal cycle. With a conditional box plot, you can treat extreme values as normal/exceptional by comparing them to the median of comparable data points instead of all data. Furthermore, in a conditional box plot, you can also inspect how much the variability differs through the seasonal cycle.

A lag plot compares the current values to the past values in a scatter plot and reveals if there is autocorrelation in the time series at this lag.

Finally, the autocorrelation (ACF) plot shows the lag with the highest autocorrelation and thus also displays the length of the possible seasonal cycle. For example, if you observe a peak at lag 7 in the ACF plot, and you have daily data, then there is weekly seasonality in the data. The ACF plot also shows if the time series is stationary. We will talk more about stationarity in the next section.

In Figure 3 below we show examples of the visual exploration techniques that we introduce above:

Building a Time Series Application
Figure 3: Lag plot, conditional box plot, line plot, seasonal plot and autocorrelation plot are examples of visual exploration techniques of time series.

Classical decomposition of time series

Classical decomposition means extracting the regular dynamics of the time series, called trend and seasonality, and its irregular part called residual, from the time series into separate components. The decomposition contributes to the final forecasting in two ways. Firstly, forecasting by the regular dynamics provides a benchmark for the more complex forecasting models. These models should exceed their forecasting performance to be worth training. Secondly, decomposing the time series produces a stationary time series, which is required before training an ARIMA model. Notice that sometimes additional transformations on the residual series might be required to make it stationary, such as first-order differencing to stabilize the mean, or log transformation to stabilize the variance.

To decompose the trend component from the time series, we fit a regression model through the data. And to decompose the seasonal components, we difference the data at the lag where the major spike in the ACF plot occurs. To decompose a possible second seasonality, we can difference the differenced time series at the lag of the second seasonality. Differencing the data a third time is rarely required.

After decomposing the time series, we can do a final check for stationarity with, for example, the Ljung-box test. After that, we can move on to training the forecasting model.

Training a forecasting model

The next step is to train the forecasting model. We can do this with an ARIMA model or a machine learning model. An ARIMA model can only be trained on the residual part of the time series, because it’s stationary, while a machine learning model doesn’t have the same restriction and can also be trained on the original time series. In addition, a seasonal ARIMA (SARIMA) model can be trained on the original, non-stationary time series data.

In the following sections, we summarize a few different forecasting techniques, which you select from and compare for your use case.

(S)ARIMA models

ARIMA (AutoRegressive Integrated Moving Average) model is a linear regression model between the current and past values (AR-part), and also between the current and past forecast errors (MA-part). If the model has a non-zero I-part, then the data are differenced in order to make it stationary. A SARIMA model is the same as an ARIMA model plus regressors at seasonal lags. Modeling data with an ARIMA model requires stationary data, and a stationary time series doesn’t have predictable patterns in the long term. Thus, the declining accuracy in the long-term forecasts can be seen in the increasing confidence intervals of the forecasts. Furthermore, having more data is not always better for training an ARIMA model: Large datasets increase the training time and might introduce noise that bias the algorithm.

Machine Learning models

Machine learning models use the lagged values as predictor columns, yet they are treated as any other predictor columns. Machine learning models can identify complex dynamics and long-term patterns in the data, provided that enough training data are available. In general, the more complex the dynamics of the time series, the more data are needed for training the forecasting model. When training a machine learning model on the original data instead of the residual, it’s recommended to compare its forecasting performance to the benchmark obtained by classical decomposition to make sure that the model also captures the irregular dynamics and is thus worth training.

Tips on model selection

The model selection is based on the forecasting performance and also other aspects related to the real-world use case. Below we highlight a few of them.

Forecastability. A time series can be difficult to forecast by nature because it describes a randomly evolving process. In such a case it often makes sense to go for a simpler model and not invest resources in modeling dynamics that are unpredictable.

Interpretability. If important decisions are based on the results of the model, the domain expert might prefer a model that is easier to interpret over a model that performs slightly better. For example, a neural network is difficult to understand without data science skills, while a model based on classical decomposition is more intuitive.

Available data. All data that is related to the dynamics of the time series is not always available or worth collecting. For example, adding explanatory variables — other predictors than past values of the time series itself — to the time series model might improve the forecast accuracy. However, the explanatory variables need to be forecast, too, which increases the complexity of the model.

Building a Time Series Application
Figure 4: The availability, complexity, and randomness of the data and the interpretability of the model are examples of aspects that determine the selection of the forecasting model. As an example, the line plot in the top left corner shows the forecasts by a neural network trained with too little training data. The line plots in the bottom left corner show a completely random process and a process with a change in its dynamics. The line plot on the right shows a time series that follows an ARIMA (2,1,1) process.

Model Evaluation

After training a model, the next step is to evaluate it. We can generate two kinds of evaluation metrics that report the in-sample and out-sample prediction performance of the model, respectively. The in-sample prediction performance reports the model’s prediction performance on the training data. The out-sample prediction performance reports the model’s performance on the test data that follows the training data in time.

One recommended error metric for evaluating a forecasting model is the mean absolute percentage error (MAPE), which provides the error in a universal scale, as a percentage of the actual value. However, if the actual values contain zeros, this metric is not defined, and then other error metrics, like the root mean squared error (RMSE), can be used. Notice that the R-squared metric should not be used to evaluate the prediction performance on non-stationary data because it reports a better performance when the variance increases.

Forecasting and Reconstructing Time Series

We’re almost there! We introduce the last analytics steps, forecasting and reconstructing the signal, in the following sections.

Dynamic forecasting

In dynamic deployment, only one point in the future is forecast at a time, and the past data are updated by this forecast value to generate the next forecast (Figure 5). The purpose of the dynamic deployment is to improve the out-sample prediction performance of the model.

Building a Time Series Application
Figure 5: In dynamic deployment, only one forecast is generated at a time, and this forecast is added to the past data to generate the next forecast one step further ahead in time.

Restoring trend and seasonalities

The final forecast value consists of the forecast residual value to which the trend and seasonality components have been restored. We start composing the final forecast by adding values in the residual series at the lag where the first seasonality occurs. For example, if we have daily residual series y which has been differenced at lag 7 (weekly seasonality), restoring this seasonality would require the following calculation:

Building a Time Series Application

where t is the last time index in the training data, and h is the forecast horizon.

After that, we restore the second seasonality by repeating the calculation described above at the lag of the second seasonality. Finally, we restore the trend component to the time series by applying the regression model representing the trend component.

Completing a Time Series Application in KNIME Analytics Platform

Finally, we show how to execute a time series application using KNIME Analytics Platform. Figure 6 below shows an example workflow (available on the KNIME Hub). The workflow accesses, pre-processes and visualizes time series data. Then, it decomposes it and trains an ARIMA model on it. In the workflow, we use time series components that perform dedicated time series analysis tasks, such as time aggregation and time series decomposition.

Building a Time Series Application
Figure 6: A workflow for accessing, pre-processing, visualizing and modeling time series. Download the workflow from the KNIME Hub.

The example workflow processes superstore sales data provided by Tableau. The Sample-Superstore.xls file is available on Kaggle. We analyze the orders of all products from 2014 to 2017, altogether 9994 rows.

After accessing the data, we perform the following pre-processing steps:

  1. We calculate the total sales per day with the GroupBy node so that we only have one value per day. At this point, though, the days when no orders were submitted are missing.
  2. Next, we introduce the missing days into the time series with the Timestamp Alignment component and replace the missing sales values with a fixed value 0 with the Missing Value node.
  3. After that, we perform time aggregation with the Aggregation Granularity components to obtain the average sales in each month and in each year, respectively, for visual exploration and forecasting.

After pre-processing, we perform visual exploration via the Line Plot and Conditional Box Plot nodes and the Decompose Signal component. Figure 7 below shows the results:

Building a Time Series Application
Figure 7: Exploring the time series visually via line plots and autocorrelation plots.

We find out that there’s a turning point at the beginning of the year 2015, as the line plot on the top right in Figure 7 shows. The line plot on the left shows the values of the monthly sales with a yearly seasonality: there are two regular peaks at the end of each year, and a lower peak at the beginning of each year. We detect the same yearly seasonality by looking at the ACF plot in the bottom left corner, which shows a major spike at lag 12. The ACF plot in the bottom right corner shows no significant autocorrelation in the residual series after decomposing the time series. The decomposed parts of the time series, that is trend, seasonality, and residual are shown in the line plot in the middle.

Next, we model the residual series of the monthly average sales with an ARIMA model. We look for the best model with the Auto ARIMA Learner component with max order 4 for the AR and MA parts and max order 1 for the I part. The best performing model based on Akaike information criterion is ARIMA (0, 1, 4), and the resulting R-squared on the (stationary!) residual series is 0.073.

Finally, we perform out-sample forecasting in a separate workflow. The workflow is available on the KNIME Hub and shown in Figure 8:

Building a Time Series Application
Figure 8: Workflow to forecast the monthly sales with an ARIMA (0,1,4) model using dynamic deployment. The workflow is available on the KNIME Hub.

It forecasts the monthly sales in 2017 using an ARIMA (0,1,4) model and the dynamic deployment approach. It also reconstructs the final forecasts by restoring the trend and yearly seasonality to the forecast residuals. Finally, it compares the actual and forecast sales values in a line plot and via scoring metrics. The MAPE value of the forecasts is 0.336, which means that the actual sales values deviate on average 33.6% from the forecast values.

Summary

In this article, we have introduced you to the specific analytics techniques for time series data and completed an example time series analytics application in KNIME Analytics Platform.

Are you looking for a more comprehensive guide on time series analysis with KNIME?

Codeless Time Series Analysis with KNIME is in progress by the time series specialists at KNIME, Corey Weisinger and Maarit Widmann, and Professor Daniele Tonini from Bocconi University (IT).

The book is already available for pre-order on Amazon:

Watch out for the release in August 2022 by Packt Publishing!

References

[1] Chambers, John C., Satinder K. Mullick, and Donald D. Smith. How to choose the right forecasting technique . Harvard University, Graduate School of Business Administration, 1971.

[2] Hyndman, Rob J., and George Athanasopoulos. Forecasting: principles and practice . OTexts, 2018.

— — — — -

As first published on the KNIME Blog: https://www.knime.com/blog/building-a-time-series-analysis-application

--

--

Maarit Widmann
Low Code for Data Science

I am a data scientist in the evangelism team at KNIME; the author behind the KNIME self-paced courses and a teacher at KNIME.