Sales Prediction: Multi-Steps Forecast from Classical Time Series Models to Machine Learning Models

Time series analysis is applicable in numerous industry, such as business, economic, financial and even healthcare. Scientists were doing research on this topic since 19th century. As the name suggests it is highly dependent on point of time when it is collected. Have you ever thought of what make it a special kind of dataset and what makes it different from regression problem?

Hshan.T
Geek Culture
9 min readApr 20, 2021

--

Time series is a collection of ordered data points collected sequentially over a period of time. Generally, it is collected at regular intervals. It is special as data points are not independent, we expect a certain degree of serial correlation. There are no independent variables and time-dependent make it distinct from the regression problem. Time series forecasting is about predicting the future based on historical data by extracting useful statistics and characteristics within data. We will be going through time series description, analysis, and modeling in this piece of writing.

Time series is mainly comprising of three components:

  1. Trend — general long term direction in which data is changing
  2. Seasonality — patterns recur systematically over specific period of time
  3. residuals — short term, random fluctuation

Discussing how to identify possibility presence of these components in time series and decompose it for visualization to validate whether it justify our hypothetical statements about investigation domain or analysis goal. For example, a businessman expects increasing trend in their yearly revenue and a decreasing or constant budget allocation, a doctor expects a rather consistent heartbeat rate. Basically, time series forecast begins with data visualization to have an overview and inspecting presence of each component, trend and seasonality. These components are then estimated and eliminated, leaving it with a stationary series. Models are built and optimized on the stationary series. Informally, a series is so called stationary if they have constant statistics property, such as variance, covariance and mean at all point of time. Formally speaking, let X(t) be a time series, we are considering only weak stationarity which states a series is stationary if

  • Mean of x(t) does not depend on t
  • Variance of x(t) does not depend of t
  • Covariance of x(t) and x(t+h) does not depend on t, solely depend on h only

However, predicted output by model is not the answer to our question. It needs to be rescale to original scale by adding back removed components to output obtained. Of course, there are models where stationarity is not the strict assumption, process will be much simplified by skipping the components decomposition and rescaling. Some concepts or terms used in time series analysis will be discussed along the process.

Problem Statement

Assuming you own a store and currently working on stocks planning for coming years. By taking data of monthly sales in past 11 years , you try to predict monthly stocks consumption for next 12 months, to ensure healthy cash flow and minimize wastages by avoiding oversupply.

Data Loading

Data Loading and Info.

Note that data type for date after loading, it is changed to datetime for extraction of date components if necessary.

Data Description

Since this is a time series data, it should not be shuffled and partition randomly. Else, we will encounter situation where we might be predicting past based on future data which does not make any sense. The order must be preserved and we take the last 12 months as test data. There is no null data.

Train Dataset: Monthly sales record for a particular product in a store, from 2000 to end of 2009.

Validation Dataset: Monthly sales for Jan 2009 — Dec 2010

Test Dataset: Monthly sales for Jan 2011 — Dec 2011

Size: 122 records for training , 12 for validation and 12 for testing

Exploratory Analysis

Data Summary and Boxplots

Statistics summary shows that the mean and standard deviation are increasing from year to year suggesting presence of uptrend and wider spreading of data, except 2002. Spreading of data is visualized through a series of boxplot by aggregating sales by year for clearer inspection.

Time Series Plot.

Time series plot agrees hypothetical statement about uptrend, a more careful observation exposes presence of seasonality through repetition of pattern along the line plot, similar shape with different degree of spike or spread.

Data Transformation

We notice presence of trend and seasonality from previous section. As most of statistical analytics tools or models rely very much on stationarity assumption, elimination of those components is crucial before proceeding with modeling. So, series to be model will have constant statistical characteristic over time. There are numerous methods introduced to verify stationarity of time series, however, only a direct inspection through visualization and Dickey-Fuller test. Dickey-Fuller test is a hypothesis testing with null hypothesis states the series is non-stationary. You may wish to test on variogram, a stationary series is expected to reach state of stability on variogram.

Time Series Plot with Rolling Mean and Standard Deviation (window=12).

A red line representing moving average with window of 12 and an orange line of standard deviation of rolling data with same window are plotted. Standard deviation of rolling data is much lower as compared to data values. At current stage of analysis, we are almost confirm there is increasing trend components decomposition is done. Python library applied here imposed moving average in decomposing the time series. Such kind of decomposition is usually not the straightforward method telling you if there is seasonality and trend. It intends to break it down for visualization after you have identify them. Typically, this step solely is insufficient to prove stationarity.

Components Decomposition.

Illustrate how does shifting, rolling mean and standard deviation work

Shifting illustration.
Rolling mean illustration.

Rolling standard deviation works in similar way as rolling mean, but calculation function is change from ‘mean’ to ‘standard deviation’.

Plot of Training Data — Rolling Mean (window=12).

A series of transformation and differencing required to make the series stationary. After substracting rolling mean out of original series, note the increment in spreading of data and rather constant rolling mean over time. Hence, a log transformation might be suitable as it penalize large number.

Time Series Plot of Log-transformed Data.
Plot of Log-transformed Data — Rolling Mean (window=12).

Although range of values for log-transformed, degree of spreading of log-transformed series with rolling mean removed does not show obvious change over time as compared to original series. Nonetheless, rolling mean and standard deviation are fluctuating without significant pattern observed. Overall scale is changed, but we do not expect changes to conclusion of presence of each time series components.

Components Decomposition.
Differencing Code.

Argument of .shift() is adjusted accordingly to test on different order, d.

Differenced Series.

Next, differencing is done on log-transformed series. Rolling mean and standard deviation are now fluctuating much slightly and can be said to be constant throughout the period. Stationarity is further proved by conducting Dickey-Fuller test, reporting a test statistics smaller than critical values at all three significant levels, rejecting null hypothesis. Modeling will be done based on stabilized series underwent log-transformation and differencing. Visualizations time series with differencing at 1,2,3 tell us d=1 is the best, having the consistent rolling mean and lower rolling standard deviation. What is the appropriate model? How to judge?

Time Series Modeling

This step is rather less complicated than previous part. We are considering mainly four types of classical time series models, autoregressive (AR), moving average (MA), Autoregressive–moving-average model (ARMA, combination of AR and MA) and finally Autoregressive Integrated Moving Average (ARIMA). We decide on differencing previously, ARIMA is opted so there is no extra step in creating substracted series needed. MA and AR can be constructed by setting the order of ARIMA(p, d, q). By setting p=0 and q=0, it is MA and AR respectively. Setting for p and q involve analysis on autocorrelation (ACF) and partial autocorrelation (PACF).

ACF and PACF Plots.

ACF measures correlation coefficient of time series and its series at lag k, means we are quantify correlation between x(t) and x(t-k). PACF measures partial correlation between time series and its lagged values. Difference between ACF and PACF is, considering PACF at lag k, it takes into account correlation between data points at shorter lag h (h<k). Some people tend to think of this as a regression problem. Taking w=ax+by+xz, PACF reports correlation between w and z that is remain unpredicted by x and y. Sharp drops in ACF and PACF determine values of q and p respectively. If both ACF and PACF decline gradually, ARIMA is opted. AR and MA models have been tested, their performances are not as good as combined model. Hence, adopting ARIMA here. From resulting ACF and PACF plots in this case, three combinations of ARIMA order, [ (2,1,1), (0,1,1), (2,1,2) ] will be trained and validated. Code below can be modified accordingly by changing argument ‘order’ for ARIMA().

Code.
Model Summary for ARIMA (2,1,2)
Performance Summary.

ARIMA(2,1,2) has the lowest AIC among all three models, and all its parameters estimates are statistical significant. One point that worth noting is ARIMA(2,1,2) has much smaller difference between training and validating RMSE. We will go on with ARIMA(2,1,2).

Model selected is refit with a complete set of training and validating data. Forecast will be made from the last point of time onwards. This is a time series data, it is not valid to skip the validation set. If you do skip, prediction for test data will be made based on the history that happened before validation set, which mean you gonna predict 2011 sales using 2009 sales. This is a multi-steps forecast problem. 12 months forecast will be done based on recursive forecasting, X(t+2) will be based on prediction for X(t+1) and so on.

Forecast for 01/2011–12/2011.

Predicted output obtained is same scale as log-transformed time series that is adopted in modeling. Reverse log-transformation to get final valid forecasts by taking exponential of model predicted values to convert it back to original scale.

The whole process involved rather subjective interpretation of plots with support from statistics theories. Python library, pmdarima provides us with methods called auto_arima. It will undergo iterations with range of values for each parameters and optimized based on some other arguments specified to find the best performing model.

Code.

Machine Learning Models

Time series problem can be rephrased as supervised learning in ML, such that output in previous step will be input for next steps. Each data points is assumed to be independent as serial correlation between output has been accounted by creating new features for each record.

Classical time series models are still attractive and useful for solving our concerns after decades. Nevertheless, restructuring time series forecasting task into a machine learning problem is thought-provoking, due to the possibility of including more factors which are suspected to be contributor to changes in dependent variables. Predicting is no long depending only on history but it take into account other meaningful predictors from surroundings. For example, groceries sales show a noticeable spikes during festival seasons, so a boolean variable whether people celebrate on a particular day worth to be included as predictors. An example of how a new data frame can be created from this dataset is as follow and any regression model, such as linear regression, SVR, neural network, etc, can be experimenting.

Example of DataFrame Creation for Supervised Learning

Please be noted, data splitting method will still be a concern as sequence of events is important.

--

--