Mastering Time Series Forecasting

Vansh Jatana
The Deep Hub
Published in
8 min readFeb 18, 2024

Time series forecasting is an important aspect of Data science and is used in different scenarios, from economics to weather forecasting. Forecasting is super useful because it helps us make predictions about the future based on past data. But to make good predictions, we need to pick the suitable model, and sometimes it’s tricky to know which one to use

We’ll start by looking at some important tests that help us understand our data better-

Stationary Test: This evaluates whether the statistical properties of the series, such as the mean and variance, remain consistent over time. A stationary series is easier to predict.

Correlation Tests, including:

Autocorrelation Test: This checks how the current values of the series are related to its past values, providing insight into the series’ own internal lagged relationships.-

Partial Autocorrelation Test: This assesses the relationship between the series’ current and past values, controlling for the effects of the intervening values. It helps isolate the direct effect of previous observations on the current value.

Seasonal Test: Identifies patterns that repeat at regular intervals, such as daily, monthly, or yearly. Recognizing seasonality is essential for models that need to capture and forecast periodic fluctuations.

Based on the outcomes of these tests, we can make informed decisions on the best model to employ for our specific dataset, whether it be AR (Autoregressive), MA (Moving Average), ARMA (Autoregressive Moving Average), ARIMA (Autoregressive Integrated Moving Average), or SARIMA (Seasonal ARIMA). Each model has unique capabilities tailored to different types of time series data, and understanding these tests allows us to match our data with the most suitable modeling approach for effective forecasting.

Stationary Test

Stationary Test is an important step to perform for time series prediction. This process it to used that weather it’s statistical values such as mean , variance are constant over time or not

The Augmented Dickey-Fuller (ADF) test is a widely used statistical test for determining whether a given time series is stationary. This test starts with the null hypothesis that the time series is non-stationary. It then calculates a test statistic, which is compared to critical values. If the test statistic is less than the critical values, we reject the null hypothesis and conclude that the time series is stationary. Otherwise, we do not reject the hypothesis, indicating that the series is non-stationary.

For example, the sales data of a well-established shop. Such data is generally stationary, meaning its statistical properties do not change significantly over time. The patterns observed last year would likely be similar to those in the current year where economic data, like the GDP of a country, is often non-stationary. This type of data typically exhibits trends, such as consistent upward or downward movements, and its statistical properties change over time.

Let’s perform adfuller test with default parameter .If p_value is more than 0.5 is consider non stationary else stationary

When we perform the ADF test, it returns following values:

  1. ADF statistics — This is value of test statistics, More negative statistics means time series is non stationary
  2. P-value — This is probability of data in null hypothesis, means how likely the data is non stationary
  3. usedlag — This number represents the number of lag used in the test.
  4. nobs — This number represents the number of observations used in the test.
  5. Critical value — These values are used to compare with the ADF statistic to determine the stationarity of the time series at different confidence levels

Example 1 : Sales data

import pandas as pd
from statsmodels.tsa.stattools import adfullerp
df = pd.read_csv('/content/sales_data.csv')
time_series = df['Sales']
result = adfuller(time_series)
print(result)

(-6.4559, 1.48e-08, 8 lagged values, 41 observations, critical values: 1%=-3.5715, 5%=-2.9226, 10%=-2.5993)

The ADF test statistic is -6.4559 with a p-value of 1.48e-08, indicating strong evidence against the null hypothesis of non-stationarity, supported by 8 lagged values and 41 observations, with critical values at 1%, 5%, and 10% confidence levels respectively.

Example 2 : GDP Data

import pandas as pd
from statsmodels.tsa.stattools import adfuller
df = pd.read_csv('/content/india_gdp.csv')
time_series = df['GDP']
result = adfuller(time_series)
print(result)

(-0.8733, 0.7966, 7 lagged values, 53 observations, critical values: 1%=-3.5602, 5%=-2.9179, 10%=-2.5968)

The ADF test statistic is -0.8733 with a p-value of 0.7966, suggesting weak evidence against the null hypothesis of non-stationarity; with 7 lagged values and 53 observations; critical values at 1%=-3.5602, 5%=-2.9179, and 10%=-2.5968 confidence levels.

Correlation Test

In correlation analysis, we use two key functions to understand the relationships between observations at different time lags within a time series: the AutoCorrelation Function (ACF) and the Partial AutoCorrelation Function (PACF).

AutoCorrelation Function (ACF) is utilized to comprehend the correlation between observations in a series and their past values at different time lags. It provides a measure of the linear relationship between the time series and a lagged version of itself. For example, ACF(x) represents the correlation between the series and itself lagged by ‘x’ time periods.

Partial AutoCorrelation Function (PACF), on the other hand, is used to ascertain the direct relationship between an observation at a specific time point and its lagged values, while effectively removing the influence of shorter lags. This means PACF(x) gives the correlation between the series and itself lagged by ‘x’ time periods, after accounting for the linear dependence on the series at shorter lags, thus isolating the direct effect of a specific lag.

The primary difference between ACF and PACF lies in their approach to correlation analysis. ACF considers the cumulative effect of shorter lags, without isolating the direct impact of any specific lag. Conversely, PACF isolates the direct effect of a specific lag by removing the influence of shorter lags, allowing for a clearer understanding of the relationships at different time intervals.

When analyzing correlation plots, we look for several key features:

  1. Positive and Negative Correlation: Positive correlations (appearing above the horizontal dashed line) indicate a positive autocorrelation at certain lags, suggesting that past values have a similar effect on future values. Negative correlations (below the horizontal dashed line) indicate a negative autocorrelation, suggesting that past values have an opposite effect on future values.
  2. Decay in Correlation: A common observation is the decay in correlations as lags increase, indicating that past observations have a diminishing impact on current values. This pattern suggests that the influence of past observations on future values decreases over time.
  3. Significant Peaks: Noticing significant peaks crossing the horizontal dashed line at certain lags can be indicative of seasonality or periodicity within the data.
import pandas as pd
from statsmodels.tsa.stattools import acf, pacf
df = pd.read_csv('/content/sales_data.csv')
time_series = df['Sales']
lag_acf = acf(time_series, nlags=20)
plt.figure(figsize=(12, 6))
plt.subplot(121)
plt.stem(range(len(lag_acf)), lag_acf, use_line_collection=True)
plt.axhline(y=0, linestyle=' - ', color='gray')
plt.axhline(y=-1.96/np.sqrt(len(time_series)), linestyle=' - ', color='gray')
plt.axhline(y=1.96/np.sqrt(len(time_series)), linestyle=' - ', color='gray')
plt.title('Autocorrelation Function (ACF)')plt.subplot(121)

This plot shows the autocorrelation of the sales data with its own lagged values up to 20 lags. The presence of autocorrelation at specific lags can be observed where the stems cross the confidence interval lines (dashed lines), indicating statistically significant correlations.

lag_pacf = pacf(time_series, nlags=20, method='ols')
plt.stem(range(len(lag_pacf)), lag_pacf, use_line_collection=True)
plt.axhline(y=0, linestyle=' - ', color='gray')
plt.axhline(y=-1.96/np.sqrt(len(time_series)), linestyle=' - ', color='gray')
plt.axhline(y=1.96/np.sqrt(len(time_series)), linestyle=' - ', color='gray')
plt.title('Partial Autocorrelation Function (PACF)')
plt.xlabel('Lag')
plt.ylabel('PACF')
plt.tight_layout()
plt.show()

.

The PACF plot illustrates the partial autocorrelation, highlighting the direct relationship between the sales data and its lagged versions, controlling for the effects of intervening lags. Significant peaks within the confidence interval suggest lags that have a direct influence on the series.

Seasonal Test

Seasonality Test is an important analysis in time series forecasting to understand behaviors in the data that recur at regular intervals, such as daily, monthly, quarterly, or annually.

Seasonal Decomposition of Time Series (SDTS) with Seasonal and Trend decomposition using Loess (STL) is used to identify trend, seasonal, and residual (irregular component)

Eg GDP data of a country, which often exhibits strong seasonal patterns, such as increased economic activity in certain quarters due to holidays or fiscal policies. By applying the STL decomposition to the quarterly GDP data, wecan separate the underlying trend from the seasonal effects

When we perform the seasonal_decompose function it returns an object containing the following components:

  1. Trend: This component reflects the long-term progression of the series, highlighting how the data’s central tendency changes over time. It is useful for identifying upward or downward movements in the series over long periods.
  2. Seasonal: This component captures the repeating short-term cycle in the data. It represents the seasonal fluctuations that occur with a fixed and known frequency. For example, in the context of monthly data, this could highlight increased sales during the holiday season or increased energy consumption during winter and summer.
  3. Resid (Residual): The residual component consists of the remainder of the time series after the trend and seasonal components have been removed. It represents the irregular or stochastic part of the series that cannot be attributed to the trend or seasonality. This could include random fluctuations or noise in the data.
import pandas as pd
from statsmodels.tsa.stattools import seasonal_decompose
df = pd.read_csv('/content/electricity_data.csv')

decomposition = seasonal_decompose(electricity_data['Consumption'], model='additive', period=12)
plt.figure(figsize=(14, 8))
decomposition.plot()
plt.show()
  • Trend: This component captures the overall direction in which electricity consumption is moving over time. In our synthetic data, we introduced a linear increase to simulate growth in demand.
  • Seasonal: The seasonal component reveals the predictable pattern that repeats annually. As designed, our synthetic data exhibits higher consumption during winter and summer months, reflecting increased heating and cooling needs.
  • Residual: This captures the irregular fluctuations in the data after accounting for the trend and seasonal components. It represents the unexplained variance, which, in real-world scenarios, could be due to unexpected events or measurement errors.

Model Selector

Once we have performed preliminary analyses such as stationarity Test, correlation analysis (ACF and PACF), and seasonality detection, the next step is to select an appropriate model for forecasting. This can often be a complex decision, influenced by the characteristics of the time series data. To streamline this process, we introduce an automated function that analyzes the given time series data and recommends whether to use AR (Autoregressive), MA (Moving Average), ARMA (Autoregressive Moving Average), ARIMA (Autoregressive Integrated Moving Average), or SARIMA (Seasonal ARIMA) models.

def recommend_model(time_series, seasonality_threshold=0.3, acf_lag=20, stationarity_threshold=0.05, ma_threshold=0.5, ar_threshold=0.5):
"""
Analyzes the given time series data to recommend AR, MA, ARMA, ARIMA, or SARIMA.




:param time_series: A Pandas Series with datetime index.
:param seasonality_threshold: Threshold to decide significant seasonality.
:param acf_lag: Number of lags to use for autocorrelation and partial autocorrelation Test.
:param stationarity_threshold: P-value threshold for stationarity test.
:param ma_threshold: Threshold for significant autocorrelation at lag 1.
:param ar_threshold: Threshold for significant partial autocorrelation.
:return: Recommendation string.
"""




# Step 1: Seasonal Decomposition
result = seasonal_decompose(time_series.dropna(), model='additive', period=acf_lag)
seasonal_std = result.seasonal.std()




# Step 2: Test for Stationarity
dftest = adfuller(time_series.dropna(), autolag='AIC')
p_value = dftest[1]




# Step 3: Autocorrelation and Partial Autocorrelation Analysis
lag_acf = acf(time_series.dropna(), nlags=acf_lag)
lag_pacf = pacf(time_series.dropna(), nlags=acf_lag)




# Step 4: Recommendations
if seasonal_std > seasonality_threshold or any(abs(lag_acf[1:]) > 0.5):
return "SARIMA"
elif p_value < stationarity_threshold:
if all(abs(lag_pacf[2:]) < ar_threshold) and all(abs(lag_acf[2:]) < ma_threshold):
if abs(lag_pacf[1]) > ar_threshold:
return "AR"
elif abs(lag_acf[1]) > ma_threshold:
return "MA"
else:
return "ARMA"
else:
return "ARIMA"

--

--

Vansh Jatana
The Deep Hub

Vansh Jatana, a Data Scientist, holds a Computer Science degree from SRM Institute of Science and Technology, India. Ranked among in Kaggle's Grandmaster