Introduction to Time Series Analysis — Part 1.1: Stationarity

Published in

Time Series Using Python

10 min readApr 18, 2024

In Part 1 of the time series, we introduced What is Time Series and Why is Time Series Analysis (TSA). We covered the following:

Chronology of Data Points: We emphasized that each data point in a time series is part of a sequence where the timing and order are paramount. This chronological order impacts how we interpret and analyze the data.
Trends and Seasonality: Time series data often exhibit identifiable trends (patterns of increase or decrease over time) and seasonality (patterns that repeat over a regular interval). Recognizing these patterns is vital as they significantly influence the behavior of the series and our approach to both analysis and forecasting.

In Part 2, we will work our way in understanding an important aspect of Time Series i.e. STATIONARITY of time series.

Stationarity is a fundamental concept in TSA that refers to a time series whose statistical properties such as mean, variance, auto correlation are constant over time. When a time series is stationary, it becomes easier to model and predict because it is not influenced by time-dependent changes in the data, which can lead to more reliable forecasts.

There can be two types of stationarity present in a time series:

Strict Stationarity
Weak Stationarity

How can we figure out if a time series is stationary or not? We can perform the following tests :

Augmented Dickey-Fuller (ADF) Test: The ADF test looks for a pattern in the data where the value at any point in time is primarily influenced by the value immediately before it, but in a way that the influence slowly reduces as we gather more data. This pattern is called a “unit root,” and if it exists, the series is not stationary.
Kwiatkowski-Phillips-Schmidt-Shin (KPSS) Test: Tests for stationarity around a deterministic trend.The KPSS test specifically determines if there’s a trend or drift that might make the data non-stationary. It’s like checking if the volume in your song gradually gets louder or softer, which would mean the song isn’t consistent.

Let’s revisit our example of the Elevation Gain as you hike up a mountain. In our example, the mountain had a fixed trend (the slope didn’t change with time), it had a fixed seasonality (peaks and troughs were always 20 minutes apart). This means, given the information, we can expect to gain almost same elevation (linear height + peaks/troughs) after every 20 minutes.

REMARK: Till now, we have been considering Additive Time Series models only.

Revisiting our Mountain Hiking

# In previous part, we took an example of hiking on top of a mountain and measure elevation gained per minute.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from statsmodels.tsa.stattools import adfuller

minutes = np.arange(0, 100)
elevation_gain = 1000 + 3*minutes + 50 * np.sin(2 * np.pi * minutes / 20) + np.random.normal(0, 10, size=100)
df = pd.DataFrame(index = minutes, data = {"Elevation":elevation_gain})

# Plot the elevation gain
plt.figure(figsize=(10, 5))
plt.plot(minutes, elevation_gain, label='Elevation Gain', marker='o', linestyle='-')
plt.title('Elevation Gain as You Hike Up a Mountain')
plt.xlabel('Minutes')
plt.ylabel('Elevation (Feet)')
plt.legend()
plt.show()

Testing for Stationarity

Let’s quickly dive into using statsmodels library to test out the stationarity. (!pip install — upgrade statsmodels)

# Use Statsmodels library for testing modules

from statsmodels.tsa.stattools import adfuller, kpss

# Augmented Dickey-Fuller Test
def perform_adf_test(series):
    result = adfuller(series)
    print(f'ADF Statistic: {result[0]}')
    print(f'p-value: {result[1]}')
    print('Critical Values:')
    for key, value in result[4].items():
        print(f'\t{key}: {value}')

# Kwiatkowski-Phillips-Schmidt-Shin Test
def perform_kpss_test(series):
    statistic, p_value, lags, critical_values = kpss(series, regression='c')
    print(f'KPSS Statistic: {statistic}')
    print(f'p-value: {p_value}')
    print('Critical Values:')
    for key, value in critical_values.items():
        print(f'\t{key}: {value}')

# Apply the tests to the elevation gain data
print("Augmented Dickey-Fuller Test Results:")
perform_adf_test(elevation_gain)
print("\nKwiatkowski-Phillips-Schmidt-Shin Test Results:")
perform_kpss_test(elevation_gain)

Augmented Dickey-Fuller Test Results:
ADF Statistic: -0.2983961870566715
p-value: 0.9257311731708413
Critical Values:
	1%: -3.5078527246648834
	5%: -2.895382030636155
	10%: -2.584823877658872

Kwiatkowski-Phillips-Schmidt-Shin Test Results:
KPSS Statistic: 1.5881056192520853
p-value: 0.01
Critical Values:
	10%: 0.347
	5%: 0.463
	2.5%: 0.574
	1%: 0.739
<ipython-input-4-ca86b7304bc1>:14: InterpolationWarning: The test statistic is outside of the range of p-values available in the
look-up table. The actual p-value is smaller than the p-value returned.

statistic, p_value, lags, critical_values = kpss(series, regression='c')

Interpretation & Misinterpretation

The ADF statistic is greater than all the critical values, and the high p-value suggests that we cannot reject the null hypothesis that there is a unit root present. This indicates that the series is likely non-stationary.
The KPSS statistic exceeds all the critical values, and the low p-value indicates strong evidence against the null hypothesis of stationarity. This also supports the conclusion that the series is non-stationary, showing a trend or drift.

Both the tests suggest that the time series is non-stationary. Diving into Tests BLINDLY without carefully examining the data (perform EDA) can result into Misinterpretation. Is our Elevation Gain data non-stationary? Well yes & no. It’s non-stationary (because the average elevation is changing) BUT the other components (seasonality & residuals) are stationary (we will test it out). Thus, if we remove the trend component then we will have the stationary components.

How can we do that? To do so, we have to perform DIFFERENCING (meaning we subtract the time series with its lagged values y(t) — y(t-1). how much to lag? TILL data becomes stationary. Let’s first look at how the differenced data looks like at various lags.


plt.figure(figsize = (15,8))
plt.subplot(411)
plt.plot(df.diff(), label = "Lag = 1")
plt.legend()
plt.xlim([0,100])

plt.subplot(412)
plt.plot(df.diff(5), label = "Lag = 5")
plt.legend()
plt.xlim([0,100])

plt.subplot(413)
plt.plot(df.diff(10), label = "Lag = 10")
plt.legend()
plt.xlim([0,100])

plt.subplot(414)
plt.plot(df.diff(20), label = "Lag = 20")
plt.legend()
plt.xlim([0,100])

# Dictionary to hold the results
import warnings
warnings.filterwarnings('ignore')
results = {}

# List of lags for differencing
lags = [1, 5, 10, 15]

# Function to perform ADF and KPSS tests
def perform_tests(data):
    adf_test   = adfuller(data.dropna())
    adf_p      = adf_test[1]
    kpss_test  = kpss(data.dropna(), regression='c')
    adf_result = "Stationary" if adf_p < 0.05 else "Non-stationary"
    kpss_stat  = kpss_test[0]
    kpss_result = "Stationary" if kpss_stat < kpss_test[3]['5%'] else "Non-stationary"
    return {
        'ADF Statistic': adf_test[0],
        'ADF p-value': adf_test[1],
        'KPSS Statistic': kpss_test[0],
        'KPSS p-value': kpss_test[1],
        'Adf_Result' : adf_result,
        "KPSS_Result" : kpss_result
    }

# Apply differencing and perform tests
for lag in lags:
    lagged_data = df.diff(lag)
    results[f'Diff_{lag}'] = perform_tests(lagged_data)
# Lets see if the Time Series became stationary post differencing or not?

results_df = pd.DataFrame(results).T
results_df

The above method (differencing) has rendered the time series STATIONARY. This means that having a deterministic TREND will make the time series NON-STATIONARY. Therefore, we need to make sure that the time-series is DE-TRENDED BEFORE we test it for stationarity and for making predictions!

NOTE: While differencing data with lag = 1 makes the data stationary, I would use lag = 5. Why? Notice that lag = 5 is able to preserve the seasonality in the data which will be critical in predictions. Thus, I always recommend to VISUALISE data before you choose your parameter.

Time Series Methods

While now we know some characteristics of the above time series, let’s work on some other basic statistics of the Time Series -

1. Mean — What is the average height that we climbed?

2. Standard Deviation — During our hike, what was the typical deviation in height we experienced?

3. Moving Average — At any point in Time, what is our elevation gain? How is it different than a simple mean?

For basic stats, We can simply use pandas describe function to get the mean, std, percentiles. Mean Value is the average height we climbed, std deviation in this context tells how much “up/down” from mean did we have to climb (a measure of peaks and troughs).

df.describe()

Moving Averages

For cases of time series, we are more interested in looking at Statistics around some time point. Example -

Hey, what was the screen size of the mobile phones in the 90's?

— Here we would want to look at the average size of mobile phones in that decade and ignore data from 80s or 2000s.

2. In India, What was the life expectancy in 1940s?

— Here we would be interested in average of life span during 40s only.

Therefore, moving averages help you in understanding average value of a variable around a time duration.

#A Moving Average is a statistical technique used in time series analysis to estimate the underlying trend or pattern in the data.
#It smooths out short-term fluctuations by averaging the values of a certain number of preceding periods.
#The idea is that the current value of a time series depends on the average of the values from previous periods. IMPORTANT TO CONSIDER : HOW MANY PREVIOUS POINTS SHOULD BE CONSIDERED?
# In the following observe that the effect of large rolling periods
plt.figure(figsize = (15,8))
df.Elevation.plot( label = "Elevation")
df.Elevation.rolling( 5).mean().plot( label = "MA Elevation, period = 5")
df.Elevation.rolling( 10 ).mean().plot( linestyle = "--", label = "MA Elevation, period = 10")
df.Elevation.rolling( 15 ).mean().plot( linestyle = "-.", label = "MA Elevation, period = 15")

plt.axhline(df.Elevation.mean(), linestyle = "--", label = "Average Elevation")

plt.title('Elevation Gain as You Hike Up a Mountain')
plt.xlabel('Minutes')
plt.ylabel('Elevation (Feet)')
plt.legend()

Using Moving Averages to get rid of Trend

Now that we have seen how the elevation gain changes and how moving averages Smoothen out the curve. Let’s now use MA for another reason — Getting RID OF TREND. Previously, we have shown that DIFFERENCING helps in detrending the time-series. We can also use Moving Averages to de-trend the time-series by simply performing

# y_diff(n) = y(t) - MA(y, n)

here n is the number of time points (t, t-1, t-2…. t-n) used to calculate the moving average.

It is an important metric to remember because ARIMA models uses moving averages

# In the below code, we have taken the Elevation and then subtracted the moving average from it.
# As a result, we are left with only the seasonal + residual components of the Elevation data
# But why does MA(5) start from 5 and MA(20) is starting from 20 (refer plot)
# Lets say we are calculating MA(5) for following [1,2,3,4,5,6,7,8,9]. then MA(5) = [null, null, null, null, 3, 4, 5, 6, 7]. The first non-null value occurs at the nth index (MA(n))

plt.figure(figsize = (15,8))
plt.subplot(411)
plt.plot(df.index, df.Elevation - df.Elevation.rolling(5).mean(), label = "MA(5)")
plt.legend()
plt.xlim([0,100])

plt.subplot(412)
plt.plot(df.index, df.Elevation - df.Elevation.rolling(10).mean(), label = "MA(10)", color = "r")
plt.legend()
plt.xlim([0,100])

plt.subplot(413)
plt.plot(df.index, df.Elevation - df.Elevation.rolling(15).mean(), label = "MA(15)")
plt.legend()
plt.xlim([0,100])

plt.subplot(414)
plt.plot(df.index, df.Elevation - df.Elevation.rolling(20).mean(), label = "MA(20)")
plt.legend()
plt.xlim([0,100])

# Dictionary to hold the results
results = {}

# List of lags for differencing
lags = [2, 5, 10, 15]

# Function to perform ADF and KPSS tests
def perform_tests(data):
    adf_test   = adfuller(data.dropna())
    adf_p      = adf_test[1]
    kpss_test  = kpss(data.dropna(), regression='c')
    adf_result = "Stationary" if adf_p < 0.05 else "Non-stationary"
    kpss_stat  = kpss_test[0]
    kpss_result = "Stationary" if kpss_stat < kpss_test[3]['5%'] else "Non-stationary"
    return {
        'ADF Statistic': adf_test[0],
        'ADF p-value': adf_test[1],
        'KPSS Statistic': kpss_test[0],
        'KPSS p-value': kpss_test[1],
        'Adf_Result' : adf_result,
        "KPSS_Result" : kpss_result
    }

# Apply Moving Averages differencing and perform tests
for lag in lags:
  print(lag)
  detrended_data = df - df.rolling(lag).mean()
  results[f'Diff_{lag}'] = perform_tests(detrended_data)

# Output :
2
5
10
15

# Lets see if the Time Series became stationary post differencing or not?

results_df = pd.DataFrame(results).T
results_df

Summary of Part 1.1: Stationarity

In Part 1.1 of the series on Time Series Analysis (TSA), we have deepened our understanding of stationarity — a fundamental concept for analyzing time series data. Here’s a concise summary of what we covered:

Introduction to Stationarity:

We defined stationarity as the property of a time series where its statistical characteristics such as mean, variance, and autocorrelation do not change over time.
We distinguished between strict and weak stationarity, emphasizing the practical importance of the latter for statistical modeling.

2. Significance of Stationarity:

Stationarity is crucial for the reliability of many statistical forecasting models because it implies consistent behavior over time, simplifying the modeling process.

3. Testing for Stationarity:

We implemented and interpreted results from key statistical tests:

a) Augmented Dickey-Fuller (ADF) Test: Tests for unit roots to indicate non-stationarity.

b) Kwiatkowski-Phillips-Schmidt-Shin (KPSS) Test: Tests for stationarity around a deterministic trend.

4. Practical Application:

We applied these concepts to real-world data by analyzing the elevation gain during a mountain hike, illustrating how these statistical properties manifest in practical scenarios.

5. Addressing Non-Stationarity:

Techniques such as differencing were used to transform the original time series into a stationary series, demonstrating the process and its impact through visualizations.

6. Moving Averages:

We explored how moving averages can be used to smooth the series and help identify underlying trends, and how they can be employed to detrend the data, setting the stage for further analysis.

In the next part, we will go through the forecasting models, tests and validate the accuracy of the forecasts.

— — — — — — — — — — — — — — — — — — — — — — — — — — — — — -

I will be posting more of these on my LinkedIn page: https://www.linkedin.com/in/vipul-joshi-4b249155/
https://colab.research.google.com/drive/1YIXuxVqFNPZXi23oGY5dgiwDCX_UbEvX?usp=sharing#scrollTo=B_gIftlA7ji1