Mathematically determining patterns in Time Series with codes

Detecting Trend, Seasonality, Outliers & cycles

Mehul Gupta
Data Science in your pocket
8 min readMar 21, 2022

--

Photo by Markus Winkler on Unsplash

* Trends detection using Mann Kendall Test

* Seasonality detection using Autocorrelation

* Outliers/anomaly detection using Modified Z-Score

* Cycle detection using CyDeTS algorithm

Time Series has been amongst the most confusing concept in Data Science be it doing some sort of forecasting or analysis. Recently, I have been building an alert system for BI tools (like Tableau, QlikView, etc.) to note if any pattern is recognized in their usage. For eg: An unusual number of user logins, a spike in ad-hoc queries every Wednesday, etc.

My debut book “LangChain in your Pocket” is out now !!

As most of this data corresponds to Time Series, its time to Google around a couple of things:

What are the patterns we can observe in time series data?

Statistical or Mathematical methods to identify these patterns.

The first question is easy to answer.

  1. Trend: A increase or decrease that is observed for some significant period of time starting at any point in the time series. For eg: The stock market saw a downward trend due to the Russia-Ukraine war
  2. Seasonality: When a pattern becomes repetitive periodically at a certain time period. For eg: Summer clothes sales go up every summer
  3. Outliers/Anomalies: Values that are abnormal & don’t fall in sync with the rest of the data points. A sudden deep-in user footfall after a release in software ‘X’. This can indicate we may have some issues with the new release.
  4. Cycles: When a pattern is observed across the time series, though, at random intervals/frequencies & maybe of variable length is called a cycle. For eg: Different Covid waves.

Before we move ahead, are cycles the same as seasonality?

Maybe, as seasonality is any pattern getting repeated after a certain frequency while cycles are irregular patterns. Though, in many places, it is mentioned seasonality is a special type of cycle that can be considered true as well

So we have answered the 1st question. What about the 2nd question?

This is slightly difficult to answer as while googling, most of the methods described online run on human intuition. You observe a graph & make an understanding of the different patterns depending on your knowledge. What we wish to have is a mathematical system/some sort of statistical test that can give us a Boolean answer, Yes or No & not a maybe. So let’s discuss whatever I observed:

Trends using Mann Kendall test

Mann Kendall test is like any other test that helps us know 1) Whether a trend exists in a time series and 2) Is it upward or downward?

Before moving ahead, you might need to know about Hypothesis testing. Once done, let’s explore all the steps involved in Mann Kendall.

  1. Similar to hypothesis testing, we have a

Null Hypothesis: No monotonic trend is present.

Alternate hypothesis: Monotonic trend is present

What is the monotonic trend?

The monotonic trend is an either ever-increasing or ever-decreasing trend. Hence, in an increasing monotonic trend, you won't have an element that takes a dip compared to any historic value.

Before moving ahead, let’s assume a sample time series

2. Rearrange data in ascending order of occurrence i.e. the sample with the oldest date is 1st & the most recent date is last. Determine sign(xⱼ-xₖ) in the time series for every pair possible where j>k in terms of occurrence i.e. j is more recent than k.

sign() can be defined as

sign(xⱼ-xₖ) =1 if xⱼ-xₖ>0

sign(xⱼ-xₖ) = -1 if xⱼ-xₖ<0

else sign(xⱼ-xₖ)=0.

3. Summate value of sign() for all possible pairs xⱼ-xₖ where j>k. In our case, this would be

j=2: (4–2),

j=3: (6–2),(6–4),

j=4: (6–2),(6–4),(6–6),

j=4: (6–2),(6–4),(6–6),(6–6)

j=5: (8–2),(8–4),(8–6),(8–6),(8–6)

j=6: (8–2),(8–4),(8–6),(8–6),(8–6),(8–8)

j=1 won't exist as there isn’t any older element in the time series than 2

Summing sign() for all the above pairs = 17. Let us call this sign_summation

4. Calculate Variance using the formula

where

n = total elements in time series

a= total tied groups

e = specific tied group

tₑ = Value of tied group

What is a tied group? a tied group is a group of the same occurring elements with a count>1. In our example, we have 2 tied groups

  1. 6 as it occurs thrice. Hence, tₑ=3
  2. 8 as it occurs twice. Hence, tₑ=2

Variance, following the above formula, becomes:

0.055 * [7(7–1)(2*7+5) —( 3*(3–1)*(2*3+5) + 2*(2–1)*(2*2+5))]

0.055 * [7*6*19 — (3*2*11+ 2*1*9)]

0.055* [798–84] = 39.27

5. Once we have calculated Variance, its time to calculate the MK test statistic (similar to Z-stats or T stat calculated in hypothesis testing) using the formula

(sign_summation-1)/√Variance if sign_summation>0

(sign_summation+1)/√Variance if sign_summation<0

else 0

In our case, MK Stat = (17–1)/√39.27 = 2.55

If MK Stat is +ve, it indicates an upward trend & vice-versa. To complete the entire process, let's assume alpha=0.05% hence if abs(MK Stat) > 1.96, the null hypothesis gets rejected & we can conclude a trend exists by accepting the alternate hypothesis. As in our case, MK Stat=2.55, we can easily reject the null hypothesis proving an upward trend exists.

The Mann-Kendall test has some assumptions that we should keep in our heads before moving ahead which are

Samples should be independent & data collection should be unbiased alongside no seasonality. If seasonality exists, we can use a variant of MK test that is seasonal MK test available in pymannkendall package.

How to do all this in Python? follow the below code snippet for a basic MK test

import pymannkendall as mk#series is pandas series object with date as indexmk_original = mk.original_test(series, alpha=0.1)

Seasonality using Autocorrelation

There is no particular statistical test to detect seasonality as we observed in the case of trend (at least I couldn’t find out on the internet). Though there are a few methods that we can follow to detect seasonality. The one I would be discussing is using Autocorrelation

For a brief, autocorrelation refers to the correlation of the time series with its lagged versions. For example, Values generated in March’22 may correlate with values generated in Jan’22. Similarly, we may observe values getting correlated after every 6 months or year showing seasonal effects in the time series. Below are the steps we can follow up to detect seasonality most of the time.

  1. Detrend the time series if the trend element present
  2. Calculate autocorrelation for detrended time series
  3. Pickup indexes where we observe significant autocorrelation (I chose it to be 0.1) skipping the very first index as every series is 100% correlated to its 0-lagged version. Let’s call it acf_index
  4. Calculate xⱼ₋-xₖ & store in array acf_index_diff where k=j-1. For n indexes, we will have n-1 difference elements in acf_index_diff.
  5. If any difference element gets repeated more than twice, potential seasonality can be at that frequency. For eg: if data is at a monthly level & we get 4 counts for 2 in acf_index_diff, we can say seasonality exists at 2 months (potentially).

Below is a sample code block

#detrending time series
trend = seasonal_decompose(series_).trend
series_ = series_ - trend.fillna(0)
#calculating acf & choosing significant lagged versions
data = [x for x, y in enumerate(acf(series_.rolling(window='30D').mean(), nlags=100)) if y > 0.1][1:]
#calculating index difference for consequent pair of elements
index_diff = Counter([data[x + 1] - data[x] for x, y in enumerate(data[:-1]) if data[x + 1] - data[x] != 1])
threshold = 2
for x in index_diff.items():
if x[1] > threshold:
print(' seasonality at {} months'.format(x[0]))

This method, though, can be erroneous but works fine for most cases.

Outliers/Anomalies using modified z-score

Outliers detection is important, at least when you are designing some sort of alert system. I haven’t followed any specific method designed for time series for anomaly detection but just modified the z-score as Z Score itself uses mean & hence may get deviated by the presence of an outlier.

The steps are as follows:

  1. Calculate the time series median
  2. Median absolute deviation (MAD) of the data i.e. centering the series by subtracting the median & calculating the median of the centered absolute series
  3. Calculate modified z score = 0.6745 * centered_absolute_series/MAD
  4. Any absolute value in the series going over the modified z-score is an anomaly
import numpy as np
def mod_zscore(self, col, thresh):
med_col = col.median()
med_abs_dev = (np.abs(col - med_col)).median()
mod_z = 0.6745 * ((col - med_col) / med_abs_dev)
mod_z = mod_z[np.abs(mod_z) > thresh]
return np.abs(mod_z)

The method can become more specific to a time series following the window strategy. if the Modified Z score & anomalies both are calculated considering values falling in respective timelines/windows, more effective results can be obtained

Cycles using CyDeTS algorithm

The last most significant pattern to determine is cycled. This is really confusing as there exists no apt source for reading around cycle detection in time series data. Though, the only major stuff I found around cycle detection is the CyDeTS python package which is easy to use. Let’s try to summarize the algorithm used:

  1. Normalize time series
  2. Calculate potential peaks & valleys time.

A peak is a value which is higher than both its immediate preceding & succeeding value in the series

A valley is lower than both its immediate preceding & succeeding values in the series

3. Find potential start & endpoints of a cycle

Potential start point: Time at which a peak is greater than a specific succeeding peak

Potential end point: Time at which a peak is greater than a specific preceding peak.

4. Search precycle. A precycle starts at

calculated start point & end at next corresponding peak OR

preceding corresponding peak & end at calculated end point.

What is the calculated starting point?

As the documentation is missing, my assumptions after reading the codes are in arrays: potential_starting, potential_ending & normalized series calculated in earlier steps, for any index x, a precycle exists if

potential_starting[x]<series[x]

leading to precycle of the length : potential_starting[x] to series[x]

OR

potential_ending[x]>series[x]

leading to a precycle of the length : series[x] to potential_ending[x].

For each precycle, the following values are stored:

Start point, endpoint, minimum value & timestamp for minimum value falling in the precycle range

5. Detect cycles from precycles

This majorly involves removing overlapping precycles. Precycles with the same timestamp for minimum value are called overlapping precycles. Out of overlapping cycles, the precycle with the latest starting point & earlier endpoint is chosen

Note: As the algorithm can be tricky to understand, the source code can be read here

& we finally get our cycles !!

The python library CyDeTS is pretty easy to use

import pandas as pd
from cydets.algorithm import detect_cycles
# create sample data
series = pd.Series([0, 1, 0, 0.5, 0, 1, 0, 0.5, 0, 1, 0])
# detect cycles
cycles = detect_cycles(series)

The output looks like this

Where t_start & t_end refers to time_stamp/index corresponding the cycle, t_minimum is timestamp/index for min value & duration is length. Doc is the amplitude of the cycle which can be ignored for now.

With this, it's a wrap-up!!

--

--