TLDR Finding surprising patterns in a time series database in linear time and space by Keogh et al. thru A Colyer

If you want to detect holidays by looking on some power demand time series — there’s a good algorithm for that =)

pavel trukhanov
some-tech-tldrs
2 min readDec 5, 2017

--

Note that this problem should not be confused with the relatively simple problem of outlier detection. Hawkins’ classic definition of an outlier is “… an observation that deviates so much from other observations as to arouse suspicion that it was generated from a different mechanism.” However, we are not interested in finding individually surprising datapoints, we are interesting in finding surprising patterns, i.e. combinations of data points whose structure and frequency somehow defies our expectations.

A pattern is “surprising” if its frequency substantially differs from expected by chance.

There is no need to specify what to ‘look for’.

We only need to have normal / non-surprising data.

Tarzan algorithm:

  • Turn the input time series X into a discretized time series over some fixed size alphabet.
  • The same with the reference time series R
  • Create suffix trees for the discretized time series of X and R, and use a Markov model to determine how likely the substrings represented by the nodes in the suffix tree of X are.
  • Move a sliding window over the discretized time series for X, and look-up the surprise score for each window using the pre-calculated scores in the previous step.
  • Flag as surprising any window with a score over some threshold.

--

--