TDS Archive

An archive of data science, data analytics, data engineering, machine learning, and artificial intelligence writing from the former Towards Data Science Medium publication.

Part 9: Time Series Consensus Motifs

Sean Law
7 min readDec 5, 2020

--

(Image by JJ Ying)

The Whole is Greater than the Sum of Its Parts

(Image by Author)

STUMPY is a powerful and scalable Python library for modern time series analysis and, at its core, efficiently computes something called a matrix profile. The goal of this multi-part series is to explain what the matrix profile is and how you can start leveraging STUMPY for all of your modern time series data mining tasks!

Note: These tutorials were originally featured in the STUMPY documentation.

Part 1: The Matrix Profile
Part 2: STUMPY Basics
Part 3: Time Series Chains
Part 4: Semantic Segmentation
Part 5: Fast Approximate Matrix Profiles with STUMPY
Part 6: Matrix Profiles for Streaming Time Series Data
Part 7: Fast Pattern Searching with STUMPY
Part 8: AB-Joins with STUMPY
Part 9: Time Series Consensus Motifs
Part 10: Discovering Multidimensional Time Series Motifs
Part 11: User-Guided Motif Search
Part 12: Matrix Profiles for Machine Learning

Consensus Motif Search

This tutorial utilizes the main takeaways from the Matrix Profile XV paper.

Matrix profiles can be used to find conserved patterns within a single time series (self-join) and across two time series (AB-join). In both cases, these conserved patterns are often called “motifs”. And, when considering a set of three or more time series, one common trick for identifying a conserved motif across the entire set is to:

  1. Append a np.nan to the end of each time series. This is used to identify the boundary between neighboring time series and ensures that any identified motif will not straddle multiple time series.
  2. Concatenate all of the time series into a single long time series
  3. Compute the matrix profile (self-join) on the aforementioned concatenated time series

However, this is not guaranteed to find patterns that are conserved across all of the time series within the set. This idea of a finding a conserved motif that is common to all of the time series in a set is referred to as a “consensus motif”. In this tutorial, we will introduce the “Ostinato” algorithm, which is an efficient way to find the consensus motif amongst a set of time series.

Getting started

Let’s import the packages that we’ll need to load, analyze, and plot the data.

%matplotlib inlineimport stumpy
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from itertools import cycle, combinations
from matplotlib.patches import Rectangle
from scipy.cluster.hierarchy import linkage, dendrogram
from scipy.special import comb
fig_size = plt.rcParams["figure.figsize"]
fig_size[0] = 20
fig_size[1] = 6
plt.rcParams["figure.figsize"] = fig_size
plt.rcParams['xtick.direction'] = 'out'

Loading the Eye-tracking (EOG) Dataset

In the following dataset, a volunteer was asked to “spell” out different Japanese sentences by performing eye movements that represented writing strokes of individual Japanese characters. Their eye movements were recorded by an electrooculograph (EOG) and they were given one second to “visually trace” each Japanese character. For our purposes we’re only using the vertical eye positions and, conceptually, this basic example reproduced Figure 1 and Figure 2 of the Matrix Profile XV paper.

sentence_idx = [6, 7, 9, 10, 16, 24]
Ts = [None] * len(sentence_idx)
fs = 50 # eog signal was downsampled to 50 Hz
for i, s in enumerate(sentence_idx):
Ts[i] = pd.read_csv(f'https://zenodo.org/record/4288978/files/EOG_001_01_{s:03d}.csv?download=1').iloc[:, 0].values
# the literal sentences
sentences = pd.read_csv(f'https://zenodo.org/record/4288978/files/test_sent.jp.csv?download=1', index_col=0)

Visualizing the EOG Dataset

Below, we plotted six time series that each represent the vertical eye position while a person “wrote” Japanese sentences using their eyes. As you can see, some of the Japanese sentences are longer and contain more words while others are shorter. However, there is one common Japanese word (i.e., a “common motif”) that is contained in all six examples. Can you spot the one second long pattern that is common across these six time series?

def plot_vertical_eog():
fig, ax = plt.subplots(6, sharex=True, sharey=True)
prop_cycle = plt.rcParams['axes.prop_cycle']
colors = cycle(prop_cycle.by_key()['color'])
for i, e in enumerate(Ts):
ax[i].plot(np.arange(0, len(e)) / fs, e, color=next(colors))
ax[i].set_ylim((-330, 1900))
plt.subplots_adjust(hspace=0)
plt.xlabel('Time (s)')
return ax
plot_vertical_eog()
plt.suptitle('Vertical Eye Position While Writing Different Japanese Sentences', fontsize=14)
plt.show()
(Image by Author)

Consensus Motif Search

To find out, we can use the stumpy.ostinato function to help us discover the “consensus motif” by passing in the list of time series, Ts, along with the subsequence window size, m:

m = fs
radius, Ts_idx, subseq_idx = stumpy.ostinato(Ts, m)
print(f'Found Best Radius {np.round(radius, 2)} in time series {Ts_idx} starting at subsequence index location {subseq_idx}.')
Found Best Radius 0.87 in time series 4 starting at subsequence index location 1271.

Now, Let’s plot the individual subsequences from each time series that correspond to the matching consensus motif:

seed_motif = Ts[Ts_idx][subseq_idx : subseq_idx + m]
x = np.linspace(0,1,50)
nn = np.zeros(len(Ts), dtype=np.int64)
nn[Ts_idx] = subseq_idx
for i, e in enumerate(Ts):
if i != Ts_idx:
nn[i] = np.argmin(stumpy.core.mass(seed_motif, e))
lw = 1
label = None
else:
lw = 4
label = 'Seed Motif'
plt.plot(x, e[nn[i]:nn[i]+m], lw=lw, label=label)
plt.title('The Consensus Motif')
plt.xlabel('Time (s)')
plt.legend()
plt.show()
(Image by Author)

There is a striking similarity between the subsequences. The most central “seed motif” is plotted with a thicker purple line.

When we highlight the above subsequences in their original context (light blue boxes below), we can see that they occur at different times:

ax = plot_vertical_eog()
for i in range(len(Ts)):
y = ax[i].get_ylim()
r = Rectangle((nn[i] / fs, y[0]), 1, y[1]-y[0], alpha=0.3)
ax[i].add_patch(r)
plt.suptitle('Vertical Eye Position While Writing Different Japanese Sentences', fontsize=14)
plt.show()
(Image by Author)

The discovered conserved motif (light blue boxes) correspond to writing the Japanese character , which occurs at different times in different example sentences.

Phylogeny Using Mitochondrial DNA (mtDNA)

In this next example, we’ll reproduce Figure 9 from the Matrix Profile XV paper.

Mitochondrial DNA (mtDNA) has been successfully used to determine evolutionary relationships between organisms (phylogeny). Since DNAs are essentially ordered sequences of letters, we can loosely treat them as time series and use all of the available time series tools.

Loading the mtDNA Dataset

animals = ['python', 'hippo', 'red_flying_fox', 'alpaca']
data = {}
for animal in animals:
data[animal] = pd.read_csv(f"https://zenodo.org/record/4289120/files/{animal}.csv?download=1").iloc[:,0].values
colors = {'python': 'tab:blue', 'hippo': 'tab:green', 'red_flying_fox': 'tab:purple', 'alpaca': 'tab:red'}

Clustering Using Large mtDNA Sequences

Naively, using scipy.cluster.hierarchy we can cluster the mtDNAs based on the majority of the sequences. A correct clustering would place the two “artiodactyla”, hippo and alpaca, closest and, together with the red flying fox, we would expect them to form a cluster of “mammals”. Finally, the python, a “reptile”, should be furthest away from all of the “mammals”.

fig, ax = plt.subplots(ncols=2)truncate = 15000
for k, v in data.items():
ax[0].plot(v[:truncate], label=k, color=colors[k])
ax[0].legend()
ax[0].set_xlabel('Number of mtDNA Base Pairs')
ax[0].set_title('mtDNA Sequences')
truncate = 16000
dp = np.zeros(int(comb(4, 2)))
for i, a_c in enumerate(combinations(data.keys(), 2)):
dp[i] = stumpy.core.mass(data[a_c[0]][:truncate], data[a_c[1]][:truncate])
Z = linkage(dp, optimal_ordering=True)
dendrogram(Z, labels=[k for k in data.keys()], ax=ax[1])
ax[1].set_ylabel('Z-Normalized Euclidean Distance')
ax[1].set_title('Clustering')
plt.show()
(Image by Author)

Uh oh, the clustering is clearly wrong! Amongst other problems, the alpaca (a mammal) should not be most closely related to the python (a reptile).

Consensus Motif Search

In order to obtain the correct relationships, we need to identify and then compare the parts of the mtDNA that is the most conserved across the mtDNA sequences. In other words, we need to cluster based on their consensus motif. Let’s limit the subsequence window size to 1,000 base pairs and identify the consensus motif again using the stumpy.ostinato function:

m = 1000
bsf_radius, bsf_Ts_idx, bsf_subseq_idx = stumpy.ostinato(list(data.values()), m)
print(f'Found best radius {np.round(bsf_radius, 2)} in time series {bsf_Ts_idx} starting at subsequence index location {bsf_subseq_idx}.')
Found best radius 2.73 in time series 1 starting at subsequence index location 602.

Clustering Using the Consensus mtDNA Motif

Now, let’s perform the clustering again but, this time, using the consensus motif:

consensus_motifs = {}
best_motif = list(data.items())[bsf_Ts_idx][1][bsf_subseq_idx : bsf_subseq_idx + m]
for i, (k, v) in enumerate(data.items()):
if i == bsf_Ts_idx:
consensus_motifs[k] = best_motif
else:
idx = np.argmin(stumpy.core.mass(best_motif, v))
consensus_motifs[k] = v[idx : idx + m]
fig, ax = plt.subplots(ncols=2)
# plot the consensus motifs
for animal, motif in consensus_motifs.items():
ax[0].plot(motif, label=animal, color=colors[animal])
ax[0].legend()
# cluster consensus motifs
dp = np.zeros(int(comb(4, 2)))
for i, motif in enumerate(combinations(list(consensus_motifs.values()), 2)):
dp[i] = stumpy.core.mass(motif[0], motif[1])
Z = linkage(dp, optimal_ordering=True)
dendrogram(Z, labels=[k for k in consensus_motifs.keys()])
ax[0].set_title('Consensus mtDNA Motifs')
ax[0].set_xlabel('Number of mtDNA Base Pairs')
ax[1].set_title('Clustering Using the Consensus Motifs')
ax[1].set_ylabel('Z-normalized Euclidean Distance')
plt.show()
(Image by Author)

Now this looks much better! Hierarchically, the python is “far away” from the other mammals and, amongst the mammalia, the red flying fox (a bat) is less related to both the alpaca and the hippo which are the closest evolutionary relatives in this set of animals.

Summary

And that’s it! You have now learned how to search for a consensus motif amongst a set of times series using the awesome stumpy.ostinato function. You can now import this package and use it in your own projects. Happy coding!

Resources

Matrix Profile XV
STUMPY Matrix Profile Documentation
STUMPY Matrix Profile Github Code Repository

Part 8: AB-Joins with STUMPY | Part 10: Discovering Multidimensional Time Series Motifs

--

--

TDS Archive
TDS Archive

Published in TDS Archive

An archive of data science, data analytics, data engineering, machine learning, and artificial intelligence writing from the former Towards Data Science Medium publication.

Sean Law
Sean Law

Written by Sean Law

Principal Data Scientist at a Fortune 500 FinTech company. PyData Ann Arbor organizer. Creator of STUMPY for modern time series analysis. Twitter: @seanmylaw

No responses yet