Dynamic Time Warping for Time Series Classification

When, How and Why the Dynamic Time Warping algorithm can powerfully replace the common Euclidean distance to better classify your time series data

Annagiulia Tiozzo

Published in

Eni digiTALKS

9 min readNov 9, 2022

Source https://www.pexels.com/it-it/foto/acqua-blu-3621556/

Classifying time series using machine learning algorithms requires some familiarity.

The Time Series Classification (TSC) task is usually solved by supervised algorithms, and it aims at creating classifiers that map input time series in discrete variables (classes) that describe one or more characteristics of the time series themselves.

Interesting examples of Time Series Classification tasks can be found in speech recognition or gesture and movement recognition.

Figure 1 — Example of movement recognition (from [13])

Standard classification algorithms that are used on other type of data e.g., tabular data, cannot be directly applied, as they treat each sample independently from the others.

With time series, you cannot ignore the temporal order of the data thus, you cannot consider each sample of a time series irrespective of the other samples, but you must preserve temporal ordering.

For this reason, in literature there are several types of techniques for time series classification which will be briefly explained in the next paragraph.

Time Series Classification approaches

Figure 2 — Taxonomy of approaches for Time Series Classification algorithms

As a brief overview of different type of approaches for TSC, let’s describe Figure 2 in order to see how the time ordering can be preserved and treated.

Interval-based approach: features and information of the time series are extracted from different intervals and standard classifiers are applied on the features themselves. An example of algorithm is Time Series Forest Classifier.
Dictionary-based approach: features of the time series are transformed as words that are representative of the classes. Standard classifiers are applied on the distribution of the extracted words. An example of algorithm is Bag of Patterns.
Frequency-based approach: features of the time series are extracted at a spectral level, with frequency analysis and successively standard classifiers are applied. An example of algorithm is Random Interval Spectral Ensemble.
Shapelet-based approach: shapelets are subsequences of the time series that are representative of a class. The k most characteristic shapelets of the time series are extracted and then standard classifiers are used. An example of algorithm is Shapelet Transform Classifier.
Ensamble approach: very competitive for general problems, they combine several estimators as, for example, HIVE-COTE algorithm.

Distance-based approach

In this article we’ll focus on the Distance-based approach.

It is a non-parametric approach mixing distance metrics with a classifier to determine class membership. The classifier is usually a k-nearest neighbor (KNN) algorithm to understand if the time series you want to label is similar to some in the training dataset. Based on the neighborhood, the nearest class or an aggregation of the nearest classes is associated to the time series under analysis. Dynamic Time Warping (DTW) is an example of distance-based approach.

Distance Metrics

In Time Series Classification we need to compute the distance between two series keeping in mind the time relation and dependence between the samples inside each series. The choice of the correct metric is fundamental for this approach.

Euclidean Distance

Let’s start considering the common Euclidean distance.

In view of Time Series Classification, the Euclidean distance is not appropriate, because even if it preserves the temporal order, it measures in a pointwise manner the distance. Indeed, the similarity with Euclidean distance of two time series is computed by considering their amplitude, irrespective of phase shift, time shifting and distortion.

Take the example in Figure 4. We have tree time series: ts1, ts2 and ts3. We would like to detect that the two sinusoidal curves are similar to each other because they share the same shape and up and down trend, even if their phase and frequency are slightly different. However, if we compute the Euclidean metrics, the straight-line ts3 results closer to ts1.

Figure 4 — examples of time series to compare

This phenomenon is happening because the Euclidean Distance is comparing the amplitudes of the curves, without allowing any time stretch.

Figure 5 — Euclidean Matching (https://commons.wikimedia.org/wiki/File:Euclidean_vs_DTW.jpg)

Dynamic Time Warping

The Dynamic Time Warping has been introduced to avoid the problem of the Euclidean Distance.

Historically, it was introduced for speech recognition. As you can see in Figure 6, having the same sentences repeated at different speeds, it was necessary to associate the time series to the same words, managing the different speeds.

Figure 6 — Speech recognition application of DTW (from [6])

DTW allows you to measure the similarity between the time series, by identifying the best alignment between them and minimizing the effects of distortion in time and shifting.

Similar shapes with different phases are matched with elastic warping in time.

Figure 7 — Dynamic Time Warping matching (https://commons.wikimedia.org/wiki/File:Euclidean_vs_DTW.jpg)

The algorithm

Let’s consider two time series X = (x₁, x₂, …, xₙ) and Y = (y₁, y₂, …, yₘ), sampled at equidistant points in time, with equal or different length.

Our aim is to find the minimum distance that aligns the time series.

Figure 8 — Example of Time series to align

A local cost matrix is defined, that will be minimized to find the optimal alignment. The cost matrix C is defined as the pairwise distance of all the time series points:

Figure 9 — Heatmap of the local cost matrix C

The aim is to find a warping path on the local cost matrix that aligns the time series, by following the least expensive route.

A warping path p is a sequence of points on the local cost matrix, and therefore a sequence of couple of points on the two time series:

which must satisfy some conditions:

Boundary condition:

the starting and the ending points of the warping path must be the first and the last points of the sequences.

Monotonicity condition:

to preserve time-ordering.

Step size condition:

to limit long jumps and shifts in time while aligning the sequences.

Each warping path has an associated cost:

Cost function associated with a warping path p

Figure 10 — Example of warping path (not optimal)

The aim is to find the optimal warping path:

DTW is solved with a recursive implementation for which the warping path with minimum cost is found:

Once the optimal warping path is found, the associated optimal cost is computed, and it can be used as the DTW distance.

The recursive implementation reaches the optimum, but the computational cost is O(NM), where N and M are the length of the two time series.

k-nearest neighbor

Going back to the original problem of classifying the time series of interest, the distance metric can be coupled with a k-nearest neighbor (k-nn) algorithm. It means that you can compute the DTW distance of your time series to all the other time series in the training dataset. Then, you can associate the class of the nearest neighbor to your time series, if you are considering a 1-nn approach or, similarly, you can associate the most frequent class of the k-nearest class with a k-nn approach.

In this way, you have reached your goal of assigning a class to your time series, with a distance-based approach that considers the time-shift of your time series.

Alternative approaches to classical DTW to speed up

Constraints

Some constraints on the warping path to speed up the algorithm and to prevent pathological alignments have been proposed.

For example, a Warping window T to define a maximum allowable distance between any pairs of indexes, as shown in Figure 12. With the Sakoe-Chiba band or the Itakura parallelogram, only the grey area is feasible for the warping path, meaning that it cannot be too far from the diagonal. In this way, the pathological too far alignments are avoided.

However, the constrained variant is proposing a sub-optimal solution (as all the constraint optimization problem with respect to the unconstraint ones) that is usually too far from the optimal one to be effectively used in a real-case problem.

Fast DTW

A multilevel approach has been proposed to speed up the algorithm in the FastDTW algorithm.

It requires different steps:

Coarsening: shrink the time series into coarser time series. This reduces the size of a time series by averaging adjacent pairs of points.
Projection: Find minimum-distance warping path, use as initial guess for higher resolution’s warping path.
Refinement: Refine the warping path from a lower resolution to the higher resolution via local adjustments. This step finds the optimal warping path in the neighborhood of the projected path, and a radius r parameter controls the size of the neighborhood.

The FastDTW allows fast resolution, with O(Nr) of complexity, with good sub-optimal solution.

Multivariate variants

All the proposed approaches are considered for the Univariates Time Series Classification Problems.

Several variants for the Multivariate case have been presented:

Independent warping (DTWI):

Each dimension is treated independently with a standard Univariate DTW and the resulting DTW distance is the sum of the univariate DTW distance.

Dependent warping (DTWD):

The Dependent warping assumes that the correct warping is shared across all the dimensions. The local cost is defined as the Euclidean distance of the two vectors that represent all the dimensions.

Adaptive warping (DTWA)

The adaptive approach requires a score function S, that is the ratio between the distance of the nearest neighbor considering the dependent strategy (DTWD) and the independent strategy (DTWI). Therefore, it requires to have both DTWI and DTWD computed. Then the final class of the classification problem is chosen based on a threshold value T on the score function S, where the threshold T is trained in such a way to discriminate between the solution proposed by the Independent strategy and the Dependent strategy.

Take Aways

The Dynamic Time Warping is a good non-parametric algorithm, therefore it is simple to be used because it does not require any parameter tuning.
It is a robust benchmark for Time Series Classification problems.
It is computationally expensive.
In case of known characteristic time scale of a phenomenon (e.g., periodicity) and in case of a specific pattern to classify, the Dynamic Time Warping can be used as starting point for a Classification task.

References

1. http://www.timeseriesclassification.com/

2. The great time series classification bake off: a review and experimental evaluation of recent algorithmic advances. Bagnall A. et al. 2017

3. The great multivariate time series classification bake off: a review and experimental evaluation of recent algorithmic advances . Ruiz AP et al. 2020

4. https://en.wikipedia.org/wiki/Dynamic_time_warping

5. https://csdl.ics.hawaii.edu/techreports/2008/08-04/08-04.pdf

6. https://databricks.com/blog/2019/04/30/understanding-dynamic-time-warping.html

7. https://pypi.org/project/fastdtw/

8. Stan Salvador, and Philip Chan. “FastDTW: Toward accurate dynamic time warping in linear time and space.” Intelligent Data Analysis 11.5 (2007): 561–580.

9. https://rtavenar.github.io/blog/dtw.html#ref-vintsyuk1968speech

10. https://perun.pmf.uns.ac.rs/radovanovic/publications/2019-inista-dtw.pdf

11. F. Itakura, “Minimum prediction residual principle applied to speech recognition,” in IEEE Transactions on Acoustics, Speech, and Signal Processing, vol. 23, no. 1, pp. 67–72, February 1975, doi: 10.1109/TASSP.1975.1162641.

12. Sakoe, H., Chiba, S. Dynamic programming algorithm optimization for spoken word recognition. IEEE Transactions on Acoustics, Speech, and Signal Processing, 1978, 26, 43–49.

13. R. G. Andrzejak, K. Lehnertz, F. Mormann, et al. Indications of nonlinear deterministic and finite-dimensional structures in time series of brain electrical activity: Dependence on recording region and brain state. Phys. Rev. E 64, 061907 (2001).