Anomaly Detection with Multi Dimensional Time Series Data

Naveen Kaushik
Northraine
Published in
6 min readFeb 26, 2019
Artwork by Mariah Arvanitakis

Time series data is one of the most common types of data found in today’s world. With the evolution of IoT(Internet of Things), the usage of sensors has become even more abundant. There are a lot of devices around us in our day to day life which generate some sort of signal every second. It could be as simple as your Apple watch capturing your heart rate and sending this to a server, generating logs every second. In some cases it becomes useful to detect if there is any abnormality in the signals that are getting generated.

At Northraine, we encountered one such client who had an enormous dataset generated from train signals and wanted us to find anomalies among them. The dataset was quite huge. It contained the data for 40 trains and 30,000 signals generated from each of the 8 cars, every 10 seconds for over 7 years. This sums to over a billion rows easily. Finding patterns in such a dataset was going to be a daunting task. On top of that, finding an abnormality was even more difficult.

There were a few approaches used to tackle the problem, of which the key ones are mentioned below:

  • Density-Based Spatial Clustering of Applications with Noise (DBSCAN)
  • Correlation Anomaly Detection

Before we start to explain how these techniques were used, we first need to understand how these techniques work.

There are two most commonly used unsupervised learning techniques, k-means and DBSCAN. In the case of k-means, it guarantees that each point would belong to one of the clusters even if it is an outlier. That is something that was not desirable in our case. DBSCAN algorithm works in a different way as explained below.

DBSCAN:

Density-based spatial clustering of applications with noise (DBSCAN) is a density-based clustering algorithm. This algorithm groups together the points that are closely packed together and marks the low density points far apart as outliers. For ease of understanding, imagine a 2-dimensional dataset and assume it is to be plotted on a graph. Intuitively, if you look at the plot, you would be able to say that a point is a part of a high-density cluster, low-density cluster or a complete outlier. To determine if a point belongs to a cluster, we would see if there are any other points around it and how far apart these points are i.e. the radius around the particular point that needs to be considered. These are the two parameters that DBSCAN works on, which are formally described as below:

  • eps: the minimum distance between two points. It means that if the distance between two points is lower or equal to this value (eps), these points are considered neighbors.
  • min_samples: the minimum number of points to form a cluster. For eg: if the value of this parameter is set to 8, then there must be at least 8 points to form a cluster.

Determining the values of these two parameters is very tricky. There is no automated way to do it, but it needs to be decided based on the kind of data that we are using.

  • eps: If the value of eps is chosen to be very small, then there wouldn’t be enough points in a region to form a cluster, which would make most of the points outliers. If the value is too large, the majority of the points would fall into one cluster and there would be almost no outliers. So, the value needs to be chosen wisely and more often or not, a smaller value is preferred for eps.
  • min_samples: The same goes for min_samples as well. If the value is too large, it would make a cluster need more points to form, leaving a major chunk of points as outliers. And if the value is too small, it would make clusters form even for what could have been an outlier.

The values of these 2 parameters depend on the kind of data and the expectation of the model. In the image below, we had multiple time series data to explore where we could have abnormalities. This data set can be assumed to be of each timestamp along the rows and each column to be one of the signals generated by the train. We aggregated the data of a few months and compared various signals against each other.

To start with, using the DBSCAN algorithm, we compared individual signals to find the outliers. Of course, it can be argued that it could have been done using the traditional approach of filtering out data with over 1.5 IQR(Interquartile Range), but this approach would only filter out the values at either extremes and wouldn’t work on multi-dimensional time series data. So, for one-dimensional data we were able to point out the data points which were not in line or abnormal with the entire pattern as shown in the below figure:

Anomalies on an individual Time Series

As we can see here, there are points that have been marked as outliers which are not only at extremes, but some of them are in the middle as well. This works out because the points in the middle are not so frequent and hence are unable to form dense clusters based on the eps and min_samples parameters.

Now, when we come to examining multiple time series data together, say n dimensions, one of the challenges is that DBSCAN calculates the distance in n-dimensional space and the range of the values of each time series varies a lot from one another. So, when the Euclidean distance is calculated, the scale of the values vary a lot and make the points not form a cluster even when they are close to each other. To tackle this, we normalised all the values to a range of 0–1 using a min-max normalisation. This helped the model to compare the distances on similar scales for the values but the model still had to be tuned for eps and min_samples for various time series that we were comparing. One example is shown in the below figure:

Anomalies on multiple signals

In this figure, we are comparing 4 time series data together and the red dots indicate the points marked by the algorithm as outliers. It can be seen that the outliers are occurring simultaneously on various series, or put in simple words, there is a pattern in the anomalies happening together. This does not mean that one is caused by the other because these anomalies can have various causes, as explained by the client’s engineering team. On a lighter note, here’s something funny on causation and correlation.

Source: https://xkcd.com/552/

On further exploration, we also clustered the outliers to determine the similarities among them. We used a simple k-means clustering on the outliers and there were clear clusters created as shown in the below image. This indicates that there are similarities among the anomalies happening which should help the Engineers in narrowing down what’s happening with the trains at various times.

Clustering of anomaly values

Future Work:

Once we know what the anomalies look like, we can further mark them and feed it into a semi-supervised learning algorithm and using a label propagation approach we can label the rest of the data points as well. Once we get more and more labelled data points we can use supervised learning algorithms to train our models and use it to do predictive maintenance for the rail network.

Hope this blog was useful in understanding the functioning of DBSCAN algorithm and also on how it can be used on a particular dataset to use it in the future for predictive maintenance.

At Northraine, we strongly believe in our motto of “Recondition the human condition” and we undertake a variety of projects ranging from retail, manufacturing, logistics to mental health related projects.

--

--