Discover Unusual Patterns in Time Series Data with Unsupervised Anomaly Detection and Isolation Forests

Phillip Wenig
10 min readDec 27, 2022

--

Photo by Mingwei Lim on Unsplash

Anomaly Detection

Anomaly detection, also known as outlier detection, is the process of identifying unusual or unexpected observations in a dataset. These observations may be due to errors, abnormalities, or other factors that deviate from the norm and could potentially indicate a problem or issue.

In a mathematical sense, an anomaly can be thought of as an observation that is significantly different from the majority of the data. This can be measured using statistical metrics such as mean, median, and standard deviation. For example, an observation that is more than three standard deviations away from the mean could be considered an anomaly.

There are several approaches to anomaly detection, including statistical methods, machine learning algorithms, and distance-based methods. Statistical methods rely on pre-defined thresholds to identify anomalies, while machine learning algorithms learn from the data and can adapt to changing patterns over time. Distance-based methods calculate the distance between each observation and its nearest neighbor(s) and identify observations that are significantly farther away as anomalies.

Overall, anomaly detection is a useful tool for identifying unusual or unexpected observations in a dataset and can be applied to a wide range of problems and industries.

Anomaly Detection in Time Series

Anomaly detection in time series data is a critical task for identifying unusual events and potential problems. It can be used to detect fraudulent activity, equipment failures, healthcare issues, and security threats, among other applications. By detecting anomalies, organizations can prevent or mitigate problems, leading to cost savings and improved efficiency. In addition, anomaly detection can uncover hidden patterns and trends in time series data. There are several approaches to detecting anomalies in time series data, including unsupervised, semi-supervised, and supervised methods.

Unsupervised approaches to anomaly detection in time series data do not require any labeled data and instead rely on the characteristics of the data itself to identify anomalies. One common unsupervised approach is to use a sliding window to compare all windows in the data and identify anomalous ones.

Semi-supervised approaches to anomaly detection in time series data involve learning the behavior of normal data and using this knowledge to predict the next point in the series. The difference between the predicted and actual values can then be used to identify anomalies.

Supervised approaches to anomaly detection in time series data require labeled data, but this is often not possible because all possible anomalies are not known in advance. As a result, supervised approaches are less commonly used for anomaly detection in time series data.

Overall, there are several approaches to detecting anomalies in time series data, each with its own strengths and limitations. Choosing the right approach depends on the specific needs and characteristics of the data and the problem being solved.

In this article, we will explore the isolation forest anomaly detector and how it can be used to identify unusual patterns in time series data. We will also demonstrate how to implement a basic version of the algorithm in Python. By the end of this article, you will have a solid understanding of how to use the isolation forest algorithm for anomaly detection in time series data.

Isolation Forests

Isolation forests are an unsupervised machine learning algorithm that can be used for anomaly detection. They work by building multiple decision tree models, called isolation trees, using a random subset of the data. To build each tree, isolation forests use a “subsampling” technique, in which only a random subset of the data is used to build each tree. This allows the model to process the data more efficiently and reduces the risk of overfitting. Then it selects a random feature from the data and chooses a random split point from the chosen feature and splits the data into two branches. This process is recursively repeated for each new branch until a maximum depth is reached.

After the isolation trees are generated, the algorithm calculates the path length of each data point to the root of the respective tree. The average of all trees for each point defines its anomaly score. Observations with a lower anomaly score are more “normal” and are less likely to be considered anomalies.

Overall, isolation forests are a powerful tool for detecting anomalies in data by identifying observations that are isolated from the rest of the data. They are particularly effective for identifying anomalies in high-dimensional data and can be applied to a wide range of problems and industries.

Applying Isolation Forests to Time Series Data

A sliding window is a common technique for analyzing time series data. It involves dividing the data into overlapping or non-overlapping chunks, or “windows,” and analyzing each window separately. This can be useful for identifying patterns or trends that may not be apparent when analyzing the entire dataset.

To use a sliding window with the isolation forest algorithm for anomaly detection, you can concatenate the data of each window and apply the algorithm to the resulting data matrix.

It is important to carefully choose the size and overlap of the sliding window based on the characteristics of your data and the specific goals of your analysis. Using a sliding window can be a useful technique for improving the performance of the isolation forest algorithm on time series data.

@dataclass
class IsolationTree:
X: np.ndarray
indices: np.ndarray
max_depth: int
left: Optional[IsolationTree] = None
right: Optional[IsolationTree] = None
split_feature: Optional[int] = None
split_value: Optional[float] = None

def __post_init__(self):
if self.X.shape[0] <= 1 or self.max_depth <= 0:
return

self.split_feature = np.random.randint(self.X.shape[1])
self.split_value = np.random.uniform(self.X[:, self.split_feature].min(), self.X[:, self.split_feature].max())

left_indices = self.X[:, self.split_feature] < self.split_value
right_indices = self.X[:, self.split_feature] >= self.split_value

self.left = IsolationTree(self.X[left_indices], self.indices[left_indices], self.max_depth - 1)
self.right = IsolationTree(self.X[right_indices], self.indices[right_indices], self.max_depth - 1)

def path_lengths(self) -> Dict[int, int]:
if self.left is None and self.right is None:
return {idx: 1 for idx in self.indices}

left_path_lengths = self.left.path_lengths()
right_path_lengths = self.right.path_lengths()
path_lengths = {**left_path_lengths, **right_path_lengths}
return {idx: path_lengths[idx] + 1 for idx in self.indices}

The IsolationTree class represents a single isolation tree, which is a decision tree used by the isolation forest algorithm for anomaly detection. It has several attributes:

  • X: a numpy array of the data that the tree is built on
  • indices: a numpy array of indices corresponding to the rows of X
  • max_depth: an integer representing the maximum depth of the tree
  • left: an IsolationTree object representing the left child of the current tree, or None if the tree is a leaf
  • right: an IsolationTree object representing the right child of the current tree, or None if the tree is a leaf
  • split_feature: an integer representing the feature that the tree uses to split the data (or None if the tree is a leaf)
  • split_value: a float representing the value that the tree uses to split the data (or None if the tree is a leaf)

The __post_init__ method is called after the object is initialized and is used to build the tree. If the X array has fewer than 2 rows or the max_depth is less than 1, the data is not split further. Otherwise, it selects a random feature and value to use for splitting the data, and creates two new IsolationTree objects for the left and right branches using the data that falls below and above the split value, respectively.

The path_lengths method returns a dictionary that maps each index in the indices array to the length of the path from the root of the tree to the leaf containing the corresponding data point. If the tree is a leaf, it returns a dictionary with each index mapped to 1. Otherwise, it combines the dictionaries returned by the path_lengths method of the left and right child trees, and increments the value for each index by 1.

The isolation_forest_for_time_series function is a simple implementation of the isolation forest algorithm for anomaly detection on time series data. It takes as input a numpy array X of the time series data, an integer window_size representing the size of the sliding window to use, an integer n_trees representing the number of isolation trees to build, an integer max_depth representing the maximum depth of the trees, and a float sample_frac representing the fraction of the data to use when building each tree.

def isolation_forest_for_time_series(X: np.ndarray, window_size: int, n_trees: int, max_depth: int, sample_frac: float) -> np.ndarray:
windows = np.lib.stride_tricks.sliding_window_view(X, window_size, axis=0)

path_lengths_sum = defaultdict(int)
path_lengths_counts = defaultdict(int)
for _ in range(n_trees):
data_sample = np.random.choice(windows.shape[0], int(windows.shape[0] * sample_frac))
tree = IsolationTree(windows[data_sample], data_sample, max_depth)
path_lengths = tree.path_lengths()
for idx, path_length in path_lengths.items():
path_lengths_sum[idx] += path_length
path_lengths_counts[idx] += 1
anomaly_scores = np.array([path_lengths_sum[idx] / path_lengths_counts[idx] for idx in range(windows.shape[0])])
return np.reciprocal(anomaly_scores)

The function first creates a sliding window view of the data using the numpy.lib.stride_tricks.sliding_window_view function, which allows it to efficiently process the data in windows without creating copies. It then initializes empty dictionaries to keep track of the sum and number of path lengths for each data point.

In a loop, the function builds n_trees isolation trees using a random subset of the data, and calculates the path lengths of each data point in the tree using the path_lengths method of the IsolationTree class. It updates the dictionaries with the sum and number of path lengths for each data point.

Finally, the function calculates the anomaly scores for each data point by dividing the sum of the path lengths by the number of path lengths and taking the reciprocal. It returns the anomaly scores as a numpy array.

This function can be used to detect anomalies in time series data by applying the isolation forest algorithm in a sliding window fashion. You can pass in the time series data, along with the desired window size, number of trees, tree depth, and sample fraction, and the function will return the anomaly scores for each window of the data.

Applying the Isolation Forest on Example Data

In the next section, we will apply the isolation_forest_for_time_series function to a synthetic time series that consists of a sine wave with a plateau in the middle. The sine wave is chosen to illustrate the ability of the isolation forest algorithm to identify anomalies in periodic patterns, and the plateau is included to show how the algorithm handles a deviation from the normal behavior.

The code begins by importing the necessary modules and generating a synthetic time series with a sine wave and a plateau in the middle. It then calls the isolation_forest_for_time_series function with the time series data and several hyperparameters as arguments:

  • window_size: the size of the sliding window to use when applying the isolation forest algorithm to the time series.
  • n_trees: the number of isolation trees to build in the ensemble.
  • max_depth: the maximum depth of each tree in the ensemble.
  • sample_frac: the fraction of the data to use when building each tree.

This generates an array of anomaly scores for each window of the time series. Next, the code scales the anomaly scores using the MinMaxScaler from sklearn.preprocessing to bring them into the same range as the original time series. It then plots both the scaled anomaly scores and the original time series on the same graph using matplotlib.pyplot. Finally, it displays the graph using the show function.

The hyperparameters of the isolation forest algorithm are crucial for its performance and can be tuned to achieve better results on a particular dataset. In this code, the hyperparameters are chosen to illustrate the basic behavior of the algorithm, but in a real-world application, you may need to experiment with different values to find the optimal settings for your data. The window size tends to follow the periodicity and the anomaly lengths.

import matplotlib.pyplot as plt
import numpy as np
from sklearn.preprocessing import MinMaxScaler

X = np.sin(np.linspace(0, 200, 1000))
X[400:425] = 0
anomaly_scores = isolation_forest_for_time_series(X, 25, 100, 10, 0.5)
scaled_anomaly_scores = MinMaxScaler().fit_transform(anomaly_scores.reshape(-1, 1)).flatten()
plt.plot(list(range(12, 988)), scaled_anomaly_scores, label="anomaly score")
plt.plot(X, label="original")
plt.legend()
plt.show()

One interesting result that can be observed in the graph is that the isolation forest algorithm tends to identify the transitions from normal to anomalous and from anomalous to normal as especially anomalous. This can be seen by the spikes in the anomaly scores at the beginning and end of the plateau in the time series.

When the time series transitions from a normal pattern to an anomalous one, or vice versa, this represents a sudden change in the data that is likely to be unusual or unexpected. As a result, the algorithm is more likely to consider these transitions as anomalies.

This result highlights the importance of considering the context of the data when using the isolation forest algorithm for anomaly detection. In some cases, it may be necessary to adjust the hyperparameters or use additional preprocessing techniques to smooth out these transitions and prevent them from being incorrectly identified as anomalies.

Anomaly Score of the Isolation Forest Algorithm on a Time Series

Conclusion

In conclusion, the isolation forest algorithm is a powerful tool for detecting anomalies in time series data. By building an ensemble of isolation trees and calculating the path lengths of each data point to the root of each tree, the algorithm is able to identify deviations from normal behavior and score them as anomalies.

In this article, we have explored how to use the isolation forest algorithm for time series anomaly detection, including how to implement a basic version of the algorithm in Python and how to apply it to a synthetic time series with a sine wave and a plateau. We have also discussed the hyperparameters of the algorithm and how they can be tuned to achieve better results.

Overall, the isolation forest algorithm is a useful tool for identifying unusual patterns in time series data, and can be applied to a wide range of applications, such as fraud detection, cybersecurity, and machine health monitoring. With its simplicity and effectiveness, it is a valuable addition to any data scientist’s toolkit.

References

Liu, Fei Tony, Ting, Kai Ming and Zhou, Zhi-Hua. “Isolation forest.” Data Mining, 2008. ICDM’08. Eighth IEEE International Conference on.

Disclaimer

Please note that this blog post was written with the assistance of the language processing tool — ChatGTP. While we have made every effort to ensure the accuracy and thoroughness of the information presented, the content of this blog post should not be taken as professional advice or as a substitute for appropriate professional guidance. Please use your best judgement and consult with qualified professionals before making any decisions based on the information provided in this blog post.

--

--

Phillip Wenig

I am a PhD student drinking coffee and philosophizing about AI, algorithms, concepts, etc.