Anomaly Detection in Time Series using ChatGPT

How to explore and evaluate a data analysis topic with an automated conversational framework?

István Szatmári
15 min readMar 18, 2023

I guess you have already heard about ChaGPT or should I say that I’m sure you’ve heard of this chatbot? Where you can ask different questions interactively, it can answer them, and suggest solutions for complex problems in a conversational form.

In this post, I will write about how to use ChatGPT to learn about different techniques for anomaly detection in time series data.

Motivations

In the past, when I was looking for information about something, I typically started by searching for the topic using either a) Google Scholar or b) Google Image Search. (In the previous sentence, the links specifically search for the term “anomalies in time series”). I got quickly a couple of articles, and among the images, I might have found something similar to my specific problem. Of course, it is also very good to search for it here on Medium and on Towards Data Science as well. Sure, different search terms can give different and additional results (e.g. “anomaly detection in time series” might produce more relevant results). After spending many hours, I could develop a general picture of the topic, I read about typical problems and various challenges. And if I was lucky, I even found a few examples that I could use for my own experiments. Why not use ChatGPT interactively for this?

So my motivations are

  • General: as a data analysis team leader, I have to support our young colleagues with introductory materials, describe problems we deal with, discuss possible approaches, and evaluate their solutions and results.
  • Specific: Recently I wrote a post about change point detection, now I would like to clarify the concept and difference, as well as get some hands-on experiments. I like to start with examples and dig deeper and deeper into the topic.
  • It’s fun to have conversations with ChatGPT and experiment with it.

What you will get:

  • What are anomalies in time series and how do they differ from change points?
  • What are the different types of anomalies?
  • What algorithms and techniques can we use to detect anomalies?
  • Get some practical examples and tips you can hopefully apply to your problems.

Example

QUERY: Create an example to demonstrate anomaly detection in time series data. Use Python without any specific package.

Remark: When I asked the same question in different sessions, the answers were different each time, sometimes it used a specific package, e.g. the Prophet one, sometimes not at all except the NumPy and Pandas packages. That’s why I specifically asked not to use it.

So the answer was:

ChatGPT: … for this example, I’ll be using the Numenta Anomaly Benchmark (NAB) dataset, which contains a variety of real-world time series data with labeled anomalies. Specifically, I’ll be using the “machine_temperature_system_failure.csv” dataset, which consists of temperature readings from a machine in an industrial setting.

To perform anomaly detection on this dataset, I’ll be using a simple statistical method called the Moving Average (MA) technique. The MA technique calculates the moving average of the data over a specified time window and detects anomalies based on deviations from the moving average.

import pandas as pd
import matplotlib.pyplot as plt

# Load the dataset
df = pd.read_csv('https://raw.githubusercontent.com/numenta/NAB/master/data/realKnownCause/machine_temperature_system_failure.csv')

# Convert the timestamp column to a datetime object
df['timestamp'] = pd.to_datetime(df['timestamp'])

# Calculate the moving average of the temperature readings
window_size = 200 # MODIFICATION, original was 50
ma = df['value'].rolling(window_size).mean()

# Calculate the deviation from the moving average
deviation = df['value'] - ma

# Calculate the standard deviation of the deviation
std_deviation = deviation.rolling(window_size).std()

# Calculate the threshold for anomaly detection
threshold = 3 * std_deviation

# Detect anomalies based on deviations from the moving average
anomalies = df[deviation.abs() > threshold]

# Plot the temperature readings and the anomalies
plt.subplots(figsize=(14, 10)) # MODIFICATION, inserted
plt.plot(df['timestamp'], df['value'], color='blue', label='Temperature Readings')
plt.scatter(anomalies['timestamp'], anomalies['value'], color='red', label='Anomalies')
plt.plot(df['timestamp'], ma, color='green', label='Moving Average')
plt.fill_between(df['timestamp'], ma-threshold, ma+threshold, color='gray', alpha=0.2, label='Threshold')
plt.legend()
plt.title('Machine Temperature Anomaly Detection')
plt.xlabel('Date')
plt.ylabel('Temperature (Celsius)')
plt.grid() # MODIFICATION, inserted
plt.show()

Remark: I made some simple modifications, changed the time window parameter, and inserted some lines to improve the plot.

Anomaly detection example
Anomaly detection example

Even with this simple example, we got a lot:

  • We could access some real data, using the NAB dataset. Either by ChatGPT you can ask: “What is the Numenta Anomaly Benchmark (NAB) dataset and how can I access it?” or using Google, we can get information about this dataset, find code examples, and publications, and find it on Kaggle, …
  • We have seen a simple technique, a moving average calculated with a moving window operator and setting a threshold around it based on the deviation of data relative to the average.
  • We got some useful visualization tips, such as displaying the anomaly detection threshold as a gray-shaded area.
  • Anyway, it was really fast.

The results weren’t very good anyway, but we don’t have to worry about that much at the moment.

We can ask about setting the parameters:

Asking about setting the parameters
Asking about setting the parameters

OK, now we know — if we didn’t know before — that this is a commonly used rule of thumb in statistics and data analysis to set the threshold this way and use it in the context of identifying extreme values and outliers. Let’s look at some variations.

Z-score

For example, if we calculate the deviations from the mean and compare them to the standard deviation of data values then we basically apply Z-score. Z-score gives how many standard deviations away a given value is from the mean of data values. Modifying the code we can use it as follows:

# Calculate the average of the temperature readings
mean = df['value'].mean()

# Calculate the deviation from the average
deviation = df['value'] - mean

# Calculate the standard deviation of the deviation
std_deviation = deviation.std() # same as we would calculate df['value'].std()

# Calculate Z-score - we don't use it directly, for reference only
z_score = deviation / std_deviation # z_score = (value-mean)/std

# Calculate the threshold for anomaly detection
threshold = 3 * std_deviation

# Detect anomalies based on deviations from the moving average
anomalies = df[deviation.abs() > threshold]
Anomaly detection based on Z-score
Anomaly detection based on Z-score

Percentiles

We can set global upper and lower limits separately based on the percentage distribution of data values. Not strictly defining, the percentile indicates that a certain percentage of data points fall below that percentile. For instance, the 50th percentile is the median. To set lower and upper bounds to define the outer range of 99.7% of the data we can use the following code block:

# calculate percentile limits
percent_limit = 0.3 # 99.7% is set
upper_threshold = df['value'].quantile((100-percent_limit/2)/100)
lower_threshold = df['value'].quantile(percent_limit/2/100)

# Detect anomalies based on percentile limits
anomalies_upper = df[df['value'] > upper_threshold]
anomalies_lower = df[df['value'] < lower_threshold]
Anomaly detection based on percentiles
Anomaly detection based on percentiles

Now, it’s really time to clarify the concept of anomaly.

Anomaly detection

We start right away by asking about the difference between anomaly detection and change point detection:

Difference between anomaly detection and change point detection
Difference between anomaly detection and change point detection

It is a bit verbose, the extract of the answer is as follows:

Anomaly detection in time series:

  • Anomalies can indicate potential problems in the underlying system or process that generates the data.
  • It involves identifying data points or events that deviate significantly from the expected or normal behavior of the time series.

Change point detection in time series:

  • Change points can indicate significant transitions in the underlying system or process that generates the data.
  • It involves identifying points in time where the statistical properties of the time series change abruptly.

The main difference is that anomaly detection focuses on identifying individual data points or events that deviate from the normal (or expected) behavior, whereas change point detection focuses on identifying points in time where the statistical properties of the time series change.

A simple, easy-to-remember formulation for the distinction:

Persistent anomalies are change points.

You can imagine a system from which you monitor the measurements as a streamed time series. First, you detect the change as an anomaly (which can be a single point or even a pattern). However, if data points still deviate from the system’s usual behavior, then the system may have already changed, so you should consider the anomaly as a change point.

Types of anomalies

Another interesting question is about the types of anomalies. Typically we distinguish point and pattern anomalies, the latter is also known as subsequence anomaly. In addition, we can distinguish contextual anomalies where some observation(s) are considered abnormal for instance, relative to neighboring values.

QUERY: What types of anomalies are in time series?

ChatGPT: In time series analysis, there are several types of anomalies that can occur, including:

1. Point anomalies: A point anomaly is a single data point that is significantly different from the rest of the data. It can be caused by measurement errors, system malfunctions, or other factors.

2. Contextual anomalies: Contextual anomalies are data points that are not unusual on their own, but become anomalous in a particular context. For example, a sudden spike in website traffic on a holiday is not unusual, but a similar spike on a regular weekday could be an anomaly.

3. Collective anomalies: Collective anomalies are groups of data points that are anomalous when considered together. They can be caused by changes in the underlying system or by external factors affecting the system.

4. Seasonal anomalies: Seasonal anomalies occur when the time series data show regular patterns or cycles, but a particular cycle deviates significantly from the expected pattern. For example, unusually high sales during a non-peak season would be a seasonal anomaly.

5. Cyclical anomalies: Cyclical anomalies occur when the time series data shows long-term cycles or trends that deviate significantly from the expected pattern. These anomalies are often caused by external factors such as economic recessions or natural disasters.

Remark: In the definition of collective anomaly, we can recognize the concept of pattern anomaly.

Now I think the concepts are clear, so I generated an example for visualization showing both anomalies and change points:

Time series example with change points and anomalies
Time series example with change points and anomalies

Algorithms for anomaly detection

So far we have seen an example of applying a simple moving average technique, then we mentioned statistical methods. Let’s see what machine learning algorithms can be used for anomaly detection.

QUERY: What specific machine learning algorithms are for anomaly detection?

I combined various answers given in different sessions.

ChatGPT: There are several machine learning algorithms that can be used for anomaly detection in time series data.

1. Isolation Forest: it is an unsupervised machine learning algorithm that uses decision trees to isolate anomalies from the rest of the data. It works by randomly partitioning the data into subsets until each data point is isolated in its own partition. The fewer partitions a data point needs to be isolated, the more anomalous it is considered to be.

2. Local Outlier Factor (LOF): LOF is a density-based anomaly detection algorithm that measures the local density of a point with respect to its neighbors. Points with a lower density than their neighbors are considered to be outliers.

3. One-Class Support Vector Machine (SVM): One-Class SVM is a type of SVM that is trained on only one class of data. It learns a decision boundary that separates the training data from the origin, and any point on the other side of the boundary is considered an anomaly.

4. LSTM-based models: Long Short-Term Memory (LSTM) is a type of Recurrent Neural Network (RNN) that is well suited for modeling sequential data. LSTM-based models can be trained to detect anomalies in time series data by learning the patterns and relationships in the data over time.

5. Autoencoder: An autoencoder is a neural network that can be trained to reconstruct the input data. When an anomaly is detected, the reconstructed output will be significantly different from the input data.

6. ARIMA (AutoRegressive Integrated Moving Average): ARIMA is a time series forecasting method that can be used to detect anomalies. It models the time series as a combination of autoregressive (AR), integrated (I), and moving average (MA) components. Anomalies can be detected by comparing the predicted values with the actual values.

7. Seasonal Hybrid ESD (S-H-ESD): S-H-ESD is a variant of the ESD algorithm that can be used to detect anomalies in seasonal time series. It takes into account the seasonal patterns in the data and uses a hybrid approach that combines the ESD algorithm with a seasonal decomposition of the time series.

Isolation Forest

Isolation Forest belongs to the type of unsupervised machine learning algorithms. It uses an ensemble of decision trees to isolate anomalous points in the data. As an explanation of how the algorithm works, we get:

QUERY: Create a python function that detects anomalies in time series by applying the Isolation Forest algorithm. The function input is a Pandas series.

import pandas as pd
from sklearn.ensemble import IsolationForest

def detect_anomalies_with_isolation_forest(series):
# Convert the series to a 2D NumPy array
data = series.values.reshape(-1, 1)

# Create an instance of the IsolationForest class
#model = IsolationForest(n_estimators=100, contamination='auto', random_state=42)
model = IsolationForest(n_estimators=100, contamination=0.1, random_state=42)

# Fit the model to the data and predict anomalies
model.fit(data)
anomalies = model.predict(data)

# Convert the predictions back to a Pandas series and return it
anomalies_series = pd.Series(anomalies, index=series.index)
return anomalies_series

Plot the result:

# Set the timestamp column as the index and convert to a series
series = df.set_index('timestamp')['value'].squeeze()

# Detect anomalies using the Isolation Forest algorithm
anomalies = detect_anomalies_with_isolation_forest(series)

# Plot the original series and the detected anomalies
plt.subplots(figsize=(14, 10))
plt.plot(df['timestamp'], df['value'], color='blue', label='Temperature Readings')
plt.scatter(anomalies[anomalies==-1].index, series[anomalies==-1].values, color='red', label='Anomalies')
plt.legend()
plt.title('Machine Temperature Anomaly Detection - Isolation Forest')
plt.xlabel('Date')
plt.ylabel('Temperature (Celsius)')
plt.grid()
plt.show()

The result (parameter contamination ‘auto‘ was replaced and set to 0.1):

Isolation Forest example

We can observe that, unlike the Z-score method, detected points can be better matched to local differences.

We can ask for an explanation of Isolation Forest parameters:

Local Outlier Factor

Local Outlier Factor (LOF) is an unsupervised machine learning algorithm for anomaly detection that works by measuring the local density of each data point and comparing it to the densities of its neighbors. It is based on the idea that anomalous data points are often located in low-density regions of the feature space. As an explanation of how the algorithm works, we get:

import pandas as pd
from sklearn.neighbors import LocalOutlierFactor

def detect_anomalies_with_local_outlier(series):
#lof = LocalOutlierFactor(n_neighbors=10, contamination='auto')
lof = LocalOutlierFactor(n_neighbors=40, contamination=0.01)
X = series.values.reshape(-1,1)
y_pred = lof.fit_predict(X)
anomalies = X[y_pred==-1]
return pd.Series(anomalies.flatten(), index=series.index[y_pred==-1])
# Detect anomalies using the Isolation Forest algorithm
anomalies = detect_anomalies_with_local_outlier(series)

# Plot the original series and the detected anomalies
plt.subplots(figsize=(14, 10))
plt.plot(df['timestamp'], df['value'], color='blue', label='Temperature Readings')
plt.scatter(anomalies.index, anomalies.values, color='red', label='Anomalies')
plt.legend()
plt.title('Machine Temperature Anomaly Detection - Local Outlier Factor')
plt.xlabel('Date')
plt.ylabel('Temperature (Celsius)')
plt.grid()
plt.show()
Local Outlier Factor example

Parameters are (two main parameters: n_neighbors and contamination):

Autoencoder algorithm

Autoencoders belong to unsupervised machine learning algorithms based on neural network models. An autoencoder has two parts: a) encoder and b) decoder. The encoder takes an input and maps it to a lower-dimensional representation. The decoder takes this representation and tries to reconstruct the original input. Asking about key parameters:

For anomaly detection, we need training data without anomalies thus during training, the autoencoder will be optimized to minimize the difference between the input and the reconstructed output.

In anomaly detection, by applying the trained model to test data, anomalies are detected as data points that have large reconstruction errors.

The generated code block by ChatGPT was this:

import numpy as np
import pandas as pd
from tensorflow import keras

def detect_anomalies_with_autoencoder(series, window_size=20, latent_dim=3, epochs=100):
# Prepare the input data
X = []
for i in range(len(series) - window_size):
X.append(series[i:i+window_size])
X = np.array(X)

# Define the autoencoder architecture
model = keras.Sequential([
keras.layers.Dense(64, activation='relu', input_shape=(window_size,)),
keras.layers.Dense(latent_dim, activation='relu'),
keras.layers.Dense(64, activation='relu'),
keras.layers.Dense(window_size, activation='linear')
])

# Train the autoencoder
model.compile(optimizer='adam', loss='mse')
model.fit(X, X, epochs=epochs, verbose=0)

# Use the trained autoencoder to detect anomalies
X_pred = model.predict(X)
mse = np.mean(np.power(X - X_pred, 2), axis=1)
threshold = np.percentile(mse, 95)
anomalies = series.iloc[window_size:][mse >= threshold]

return anomalies

The main problem with this code is that it is not separated into training and test phases, so it tries to learn the entire data set together with the anomalies. There are also other aspects related to model architecture, the number of input neurons, the number and sizes of encoding and decoding layers, regularization techniques, … However, this is rather our task 😊.

If we are not familiar with an autoencoder model and architecture we can ask for an explanation of the generated function:

Specifically, when asked about the choice of optimization, we get:

OK, then let’s look at the results, changing and experimenting only with the window_size parameter, as a result of one experiment we got:

Autoencoder example

Now we might want to learn more about this data in order to choose regions for training. Looking for some publications, e.g. in one of them I found information about the anomalies assigned to this time series.

Anomalies assigned to machine temperature values
Anomalies assigned to machine temperature values

“The first anomaly was a planned shutdown. The third anomaly was a catastrophic system failure. The second anomaly, a subtle but observable change in the behavior, indicated the actual onset of the problem that led to the eventual system failure. The anomalies were hand-labeled by an engineer working on the machine.” [1]

[1] S. Ahmad, A. Lavin, S. Purdy, and Z. Agha, “Unsupervised real-time anomaly detection for streaming data,” Neurocomputing, vol. 262, pp. 134–147, Nov. 2017, DOI: 10.1016/j.neucom.2017.04.070.

So we could select regions for training, change the structure, parameters, …

ARIMA

ARIMA (Autoregressive Integrated Moving Average) is a linear modelling technique that models time series data as a mixture of autoregressive (AR) and moving average (MA) components, applied mainly to make predictions in time series data. As an explanation, we get:

Regenerating responses by ChatGPT we might get additional useful information for further research, such as the following was:

An ARIMA model can be further extended with seasonal components resulting in the SARIMA model, which includes additional parameters for seasonal autoregression, seasonal integration, and seasonal moving average.

However, we must be careful not to just “blindly” copy the code blocks generated by ChatGPT. The next block shows such a counterexample (with some corrections to avoid runtime errors (sic!) and a comment highlighting a single prediction step):

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from statsmodels.tsa.arima.model import ARIMA

# Split the data into training and testing sets
train_series = series.loc['2013-12-18':'2014-01-26']
test_series = series.loc['2014-01-22':] # some overlapping - intentionally

# ---------------------- CHATGPT ---------------------- :
# Fit an ARIMA model to the training data
model = ARIMA(train_series.values, order=(3,1,1))
model_fit = model.fit()

# Make predictions on the testing data
predictions = model_fit.forecast(steps=len(test_series))[0] # resulting a single step prediction

# Calculate the residuals between the predicted and actual values
residuals = test_series.values - predictions

# Calculate the mean and standard deviation of the residuals
residual_mean = np.mean(residuals)
residual_std = np.std(residuals)

# Determine the threshold for anomaly detection
threshold = residual_mean + 2.5 * residual_std

# Identify anomalies in the testing data
anomalies = test_series[abs(residuals) > threshold]
# ---------------------- CHATGPT ----------------------

As I said before, applying it “blindly” one could think that we have found an almost “perfect” solution 😂. See some explanations in the code (you can find the link at the end of this post).

ARIMA example

Remark: the overlap between the training and test regions was intentional to “see” how the model behaves on the training data as well.

However, in order to use the ARIMA model properly, we would need more careful preparations. Preliminary examination of the statistical characteristics, whether the characteristics of the stochastic process change, accordingly apply trend, seasonality compensation, the length of the predictions (single-step, multi-step), the parameters of the model …

I think we now have an idea of how to use ChatGPT to process a specific topic. Of course, these are only the first steps, I haven’t covered a lot of other techniques — personally I often use spectral analysis to extract features — , further searches and iterative experiments are needed, not only ChatGPT queries. Moreover, I recommend that you don’t just rely on this tool, as I wrote at the beginning of the post, there are several useful portals where you can research. However, this tool can actually speed up the process.

Summary and Conclusion

  • I have tried to demonstrate how to carry out research on a particular data analysis topic with the help of ChatGPT
  • We explored using ChatGPT to learn about anomaly detection in time series data.
  • We saw examples of different techniques, statistically based, machine learning approaches.
  • We clarified the difference between anomaly detection and change point detection.
  • Finally, don’t forget this is merely a tool — although a really powerful one— , so use your critical thinking skills and avoid blindly accepting the results without checking them.

The source code can be found here. Thanks for reading.

--

--

István Szatmári

Data Analytics competence group leader at evopro Innovation Kft