End-to-End Anomalies Detection Models Evaluation Algorithms

Yann Hallouard
TotalEnergies Digital Factory
12 min readMar 19, 2020

Time series are numerical series representing the evolution of non-static specific phenomena through time such as in finance, weather, or in the industry.

Analyzing and making predictions of time series has become an important stake in data science. Anomalies detection classifies very rare events as being different from a ‘normal’ behavior. It becomes a major problem in time series analysis and it can be solved by supervised or unsupervised binary classification algorithms. Something to keep in mind, this is an unbalanced problem: anomalies account for less than 1% of observations and class unbalance has a deep impact on metrics. Indeed, a naïve algorithm which classifies every observations as normal will get a 99% accuracy without being effective.

Anomaly detection should not be mistaken with anomaly forecasting. The first one consists, as said above, classifying observations of a phenomena that has already happened. It enables us to react faster on a problem (ensuring the proper functioning of a production line, fraud detection, market anomalies, intrusion…) or to improve the quality of products. A forecasting problem means learning to recognize patterns just before the anomaly (see multivariate time series) to act before it happens. This can be use in predictive maintenance for example.

In real life, anomalies are rarely labeled in time series. Data scientists cannot know when anomalies occurred when looking to the data history.

And this raises 2 questions :

- How to label a time series dataset in order to transform the problem into supervised learning ?

- How to evaluate anomaly detection model’s performance since time series have no labeled data to compare predictions with reality ?

The second question can be also put this way: “How can I compare two anomalies detection models?”. It’s like listening to two people arguing without knowing who’s right. Both have great arguments, but you can’t decide between them.

A better appreciation of a classification algorithm performance is the famous confusion matrix. Without ground truth, no confusion matrix. And without confusion matrix, no classical performance metrics.

So, we need to label our time series. But labelling long time series by hands can be a really long and unpleasant task.

Everybody will agree that, metrics aren’t such great performance indicator without their confidence bounds! So, here is our objective: we need to generate several labeled time series in regard to the presence of anomalies which have the same behavior as a specified use case in order to compare models’ performance.

How can we label anomalies in real world time series in order to:

-Let data scientists use supervised models for Anomaly Detection?

- Assess these models?

To answer these needs, we launched @Total a project called TSODA (for Time Series Outliers Detection Algorithm — Evaluation tool)

TSODA

The following paragraphs summarize our approach. Please note that this project is still in development, but an open source code might come out later.

TSODA is a tool divided in three blocks:

  • Generation: generating a set of time series similar to the business use case, with labeled outliers
  • Model: feeding a model with the set
  • Evaluation: evaluating model performance with the classified dataset given by the “model block” and reports them into a Jupyter notebook

This is the process of how a data scientist uses TSODA:

  1. The data scientist provides TSODA with time series proper to his business case (no labels for anomalies)
  2. TSODA generates similar time series, add pollution, anomalies and labels. all the time series
  3. The data scientist uses their model to detect anomalies on generated times series
  4. TSODA uses prediction and ground truth to evaluate model’s performance and create a report
  5. The data scientist uses this report to make conclusions

Now, let’s go deeper into each block.

GENERATION BLOCK

Figure 1 : Schema of the Generation Process

This part is coded in R.

First of all, the user provides the generation block with his use case.

Here is an example:

Figure 2 : Use case given by the user

The global idea is to use a SARIMA process to capture the time series characteristics, then generate new similar time series. Using a SARIMA process to find time series parameters can be very time consuming due to the large amount of calculations. In order to reduce this time, we tried in the next parts to reduce the research field and also to use Bayesian Optimization.

What the process does:

  1. Smoothing the signal to remove most anomalies and gaussian noise
  2. Applying a stationarity test
  3. Using SARIMA process to get the time series characteristics
  4. Using these characteristics in order to build random time series close to the use case
  5. Adding pollution, anomalies and labeling the dataset

Smoothing

Figure 3 : use case given by the user (black) and the smoothed time series (red)

We perform an exponential smoothing to eliminate most of the anomalies. Then, we try to measure the seasonality (if there is one) in order to simplify the research field of Bayesian optimization used in the SARIMA process (step 3).

This smoothing step is automatic and optimized. Balance is set between reducing the reconstruction error and reducing the high frequencies using the following cost function:

| y(t) being the time series, w a weight

Stationarity Test

To reduce the research field, we also used a stationarity test on the smoothed series to find the number of differentiations needed: “d”. (one of the SARIMA parameters)

N_d <- 0

while (Test < 1){
N_d <- N_d + 1

#Differenciation
ddf <- diff(df_smoothed, differences = N_d )

#Adf
Test_adf <- adf.test(ddf)
Test <- (1*(Test_adf$p.value<=0.1)

if (N_d > 10){break}
}

Finding the best SARIMA parameters

A SARIMA (p, d, q)(P, D, Q)(s) process is use to simulate raw generated time series. We chose SARIMA because of simplicity and also because of the seasonal aspect which can help to simulate more complex behavior than ARIMA. A Bayesian Optimization is used to find the best SARIMA parameters throughout the research field [p, d, q, P, D, Q]. We chose to use this kind of optimization because of its relevance in the optimization of costly functions (AIC here) and what can be called a black box.

This research field goes from seven dimensions to five because we already found ‘s’ and ‘d’ respectively in the smoothing and then in the stationarity part.

“N_d” is the number of differentiations we got at the stationarity test step and “s” is the frequency found in the “N_d” times differenced series.

Parallelizing processing allows you to speed up the research by testing much more points in the same time. Using R, you have to define an object containing an objective function (such as AIC here), the research field and a boolean saying if you want to minimize or maximize the objective function.

## Object needed for the bayesian Optimization
# fn : the cost function
# par.set : Research field
obj.fun <- smoof::makeSingleObjectiveFunction(
name = "xgb_cv_bayes",
fn = function(x){
cfg = c(x["p"],
N_d,
x["q"],
x["P"],
x["D"],
x["Q"],
findfrequency(ddf))
model <- try(Arima(ddf,
order = c(cfg[1],cfg[2],cfg[3]),
seasonal = list(order=c(cfg[4],
cfg[5],
cfg[6]),
period=cfg[7])),
silent=TRUE)
k = as.integer(log(length(ddf)))
temp <- try(AIC(model,k= k) , silent = TRUE)
result <- 100000
if (is.numeric(temp)){
result <- temp
}
print(result)
return(result)
},
par.set = makeParamSet(
makeIntegerParam("p", lower = 0, upper = 4),
makeIntegerParam("q", lower = 0, upper = 4),
makeIntegerParam("P", lower = 0, upper = 4),
makeIntegerParam("D", lower = 0, upper = 2),
makeIntegerParam("Q", lower = 0, upper = 4)
),
minimize = TRUE
)

Here is the research step:

#------------------------
# Finding best parameters
#------------------------
start_time <- Sys.time()
des <- generateDesign(n=30,
par.set = getParamSet(obj.fun),
fun = optimumLHS)
control = makeMBOControl()
control = setMBOControlTermination(control, iter = 100)
control = setMBOControlInfill(control,
crit = makeMBOInfillCritEI(),
filter.proposed.points = TRUE)
# Evaluate the cost function on the design des
y <- data.frame(find_initial_index(des, N_d, s, ddf))
colnames(y) <- c('y')
final_design <- des %>% bind_cols(y)# Bayesian Optimisation
suppressWarnings({
run = try(mbo(fun = obj.fun,
design = final_design,
control = control,
show.info = TRUE), silent=TRUE)
})
end_time <- Sys.time() - start_time
end_time
#the best parameters vector
s = c(run$x$p ,N_d, run$x$q, run$x$P, run$x$D, run$x$Q, s)

With the parameters vector s resulting from the research step, we match a SARIMA model to our differenced time series to get the coefficients.

model <- Arima(ddf,
order = c(s[1],s[2],s[3]),
seasonal = list(order=c(s[4],s[5],s[6]),period=s[7])

A SARIMA model is defined this way:

𝚯, 𝚽, 𝛗 and 𝛉 are respectively P,p,Q and q order polynomial. B is the lag operator, so as to that B*x_t = x_{t-1}.

After finding the best SARIMA parameters, we use the coefficient to simulate one generated time series that follows the same model.

x <- sim_sarima(n=length(df) + 2000, 
model = list(ar=ar,
ma=ma,
sar=sar,
sma=sma,
iorder=iorder,
siorder=siorder,
nseasons=nseason,
sigma2 = sigma2))

Adding Anomalies and Choosing pollution parameters

Data scientists can decide to add features in order to recreate behavior that can’t be simulation through a SARIMA process, such as:

  • Noise — Adding white gaussian Noise.
  • Noise Seasonality — Adding a seasonality on the noise, recreate a maintenance on a sensor (tare).
  • One seasonality — Adding a seasonality.
  • Two Seasonalities — Adding two seasonalities (you can decide if you want to add both or multiply them before adding the result to the simulated time series.
  • Pollution rate — Percentage of outliers thought out observations.
  • Shift (Structural Change) — Adding a structural change, here it is a great drop/increase in the values at a local point that change the mean of the time series after it happens.
  • Inferior threshold — Recreate Industrial behavior for instance, to simulate cycle of daily production with zero values during the night.
Figure 4 : Comparison between Use case and Generated time series

(black points are outliers added according to the pollution rate)

Figure 5 : Example of Generated Dataset and its associated Labels

Evaluation block

Once times series are generated, the data scientist submits them to his model to detect anomalies. The results are compared to the ground truth to creates a report.

This report shows:

  • The process used to generate labeled time series
  • An indicator that measures the similarity between the time series provided by the data scientist and each generated time series
  • Indicators on the model performance (confusion matrix, …)
Figure 6 : View of the whole Report

Compute Model KPIs

Let’s take a look on the KPIs we chose to evaluate the model. As anomalies detection is a classification problem, the first thing we want to do is to look at the confusion matrix.

Secondly, we want to check classic metrics, like recall, precision, F-mesure and false positive rate. We chose to compute all these metrics, so the data scientist can refer to the most appropriate to a use case. Depending on the problem one might want a model with a very law false negative rate (health, etc.) or with a very high recall (political, advertising problem, etc…).

But anomalies detection is not only a simple classification problem, but an unbalanced classification problem (anomalies may account for very slight part among observations). This is why we decided to use a cost-based metrics: TBC. This metric uses different weight to penalize false positives, false negatives and ignore true negatives which are often too numerous in this kind of problems.

Cpred : number of prediction clusters (Here there are 4, due to the confusion matrix)
𝜖i : number of elements in the ith cluster
Cost(i) : arbitrary weight given to different cluster
Ptarget and Ntarget : number of Positive and Negative elements in the target

As several time series have been generated by the previous block, the model can be evaluated on numerous series, so several measures of each metric can be taken. It allows us to draw confidence intervals for each metric.

And as we use dashboards, it might be easier to read if KPIs were plotted:

Measuring Similarity between generated time series and the real business case

KPIs are an important aspect of this algorithm, it allows the data scientist to see if a model has good performances. But it’s also important to know how similar to the original time series the generated dataset is. In fact, the data scientist must be confident in the data submitted to the model to compute KPIs.

How similar the generated dataset is to my use case? We don’t need to recreate the same time series, we want another who has the same behavior. We decided to compute the cosine similarity, using features extraction and angle measurement in a vector space.

These are the features extracted:

  • The main Frequency represented
  • Autocorrelation at lag 0
  • 25% Quantile
  • 75% Quantile
  • Trend
  • Maximum value of the absolute derivate
  • Number of Peaks
  • Global Maximum of the mobile std on a 30 days sliding window
  • Global Minimum of the mobile std
  • Global Mean of the mobile std
  • Global Median of the mobile std

Let’s take a look at the dispersion of Real and Generated time series on each feature.

Figure 7: Boxplot of Real and Fake data dispersion on each feature

For this example, real and generated have different behavior on global median, number of peaks, maximum value of the absolute derivate and on the minimum value of the sliding standard deviation. However, they have exactly the same behavior on global max, global min, first frequency, std mean and std median.

Now let’s compute cosine similarities for each generated time series compared to the original one:

Figure 8 : Heatmap of cosine similarity between real time series (ordinates) and all generated series (abscissa)

As you can see, all Generated series aren’t so well similar to the Real one. We choose to generate more series and do some series selection, using this similarity as metric, to keep only 100 much similar series. With this we get an averaged similarity of 50%.

Figure 9 : Heatmap of cosine similarity between real time series (ordinates) and all generated series (abscissa) after similarity selection process

Result on Energy Consumption in France

Figure 10 : Comparison between use case (black) and 3 generated time series (blue)
Figure 11 : Heatmap of cosine similarity between real time series (ordinates) and all generated series (abscissa) for Energy consumption

For this example, simulations are more likely to be very similar to the original time series, but there are still some which are under 20% of similarity.

Conclusion

The generation block allows us to easily create time series with labeled anomalies relatively similar to the one provided by the data scientist. TSODA is also useful to measure this similarity and build a report in order to assess the performance of a model.

However, I think that we can improve the tool in order to generate even more similar time series, closer to the original one. And this is why we are currently interested in GANs. Here are some examples of how a GAN performs generating similar time series:

Figure 12 : Heatmap of cosine similarity between all real time series sequences (ordinates) and all generated series (abscissa)
Figure 13 : Heatmap of Dynamic Time Wrapping with euclidean distance between all real time series sequences (ordinates) and all generated series (abscissa)

The result seems very promising. The averaged similarity is rising, and we don’t need the generated series selection process anymore. Moreover, industrial time series are often hard to simulate with SARIMA model… Lot of structural change, several seasonalities, plateau… Lot of behavior that GANs can easily learn and reproduce. We calculated the euclidean distance between real and fake series by applying dynamic time wrapping. All fake series seem way closer to the real ones than the number 0 fake series (which is just a control linear series). But even though it seems to be a good option, some issues have to be overcome. Indeed, we need longer time series in order to sequence them in periodic sequences, and as GANs try to reproduce data distribution, time series must be sequenced in a way that each sequence is quite similar to the others. To be continued…

Interested in GANs ? You should ;)
Just to feed your curiosity, a GAN, or Generative Adversarial Networks, is composed with two neural networks which are fighting each other. The first one, called Discriminator, tries to separate true objects (pictures, times series, music…) from fake, generated by the the second network called Generator. The generator, on the other hand, tries to fool the discriminator. Through ‘epochs’, the discriminator will improve and become better. Obviously, the generator will have to get smatter too to keep challenging its counterpart.

This is very brief introduction to how we aim to create a neural network which can create fake objects that can even fool humans. You might create pictures of people who never existed, music that you swear had already heard while playing SuperMario… but you never had. Or even time series you swear you recognized on a different use case ;) …

I want to thank RODRIGUES Aline and FROT Nicolas for their participation in this project and this article ! :)

--

--