Overview Of Some Data Anomaly Detection Techniques With Artificial Intelligence And How It Can Be Applied In Industry.

Guillem Miralles
LatinXinAI
Published in
15 min readJan 11, 2023

In this article, we can see what we should take into account when designing a system that we can apply in the industry to detect anomalies through data.

Image Created with Stable Diffusion

Introduction:

Anomaly detection is the process of identifying unusual patterns or observations within data sets. It is often used in situations where it is difficult or infeasible to specify the exact characteristics of the anomalies to be detected ahead of time. Manual anomaly detection typically involves a human analyst manually inspecting the data and identifying any patterns or observations that appear to be unusual.

Artificial intelligence (AI) has the potential to greatly improve the efficiency and effectiveness of manual anomaly detection. By using machine learning algorithms, an AI model can learn to automatically identify anomalies within data sets. This can greatly reduce the amount of time and effort required to perform anomaly detection manually.

Anomalies Detection:

In general terms, an anomaly is something that differs from a norm: a deviation, an exception. Unlike other methods that store rules about strange cases, anomaly detection models store information about the pattern of normal behavior.

Then, as we have a pattern that we can consider as usual/normal, we will consider that it is 0, and we can define as anomalous those that are 1 (I use 1 as the anomaly because that is what we are trying to detect). We can solve the problem using Machine Learning.

The advantages it can offer us in Machine Learning over manual definition of an anomaly detection problem can be:
- Better performance: higher hit rate.
- Adaptive system: System capable of adapting to possible changes in the data.

But we also have a number of issues to deal with:
- Data sets are needed: The fundamental principle of any good machine learning model is that it requires data sets. The data, moreover, will have to be labeled so that we know in advance which squeezes are 0, and which are 1.
- Secondly, knowledge of machine learning is needed to have the ability to design these systems and perform maintenance for their correct operation.

Types Of Learning:

Machine learning algorithms are generally classified as supervised or unsupervised.

- Supervised Learning:

Supervised: In this case, the model needs to have at hand a data set with some observations and the labels/classes of the observations. We have a set of inputs with which we try to predict a variable called response variable. In this type of the problem will be of:

Classification: The response variable is a category. In the assumed case, 0 or 1.

- Unsupervised Learning:

The model has at hand a dataset with some observations without the need to also have the labels/classes of the observations. Unsupervised learning studies how systems can infer a function to describe a hidden structure from unlabeled data. The system does not predict the correct result, but explores the data and can draw inferences from data sets to describe hidden structures from unlabeled data.

For example, if the data are pictures of dogs and cats, in unsupervised learning, the program sorts all dog pictures into one category and cat pictures into another. However, this classification is not given beforehand, unlike in supervised learning, but finding this difference is achieved, for example, by association (e.g., relating leashes to dogs) and thus classifying it correctly.

An important thing to keep in mind is that, in the case of unsupervised learning, the fact that the model does not need labels for the 0 or 1 leashes does not mean that we do not need a labeled database. As we will see now, we need to have the classes labeled to find techniques that manage to split the structure of the 0 and 1 constraints, and obviously, we need to know which ones are 0 and which ones are 1 to know if a good separation is being found or not.

Evaluation Metrics:

Evaluating how systems work is a crucial part and we must take several things into account depending on the problem to be solved. A system where we predict whether a person has cancer or not (predicting that he does not and that in reality he does would be critical) is not the same as a system where we predict the genre of a movie (we want it to be as accurate as possible without caring about the failures that occur).

Then, we must take into account what we want to achieve in our problem, know how critical the failures are, to decide on the metrics we should use. Let’s assume that the bugs are quite critical.

Let’s take a look at some of the most important evaluation metrics to keep in mind when solving one of these problems.

1. Confusion matrix:

A confusion matrix is a technique for summarizing the performance of a classification algorithm. Classification accuracy alone
can be misleading if you have an unequal number of observations in each class (as is the case). Calculating a confusion matrix can give you a better idea of what the classification model is doing well and what types of errors it is making.

In our case we have as positive that which we seek to detect, i.e., anomalies (class 1), and class 0 would be the usual/normal.

  • TP (True Positives): Number of well-detected anomalies.
  • FP (False Positives): Number of normal data that are predicted to be abnormal.
  • TN (True Negatives): Number of normal data well detected. o FN (False Negatives): Number of normal data well detected.
  • FN (False Negatives): number of anomalies detected as normal data (VERY CRITICAL).

2. Accuracy:

It is the precision: number of hits/total samples.

3. Recall:

Anomalous cases (1) correctly identified out of all the anomalous cases we have. It is a measure of how many of the actual anomalous cases were correctly labeled as 1 by the model.

  • TP / (TP + FN).

4. F1-Score:

This is the harmonic mean of Precision and Recall and gives a better measure of incorrectly classified cases than Accuracy. Precision is the anomalous cases (1) correctly identified out of all anomalous predictions made. High precision means that the model makes a high proportion of correct positive predictions. TP / (TP + FP).

What should we take into account?

In short, accuracy is more interpretable than F1 Score, but for a problem like this, which has unbalanced classes (and false negatives are a crucial element), F1 Score is a metric to be taken into account much more.

For example, if we have 1000 data as class 0 and 50 as class 1, and the model classifies all as 0, it would be a bad model, but it would get 95% accuracy, while F1 Score and Recall would be 0.

The problem with F1 Score is that it gives the same importance to false negatives as to false positives. That is why we also use the Recall, because if it is 100% it means that there are no squeezes that are predicted as 0 and are actually 1 (false negatives).

System Parameter Selection Strategy:

To create a system, as we will see later, we can always establish a series of parameters or thresholds that will allow us to make the system work as we wish, always taking into account the problem we are facing. As we have said, we are going to suppose that there are bad consequences if anomalous data is not detected correctly.

In this case, our objective is a 100% Recall Strategy (Applicable in the industry): The Recall must be 100% (no false positives) and then we maximize within the possibilities the F1-Score. This is our objective when designing the system with critical anomalous data if not detected correctly. We seek, through a series of adjustable parameters depending on the system, to achieve maximum control in which no false negatives appear.

TECHNIQUE 1 — SUPERVISED LEARNING:

As mentioned above, supervised models are those that attempt to make a prediction between two classes (0 and 1).

Following the documentation available in the scikit — learn library, from where we can import a lot of models, we can find the parameters of the models that we can modify to find the best results of our models. It is here where the decision strategies of what we want to achieve with these parameters come in. As mentioned above, what we are looking for is to have a 100% Recall (no false negatives) and also to maximize the F1-Score as much as possible.

Then we can try different models such as a Random Forest, design a neural network, or for example, a simple and very flexible model such as the Support Vector Machines.

Depending on the industry we are working in, we will have to evaluate how to apply this type of models. For example, if we receive a patient with possible cancer, and we have a neural network that predicts whether he has cancer or not through an MRI, we can perform the MRI and then run this system. As we can see, in this case it does not require great immediacy, so once the model is trained, we can retrain it whenever we wish, and predict at the times we need to do so.

MRI Image

But let’s suppose that in our industry we have data coming in constantly and we need to make the prediction instantaneously. One way to solve this problem would be to train the models, with a certain amount of data, at the beginning of the day, or during areas where we stop receiving data, so that during the arrival of data it simply performs predictions.

TECHNIQUE 2 — LINEAR DISTANCE SYSTEMS (UNSUPERVISED LEARNING):

This type of technique are based on the calculation of distance metrics. The main objective of this type of system is to reduce the computational cost since, by using metrics, we are using only linear operations.

The theory used consists of using the distance between a series of temporal data. If we know the distribution that the correct data normally follows, we can compare its mean with each new data. So we can assume that, if abnormal data arrives, its distribution will be very different from the mean of what we consider normal/usual. On the other hand, if data arrives that is normal or habitual, it will be much more similar than abnormal data.

We can choose one, two or more metrics. The greater the number of metrics (Cosine, Manhattan, Euclidean, Chebyshev, etc…) the greater the control but the greater the number of thresholds we must define. It is necessary to define a number of metrics that manage to divide the anomalous data from the average in a clear way and with the smallest possible number of metrics.

Therefore, in the first phase we must carry out an analysis stage. We must make an average with some training data and define some thresholds that better separate this average from anomalous and normal test data. In the analysis phase, using training data, it was decided that the threshold for the Euclidean distance > 4.7 and for the Chebyshev > 17

In the example of the image, we can see this phase of the test. It must be said that the test has been carried out with more than 5000 data, therefore we can see how the data that is separated from that cloud of correct data is very few and, for the most part, are the abnormal ones.

Test set confusion matrix using distances

Then, once the thresholds have been calculated in the previous phase, we can choose a number N of non-anomalous data from which we will calculate the average. The smaller the number of data, the more adaptable and flexible the system will be, but the greater the possibility of error.

Another possibility is to compare the new tightening to any data we have in the database. Remember that these calculations are computationally quite low, so this option can be considered, although it is more costly than the above-mentioned average.

The main difficulty of this system is to establish a threshold for each distance according to the data, since it requires a previous study of the data and a selection of the best threshold. However, this is the usual problem in the creation of any anomaly detection system.

TECHNIQUE 3— UNSUPERVISED LEARNING MODELS :

DBSCAN:

The unsupervised learning model chosen for the problem, due to its characteristics, is the DBSCAN algorithm. The DBSCAN (Density Based Space Clustering With Noise) algorithm is a data clustering algorithm. As we have already mentioned, clustering algorithms are those algorithms that try to group the clusters according to their structure, that is, their similarity or the distance between them.

The DBSCAN unsupervised learning model seems to be a good approach to the problem as it has the ability to discover clusters of different sizes from a large volume of data containing noise or outliers.

DBSCAN Algorithm (Wikipedia)

The DBSCAN algorithm is based on two parameters that we can adjust to modify the way it forms the clusters:

  • Epsilon (eps): defines the size and boundaries of each sample neighborhood.
  • Minim Points (min_samples): If a point has, in its
    neighborhood at least, the minim points, a group will be formed.

We consider that anomalous data can be defined as outliers to the rest, but we also have to consider that anomalous data can form groups. Therefore, data that are in a cluster where there is some anomalous data will be considered anomalous as well.

Then, given some training data, we try to find those model parameters that best fit the desired system resolution. There are also other unsupervised learning techniques such as K-MEANS or other more complex ones such as self-organizing maps that also allow us to solve the problem in this way.

TECHNIQUE 4 — RECONSTRUCTION MODELS (UNSUPERVISED LEARNING):

These types of models are based on performing a dimensionality reduction and then trying to reconstruct the data and return to the original dimensionality. This does not seem to make much sense, but it does, since what we are looking for is that these models generate a certain error between the original data and the reconstruction performed. The idea then is that, if we only train the models with data that are normal (0), the model will learn to reconstruct well the 0-tailed ones, but for the anomalous ones, it will generate more error. Then, we will set a threshold to classify as normal those whose error is lower than 0.

The threshold is decided based on the error generated by the reconstruction compared to the original data. To know the error generated, there are multiple metrics, but we will explain here the two most famous metrics:

  • Root Mean Square Error (RMSE): it is the square root of the average of squared differences between the prediction and the actual observation. It gives a relatively high weight to large errors.
  • Mean absolute error (MAE): It is the average over the absolute differences between the prediction and the actual observation. All errors have the same weight.
RMSE and MAE errors

We can use different techniques to reduce the dimensionality and then return to the original dimension. In this article, we will look at two:

1. Principal Component Analysis (PCA):

First of all, we do an analysis phase where we train the PCA with the correct data and we test with some test data. One thing that we will also have to decide is the number of components of the pca that we are going to reduce. There are multiple techniques to select the optimal number of principal components such as the Elbow Curve. Once we see how it works, we select a threshold for the errors that allows us to separate the anomalous data as well as possible.

Then, in the test part, we send it both class 0 and class 1 data. If the error produced in the reconstruction is above a threshold, it is set as incorrect.

Here we can observe the same thing that was shown with the distances. In the analysis phase, the thresholds for errors were decided and later the test was carried out. The important thing is that we see a successful split in the data.

Test set confusion matrix using PCA reconstruction

2. Autoencoder:

An autoencoder is a neural network that tries to compress the input information and reproduce it correctly with the reduced information in the output.

Autoencoder Structure
  • Encoder: The encoder tries to reduce the dimensionality of the data. That is, it tries to keep the most important information, but reducing the number of variables. As in any dimensionality reduction process, there is a loss of information.
  • Latent Space: In this space, we have the dimensionality reduced to a certain number of values.
  • Decoder: The decoder tries to return to the input dimension, that is, it tries to reproduce the original tightening. As we have said, reducing the dimensionality causes a loss of information, therefore, the output will have a certain error regarding the original tightening.

So, given an analysis stage, we have to design an autoencoder that best reconstructs the correct data, that is, design an autoencoder that minimizes the error with correct data. To do this, you must know the pattern that the correct data follows. This is achieved in a stage where we create an architecture and v train this autoencoder.

For this, you must have knowledge of neural networks and how they work. There are many parameters that we have to adapt to the problem (Number of neurons, optimizers, activation functions, etc…). After this phase, as in all cases, we choose a threshold and test it with a test set.

Test Set Autoencoder errors

We see that it divides the data much better and achieves a good division.

We see that we only have 7 false positives, that we maximize the recall to 100%, and that it has an accuracy of 99.8%. In this way, we have achieved the goal.

Things to Keep in Mind:

Adaptability:

In this article, we have talked about adaptability at some point. This can be important in an industry since data can change by many factors but still be correct. For this reason, it is very important to always give feedback to the system; that is, when the system has something that it classifies as “rare”, we assess whether it was really rare or not.

This can be interesting because, if we limit our database and always have the newest data, the system will be adaptable to changes and, when the data changes, at first it will ask for a lot of feedback but, once we teach it again what is correct, the system will keep working.

The less data, the more adaptable to change a system can be, but the fewer control it provides. We must find a point where our system is balanced. You have to know that, normally, even if there are changes in the normal data, these will always be more similar to what was previously normal, than to what is anomalous. Then the time spent giving you feedback will not be much.

Timing.

One thing that we will also have to evaluate will be the times. Depending on the resources available we will be able to apply more or less powerful systems. If we have an autoencoder but we have a lot of resources, we can train it in a practically constant way and with a lot of data, on the other hand, if we have few resources, maybe we can train this autoencoder (which has a very high computational cost) when it is just done and, later on, we can go by saying that it does not have a very high cost.

Automatic Threshold Determination System:

Determining thresholds automatically is something that requires a much broader study depending on the problem. The ability to be able to set these thresholds while maintaining a high degree of control so that false negatives do not appear, needs a study with a lot of labeled data, as possible changes can screw up our system. But if there is nothing critical depending on the choice of the system, we can make the system choose the thresholds that get the best results given the database.

LatinX in AI (LXAI) logo

Do you identify as Latinx and are working in artificial intelligence or know someone who is Latinx and is working in artificial intelligence?

Don’t forget to hit the 👏 below to help support our community — it means a lot!

Thank you :)

--

--

Guillem Miralles
LatinXinAI

Hello! My name is Guillem and I am passionate about data science :) Contact: guillemmiralles1@gmail.com