Unsupervised Learning for Anomaly Detection

Published in

Teknopar Akademi

22 min readAug 29, 2023

Introduction

Anomaly detection, regardless of the context of the data set, has become one of the prevailing use cases of Machine Learning (ML) in various industries. It often provides critical actionable information in various application domains. From cybersecurity, to identify unusual patterns or behaviors in network traffic, user access, or system logs in order to detect malicious attacks; to the healthcare industry where anomaly detection can be used to identify unusual patterns in patient data, detect irregularities in medical images (e.g. X-rays, MRIs, etc.) — traces of ML can be noticed in numerous applications.

Outliers are data points that differ significantly from the rest of the data points in a given set [1]. Anomaly detection is the process of finding these types of data points in order to extract important information about the system/process which is being analyzed. With so many applications in diverse set of fields, anomaly detection addresses critical challenges and provides valuable insights for many different purposes such as efficient resource utilization, monitoring systems predictive maintenance, and optimizing industrial processes. In this post, different types of anomalies and various methods used for their detection, with an emphasis on unsupervised learning methods used for unlabeled datasets, were explored.

Contents and Objective

The main objective of this article is to provide more context on anomaly detection, discover the various ML methods used for it, and apply this knowledge on a sample dataset which was obtained from a petrochemical plant that has sensor data from numerous stages of production. Since the aforementioned dataset used to test various methods came without labels, unsupervised learning methods were emphasized in this post. Furthermore, similar to the data collected from Internet of Things (IoT) systems, the sample data used was time series which was a constraint considered when picking the methods for anomaly detection.

The information is presented in a way such that all the terminology and main concepts are broken down to make it easier for the reader to understand. The different types of anomalies are described in the first section, then various unsupervised learning methods for anomaly detection are explored in detail throughout the rest of this article. The six methods explored in order are:

Isolation Forest (iForest)
Density-Based Spatial Clustering of Applications with Noise (DBSCAN)
Local Outlier Factor (LOF)
Autoencoders for anomaly detection
Histogram Based Outlier Score (HBOS)
Principle Component Analysis (PCA) for Anomaly Detection

Types of Anomalies/Outliers

Global Outliers

When specific data points are outside of the overall distribution of the dataset, we can call these events as global outliers. Since they indicate a significant deviation from the overall behaviour of the dataset, these points can sometimes be classified as anomalies. In order to detect global outliers, techniques such as looking at their z-score (to get an idea of the distance between the mean and the specific data point [14]), using Mahalanobis distance (a multivariate distance metric to analyze the data point’s deviation from the distribution [2]), isolation forest algorithm, and one-class SVM. Some of these terms/techniques will be further described in the following section.

Contextual Outliers

If the data points are not in the range that we expect them to be, based on the behaviour forecasted for them, as well as the context, these points are classified as contextual outliers [28]. It is important to note that contextual outliers take the external conditions/the time of occurrence into account. This means that contextual outliers are transitory, and the same situation observed at different times can be classified as an outlier. Contextual ML methods ought to be preferred in order to achieve the best results when detecting contextual outliers. It is important to cluster the data or take the context into account when performing anomaly detection [1].

An example of a contextual outlier — Contextual outlier

Collective Outliers

Collective outliers are groups of data points that deviate from the overall behaviour collectively [28]. Although when analyzed individually, no unusual behaviour is observed, an anomaly can be detected when these data points are monitored collectively. While detecting collective outliers, subspace-based methods (in order to extract features, reduce dimensions in order to observe patterns of high dimensional data in its subspace where significant information is maintained [3]) can be used. Furthermore, clustering algorithms can also lead to success in detecting collective outliers. Clustering algorithms will be further elaborated on in the following section.

Machine Learning Methods/Algorithms for Anomaly Detection

Isolation Forest

As the name suggests, isolation forest (iForest) is a tree-based algorithm which can detect anomalies by randomly partitioning data points. It is an unsupervised ML algorithm that identifies anomalies by isolating outliers in the data [4]. This method uses the assumption that the anomalies present in a given data set are limited in number and have distinct characteristics compared to regular data points.

Isolation forest algorithm uses binary decision trees to isolate the outliers by making random splits in the dataset [4]. These splits are done based on a random value picked between the minimum and the maximum of the specific feature analyzed. The intuition behind this method is that a regular point would be much harder to isolate compared to an anomaly as the dataset keeps being partitioned — so the points that are isolated at the start of partitioning are more likely to be outliers.

Below is a visual representation of how these partitions are done and the difference in the number of partitions it takes to reach outliers in a dataset compared to an inlier.

*Visualization of isolation forest algorithm — anomalies (xo) are more susceptible to isolation compared to regular points (xi)*

By eliminating the cost of computing distance or density which most of the other anomaly detection methods need, iForest is able to detect outliers in linear time complexity with a low constant and a low memory requirement [4]. To classify anomalies from regularities, iForest algorithm computes anomaly scores between 0–1 for each sample in the dataset where scores closer to 0 (generally <0.5 though this threshold can be adjusted based on the application) are considered to be regular [4]. These scores are assigned during the training process based on how many partitions it takes for a certain data point to be isolated from the rest — the formula to calculate the anomaly score of an instance x from a dataset with n instances is shown below.

Here, E(h(x)) is the average number of partitions it takes to isolate a certain data point from a collection of isolation trees built during training, and c(n) is the height of the binary tree constructed for iForest which has a similar structure to a Binary Search Tree (BST) [4].

It is important to note one drawback of the iForest algorithm: treating a cluster of anomalies as inliers. This happens when the anomalies are too close to regular instances or when there is a batch of anomalies — causing the algorithm to treat them as inliers. In order to prevent swamping, wrongfully labelling normal instances anomalies, subsampling can be used by partitioning the dataset into smaller fragments and analyzing each separately for anomalies [5].

Isolation forest has been proven to be scalable to larger datasets and also produce good results in identifying anomalies in high dimensional datasets. When necessary adjustments are done to the original algorithm; and by tuning the parameters of the algorithm to accurately detect outliers in the dataset, iForest can be a valuable algorithm to use in anomaly detection.

Density-Based Spatial Clustering of Applications with Noise (DBSCAN)

DBSCAN is an unsupervised clustering algorithm that is based on the assumption that high density datapoint clusters are separated by less dense regions [7]. As opposed to K-means clustering which requires well defined clusters and the outliers impact the results more — DBSCAN can identify clusters that differ in size and shape, while picking the outliers efficiently, making it a more desirable option in many contexts.

Visualization of the difference in detecting outliers between DBSCAN and K-Means clustering

The algorithm’s main working principle is forming circles around the datapoints a and separating the dataset into three classes: Core Point, Border Point, and Noise [7]. This classification is based on two parameters specified by the programmer: the radius of these said circles (epsilon) and the minimum samples required inside these circles for the data point to be classified as a core point (minpoint) [8]. To determine the distance of datapoints for the center of the circle, Euclidian distance is used. This is how the classification is done where x is the number of samples that fall in the circle with radius of epsilon drawn around the data points:

x > minpoints: Core Point
x < minpoints: Border Point
x = 0: Noise

Visualization of how the DBSCAN algorithm works

While determining the parameters for DBSCAN, it is common practice to pick minpoints to be twice the number of dimensions though it depends on the specific application. Though, it is good practice to pick this value such that it is greater than or equal to the dimensionality of the set [6] [7] [8]. Next, the epsilon value is where the k-distance graph displays the greatest slope. The k-distance graph is plotted by calculating the average distance between the samples and their k-nearest neighbors with k set to minpoints parameter picked previously. The average k-distances are then plotted in ascending order to get the k-distance graph [18].

Sample k-distance graph to show how the best epsilon value for DBSCAN is selected

DBSCAN can identify clusters in large datasets by observing the local density of the data points — and does not require the programmer to specify the number of centroids of the clusters unlike K-Means Clustering.

Overall, DBSCAN is an efficient clustering algorithm to classify the dataset based on the density distribution of the data points. It can be an excellent algorithm to detect the outliers in datasets where there is an evident difference between the high and ow density regions. On the other hand, the algorithm suffers when there are too many dimensions or when the data clusters are similar in density which makes it harder to DBSCAN to distinguish the clusters [8]. For the purposes of anomaly detection in a time series dataset, DBSCAN is an effective algorithm to detect the global outliers with extreme values. But when it comes to local anomalies which are not so obvious in the whole dataset, the parameters might have to be adjusted to make the algorithm more sensitive to these mid-range values as well.

Local Outlier Factor (LOF)

When the density distribution of the dataset is not uniform across the samples, Local Outlier Factor (LOF) algorithm is able to detect the points that deviate based on their local neighborhoods [22]. It is based on the idea that outliers won’t have as many neighbors as an inlier. With these in mind, it can be said that LOF uses a similar concept as DBSCAN in terms of using the presence of neighboring points as a metric to decide whether or not a certain point is an outlier.

To understand LOF in detail, it is important to have a good understanding of certain concepts such as; K-distance, K-neighbors, Reachability Density (RD) and Local Reachability Density (LRD).

· K-distance: the distance from a point to its k-th nearest neighbor (so the distance to get to k many nearest neighbors from a point).

· K-neighbors: these are the datapoints that fall within the radius of k-distance around the specific point that is being observed

· Reachability Distance (RD):

This formula implies that the reachability distance from a to b is the k-distance of the point if a falls in the k-neighborhood of b; and is the regular Euclidian distance (or other distance measure depending on the application) if it falls outside of the k-neighborhood.

· Local Reachability Density (LRD): the inverse of the average reachability distance of point a from its neighbors.

By intuition, it can be stated that low LRD values implies that the closest cluster detected by the algorithm is far from the point. The formula to compute LRD of point A is as follows,

where Nk represented the set of k-neighbors of point A. Then, Local Outlier Factor (LOF) can be defined as the ratio of its LRD to the LRDs of its k-nearest neighbors.

LOF provides a metric to compare the local density of a point to its neighbors. The formula to compute the LOF is

Higher LOF values hint that the point belongs to a less dense region and is likely to be an outlier. If the LOF for a point is smaller than or is around 1, it is likely to be an inlier and has similar density as its neighbors. On the other hand, LOF values much higher than 1 suggests that the density of a point is much smaller than the densities of its neighbors — thus is likely to be an outlier.

· Inlier if LOF ~ 1 or LOF < 1

· Outlier if LOF >> 1

In high-dimensional time-series data, LOF becomes computationally more expensive and sparsity (the distribution of the datapoints in high dimensional space compared to lower ones) starts becoming an issue. The reason why increased sparsity caused by high number of dimensions becomes a challenge for LOF to work efficiently is due to the fact that the algorithm is unable to find meaningful clusters/neighborhoods for the data points which is an essential step in computing the LRDs [8].

When the dataset analyzed is known to be dominated by local outliers, LOF can be preferred over other algorithms since it is more sensitive to points that differ even slightly from densely clustered points.

Autoencoders for Anomaly Detection

Autoencoders are similar to Principle Component Analysis (PCA) in a sense that it is also a compression algorithm. Autoencoders are types of neural networks based on unsupervised machine learning which compress the raw input into a lower dimensional representation — called code and uses this code to reconstruct the output [25]. There are three main components in each autoencoder: encoder, code and decoder. A visual representation of this structure can be seen below:

Autoencoder components and representation of the dimensionality reduction process

As the image depicts, the output is a compressed version of the input provided to the network after the decoder reconstructs the image from the code obtained from the encoder. Autoencoders differ from PCA as it reconstructs the input in a non-linear fashion rather than a simple linear transformation that PCA applies. So, autoencoders lead to more accurate lower-dimensional representations of the input as it learns how to accurately reconstruct the input by training a neural network made up of multiple non-linear transformations [25].

It is also important to note that the autoencoders should be trained based on their specific applications, so they are data specific [26]. And they are also lossy, meaning that once the output gets compressed some of the original data in the raw input is lost permanently [25].

The intuition behind using autoencoders for anomaly detection is that the encoder cannot create a representation in lower dimensional latent space accurately for anomalies as the autoencoder is trained to learn generalizations from the dataset [25]. So, during reconstruction process in the decoder, the output will not resemble the original data as the properties of the outliers vary significantly from regular data points used during training.

The difference between the input to the neural network and the output representations is called the reconstruction error. This metric can then be used to classify the outliers — as outliers will have to higher reconstruction errors as they behave differently than regular data points. So, when an outlier is fed to the autoencoder architecture, the autoencoder will not do a great job with the weights determined from regular datapoints during training — leading to a high reconstruction error [26].

In the autoencoder architecture, the most important hyperparameter to be tuned becomes the size of the code — or the bottleneck — so the compressed version of the original input. Based on how deep the neural network is as well as how many nodes each layer has, the compression process differs. To explain this better, think of a simple autoencoder with three layers in total: one input with 4 nodes, one hidden layer (bottleneck layer) with 2 nodes, and one output with 4 nodes. When the input is received by the autoencoder, the encoder will have to compress the 4-dimensional input into 2 which will then have to be reconstructed to output with 4 features with the essential information from the input by the decoder.

A simple autoencoder with one layer for the input, bottleneck, and the output

The weights in each node of the neural network are then adjusted via backpropagation such that the reconstruction error is minimized and the output resembles the input as much as possible as since the autoencoder is trained with reliable data [24]. So, the weights are updated based on how much they are responsible for the reconstruction error.

To summarize everything mentioned above, the whole process can be outlined step-by-step as:

- Step 1: Input received by the autoencoder after the necessary pre-processing of the data has been completed

- Step 2: Encode the input into a lower dimensional vector using an adequate activation function based on the application.

- Step 3: Obtain the code which is the latent representation of the original input including the essential information.

- Step 4: Recreate the input by decoding the lower dimensional vector from the encoder. This new vector will become the output which has to be of the same dimension as the input.

- Step 5: Calculate the reconstruction error L which is the difference between the input and the output vector.

- Step 6: Backpropagate through the neural network to update the nodes’ weights based on how much they are responsible of the error. As it is the case for NNs, the learning rate determines by how much the weights are updated.

- Step 7: Depending on the optimization algorithm (stochastic gradient descent in most of the autoencoders) used to minimize the reconstruction error, repeat the above steps to get the most optimal weights.

One of the challenges that come with autoencoders is overfitting the data as there are so many ways to represent the original input [25]. For this reason, sparse autoencoders are used which introduces a sparsity penalty added to the cost function (function representing the error between the computed vs expected value of the output over the entire training set). This sparsity component introduces a regularization term in the loss function which activates only a portion of nodes in the neural network [26]. Sparsity in neural networks reduces the dimensionality in the networks which prevents the autoencoder from learning redundant features which can lead to overfitting.

Histogram Based Outlier Score (HBOS)

HBOS is a statistical anomaly detection algorithm which calculates an outlier score for each feature. This algorithm assumes each feature to be independent. Although it less precise than LOF or other local outlier detection algorithms, it is very efficient when it comes to detecting global outliers in linear time [28]. HBOS algorithm creates univariate histograms (histograms for singular features) for all dimensions of the dataset in order to visualize the distribution of datapoints across “bins”. There are two types of features that can exist in the dataset mentioned in the original paper written by Goldstein and Dengel: categorical and numeric [28]. Categorical features are values that represent labels instead of numeric values — colors (blue, yellow, green etc.) or cities (New York, Istanbul, Paris etc.) as an example. For categorical data, values of each category are counted and relative frequency which is the height of the histogram is computed. For numerical features, there are two methods used: Static bin-width histograms and dynamic bin width histograms [28].

Static bin-width histograms

The width of the bins are determined by dividing the range of the numeric values in the dataset into a predetermined number of bins. The width of each bin is constant across the entire range and the frequencies of the datapoints are plotted in a histogram. This is the typical histogram that comes to mind and is the most common form. The rule of thumb method is to pick the number of bins to be the square root of number of samples [27]. Another mathematical method commonly used is Sturge’s Formula, which works the best with large data sets, states:

where n is the sample size. Freedman-Diaconis Rule is another method used to determine the number of bins [27]. This rule also considers the spread of the sample by taking the interquartile range into account. The bin width is calculated using the formula below:

where the numerator is the interquartile range of the data and n is again the sample size. As expected, this formula leads to the number of bins to be:

Finally, the number of bins can be selected via visual inspection with some knowledge of the domain.

Dynamic bin-width histograms

Sample histogram with dynamic bin-widths

The width of each bin is not constant unlike the traditional static-bin width histograms. The widths are determined based on the characteristics of the data — this implies that more densely populated inter ranges will have smaller bin widths while more sparse regions will have wider bin widths [28]. Dynamic bin-width histograms avoid the issue of oversmoothing which can lead to loss of detail in the behaviour of data in traditional histograms.

After the histogram is created, using whichever one of the types mentioned above, the density estimations are determined — which the height of each bin corresponds to. Then, these heights are normalized such that the highest density corresponds to 1. This normalized value can then be used to compute the Histogram Based Outlier Score (HBOS) of each data point. The formula to compute this value is as follows:

From this formula, it can be deduced that a higher HBOS might hint the point to be an outlier. So this score measures how different the data point’s feature values differ from the average distribution of values in that bin [28]. Based on the Outlier Scores computed and the distribution of these scores for different features, a threshold can be chosen such that points with scores greater than the threshold cause the point to be classified as an outlier.

Principle Component Analysis (PCA) for Anomaly Detection

PCA is a statistical method used to reduce the dimensionality of large data sets while still preserving important relationships between the dimensions in order to make it easier to analyze and to visualize large datasets [14]. It is a linear transformation obtained by maximizing the variance in the data in order to preserve as much statistical information as possible using the least number of dimensions [9]. By going through a list of steps, further explained in Appendix A, PCA compresses the data and generates a new, smaller subset of dimensions by identifying the principle components of the data points [9]. Oftentimes, machine learning algorithms benefit from this simplification as it can minimize the number of unnecessary variables dealt with — speeding up the overall process. Though it is important to note the trade-off between the reduction in the number of dimensions and loss of information. As expected, as the primary components are computed, some of the information is lost as the number of dimensions are reduced [10].

To expand more on the limitations of the method, three key factors about the dataset should be considered: mainly the linearity of the data, the presence of outliers, and the flexibility when it comes to loss of information [10]. PCA assumes correlation between features — meaning that if the variables are not corelated, the primary components computed by the algorithm will not be great representations of the overall dataset. PCA also disregards primary components with low variance as it assumes that only the primary components with high variance retain valuable information about the large dataset [10]. The outliers in the dataset also have impact on the primary components computed as the method minimizes the quadratic norms when calculating the primary component. For this reason, as the number of outliers in the dataset increase, these datapoints start to drive the PCA components — affecting the accuracy of the results. Finally, the challenges in the interpretability of the newly created primary components should also be highlighted as it is usually challenging to reach meaningful conclusions from these new dimensions derived.

Robust PCA

In order to address specific of the challenges outlined above, numerous variations of the traditional PCA have been proposed over time. One of these many variations is robust PCA which has been developed and is being used specifically for datasets with a lot of noise and outliers which can affect the Principle Components (PCs) extracted from the initial set of data. This is done by decomposing the input data into its sparse (corrupted data) and low rank components (clean data) — then the low rank component (Lo) is used as the key part for decomposition of the data. This decomposition is done iteratively in order to satisfy the optimization objective in the best way possible [11].

Sparse PCA

Another variation is Sparse PCA, by introducing sparsity constraints to the PCs resulting in a smaller subset of features compared to standard PCA with zero and near zero values. This makes it easier to interpret the PCs computed since not all dimensions are represented in the newly extracted features (only the ones with non-zero feature weights are represented), the irrelevant and redundant ones being discarded [12]. The sparsity can be adjusted using various methods, based on how the feature weight selection is done. Python’s Sklearn library uses Iterative Soft Thresholding (IST) which uses L1 Regularization (Lasso regression) for feature weight selection which affects the sparsity of the data. So, Sparse PCA selects the most informative features, leading to more understandable visualization of the data [13].

In order to detect anomalies using PCA, the reconstruction error — the distance between the original data point and its projection onto a lower dimensional subspace — can be considered [12]. The reconstruction error will be high for anomalies since they will deviate significantly from the principal components’ subspace or will be very far from the regions where regular datapoints are clustered. So, by computing the reconstruction errors for individual points, and by setting a threshold for this error for what is considered as an outlier, anomalies can be separated from the dataset. Below is an image showing how PCA projects the original points in the dataset (blue) onto the principal components found to maximize the variance (red). Long distances from the original points to their projections might imply the point being an outlier.

An example of how PCA maps the original datapoints to the principal components to reduce dimensionality, 2D to 1D in this image

Works Cited

[1] R. Aggarwal, “Types of Outliers in Data Mining,” Geeks for Geeks, 4 May 2023. [Online]. Available: https://www.geeksforgeeks.org/types-of-outliers-in-data-mining/. [Accessed 24 August 2023].

[2] S. Prabhakaran, “Mahalanobis Distance — Understanding the math with examples,” machinelearningplus.com, [Online]. Available: https://www.machinelearningplus.com/statistics/mahalanobis-distance/. [Accessed 17 July 2023].

[3] L. C. J. Yen-Wei Chen, Subspace Methods for Pattern Recognition in Intelligent Environment, Berlin: Springer, 2014.

[4] F. T. Liu, K. M. Ting and Z.-H. Zhou, “Isolation Forest,” in 2008 Eighth IEEE International Conference on Data Mining, 2008.

[5] A. Mishra, “Swamping and Masking in Anomaly detection: How Subsampling in Isolation Forests helps mitigate this ?,” Walmart Global Tech Blog, 12 July 2019. [Online]. Available: https://medium.com/walmartglobaltech/swamping-and-masking-in-anomaly-detection-how-subsampling-in-isolation-forests-helps-mitigate-bb192a8f8dd5. [Accessed 24 August 2023].

[6] T. Mullin, “DBSCAN Parameter Estimation Using Python,” Medium.com, 10 July 2020. [Online]. Available: https://medium.com/@tarammullin/dbscan-parameter-estimation-ff8330e3a3bd. [Accessed 24 July 2023].

[7] D. Valeti, “DBSCAN Algorithm for Fraud Detection & Outlier Detection in a Data set,” Medium.com, 18 October 2021. [Online]. Available: https://medium.com/@dilip.voleti/dbscan-algorithm-for-fraud-detection-outlier-detection-in-a-data-set-60a10ad06ea8. [Accessed 24 July 2023].

[8] D. Dey, “DBSCAN Clustering in ML | Density based clustering,” Geeks for Geeks, 23 May 2023. [Online]. Available: https://www.geeksforgeeks.org/dbscan-clustering-in-ml-density-based-clustering/. [Accessed 24 July 2023].

[9] I. T. Jolliffe and J. Cadima, “Principal component analysis: a review and recent developments,” Philosophical Transactions: Physical Sciences and Engineering, 2016.

[10] J. Schork, “Advantages & Disadvantages of Principal Component Analysis (PCA),” Statistics Globe, [Online]. Available: https://statisticsglobe.com/advantages-disadvantages-pca. [Accessed 20 July 2023].

[11] E. J. Candès, X. Li, Y. Ma and J. Wright, “Robust Principal Component Analysis,” 17 December 2009. [Online]. Available: https://arxiv.org/pdf/0912.3599.pdf. [Accessed 24 July 2023].

[12] A. Sankar, “Principal Component Analysis Part 1: The Different Formulations.,” Towards Data Science, 29 September 2021. [Online]. Available: https://towardsdatascience.com/principal-component-analysis-part-1-the-different-formulations-6508f63a5553. [Accessed 24 July 2023].

[13] Z. Jaadi, “A Step-by-Step Explanation of Principal Component Analysis (PCA),” Built-In, 29 March 2023. [Online]. Available: https://builtin.com/data-science/step-step-explanation-principal-component-analysis. [Accessed 24 July 2023].

[14] S. Glen, “ “Z-Score: Definition, Formula and Calculation”,” StatisticsHowTo.com, [Online]. Available: https://www.statisticshowto.com/probability-and-statistics/z-score/. [Accessed 19 July 2023].

[15] J. Huo, C. Lu, Y. Yang and H. Guo, “Quality Outlier Detection for Tobacco Based on Robust Sparse PCA: Advantages and Limitations,” in IEEE 13th International Conference on Software Engineering and Service Science (ICSESS), 2022.

[16] I. T. Jolliffe, Principal Component Analysis, New York, NY: Springer, 2010.

[17] N. Rahmah and I. S. Sitanggang, “Determination of Optimal Epsilon (Eps) Value on DBSCAN Algorithm to Clustering Data on Peatland Hotspots in Sumatra,” in IOP Conference Series: Earth and Environmental Science, Bogor, 2016.

[18] P. Jain, M. S. Bajpai and R. Parmula, “A Modified DBSCAN Algorithm for Anomaly Detection in Time-series Data with Seasonality,” The International Arab Journal of Information Technolog, vol. 19, no. 1, 19 January 2021.

[19] N. M. Mutua and P. Matoušek, “Outlier Detection in Smart Grid Communication,” in 17th European Dependable Computing Conference (EDCC 2021), Munich, 2021.

[20] P. Wenig, “Local Outlier Factor for Anomaly Detection,” Towards Data Science, 6 December 2018. [Online]. Available: https://towardsdatascience.com/local-outlier-factor-for-anomaly-detection-cc0c770d2ebe. [Accessed 27 July 2023].

[21] V. Jayaswal, “Local Outlier Factor (LOF) — Algorithm for outlier identification,” Towards Data Science, 31 August 2020. [Online]. Available: https://towardsdatascience.com/local-outlier-factor-lof-algorithm-for-outlier-identification-8efb887d9843#:~:text=4.-,LOCAL%20REACHABILITY%20DENSITY%20(LRD),present%20around%20a%20particular%20point.. [Accessed 27 July 2023].

[22] S. Machiraju, “Detect Anomalies in Telemetry Data using Principal Component Analysis,” Towards Data Science, 26 July 2022. [Online]. Available: https://towardsdatascience.com/detect-anomalies-in-telemetry-data-using-principal-component-analysis-98d6dc4bf843. [Accessed 27 July 2023].

[23] A. Ali, “Auto Encoder with Practical Implementation,” The Art of DataScience, 26 May 2019. [Online]. Available: https://medium.com/machine-learning-researcher/auto-encoder-d942a29c9807. [Accessed 2023].

[24] N. Hubens, “Deep inside: Autoencoders,” Towards Data Science, 25 February 2018. [Online]. Available: https://towardsdatascience.com/deep-inside-autoencoders-7e41f319999f. [Accessed 2023].

[25] R. Agrawal, “Complete Guide to Anomaly Detection with AutoEncoders using Tensorflow,” Analytics Vidhya, 10 January 2022. [Online]. Available: https://www.analyticsvidhya.com/blog/2022/01/complete-guide-to-anomaly-detection-with-autoencoders-using-tensorflow/. [Accessed 2023].

[26] K. Muralidhar, “How to decide on the number of bins of a Histogram?,” Data Driven Investor, 16 February 2021. [Online]. Available: https://medium.datadriveninvestor.com/how-to-decide-on-the-number-of-bins-of-a-histogram-3c36dc5b1cd8. [Accessed 2023].

[27] M. Goldstein and A. Dengel, “Histogram-based Outlier Score (HBOS): A fast Unsupervised Anomaly Detection Algorithm,” in ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD) , Kaiserslautern, 2014.

[28] A. Kharwal, “Sparse PCA in Machine Learning,” The Clever Programmer, 9 May 2021. [Online]. Available: https://thecleverprogrammer.com/2021/05/09/sparse-pca-in-machine-learning/. [Accessed 24 July 2023].

[29] I. Cohen, “Outliers explained: a quick guide to the different types of outliers,” Towards Data Science, 5 October 2018. [Online]. Available: https://towardsdatascience.com/outliers-analysis-a-quick-guide-to-the-different-types-of-outliers-e41de37e6bf6. [Accessed 24 August 2023].

Appendix

Appendix A — Standard PCA Computational Steps

Step 1 — Normalization

This step ensures that each variable has a mean of 0 and a standard deviation of 1, this allows us to analyze the contribution of each variable equally. Here, X represents the data point, mu is the mean, and sigma is the standard deviation.

Step 2 — Computation of the Covariance Matrix

The covariance matrix show us how the variables in the data set vary — indicating how much they change in relation to each other. Positive covariance hints that the variables move in the same direction, negative covariance means that the variables are inversely related, and zero covariance shows that there is no direct relation.

Step 3 — Computation of the Eigen Vectors of the Covariance Matrix

Then, the eigenvector is computed using the relationship

where A is the eigenvector and theta is the corresponding eigenvalue. Then, the eigenvalues discovered are sorted in a descending order, along with their corresponding eigenvectors. The projection matrix can then be put together which consists of the eigenvectors. The eigenvectors which are always orthogonal to each other by definition, are called principle axes of the data. And the projection of the dataset onto these axes are called to be the primary components [13].