Harnessing Deep Learning and low fidelity security insights to detect advanced persistent threats (Part 2 of 2): Introducing our approach

Published in

Data Science at Microsoft

9 min readDec 7, 2021

By Yasmin Bokobza and Jonatan Zukerman

This is the second article of a series focusing on detecting advanced persistent threats formulated as multivariate anomaly detection using Deep Learning and security insights. In our first article, we discussed the use of UEBA in some Microsoft offerings and various methods for anomaly detection, with an emphasis on what led us to develop our approach of fusing security research alongside Deep Learning to detect advanced persistent threats. We also provided a list of some popular Python packages that represent a good starting point for performing anomaly detection.

In this article, we present the high-level architecture of our technique, specifically how we leverage security research knowledge and autoencoders to identify compromised accounts, including how we deployed the technique using Azure synapse. We also share the results and validation for our business scenario within Microsoft including a case study of a reported attack on a large Azure customer resulting in compromised accounts.

Fusing security research alongside Deep Learning

In this section we introduce our approach to detecting compromised accounts by combining malicious indications generated by our UEBA engine, known attack vectors, and Deep Learning. We propose an unsupervised, explainable anomaly detection engine using autoencoders to identify compromised accounts based on their recent activities. The high-level architecture of our technique can be divided into three stages, as depicted in Figure 1 (specifically: log filtration, engine enrichment, and anomaly detection).

Figure 1: High-level architecture of the existing UEBA engine and our anomaly detection engine.

The first stage aims to determine relevant logs. The security research team analyzes the raw data log types and filters on logs with relevant security value that can be mapped to an attack vector and into the MITRE framework. Then the security researchers map the raw data fields into a normalized schema. Namely, parsing the data to collect various pieces of information such as user identifiers, timestamps, device information, and IP address is critical for extracting the right value from the data.

Next, the engine enrichment process is conducted. The security research team provides a list of enrichments that the UEBA engine adds to the filtered raw data. Those enrichments can be either contextual or behavioral.

Contextual enrichments provide more context to the analyst who is investigating an incident. Such enrichments can be a mapping of an IP address to a geo location and Internet Service Provider (ISP) or mapping an IP address to a device, adding high-value asset information and adding relevant threat intelligence information.
Behavioral enrichments provide information about specific activity based on the entity context. These enrichments are designed to detect deviations from the entity’s normal behavior. While the behavioral enrichments can be used as a means of spotting specific behaviors, they can also be used as inputs to a multivariate anomaly detection engine.

Finally, during the anomaly detection stage the UEBA engine uses the enriched data that was generated in the engine enrichment stage as malicious indications that can be part of different types of activities. To detect anomalous combinations of the malicious indications, the engine leverages Autoencoder models to provide a security-embedded ML anomaly detection approach. As mentioned in Part 1, the autoencoder approach was found to be superior to other anomaly detection methods because our problem is unsupervised multivariate anomaly detection, meaning we are dealing with high-dimensional input data with a highly unbalanced data distribution that the model should explain. For each activity type, an Autoencoder is developed, while the input for the Autoencoder consists of malicious indications identified per account and represents several aspects of possible attacks. Some of the indications are time series anomaly–based and others are distance-based anomalies. The Autoencoder output consists of reconstructed indications, while the basic assumption is that the network will learn what normal indications look like so that abnormal indications are reconstructed with high error.

The account anomaly score, in a specific time window, is defined as the aggregation of reconstruction errors of all the activities. The aggregation allows us to identify accounts with a high probability of being compromised. Namely, an alert for an account that may be compromised is generated when the aggregation of the trained Autoencoders reconstruction error is above a predefined threshold.

After calculating the account anomaly score for each activity, the engine uses the relative reconstruction error of each activity to explain to cybersecurity researchers the mechanism of possible attack.

Model deployment

Technique deployment is very important for a practical scenario. Figure 2 illustrates the workflow for our technique deployment in our anomaly detection application. First, the relevant data is extracted from multiple Kusto databases. This data is stored in our blob storage using Microsoft Azure Data Factory (ADF), which is the Azure cloud Extract, Transform, Load (ETL) service for scaling out serverless data integration and data transformation.

As mentioned in the previous section, our approach includes development of Autoencoder models to provide a security-embedded ML anomaly detection approach. As part of the neural networks model-building and execution stage the framework should be chosen based on the use case requirements. These frameworks provide a choice of neural networks and tools for training and testing the selected network. Examples of such well-known frameworks that can assist with building neural nets are Pytorch, TensorFlow, and Caffe2. In our use case, we used Pytorch, which is best suited to our needs because it contains Autoencoders, is simple and easy to use, is fully supported for Python, is dynamic, and has no special session interfaces.

After the framework selection, the deployment of the final technique architecture is conducted while there is separation between the training and scoring pipelines. The deployment is distributed using a GPU cluster that enables distributed computing by breaking the problem into smaller tasks in a multi–GPU-core environment on multiple nodes in the cluster. In addition, to train the massive number of models we use Azure Synapse, which allows us to process big data with serverless Spark pools using the latest Spark runtime. Azure Synapse also enables us to define specific cluster characteristics and automatically scale worker nodes. The Autoencoder-trained models are then stored and used by our scoring pipeline. The output of the scoring pipeline is written back into our blob storage and then sent to the Kusto databases. Finally, the output integrates into UEBA APIs for use by end users. The model deployment pipeline can then simply run at the desired frequency thanks to the use of ADF and Azure Synapse.

Business scenario: Results and validation

Evaluating an unsupervised anomaly detection problem is a challenging task due to our lack of labeled data, so we split our technique evaluation process in two. First, we evaluated the Autoencoder models that were trained using real customer activity on an artificially generated dataset (described in this section). Then we evaluated the same Autoencoder-trained models on actual attacks (described in the next section).

The Autoencoder models were trained on real activities among a subset of customers for 30 days. The artificial evaluation dataset contains normal activities, as well as anomalous and extremely anomalous activities that were generated according to known anomaly patterns. The ratio of the generated data was 1:1:1. Because an attacker would try to disguise malicious activities as much as possible, extreme anomalies are rare in real attacks. However, they are still important for evaluating the performance of our approach.

Table 1 presents the performance of the Autoencoders compared to PCAs — which represent an early application of Autoencoders and are used as one of the benchmark models — for four random types of activities. Because the generated evaluation dataset was balanced, we used the Compute Area Under the Receiver Operating Characteristic Curve (ROC AUC) to compare the accuracy of Autoencoders and PCAs. According to the results shown in Table 1, the Autoencoders outperform the benchmark models.

Table 1: comparison between the baseline model and the Autoencoder.

To evaluate the performance of our technique, we analyze the activities classification as well. Figure 3 presents activities classification histograms generated by the Autoencoder models for the same four random types of activities shown in the Table 1. From these results we can conclude that the models easily detected extreme anomalies and that the number of errors in classifying normal activities as anomalous is relatively low.

Figure 3: Activities classification histograms generated by the Autoencoder models.

To get a better understanding of the model’s accuracy we examined their ability to distinguish among different types of activities. Namely, we compared the distribution of the activities classification histograms generated by the models. Figure 4 presents the classification histograms generated by the Autoencoder and PCA models for Type 3 activities as an example. The results indicate that Autoencoder better separates normal from anomalous activities because there is less overlap between histograms.

Figure 4: Type 3 activities classification distribution.

To evaluate the performance of our technique, it is also important to analyze the Autoencoder model’s ability to learn from known examples and generalize the knowledge to unseen examples. Figure 5 presents the Autoencoder train and validation learning curves for Type 1 and Type 4 activities as an example. The decrease in training and validation reconstruction errors (the anomaly scores) and the close values in each number of epochs indicate that the models are learning the training data well and generalizing the knowledge to the validation data. Specifically, the models are not overfitting to the training set.

Figure 5: Autoencoders learning curve for two different types of activities.

As part of the evaluation, we also examined the distribution of the reconstruction errors of normal and anomalous activities in the artificial evaluation dataset and test dataset. Figure 6 presents the Type 3 activities reconstruction errors distribution as an example. By looking at the error distribution in the test dataset we found that the reconstruction error of more than 97 percent of the activities is very close to zero, while a few of the activities have a higher value that reveals suspicious account activity. This observation also matches our expectations that anomalous combinations of malicious indications are rare because attackers would do their best to disguise malicious activities.

Figure 6: Activities distribution in artificial evaluation dataset and test set

Case study: Real compromised accounts detection

As mentioned in the previous section, as part of the technique evaluation process we measured the accuracy of the Autoencoder-trained models using reported attacks. Figure 7 describes the process of detecting real compromised accounts. First, the Autoencoder models are trained on real activities for 30 days, which is typically before compromised accounts are identified. We then rank the accounts according to their anomaly score. Our technique ranked approximately 50 percent of the compromised accounts in the top 95th percentile of that ranked list. Fifty percent recall is a meaningful number in the cybersecurity domain because the ability to detect a threat at a given time window is priceless. By applying our technique, we flag two percent of accounts as compromised. In other words, we reduced the number of accounts for customers to handle manually and increased their chances of identifying compromised accounts from 0.06 percent to 2 percent of all accounts.

Figure 7: Real compromised accounts detection.

Conclusion

In this article we introduced our approach to detecting compromised accounts using a UEBA engine that leverages Deep Learning to strengthen security by detecting anomalies in behavior patterns among users and other entities that could be indicative of a threat. By leveraging Machine Learning on top of enriched data we can achieve better results in detecting anomalies and find cyberattacks in real time while saving analysts a significant amount of time in writing and modifying complex correlation rules.

We hope this article, and the series it is part of, helps you with your own business problems. Please leave a comment to share your anomaly detection scenarios and the techniques you are using today.

We’d like to thank the Microsoft Advanced AI School and Microsoft Research, especially James McCaffrey and Patrice Godefroid, for being great partners in the research design of this work. We also would like to thank Itay Argoety and Casey Doyle for helping review the work.