Statistical vs Graph-based approaches for Anomaly Detection in company networks: what’s best?

Giacomo Zarbo
Data Reply IT | DataTech
7 min readApr 29, 2022

In recent years, cyber attacks have grown tremendously. For business contexts that specialize in Big Data, the issue is even more delicate, because cyber criminals have the ability to access massive and significant amounts of personal information through the use of advanced technologies that are becoming very difficult to fight. On the other hand, the real challenge has been to transform the world of Big Data from a “threat” to an “opportunity.”

This article describes two different BigData-based cybersecurity approaches, specifically for malicious entity detection contexts.

While the first approach makes use of statistical techniques that try to understand the “habits” of users within a corporate network and investigate any anomalies they may have in order to find those that have been infected by viruses, the second, through dynamic graphs, creates a scoring system to find malicious hosts with which the company users have made connections.

Statistical approaches

In the first part, statistical approaches to detect infected users are tested daily in order to understand anomalies in user behavior on a given day.

In fact, all experiments in this part are done with a Moving Window algorithm: taken a set of days, the time window is built on this, and is used for the training phase, so that on this considered set the algorithm can learn the usual behaviors of the users for that period. Instead, the day immediately following is used as the test day. The experiment is then repeated several times, advancing the moving window (and the test day) by one.

The Moving Window algorithm

In this first approach, Feature Engineering work is also performed, in order to create new “quantitative” features from the fields in the dataset.

This procedure is used mainly for two reasons. First, to consider only relevant data from the initial dataset, discarding those less useful for the purpose of the experiment. Second, to know the habits of individual users within the company, thus knowing all the information that can establish certain behavioral trends.

This is done by taking into account the “raw” information in the dataset and aggregating it in such a way as to create “quantitative” features, as these better represent user habits and better highlight any anomalies.

For example, it’s much easier to think about a user’s habits not by calculating which sites the user usually visits (too onerous a calculation if the number of users in the company is very high), but how many sites he visits the most, and at the same time it’s much easier to detect an anomaly in the number of sites visited during the day than to consider any site he has never visited before.

The most important new features created are therefore:

  • cnt_domain which represents the number of distinct top-level domains of the URL belonging to the connections;
  • cnt_status_2xx which represents the number of connections with HTTP status code between 200 and 300;
  • cnt_status_4xx which represents the number of connections with HTTP status code between 400 and 500;
  • cnt_status_5xx which represents the number of connections with HTTP status code between 500 and 600;
  • cnt_payload which indicates the number of distinct payloads belonging to connections;
  • cnt_host which represents the number of distinct hosts the user connected to during the day, extracted as the number of distinct hostIDs;
  • cnt_URL which represents the number of distinct URLs the user connected to;
  • avg_payload which indicates the average (in Bytes) of payloads belonging to connections.

These features, therefore, are collected for each user, in order to understand their personal habits. Many anomalous features at the same time can therefore be indicative of an infected user.

Then, approaches to detect anomalies in the features themselves such as Percentile-99, Z-Score, and Inter-Quartile Range are tested.

Boxplot and IQR: each value lower than Lmin or greater than Lmax can be considered as “anomalous”.

Reinforcement Learning

Finding the best results using the Inter-Quartile Range, three reinforcement approaches are used on it.

The first with a dual reinforcement mechanism whereby, based on the previous day’s results, an attempt is made to increase or decrease a certain anomaly discrimination coefficient or to increase or decrease the number of anomalies, found in the features, to be considered in classifying a user as violated.

A second approach, an enhancement of the first, which has the ability to take into account only certain types of anomalies rather than others.

Finally, the third approach, which through a daily dynamic system of weights associated with each feature in the dataset is able to give the best results and denoted a certain prominence of the features in the search for anomalies.

Three different levels of Reinforcement Learning

The best results, for this experiment, are obtained with the third approach, in fact, this is able to adapt to the dataset considered in an autonomous way, and to understand which features must be taken into consideration more than others.

Graph-based approach

The second part of the experiment is based on testing a graph-based approach on a daily basis, that simulated the connections of business users to external hosts. In fact, each node of the graph can represent a user or a host, and each edge represents the connection that a certain user has made with a certain host. It is possible to think of the graph obtained as a daily “portrait” of the communications of company users.

Constructed daily, with therefore all the relative changes, this graph is considered as a dynamic graph, and therefore studied day by day.

The algorithm on the graph makes use of the Pregel API, offered by Spark GraphX, to allow, through a scoring system, the detection of new malicious hosts potentially harmful to the company that had never been discovered before (or had never been recognized as such).

Pregel IN and Pregel OUT

The purpose of the first use of the pregel paradigm, called Pregel IN, is to accumulate scores among users. If a user has had a connection with a malicious host (whose maliciousness is known a priori) he is then entitled to earn his score. Each user, in the end, will then have an accumulated score equal to the sum of the scores of all hosts they have connected to.

Pregel IN Phase: messages go from Hosts to Users

The second use of Pregel, called Pregel OUT, which makes use of the score accumulated by the Pregel IN on the previous day, has the purpose of downloading the score uploaded by users to the hosts connected to that day. In fact, a user who uploaded a certain score on the previous day can be considered hacked, because he may have taken points only from malicious hosts (or that the proxy server considered as such) and can download it to the hosts it has connected to, the day considered, because most likely these will be malicious as well. These hosts on which the score is downloaded are certainly hosts not seen by the user the previous day, so this mechanism allows to identify new malicious hosts never seen before, or never known as such.

Pregel OUT Phase: messages go from Users to Hosts

Final results

In the last part of the experiment, the results of the statistical and graph-based approaches are compared.

The statistical approaches found quite good results, highlighting their great flexibility and ability to learn habitual behaviors for each individual, and being able to detect numerous potentially malicious hacked users within the company.

The graph approach, on the other hand, which unlike the previous one is not based on the learning of a specific behavior occurred in the past, but on a dynamic scoring system, allowed to detect in particular many malicious hosts external to the corporate network to which users are connecting, and which had never been identified as such before.

Both approaches, however, have proven to be very useful within the company, managing to provide security department teams with a very precise trail of possible malicious entities, allowing them to conduct their business analysis in a more precise and concrete way.

Conclusions

This article is intended to provide suggestions or approaches in Data-driven security domains.

We must not forget that the goodness of the solutions found, in any case, also depends on the quality of the starting dataset. But the great power of the solutions described in this article lies in the great customization and scalability that they have.

For the statistical approach, for example, it is very easy to integrate new features in the Feature Engineering part if there is more information in the starting dataset. For the graph approach, instead, it is possible to study a more complex scoring system if necessary.

In addition, another feature of the proposed solutions is the possibility to be easily integrated into other security environments, which may be different from corporate networks, such as banking, finance or insurance.

--

--