What are we all about?

Published in

Balabit Unsupervised

5 min readNov 5, 2015

So actually, what are we doing between two blogposts?

Our Data Science team works in close collaboration with the developer team of Blindspotter. Our main task is to create the algorithms which can be used in this Real Time User Behaviour Analytics product for detecting rare, anomalous events in an enterprise IT system. As these irregularities often the signs of malicious misuses, attacks, we are ensuring that the users of Blindspotter have the right tools for detecting risky actions and also the right tools for exploring and mitigating such risk.

But how to find attackers, misuses, data breaches if examples to learn are rare to find, and are never comprehensive by the nature of the problem? Unsupervised machine learning approaches (the name of our blog explained.) are the best ways to find unknown intrusions and malicious activities. They can find anomalies in the logged data that represents the behavior of the users. To catch anomalies first we have to define the normal behavior. So in short, we are implementing algorithms that are capable of building a “normal profile”, called baseline, based on the data, and then measure the differences between this baseline and the incoming activities. As we firmly believe that there is no silver bullet for every use case, we create different methods dealing with different aspect of the data. These methods create different baselines and use them to evaluate the incoming activities.

First let us see the main ways we can handle data which contains events as records, and details of that events as fields.

The simplest way is to observe the data field-by field. The so called “one dimensional” algorithms are to do this. These are quite simple — or to be more precise: easily interpretable — algorithms.

Give us the “time of the login” field, and we can build a unique workload curve for every user, moreover we can measure that how unusual it is to have a login at a particular time for a particular user. Although the implementation details have to deal with some mathematical munging in order to have a well defined probabilistic measure of “unusualness”, this algorithm is easy to use, and its output is straightforward.

Given the hostname field, we can use the true magic of the so called recommendation engines. You should see the details in our previous post, but basically if you have the user-host matrix, where the values represent if a user have been logged in to a host, you can define related, and not-so-related hosts for a user. It happens just as easily as Amazon can find related products for the user based on the immensely big matrix of every user and every product. While Amazon focuses on the related products, we have a much bigger interest on the unrelated hosts, as access to them can be a potential anomaly.

What if you have a feeling that not only the content of each activity record should have been monitored, but the sheer amount of the activities as well? That point of view is also considering only a narrow aspect of the database: namely the count of the activities in time. It differs from the methods mentioned earlier, because it is an aggregated view.

After aggregating the data from the past, we have a clear view what would be considered as a normal amount of events per hour for every user. If we take into consideration whether the particular hour under investigation should be considered busy, or not, we can point out every unusual bursts of activities over time.

Unfortunately the world is much more complex that to be observed only in one dimension at a time.

Just to have the usual example, if a significant amount of your population is male, and a significant amount of your population is pregnant, and your measure the potential anomalous entities only one dimension at a time, you will fail to point out any pregnant male in your database as an anomaly. The limitations are clear.

So what methods do we use in order to build profiles of normality such a way that they take the multidimensionality into consideration?

We have some powerful tricks. One is from the retail again, and one from psychology.

For the big retailers, the analysis of the consumer baskets is relatively common practice. What happens is that they want to distill some information about trends and co-occurences in a database which contains unordered sets of categorical items. Yes, that is the stuff you put on the conveyor belt at the checkout. The situation should be familiar by now: If they can build profiles of trends and structures of the normal behavior, we can use it to search for abnormality.

So that’s what we do. Assuming every activity as a “basket” of items of different types of connection metadata (protocols, IP addresses, names, commands, clients) an algorithm can define patterns of co-ocurences, and their relative strength in the history of the user. This approach can solve the problem of the “pregnant male” situation, as it views the data in a more holistic way.

Given the problem of finding structures in a multidimensional dataset without any label or target variable one would start to think about the Principal Component Analysis (originating from the methods of psychologists). We have the same thoughts.

After dealing with the fact that we have mainly categorical variables, and that PCA is originally used for analyzing numerical data, we can use this approach. The PCA — as always — nicely creates the principal components which are representing the multivariate structure of the analyzed data. Then, a measurement can be taken for every new data point (user activity) to see how good the fit, thus, how usual (or unusual) the activity is. After all these careful modeling, deviation-measuring, scoring steps we can evaluate every user and every current activity. This means a score is calculated for every action taken, and we know that this score, called anomaly score, is low if the action was close to normal, and high if the action was further from normal. In the later case it is a good proxy that the action is risky, and worth to investigate.

We have to constantly assure ourselves that this approach is efficient, so we check our results by sampling fake abnormal events into our test databases and measuring how well our detectors can distinguish them from the original ones. As the goal of the Blindspotter™ is to create a sorted list of the most risky events in a system, we are using among others ROC-AUC measures to see how good are our detectors. We have found that the samples of unknown abnormal activities are getting higher scores on average than the genuine ones.

Originally published at www.balabit.com on November 5, 2015 by László Kovács.

What are we all about?

Written by Unsupervised Blog