Image for post
Image for post

If You’re Into Cybersecurity, Get Into Splunk and Machine Learning

The future will see the rise of the cybersecurity specialist who uses data science and machine learning in the way that we have used speadsheets in the past. These specialists will be able to take data sets from logs on systems and from open source data sets, and then make predictions on the data. They will find the correlations and build new threat models. The rise of Splunk, too, seems almost endless, especially as it provides a great interface to machine learning models.

So let’s take an example for malware infection [download dataset]:

Image for post
Image for post

The data set we have contains 98,944 records, which is rather a lot, so we will just analyse the first 50,000. Overall the fields used are: receive_time, serial_number, session_id, src_ip, dst_ip, bytes_sent, bytes_received, packets_sent, packets_received, dest_port, src_port, used_by_malware, and has_known_vulnerability. First we will use the logistic regression learning method in order to train on “used_by_malware”) and against all the other fields:

Image for post
Image for post

In machine learning, we normally do a 70/30% split and where we train on 70% of the data, and then test against the remaining 30%. In this case we will use a simple logicistic regression model with a 70/30 split:

Image for post
Image for post

Next, in Splunk, we select the Train Model button, and wait for a minute or so. Our results are then:

Image for post
Image for post

This creates a model defined by “example_malware”. Next we can simply view the confusion matrix with (or click on the Classification Results link):

| inputlookup firewall_traffic.csv | head 50000 | apply “example_malware” | `confusionmatrix(“used_by_malware”,”predicted(used_by_malware)”)`

This gives us:

Image for post
Image for post

We can see that we were right for 83.3% of the time in predicting that a device was not infected, and 74.1% it was infected. For our false positives, we had 16.7% chance that we predicted it had malware, where it did not, and 25.9% of the time we did not spot that it had malware, but actually had it. The overall quality of the model is defined with:

Image for post
Image for post

Overall, we want to move these as near to 1 as possible. See the background section below for details on these metrics.

Now let’s try to train with SVM (Support Vector Machine), and which is a machine learning model that tries to classify our dataset into two different clusters:

Image for post
Image for post

Now, the model is so much better:

We now have a 91% accuracy for predicting that a device is not infected, and 99.2% accuracy that it is infected:

Image for post
Image for post

Our scores for precision, recall, accuracy and F1 are also much better, and getting much closer to 1:

Image for post
Image for post

Now we will use random forest linear regression, and which uses a number of learning method for classification and regression in order to create a number of decision trees:

Image for post
Image for post

In this case the results are almost perfect:

Image for post
Image for post

This is shown by:

Image for post
Image for post

and where we see we are 98.3% accurate in predicting that a device does not have malware, and 98.8% accurate that it has.

So what?

We have a new Data & Cyber training programme coming up (supported by The Data Lab), and on 1 July 2020, you can find out about Machine Learning and Splunk. You can register here:

The course is presented over Zoom, and instructed led. It will teach you about all the different machine learning models, and how you analyse your data. All the course content will be delivered on-line, too, and with hands-on applications related to cybersecurity.

Background

In our data analysis for Cybersecurity we must often classify our data in order that we can efficiently search for things, or use it to trigger alerts from rules. A rule might relate to the blocking of access to a remote site. Thus our classification might relate to building up a list of whitelisted IP addresses and blacklisted ones. For each access, we may have to access the trustworthiness of an IP address, such as whether it is listed as malicious, or whether its domain name has existed for a specified amount of time. It we get it right, we have a successful classification, if it is wrong, then we have been unsuccessful. The success rate will then be defined as:

  • True-positive (TP). That something has been successfully classified as the thing that we want it classified as.
  • False-positive (FP). That we have classified something that is incorrect. This would be defined as a Type I error. An example might be where a system classifies an alert as a hack and where a user enters an incorrect password a number of times, but, on investigation, it is found that the valid user had just forgotten their password.
  • True-negative (TN). That we rejected something, and it is not a match.
  • False-negative (FN). That we have dismissed something, but, in fact, it is true. This is defined as a miss and is a Type II error. With this, a hacker might try a number of passwords for a user, and but where the system does not create an alert for the intrusion.

The Accuracy of the analysis could then be defined as:

Accuracy = (TP+TN)/total

The Sensitivity (or the True Positive Rate) is then the number of true positives (TP) against the number of times we have found a match (TP+FN):

Sensitivity = TP/(TP+FN)

The Sensitivity is also known as Recall.

Within Cybersecurity analytics, it is often the Sensitivity Rate which is seen to be a strong measure of the trustworthiness of the system. Thus the higher the Sensitivity rate, the higher the confidence that an analyst will have in the machine classification/search. If the Sensitivity Rate is low, the human analysts may lose trust in the machine to properly classify and find things correctly. A rate of 0.1, would mean that only one-in-ten classifications were actually correct. But this metric is not measuring the number of false positives, and these could confuse an analyst by investigating something that is not correct. So an improved measure might relate to the precision of the matches and will be the ratio of the true positive matches to the number of positive matches. In this example, we would have a precision of:

Precision = TP/(TP+FP)

In this case the analyst would have to deal with one in three false positives. For the Specificity we define the True Negative Rate, and when we do not have a match, how often do we predict it correctly? In our example, this we be:

Specificity = TN/(TN+FP)

And the False Positive Rate (FPR) is then 1 minus the Specificity. In our example, this we be:

False Positive Rate = FN/(TN+FN)

Prevalence then defines how often the system identifies something as being correct, as a ratio of all the data records that have been sampled. In our example, this we be:

Prevalence = (FN+TP)/Total

And so more …

Here’s more:

So what? Well, in partnership with The Data Lab, we are releasing a new MOOC on 3 August 2020, so watch this space. It will have a full Splunk learning environment, and where you can analyse a whole lot of cybersecurity data sets, and apply machine learning.

Image for post
Image for post

ASecuritySite: When Bob Met Alice

This publication brings together interesting articles…

Prof Bill Buchanan OBE

Written by

Professor of Cryptography. Serial innovator. Believer in fairness, justice & freedom. EU Citizen. Auld Reekie native. Old World Breaker. New World Creator.

ASecuritySite: When Bob Met Alice

This publication brings together interesting articles related to cyber security.

Prof Bill Buchanan OBE

Written by

Professor of Cryptography. Serial innovator. Believer in fairness, justice & freedom. EU Citizen. Auld Reekie native. Old World Breaker. New World Creator.

ASecuritySite: When Bob Met Alice

This publication brings together interesting articles related to cyber security.

Medium is an open platform where 170 million readers come to find insightful and dynamic thinking. Here, expert and undiscovered voices alike dive into the heart of any topic and bring new ideas to the surface. Learn more

Follow the writers, publications, and topics that matter to you, and you’ll see them on your homepage and in your inbox. Explore

If you have a story to tell, knowledge to share, or a perspective to offer — welcome home. It’s easy and free to post your thinking on any topic. Write on Medium

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store