Machine Learning to Maximize Security

Machine learning (ML) makes predictions about data using algorithms and statistical techniques. It “learns” information about data, finds patterns and hence make decisions needing little help from anyone. It is part of the larger field of artificial intelligence.

It is already used widely in the fields of marketing, finance and sales and over the last few decades, because of ML, these industries have become more and more effective. Data analytics uses ML and is used by business to gain insights into specific topics so that they can then liaise with customers accordingly. Another area that seems like it would benefit substantially is cybersecurity.

The Verizon Data Breach Report, 2016, showed that within 2 minutes of trying, attackers can get into any company network and remain undetected for as long as 6 months. As you can imagine, within that time, a lot of damage can be done, and a lot of information sourced by the attacker. Current tools, perimeter-based, such as firewalls, anti-virus software, intrusion detection tools, etc., cannot prevent such advanced attacks. Attackers’ tools are becoming more and more sophisticated by the day rendering current tools useless. Something deeper that can contain and engage with the attack is needed.

Let’s now look at how data science has been used in cybersecurity since the 1990s. Please refer to Figure 1 for the evolution of security data science.

Data science has been used in cybersecurity since the 1990s and is used to detect malware, malicious websites and domains. However, applying deep learning to security data is not as easy as one might think. To increase security, data is not labelled. However, this is currently being addressed by data holders who are starting to introduce class labels. In the meantime, there are already petabytes (1015 bytes) of existing data currently unlabeled. Generally, with such data, there is a very low signal to noise ratio. If the attacks that are within this data are few and far between, connecting the sequences of the attacks to build up a pattern is very hard.

Using the principles of data science, rule-based and behaviour-based anomaly detection systems are both used. For behaviour-based anomaly systems, a baseline of an entity, device or user in the network is made and compared with actual realtime data. Any anomalies should then be revealed.

It is, however, 100 time less reliable than rule-based and gives rise to many false positives. Rule-based systems are written by experts and, as their name suggests, these systems raise alarms based on specific rules, e.g. if the number of login attempts to a bank account is above a certain threshold, raise an alarm.

Behaviour-based anomaly detection systems are used to see if there is a deviation from normal host and network behaviour. Specific algorithms that are used are, Clustering, SVD, One-Class SVM, DB Scan, Robust-PCA, and KDE.

In the 2000s, as more and more data became available, a second phase emerged, and Security Information and Events Management tools (SIEMs) were introduced. Data was kept in security data lakes and the correlation of many data sources within these were data-mined for anomalies. However, in the Big Data era, it was realised that this needed another level of intelligence to be useful and was far too slow.

In 2005, Roger Mougalas, O’Reilly Media, coined the term “Big Data” for data too unmanageable to analyse with traditional tools. A third phase, the ML technique of User and Entity Behaviour Analytics, UEBA, was born. The sea of data had become an ocean, and UEBA enabled people to “boil” the data in realtime, find anomalies, and raise the alarm. Hadoop/Spark and anomaly detection came out of this. However, even with this massive improvement in detection, the anomalous detection systems were still giving many false positives.

During this time, Endpoint Security also gained momentum and is a deep learning technique used in realtime. Its purpose is to detect malware in the form of malicious scripts, DNS tunnels and application attacks to name a few. It deals with any labelled threat from which information can be learned.

Figure 1 Data Science for Security Advancement (1)

The fourth phase is the current phase and uses Deception-Triggered Data Science. It combines data science and deception. The difference between this and previous methods is that it starts from a real attack and doesn’t require anomaly algorithms, so the method which has caused so many false positives is now unnecessary. The real attack has been identified from a deception event.

Information is then derived from this real attack and it is then possible to understand the behaviour of the cybercriminals. Their plans begin to unravel. This reduces the false positives and will save the cybersecurity world a lot of money. The paper by Almeshekah et al, 2014 (2), goes into more detail about this method.

Deception-Triggered Data Science is used in conjunction with other information tools which are perimeter-based, such as firewalls and anti-virus software, to create a robust barrier against attackers.

How does it work?

At each point where an attack might occur, thousands of “honeypots” of low and high interaction are mounted, see Figure 2. These are deception sensors which emulate applications, servers, networks, etc. When an attacker trips one of these sensors, the system is alerted, and information gleaned from this attack about the cybercriminal. Data science kicks in.

The most likely path that the attacker may have taken is determined. The holes in the most likely path are then investigated. Data science can help to work out what the attacker is after and what has already been done.

Figure 2 Schematic for Data-Triggered Data Science (3)

So, there is no need for anomalous detection or to boil the ocean. This is much cleaner, faster and less prone to error.

Final thoughts

From ensuring cybersecurity to protecting data breaches Machine Learning plays a vital role in today’s technologically advanced world. Hence, businesses can plan to implement a well-structured Machine Learning framework within the organization.

--

--