Gravitational Waves Collide with Cybersecurity: Using Machine Learning Inspired by Astrophysics
By Ruslan Vaulin, senior data scientist at Sqrrl, member of the LIGO Scientific Collaboration
What do searching for signals from merging black holes some billion light years away and searching for cyber adversaries operating on your network have in common? More than you might have guessed…
But let’s start from the beginning. Last week (February 11, 2016) National Science Foundation and LIGO Scientific Collaboration announced the first confirmed detection of gravitational-wave signal from collision of two black holes. The collision happened more than a billion light years, away producing an outburst of gravitational-wave energy equivalent to the light of all stars in our galaxy. While very powerful, such radiation is extremely difficult to detect due to a very weak interaction between gravity and ordinary matter. It truly requires a Jedi’s power to sense such disturbances in the force!
Gravitational waves were predicted by Albert Einstein in 1916. It took us 100 years and the work of a thousand scientists to develop the necessary technology and open a new window into what was, up until now, the “dark” side of the Universe. LIGO detectors that serve as receivers of gravitational-wave signals here on earth are giant laser interferometers (each arm is 2.5 miles long). They are marvels of technology and engineering capable of making precise measurements within 1/1000 of the size of proton. But even with this technology, GW signals from merging black holes are very rare, weak chirps lasting a fraction of a second, hidden in a huge volume of very noisy data.
In order to make robust detection of gravitational-wave signals, LIGO’s data scientists developed a range of sophisticated algorithms designed for optimal search and extraction of such signals from the noise. The problem that LIGO data scientists were facing can be compared to the classic problem of searching for a needle in a haystack.
And this is the point where cosmic and earthly interests coincide. The problem of searching for rare and weak signals hiding in noisy data arises in many data science applications. Cybersecurity is one of the most challenging of them all.
At Sqrrl we understand that uncovering hidden adversaries operating on the large network requires a truly scientific approach and rigor. Capitalizing on almost two decades of research by LIGO scientific collaboration, we employ some of the formalisms and algorithms used to detect the very first gravitational-wave signal. We adopt and further innovate these methods to tackle various challenges arising in cybersecurity threat hunting.
Extracting weak signals from the noise
One of the most basic problems arising in the analysis of cybersecurity or gravitational-wave data is optimal processing of time series data. Specifically, algorithms performing time series decomposition and detection of transient signals (outliers) that account for underlying background noise. Signal processing techniques like matched filtering, whitening, seasonal decomposition, etc. used in LIGO’s analysis require further adaptation to be able to “learn” and adapt to varying noise and baseline characteristics.
For example, if one searches for events of data staging and exfiltration, one would have to search for transient excesses in a time series of transferred bytes scanning all available channels of communication. Each channel (e.g. network connection between specific computer hosts) has its own baseline and noise characteristics representing normal user activity. Given the sheer number of possible channels that can be used for data staging and exfiltration it is essential to use algorithms optimized for detection of transient outliers in very much similar fashion, like what was done with the signal from colliding black holes discovered by LIGO.
Fighting False Positives
False positives are inevitable. Even the most optimal signal processing algorithm will produce false positives. The key in mitigating them is to a) use additional information and context to perform refined classification of detected outliers, and b) measure the rate of false positives in real data and use algorithms that account for it and adapt to its changes.
The fact that the outliers found in LIGO data required further classification was realized early on. A number of novel multivariate analysis and machine learning algorithms have been developed for that purpose. In addition, a large effort was dedicated to a very robust and precise characterization of the “background” (false positives). It is because of this work that we were able to uncover the GW signal and establish with a very high degree confidence (greater than 1 in ten million chances) that it is a genuine signal and not a false positive.
When dealing with cybersecurity data, we face even harsher conditions in terms of the rate of false positives and ever changing user behaviors. Using contextual, enriched information that allows one to separate activities of normal users from those of the intruders is the key in developing robustcybersecurity analytics.
Sqrrl’s unique ability to collect and provide such information represented as a graph facilitates the application of multivariate statistical analysis and machine learning. It is used to classify outliers/potential activities into benign and malicious. For this purpose we use combination of Bayesian multivariate statistics, machine learning and graph algorithms.
Borrowing from experience and methods of LIGO searches, we also measure our “background”, estimate rates of false positives, and assign statistical confidence to detected events. This make our prediction robust and our algorithms adaptable to changing data.
Network of detectors
Perhaps the greatest challenge in hunting for adversaries on the network is that their tactics and methods are not known ahead of time. Moreover, they can actively change them in order to avoid being detected. The latter makes cybersecurity analysis even in some ways harder than the search for GW signals. After all, the Universe is not malicious and is not trying to actively avoid being probed by us.
In order to improve our chances of finding adversaries, we build a network of detectors. Each detector searches for various signs of malicious activity. This increases our chances of detecting adversaries, but it also increases the rate of false positives. If not handled correctly, the advantage can become a curse.
The problem is that something is always happening somewhere on the network. Having many detectors means that cybersecurity analysts might be swamped with false positives.
While not to this extent, the LIGO observatory is also faced with the problem of handling multiple detectors operating as a single network. The approach that was proven to be successful in LIGO was to develop methods that combine outputs from each detector, accounting for their underlying noise (e.g. rate of false positives) and coherence of the signal’s characteristics that must be similar in each detector.
Sqrrl’s hunting platform takes this approach to the next level by connecting detectors via a contextual graph and combining their predictions using Bayesian statistics and graph algorithms. This approach allows us to “add up” sensitivities of different detectors without losing control over false positives. The result is a far superior detection system that increases efficiency and decreases the rate of false positives. Additionally, it offers security analysts a far more complete view of the attack, facilitating more efficient incident response.
LIGO’s discovery of gravitational waves is a truly remarkable achievement for humanity. It is equally amazing that it led to new developments in data science that have such unexpected connections to the field of cybersecurity.
In some ways, we at Sqrrl feel very much like LIGO data analysts in the very early years of that project. And now LIGO’s breakthrough gives us a new hope in the fight against the dark side. The gravity force is with us!
- LIGO Discovery Paper: https://journals.aps.org/prl/abstract/10.1103/PhysRevLett.116.061102
- Overview of LIGO analysis methods: http://arxiv.org/abs/1208.3491
- Baysian multivariate analysis for LIGO: http://arxiv.org/abs/1201.2959
- Handling multiple detectors/analysis pipelines: http://arxiv.org/abs/1201.2964
If you’re interested in learning more about how our data science and analytics work, check out our white paper on linked data.