This past week I found a job listing for “Expert Threat Data Scientist” at a prestigious company based out of Oregon. The job title immediately intrigued me because I had never seen those two specialties put together.
A data scientist working in cyber security?
So I immediately did some research — partly to be able to apply to the job (the Pacific Northwest is my favorite region in the country), and partly to see where this rabbit hole took me. Applying data science principles to protecting the data itself seemed like such an interesting topic, so I explored deeper. We should set the scene so we can see how these two worlds collide.
Data is being created at unprecedented levels. The last two years account for 90% of data ever created — which means that data creation is growing at an exponential rate. This includes Instagram photos, Tik Tok videos, browsing histories, shopping carts, etc. Regardless of what it is, these companies that hold and deal in data need to make sure that they are secure or they would soon find themselves on the wrong end of a horde of angry users.
But before I dive into how data science can now be applied to cybersecurity, let’s take a look at what each of those two disciplines does on their own.
Data science is such an overwhelmingly overarching field. It touches on so many bases that it is hard to describe exactly what they do in a reasonable amount of time. So instead I am going to touch on what they produce when their work is all said and done.
Data science provides products like predictions, forecasts, anomaly detection, classification, statistical analysis, and pattern finding. These are all techniques of machine learning. Machine learning is a lower branch of artificial intelligence that allows for the machine to learn from past experiences and use that knowledge to make choices about things it hasn't seen before.
The image above is the go-to image when explaining the data science process. It is the combination of using the data to inform the decisions and evaluation based on the business needs that makes machine learning such a reliable tool. And the best part is it happens over and over to the point where it gets continually better with each new iteration.
When you think about machine learning techniques like these, the data scientist can automate a lot of functions that are strenuous or time-consuming for a user to do. That is an incredibly powerful tool when it comes to the ever-changing domain of cybersecurity.
Cybersecurity is the field of protecting data and systems. This is a very important task and only gets more difficult due to the limitations of cybersecurity techniques.
In cybersecurity, the job is to defend against attacks from hackers and protect systems. But the strategy of responding to attacks after they happened means that hackers always had the upper hand. Reactionary measures cause cybersecurity to be slower than the threats they face.
Web application firewalls (WAFs) are methods with which a firewall detects malicious code and determines the best course of action. Two examples of WAFs are rule-based and signature-based WAFS.
As you can see from the picture above, the two systems are rigid and must be pre-programmed in order to identify new attacks. Signature-based detection looks for indicators that may foreshadow an attack. These signatures need to be collected ahead of time and the technique is severely hampered by attacks that have not yet been seen. A signature-based detector has to cycle through each preloaded code example looking for the right fit. This causes slow response times and may result in false positives.
Rule-based detection applies a different strategy. Instead of looking for codes one by one — the strategy first looks at the effects of the hack. When we talk about “rules”, we are just talking about suspicious behavior that a hack may attempt that clean code would not. This strategy works faster because it doesn’t need to look through each signature and instead eliminates options based on the code’s effect.
But this still leads to needing to have examples of malicious code to look at. And that reactionary pose leads to their overarching strategy. Everywhere you go online, you are going to run into the acronym FUD, which stands for ‘fear, uncertainty, and doubt.’ From my research, many in the cybersecurity field are tired of this principle being their guiding light. This boils down to security forces working in the dark and swinging blindly at their opponents.
This is where data science comes in.
Data Science meets Cybersecurity
When these two disciplines collide cybersecurity gains an invaluable weapon against intrusions.
Data science becomes the eyes to cybersecurity’s sword.
Cybersecurity data science (CSDS)offers a scientific approach to identifying hostile attacks on digital infrastructures. It uses the data-focused approach that applies machine learning techniques in order to identify threats.
Anomaly detection is a major feature that machine learning brings to cybersecurity. Attacks are often committed by code that is different from the norm or code that does tasks that are considered anomalous. Creating a machine learning model to detect an anomaly is a great way to use data science techniques to help cybersecurity interests.
Another way to use machine learning is for penetration testing. The automation and way machine learning adapts from past experiences make it a prime tester for firewalls protecting data and data structures.
As an end product, data scientists give cybersecurity workers the information that can better inform them how to counter attacks.
I’ve fallen down this rabbit hole and am trying to follow it to the end. This represents an exciting application of data science and the implications of it are astounding. I know that I am far from being an expert but give it time.
Unfortunately, by the time I finished this article, the posting had closed and the company wasn’t accepting any more applications for “Expert Threat Data Scientist.” But I am on this path now and I am more than curious as to where I will end up. Who knows? Maybe this becomes a major focus of mine. All I know is that using machine learning to tackle important real-world issues will only help me in becoming a great data scientist.
If you would like to talk more about CSDS, connect with me on LinkedIn.
You can check out my projects on Github where there will shortly be some repositories about cybersecurity.
I am also on Twitter where I share my projects, data puns, and thoughts on cool uses for data in contemporary ways.
What is Cybersecurity Data Science?
Cybersecurity Data Science (CSDS) is a rapidly emerging profession focused on applying data science to prevent, detect…
Signature-Based vs. Rule-Based WAFs: A Detailed Comparison | Penta Security Systems Inc.
To help consumers make more informed choices on web application firewalls (WAF), we discussed in our previous article…