Security Alert Analysis Using GPUs

Published in

RAPIDS AI

9 min readJan 27, 2020

Security teams frequently receive more alerts than they can process, and it is common to disable high-volume logs if they don’t show any immediate value. The problem is that these high volume logs are not typically re-evaluated for value at a later date, and valuable information could be dropped or obfuscated by difficult access. This blog shows how over 300,000 raw alerts can be processed in under 3 seconds using a single Tesla V100 (including data movement from/to the SIEM as well as multiple analytics). Enabling security teams to ingest and analyze more log types and log amounts increases visibility across the network and provides real-time, complex meta alerting.

Why Alerts?

Traditional Security Operations (SecOps) teams rely on pre-made alerts to inform them if something is abnormal on their networks. Nearly every appliance today can be configured to alert on various inputs, and these alerts are funneled to a Security Information and Event Management (SIEM) for use in triage, correlation, and forensic investigations. But alerts aren’t perfect. Far from it, some alerts are so frequent and annoying that security professionals ignore them. According to a McAfee report, 31.9% of security professionals report they ignore alerts due to the large volume of false positives [1]. The same study showed that 25.9% of security professionals receive more alerts than they can investigate. Cybersecurity leaders deal with over 10 alerts per day with a 50% false positive rate, with each alert taking more than 10 minutes to fully investigate [2]. And, there is still a shortage of qualified security professionals. Today, only 40% of security professionals actually analyze and remediate security threats (i.e., triage and respond to alerts) compared to 70% a year ago [3]. Rather than focus solely on generating new alerts for security analysts, providing additional context for alerts and helping to reduce the false positive rate is a key way that SecOps teams can and should make better use of alerts they’re already receiving.

We Can’t Scale with People

Throwing more people at the problem just doesn’t work. A 2019 survey of IT professionals from eight countries revealed that 82% of employers report a shortage of cybersecurity skills, with the most acute needs for highly-skilled technical staff [4]. Continuing to rely on a broken model of addressing cybersecurity threats will continue to fail, especially when over a million new malware samples surface every day [5]. It’s natural to want to rely on prebuilt alerts to help reduce this workload, but with more than 1,200 cybersecurity vendors in the market [6], the potential for an onslaught of even more alerts is very real. A previous blog post introduced CLX, a collection of RAPIDS applications that specifically target these types of pressing issues in the cybersecurity and information security communities. Real-time analysis has historically been difficult with large scale data ingest. Typical workflows require time consuming ingest and parsing before analytics are applied to the data. In addition, data incoming into a SOC is so large that teams often disable log types and dial-down sensing to make log ingest more manageable. By utilizing GPU acceleration [7], we can simultaneously provide real-time analysis of these ever growing alerts while also widening the ingestion aperture to capture even more data.

Providing Additional Context to Alerts

The number of alerts generated from a single rule or appliance can vary overtime. Providing analysts insight into the frequency of the different types of alerts they receive is important. This CLX use case demonstrates multiple ways to use alert data already being collected to give more context to the security professionals tasked with analyzing them.

Visualize: Analysts can use CLX to visualize trends and see correlations between different alert types.
Frequency of Anomalies: In addition to visual inspection, analysts want to know if the frequency of alerting is out of the ordinary. By using a rolling Z-score metric with a threshold, we can define “out-of-the-ordinary” and direct analysts’ attention to these alerts at specific time points.
Co-occurrence Calculations: Co-occurrence calculations can be used to find entities, like IP or MAC addresses, that regularly generate alerts at around the same time.

Visualizing Alert Trends and Anomalies

The number of times a network triggers a certain alert, like geographically improbable access, can vary greatly depending on the time of day and day of the week. The alerts are aggregated by type and day, normalized them by total number of alerts by type, and plotted them as a heatmap so analysts can quickly see trends in their alerts and compare them with each other.

*Figure 1: Visualization of alerts over time*

Frequency of Anomalies

While visual exploration can be informative, a measurement for how anomalous the frequency of alerting is can also be provided. A rolling z-score with a seven day window can account for the periodicity of data (fewer alerts on weekends) and can draw an analysts’ attention to potentially overlooked alerts with large pattern changes.

*Figure 2: Using a rolling z-score to automatically flag anomalies*

High Dimensional Co-Occurrences of Network Entities

After an analyst investigates and identifies a malicious IP or MAC address on their network, they might wonder if similar entities are present in other unanalyzed alerts. We wanted to provide more context about the IP or MAC address in an alert by clustering entities together based on their alert activity. We filtered our data to only include alerts from IPs and MAC addresses that regularly generate alerts (at least more than one). We then grouped alerts by 30 minute intervals and embedded IP addresses and MAC addresses that co-occurred into a 52 dimensional vector space using a shallow two-layer neural network architecture similar to the process of generating a custom word2vec embedding. Analysts can query the model to return the top n closest or most similar entities to the entity of interest. Using this “guilty by association” method, analysts can direct their efforts into investigating alerts, IPs, and MAC addresses similar to events they know are malicious.

Scaling with RAPIDS and CLX

In order to scale the alert analysis workstream, the common, reusable components were analyzed and an optimized with RAPIDS. This was important for CLX not only to scale this particular alert analysis use case but to also use CLX in tackling other challenging and complex use cases. With that, CLX has evolved to provide a dynamic level of portability and modularization that allows users to easily implement any log parsing and analytic workflow functionality within their own SIEM environment.

Notable Alert Parsing

The common components of the alert analysis workstream are both the alert log parsing and the analytics workstream. If you have ever been consumed by the intricacies of regex parsing specific log events, CLX has a useful module that parses common logs formats using RAPIDS. Regex for the alert parsing is stored in a yaml file, and the Splunk notable parser reads that yaml file and executes the parsing functionality. See below for a quick demonstration on how to use the Splunk notable parser.

import cudf
from clx.parsers.splunk_notable_parser import SplunkNotableParsersnp = SplunkNotableParser()
test_input_df = cudf.DataFrame()
raw_colname = “_raw”
TEST_DATA = ‘1566345812.924, search_name=”Test Search Name”, orig_time=”1566345812.924", info_max_time=”1566346500.000000000", info_min_time=”1566345300.000000000", info_search_time=”1566305689.361160000", message.description=”Test Message Description”, message.hostname=”msg.test.hostname”, message.ip=”100.100.100.123", message.user_name=”user@test.com”, severity=”info”, urgency=”medium”’
test_input_df[raw_colname] = [TEST_DATA]
test_output_df = snp.parse(test_input_df, raw_colname)

Output:

time: 1566345812.924
search_name: Test Search Name
orig_time: 1566345812.924
urgency: medium
user: 
owner: 
security_domain: 
severity: info
src_ip: 
src_ip2: 
src_mac: 
src_port: 
dest_ip: 
dest_ip2: 
dest_mac: 
dest_port: 
dest_priority: 
device_name: 
event_name: 
event_type: 
id: 
ip_address: 
message_ip: 100.100.100.123
message_hostname: msg.test.hostname
message_username: user@test.com
message_description: Test Message Description

Notable Alert Workflow

Another focus is the ability to bundle the alert analysis workflow, including both parsing and analytics, into something compact, deployable, and shareable. To assist with that, we use a CLX Workflow which handles the I/O as well as the data parsing and analytics. Each workflow requires an input and an output location, a workflow name, and any necessary parameters specific to the workflow. See below for an example on how we instantiate a new Splunk alert workflow. For any new developer, this is a great way to get started since the Splunk Alert Analysis workflow discussed in this blog post is already fully developed and packaged within CLX.

from clx.workflow.splunk_alert_workflow import SplunkAlertWorkflow
source = {
 “type”: “fs”,
 “input_format”: “csv”,
 “input_path”: “/path/to/input.csv”,
 “schema”: [“raw”],
 “delimiter”: “,”,
 “usecols”: [“raw”],
 “dtype”: [“str”],
 “header”: 0,
}
destination = {
 “type”: “fs”,
 “output_format”: “csv”,
 “output_path”: “/path/to/output.csv”,
 “index”: False
}
workflow = SplunkAlertWorkflow(name=”splunk_workflow”, source=source, destination=dest,threshold=2.0, raw_data_col_name=”raw”)
workflow.run_workflow()

A way to integrate the CLX Workflow into an environment that also scales is important, which is why both Kafka and Dask were introduced as part of our CLX Workflow I/O. CLX Workflow I/O functionality allows the flexibility to interact with data from Kafka, Dask, or a file. I/O parameters are specified in a Python dictionary which is then read and processed by the workflow. Below is an example of how to utilize the Splunk alert workflow to read and publish data to Kafka.

from clx.workflow.splunk_alert_workflow import SplunkAlertWorkflowsource = {
 “type”: “kafka”,
 “kafka_brokers”: “kafka:9092”,
 “group_id”: “mygroupid”,
 “batch_size”: 24,
 “consumer_kafka_topics”: [“mytopic_input”],
 “time_window”: 5,
}
dest = {
 “type”: “kafka”,
 “kafka_brokers”: “kafka:9092”,
 “batch_size”: 24,
 “publisher_kafka_topic”: “mytopic_output”,
 “output_delimiter”: “,”,
}workflow = SplunkAlertWorkflow(name=”splunk_workflow”, source=source, destination=dest,threshold=2.0, raw_data_col_name=”Raw”)
workflow.run_workflow()

The workflow configurations can be declared within a yaml file instead of in-line using Python, making it easy for operations teams that deploy the same workflow to different environments.

/etc/clx/splunk_workflow/workflow.yamlsource:
 type: kafka
 kafka_brokers: kafka:9092
 group_id: mygroupid
 batch_size: 24
 consumer_kafka_topics:
 — mytopic_input
 time_window: 5
destination:
 type: kafka
 kafka_brokers: kafka:9092
 batch_size: 24
 publisher_kafka_topic: mytopic_output
 output_delimiter: ,
name: splunk_workflow

Full documentation on Workflow I/O is available.

Integrating with the SIEM

Because of the flexibility of the Workflow I/O functionality, CLX can be extended to directly integrate with the SIEM, allowing for ease of use and visibility by a security analyst. Currently, this support includes splunk2kafka, enabling data integration between Splunk and CLX.

This blog post detailed how to integrate with Splunk, and there are two options for extracting raw alert data from Splunk — using the Splunk REST API and using the Splunk2Kafka app.

The Splunk2Kafka app is useful in that it can directly plug into a CLX Workflow using Kafka and Kafka topics. Full details on how to get started using the Splunk2Kafka app can be found within CLX Siem Integrations README.

Alert Analysis Benchmarks

Because CLX is built usingRAPIDS, a large number of alerts are able to be analyzed very quickly. Using a single Tesla V100, over 300,000 alerts can be processed in 3 seconds (including moving data from/to Splunk).

What’s Next

There are other ways to investigate alerts in connection with each other. For example, constructing a heterogeneous alert graph and updating that graph in real time could be beneficial. Embedding alerts as a property graph affords all the benefits of graph interpretation, making new logical connections, and pivoting, and graph operations with cuGraph are fast. But alerts are only one type of data. This workflow is our first step into a larger class of security problem that involves high dimensional co-occurence of events. This is a classic problem and has the potential to explode, both from a computational view and a storage view. In order to overcome this, embedding the data into a high dimensional vector space may be useful. Word2Vec embeddings were a proof-of-concept, and plans to expand on this (including visualization using UMAP) and utilize additional techniques are in place. The structure of the CLX Workflow allows us to easily adapt and scale other use cases and integrate with the SIEM. Stay tuned for more examples on how to build these other cyber use cases with CLX or read our documentation and notebooks to get started on your own.