Cyber Log Accelerators (CLX)

Bartley Richardson
RAPIDS AI
Published in
8 min readNov 7, 2019

If you’re a data scientist, engineer, or frequently work with large amounts of data, you’re likely familiar with the typical data science workflow: the process of iterating through ETL, data exploration, feature engineering, modeling, testing/validating, and visualization. As the amount of data organizations generate and store increases, this workflow benefits from acceleration. The community has seen numerous ways to accelerate this workflow, including moving end-to-end use cases and pipelines on NVIDIA GPU compute platforms with RAPIDS.

Taking a hard look at workflows for senior cybersecurity analysts, forensic investigators, and threat hunters reveals many parallels with the data science workflow. Security analysts frequently deal with massive amounts of data, generated by sensors across their environment, alerting mechanisms, and open source intelligence feeds (OSINT). They have a large number of tools used during an investigation, and they often want to integrate them for a larger picture of the security incident. They rely on iterations to perform analysis, often pivoting between data feeds to enrich their models and understanding. Rather than relying solely on heuristics for alerting, security operations (SecOps) teams create behavioral models that provide more flexible alerting. When it’s time to brief a senior decision maker on an incident, they rely on facts and visualizations to help tell the story.

The cybersecurity workflow really is the data science workflow. However, there is a shortage of cybersecurity talent that nears 3 million people (1). Asking SecOps to acquire intermediate or expert data science skills, while certainly beneficial, is frequently not a priority. Another issue facing many teams is the inability to use their SIEM for everything they want. Ingesting new alerts and research-based feeds isn’t typically done, but it is useful to evaluate this data not in a silo but rather in the same context as other datasets (i.e., in the SIEM). Putting all of these pieces together is complex, and putting them together while maintaining incredibly fast processing and response is even more challenging.

To help address these issues, the CLX (“clicks”) repository of examples is now available. Demonstrating RAPIDS for cyber analysis, CLX aims to do four things:

  1. Provide SIEM integration with GPU compute environments via RAPIDS and effectively extend the SIEM environment,
  2. Deliver pre-built example use cases that demonstrate CLX and RAPIDS functionality that are ready to use in a Security Operations Center (SOC),
  3. Teach cyber data scientists and SecOps teams how to generate workflows, using cyber-specific GPU-accelerated primitives and methods, that let them interact with code using security language, and
  4. Provide a foundation for accelerate log parsing in a flexible, non-regex method using techniques like cyBERT.

Extending the SIEM Environment

Extending the SIEM environment with CLX and RAPIDS

Modern SIEMs are critical to most SOC operations and provide much functionality. However, more SecOps teams want to take advantage of tools, techniques, and compute that are not readily accessible inside their SIEM. Extending the SIEM environment using CLX enables SOCs to continue using their SIEM how they normally would while also passing data and results to and from a RAPIDS environment. This has numerous benefits. For example, consider the case where a cyber data scientist wants to create a new model or pipeline. Previously, they would:

  • Copy the data from their production SIEM environment to a separate development cluster, typically using one-off batch transfers,
  • Perform their typical data science workflow of ETL, feature engineering, modeling, evaluation, and visualization,
  • Manually iterate with security staff (typically in meetings and email) on results and changes to the data science process, and
  • When both parties are satisfied, work with SecOps and IT to integrate model inference into a production environment, which is typically built using a different software stack.

CLX shows how this process can be formalized and repeatable. Data in the SIEM can be fused with data residing on disk or enriched with OSINT feeds, and the data scientist has the freedom to integrate existing software they’re familiar with (e.g., PyTorch, Numpy, Chainer) into their cyber workflow. There is currently interoperability with Splunk, and new software interoperability is coming in the future.

Pre-Built Example Workflows

There’s a lot of functionality built into CLX and RAPIDS, so part of CLX is providing pre-build end-to-end example workflows for specific cybersecurity applications. Essentially these are recipes that expose CLX functionality. These won’t cover every scenario or every use case, but they should provide an overall picture of how to structure workflows and architect cybersecurity applications. Working closely with several stakeholders, the first set of use cases in the CLX repository include Jupyter notebooks for:

  • Dynamic network mapping,
  • Security alert analysis,
  • DGA detection, and
  • Cybersecurity log parsing (including cyBERT)

For example, consider dynamic network mapping. Although the overall goal is to provide a map of the network using passively collected logs, the type of logs available and the need for visualization (or not) will make individual use cases different. This example in CLX shows how to create this map using Windows Event Logs, relying on some port heuristics as well as graph analytics in RAPIDS graph library, cuGraph.

Sample output of dynamic network mapping, visualized using Graphistry

These notebooks can be run on sample data or on data in your environment, be it in the cloud or on premises.

Workflow Generation, CLX Modules, and Primitives

One of the benefits of developing in RAPIDS is that development code is production code. In a security production environment, the analysts may not want to manage a large collection of data science created files and lines of code. Once a use case is explored and results are verified, CLX provides examples of how easy it is to wrap these use cases, including any necessary I/O, with workflow and I/O modules. Another sample use case notebook is SIEM alert analysis. Taking the code written by a cyber data scientist and putting it into a CLX workflow results in easy configuration and running.

source = {
“type”: “kafka”,
“kafka_brokers”: “kafka:9092”,
“group_id”: “gtcdc”,
“batch_size”: 23,
“consumer_kafka_topics”: [“gtcdemo_raw”],
“time_window”: 5,
}
dest = {
“type”: “kafka”,
“kafka_brokers”: “kafka:9092”,
“group_id”: “gtcdc”,
“batch_size”: 24,
“publisher_kafka_topic”: “gtcdemo_enriched”,
“output_delimiter”: “,”,
}
workflow = SplunkAlertWorkflow(name=”my-splunk-alert-workflow”, source=source, destination=dest)
workflow.run_workflow()

The workflow continues to run, producing new results until stopped. It can be instantiated in regular Python files or even in a Jupyter Notebook if preferred.

CLX also includes specific cybersecurity primitives and methods that make interacting with cyber data easier and faster. For example, CLX includes accelerated IPv4 methods and RAPIDS integrates IPv4 as a datatype in cuDF. Performing IPv4 methods (e.g., unique, isin, is_private) using CLX is faster than those on a traditional CPU. DNS data is also supported, with DNS parsing functionality available. Check out some examples of how to use TLD and SLD extraction in the DGA detection use case.

IPv4 based methods using CLX (green) vs. traditional CPU functions

If you want to include OSINT data as part of your workflow, CLX provides integration points for several leading sources: WhoIs, Virus Total, and FarsightDB. Statistics functionality that is useful for cybersecurity alerting is also shown in CLX as part of the mlstats module. Need to alert via a rolling z-score for volumetric changes? clx.mlstats.rzscore is a great example of how to define a custom function.

Flexible Log Parsing

One of the most arduous tasks of any security operation (and equally as time consuming for a data scientist) is ETL and parsing (see this paper)(2). Cybersecurity logs vary widely in their formatting, purpose, and source (sensor) location. Some logs are nicely structured while others are dumps of English text, and this doesn’t begin to include bespoke internal applications that emit highly specific logs. Regex is typically used to parse these logs, and this process can be easily broken by changes to the log format and corrupted data.

Further, SecOps may want to parse raw data differently than they originally did to evaluate a new model, as part of a forensic investigation, or provide tweaked input for downstream processing/analytics. CLX includes cyBERT, a flexible log parser based on the BERT language model. cyBERT is tested against Windows Event (WinEVT) logs as well as multi-vendor DNS logs, and shows a high micro- and macro-F1 (>0.999) with a minimal validation loss (<0.0048).

Preliminary cyBERT results when parsing raw Windows Event Logs

Parsing heterogeneous logs without the need to write explicit parsers while simultaneously recovering from corrupted data and providing parsed values to downstream analytics is a force multiplier in a typical SOC that is understaffed. It also could provide flexibility to organizations that want to evaluate new security appliances or customize their logs, allowing them to test these new appliances/techniques in their own environment without time-consuming setup. And if you’re not ready to try cyBERT yet, CLX includes traditional parsers for some common cyber log formats, including Zeek (flow) and WinEVT logs.

There’s so much more to say about cyBERT that an entire blog post dedicated to the technique and its development will be available soon on this same Medium channel.

Next Steps

CLX has some of the basics to get started today, and more is coming in the future. An example direct query integration with Splunk (CLX query) is already working as a proof-of-concept. It allows a security analyst to integrate data that doesn’t exist in the SIEM (e.g., results from a ML/DL workflow) with indexed data.

A sample CLX query

Integrations with additional SIEM vendors is an active area of investigation.

Support for more cyber data types, including hexadecimal for IPv6 and MAC addresses will be included, and additional use cases and example workflows will be added to the CLX repo as they become available. Continuing to refine and build upon existing notebooks is also planned. Additional experimentation with and expansion of cyBERT is a priority, and CLX is open source and part of RAPIDS, and GitHub issues are always welcome. Everyone can help! Please give feedback, contribute code, submit PRs. More content and functionality is coming. You can also join the RAPIDS Slack channel and interact with the entire RAPIDS team there.

References:

  1. “Cybersecurity Skills Shortage Soars, Nearing 3 Million”, blog posted by ISC Management, 18 October 2018 at 09:11 AM, Cybersecurity Workforce, https://blog.isc2.org/isc2_blog/2018/10/cybersecurity-skills-shortage-soars-nearing-3-million.html.
  2. Zhu, J. et al., “Tools and Benchmarks for Automated Log Parsing”, International Conference on Software Engineering 2019, Montréal, QC, Canada, 25 May 2019, https://arxiv.org/pdf/1811.03509.pdf.

--

--