GPU Accelerated Cyber Log Parsing with RAPIDS

Bianca Rhodes US
RAPIDS AI
Published in
5 min readMay 7, 2019

By: Bianca Rhodes US, Bhargav Suryadevara, and Nick Becker

Security Operations (SecOps) and IT departments are collecting, managing, and attempting to analyze more data than ever before. Employees are likely to connect their own devices to corporate networks, further widening an already heterogeneous attack surface. With the average time to detect a data breach at 196 days and the cost of a data breach to a US company at $7.91M, it is absolutely necessary to collect, ingest, and make available to SecOps teams salient cybersecurity logs and data feeds. Parsing that raw data quickly and staging it in such a way that makes it readily available not only to human operators but also machine learning models is a key factor in a strong, layered security model.

We introduce a RAPIDS use case to address this issue. Our workflow focuses on two RAPIDS libraries: cuDF and dask-cudf. cuDF the (GPU dataframe library within RAPIDS) is modeled after the Pandas API, allowing the user to leverage GPUs almost seamlessly by editing a python import statement. Dask-cudf allows us to execute tasks or code on a partitioned dataframe across GPUs. Incorporate dask-cudf to easily scale your data processing to multiple GPUs. In this post, we show how using RAPIDS to parse Windows Event Logs (WinEVT logs) provides a speed increase and affords immediate benefits of integration with machine learning techniques. Benchmarks for general end-to-end data processing and machine learning using RAPIDS can be found at RAPIDS.AI.

By the end of this tutorial, we’ll be able to parse raw Windows Event Logs containing authorization data.

Using RAPIDS for Log Parsing

Let’s start with importing the necessary RAPIDS libraries.

  • cuDF is a GPU DataFrame library for loading, joining, aggregating, filtering, and otherwise manipulating data.
  • dask-cuDF is a library used to create partitioned GPU-backed dataframe, using Dask.

Raw Windows Event Logs

Below are sample windows event record samples of type 4624 and 4625, provided by Los Alamos National Laboratory (LANL). Event code 4624 represents a successful logon event. Event code 4625 represents a failed logon event. While the LANL data is provided already parsed in JSON format, we created raw WinEVT data from this parsed format in order to test the workflow (shown in the “Raw” column). This data closely matches how a raw WinEVT log would appear after creation. The script used to create this raw representation of the data can be found here.

Importing the Data into a Dataframe

Let’s take these sample records and begin the parsing process. We can import the data rather easily from a CSV file. Follow along by saving the above data into a csv file called “sample.csv”.

Data Preprocessing

The data must be preprocessed to remove the non-printable characters (e.g., newlines and tabs), replacing newlines with a “|”. We do this by creating a function called preprocess_logs that accepts a dataframe and preprocesses the “Raw” column.

Then we run this function on our dataframe

A sample output of the first record after preprocessing is shown below:

Applying Regular Expressions to Capture Key-Value Pairs

After preprocessing is complete, we use regular expression (regex) dictionaries to extract all of the key-value attribute pairs. Because each Windows Event Log of a different event code type contains different key-value pairs, we must apply different regex to each log type. To help with this, we created the functions filter_by_pattern, which filters the data by event code type, and process_log_type.

The function process_log_type below performs several operations needed for processing. It:

  1. Creates an empty dataframe with the list of superset columns to hold parsed data,
  2. Retrieves the list of regex pattern keys for a given event code, and
  3. Iterates over each key to use the regex extract operation on raw records to pull an attribute’s corresponding value.

We execute this on our dataframe for WinEVT code 4624.

These steps can be repeated for other event codes. The parsed output for event code 4624 is shown below.

Bringing It All Together

Let’s now tie all of these steps together with a pipeline function. This function preprocesses the logs then applies filtering and regex to each log based on event code.

Optimizing the Pipeline for Large Datasets

For ease of use, we created additional functions to read in regex configurations from yaml files. This creates a well-formatted regex dictionary that we apply to a larger dataset with various event codes. A sample yaml file can be found here.

Before we run our pipeline on a large dataset, we first define our input and output columns. As of this post, cannot concatenate dataframes that have different column names. For the time being, we pre-define the output columns to ensure each new dataframe has the same columns.

We incorporate dask_cudf to execute our pipeline function. The following parameters are configurable and may be modified to suit your needs.

  • AUTH_INPUT_PATH = Path to input file containing LANL data.
  • AUTH_REGEX_CONF_PATH = Path to regex config files
  • AUTH_EVENT_CODES_OF_INTEREST = An array of event codes we are interested in
  • AUTH_REQUIRED_COLS = Required columns from input csv file needed for parsing and analytics.

A sample output of auth_gdf is shown below when the pipeline is run on a larger dataset.

Conclusion

As demonstrated, we utilize RAPIDS to parse event log data utilizing the GPU. The first step of any analytic is to parse the data, and we utilize this parsed data to create a network map in a future post. View a completed Jupyter notebook, which includes network mapping, and execute it in your own environment. Input data that matches data used in this post is provided. Instructions for configuring a Docker environment are available in the README of the Github repository.

The processing time for the Jupyter notebook referenced above can be found in the table below.

We invite you to contribute to RAPIDS and submit issues or feature requests as we continue development.

--

--