cyBERT 2.0 — Streaming GPU log parsing with RAPIDS

New GPU subword tokenizer and integration with cuStreamz

Rachel Allen

Published in

RAPIDS AI

5 min readNov 18, 2020

Authors: Rachel Allen and Bianca Rhodes

The cyBERT experiment continues! We’ve made some big improvements to the pipeline since our initial post — and are calling the update cyBERT 2.0 which includes a GPU subword tokenizer and integration with the python streaming library, Streamz. In response to community feedback, we’ve added support for DistilBERT and ELECTRA. We’re also sharing cyBERT models trained on publicly available data at the HuggingFace model repo, so you can skip the training and get started using cyBERT right now.

Using cyBERT and Streamz

Security teams have the burden of interpreting gigabytes of streaming logs per day, scanning for abnormal or malicious events, and taking immediate action. With more data, this only becomes more difficult as increased data pre-processing workloads expand to capacity. In order to keep up with this workload, we need a type of solution that can further expand the pre-processing pipeline at scale. By utilizing cyBERT and Streamz, we are able to construct this scalable pipeline that seamlessly ties in Dask, RAPIDS, and cyBERT for GPU-accelerated parsing of log data. To easily incorporate cyBERT into this pipeline, we also introduce the implementation of cyBERT within CLX, a repository where we maintain a collection of RAPIDS-focused examples and integrations for security analysts, data scientists, and engineers. It’s quick and easy to get started using cyBERT with a simple python import statement.

The cyBERT streaming workflow (Figure 1) is designed to:

Read in data from Kafka in batches,
Push that data to a Dask cluster for GPU-accelerated tokenization and cyBERT parsing, and
Publish the parsed data back to Kafka.

3 Steps to Create the cyBERT Streamz Workflow

To create the cyBERT Streamz workflow, there are three major steps:

initializing each Dask worker with the cyBERT model,
defining the Streamz source, and
creating the Streamz pipeline.

Initializing Dask Workers with cyBERT Model

First, we initialize our Dask cluster. We can do this with a predefined cyBERT model by supplying the path to the model and corresponding label file. To help with this, we reference the CLX cyBERT class. This model will be referenced by each Dask worker when performing parsing on the streaming log data.

Define the Streamz Source

Next, we define our Streamz source. In this example, we choose to ingest data via Kafka in a batched fashion. Very few applications in cybersecurity are actually pure streaming, and micro-batching is a common approach to move data. To learn more about Kafka consumer configurations detailed below, visit the Apache Kafka documentation. We also indicate that we will be using Dask in our stream by setting dask=True.

Create the Streamz Workflow

We then create our Streamz workflow. For this, we directly reference the CLX cyBERT class to perform the inference step. This class accepts a RAPIDS cuDF series containing raw log data. We conclude the Streamz pipeline by sinking the processed data to Kafka.

In cyBERT 2.0, we’re creating a streaming deep learning pipeline that can handle text inputs of any size. We now support non-truncation of logs, and this makes speed comparisons to cyBERT 1.0 difficult. However, in our testing, we’re able to achieve a speed of over 1000 raw logs per second using only five Tesla V100 GPUs. This represents an end-to-end pipeline, moving from raw to parsed logs, without a single line of regex. To find a complete example of this workflow, visit our documentation.

Support for More Than Just BERT Models

Since the start of cyBERT, we added support for two additional transformer models: DistilBERT and ELECTRA. Both of these model architectures also take advantage of the GPU accelerated subword tokenizer. On an NVIDIA Tesla V100, inference for DistilBERT and ELECTRA is 3X and 6X faster than traditional BERT, respectively. However, this speed comes at the cost of accuracy. In our experiments, the average F1-score decreases by 0.05 for DistilBERT and 0.01 for ELECTRA. Depending on your logs and use cases, the trade-off could be worth it.

Additionally, we pre-trained all three supported model types on a training dataset that was a mixture of English Wikipedia and a large publicly available log dataset. All of these models can be found and downloaded from the HuggingFace model repository. From there you can use our training notebook to fine-tune them to parse your specific logs by following our notebook examples.

Next Steps

Although the majority of our pipeline now resides on the GPU, there are a few portions in post-processing that require the use of Pandas to accommodate nested types in columns. There is an open feature request for this in cuDF which we continue to follow. We also are exploring additional ways to post-process the data without the need for nested types. However, this does not represent a substantial bottleneck for cyBERT. In addition, cyBERT depends heavily on both RAPIDS and PyTorch. Competing pool allocators between the two prevent us from utilizing GPU memory as efficiently and effectively as we could. The RAPIDS team is exploring ways to work with the community to address this issue.

Conclusion

Now that we have shared log specific language models and a streaming log pipeline, we hope the community can use them for other downstream tasks in addition to log parsing. We’re really interested to hear about all of your downstream tasks and how you are using and modifying cyBERT for your own needs. Potential future use cases include sequence classification for anomalous traffic and identification of important messages from error logs. As always, we welcome your input about your own cyber log pain points, any feature requests you might have, and all bugs you encounter. Head over to our GitHub repo to file an issue!