cyBERT

Bartley Richardson

Published in

RAPIDS AI

11 min readDec 5, 2019

Neural network, that’s the tech; To free your staff from, bad regex

Authors: Rachel Allen, Bartley Richardson

Introduction to Logs

Since the dawn of time, humans have been struggling with and overcoming their problems with logs. Tools to fell trees that first built simple lean-to structures were inefficient for growing populations, and civilizations invented new ways to harvest logs, mill them, and erect larger and more complex buildings, outpacing traditional log cabins. Humans discovered new ways to use logs as fuel, combining them with a spark to maintain fires that provided warmth and power. It comes as no surprise that with the rise of computers and network communication, a different type of log became important and, unfortunately, more difficult to manage than ever before.

Cybersecurity logs are generated across an organization and cover endpoints (e.g, computers, laptops, servers), network communications, and perimeter devices (e.g., VPN nodes, firewalls). Using a conservative estimate for a company of 1000 employee devices, a small organization can expect to generate over 100 GB/day in log traffic with a peak EPS (Event Per Second) of over 22,000¹. Some of these logs are generated by users and activity on the network, and other logs are generated by network and security appliances deployed throughout the environment.

Why Log Parsing?

It began simple enough. Logs needed to be kept for important events. Bank transactions needed to be recorded for verification and auditing purposes. As communications systems became more complex, additional logs were kept to ensure systems were reliable and robust. The Internet ushered in a new age of communication, commerce, and information exchange. Valuable information was being passed, and we started to need logs to verify communications were authentic and permitted. As the price of storage fell, security professionals urged their organizations to collect more logs, to collect more data. And they were successful.

Today, organizations collect, store, and (attempt to) analyze more data than ever before. Logs are heterogeneous in source, format, and time. In order to analyze the data, it first needs to be parsed. Actionable fields must be extracted from raw logs. Today, logs are parsed using complex heuristics and regular expressions. These heuristics are inflexible and prone to failure if a log deviates at all from its specific format. Consider the below situations.

What happens when a new sensor/application is introduced and with it introduces a new log format? Even if current log parsers can handle data similar to this new log format, a new parser must be written (regex, it’s fairly rigid for this).
What happens with degraded logs? Does the entire pipeline fail, or do we lose that entire log? A SIEM company or an ISV may write parsers for their own logs, but your internal staff has to write parsers for internally created applications that don’t adhere to an extremely common format.
What if a security operations (secops) team wants to ingest these logs and evaluate what information from them is actionable and required? Today this can require multiple iterations, taxing an already understaffed secops team to evaluate the quality of the parsed log.

Put simply, there has to be a better way of parsing logs in a more flexible, resilient way. Let’s look at what’s available. Most organizations keep a large history of logs, and it’s straightforward to keep raw logs and parsed versions of those logs. Access to a lot of data examples sounds like something well-poised for deep learning — specifically, a deep neural network (DNN). But there are so many to choose from, so where do we start?

What’s NLP and Why Should We Use It?

There are many ways to process logs and cybersecurity data. In this case, we focus on parsing logs that are typically defined by humans to record data that captures machine-to-machine exchanges. Turning to a technique like Natural Language Processing (NLP) is worth exploring. NLP is traditionally used for applications such as text translation, interactive chatbots, and virtual assistants. The first step for modern NLP techniques is transforming text or speech into a mathematical representation. These representations can be as straight-forward as a look-up that converts characters to numbers, or they can be much more complex, like using the output from a previously trained neural-network (e.g. Word2vec, GloVe, BERT, GPT-2). These neural-network representations learn relationships between words in an unsupervised method based on their occurrences with other words in a very large training corpus, like all of english wikipedia. Machine learning models are then developed using these representations to achieve the desired output, such as clustering or classification. Previous² work³ shows that viewing cybersecurity data as a type of natural language can be successful.

Why BERT?

Given their functionality, there is no shortage of pre-trained word representations created for NLP. Older neural-network word representations like Word2vec are context-free. They create a single word-embedding for each word in the vocabulary and are unable to distinguish words with multiple meanings (e.g. the file on disk vs. single file line). More recent models (e.g., ULMFit and ELMo) have multiple representations for words based on context. They achieve this by using the word plus the previous words in the sentence to create the representations.

BERT (Bidirectional Encoder Representations from Transformers) also creates contextual representations, but it takes into account the surrounding context in both directions — both before and after a word. Encoding this contextual information is important for understanding cyber logs because of their ordered nature. For example, across multiple log types a source address occurs before a destination address. An additional challenge of applying a natural language model to cyber logs is that many “words” in a cyber log are not English language words; they include things like file paths, hexadecimal values, and IP addresses. Other language models return an “out-of-dictionary” entry when faced with an unknown word, but BERT breaks down the words in our cyber logs into in-dictionary WordPieces. For example, ProcessID becomes two in-dictionary WordPieces — Process and ##ID. Additionally, BERT is an attractive model for our use case because it was open sourced by Google in late 2018, and the HuggingFace transformer library contains an easy to use pre-trained model implemented in PyTorch. The transformer library can easily add fine-tuning layers to the representation layers for our specific downstream classification task of Named Entity Recognition (NER). A final benefit of selecting the BERT model for cyber log parsing is that we can take advantage of the epic portmanteau — cyBERT.

The cyBERT Experiment

cyBERT is an ongoing experiment to train and optimize transformer networks for the task of flexibly and robustly parsing logs of heterogeneous cybersecurity data. It’s part of CLX (read our overview blog about CLX), a set of cyber-specific applications built using RAPIDS. Since BERT was designed for natural human language and more traditional NLP tasks like question answering, we have overcome several challenges in our implementation. Unlike the flexible sentence organization of human language, the rigid order of some cyber logs can cause our model to learn the absolute positions of the fields rather than their relative positions. Another challenge is that many of our logs exceed the maximum number of 512 tokens, also called WordPieces, that can be input as one sequence into BERT. Additionally, longer sequences are disproportionately expensive because the time of the attention mechanism of the network is quadratic to the sequence length. To achieve more robustness and flexibility, we fine-tuned our model on log pieces of varying lengths and starting positions. Before inference, we split the logs into overlapping pieces to accommodate the input size of the model; labeled logs are recombined in post-processing. Thus far, we’ve experimented with input sequences of varying lengths, training data sizes, numbers of log types, and the number of training epochs.

For example, inference for a BERT model of 512 is 20.3ms. However, this does not tell the entire story. In order to parse a log with a WordPiece sequence size of 256, more than 2 parts must be fed into the model. This is to account for overlap between the log pieces. To achieve the same effect as parsing a log with one 512 length WordPiece sequence, it is necessary to run 3 sequences through a 256 WordPiece sequence model. Figure 1 illustrates the performance characteristics (lines) and timings (bars) of across various WordPiece sequence sizes when parsing an entire log.

Figure 1: Model inference performance vs. Sequence size

For large log sizes with an average number of tokens over 512, it makes sense to use the largest possible WordPiece size. This gives not only the fastest performance but also near top performance on all evaluation metrics. However, in the real-world, a Security Operations Center (SOC) may not actually approach these large amounts of tokens in their logs. In this case, a balance could be struck between the maximum number of tokens and performance criteria.

Consider a WordPiece size of 64. While parsing an entire log of multiple sequences requires 15 sequences in our experiment (compared with a single sequence at 512), the time required increases by ~5ms. If logs are typically smaller though, inference time on a single sequence with 64 tokens is 18.9ms. Even with a reduced number of tokens, performance across all metrics is still high. What all of this means is that there isn’t a single off-the-shelf way to implement cyBERT that will work for every organization. Attention must be given to the type of logs and their general composition. Our code for cyBERT with the parameters that worked best for our data can be found in the CLX repo.

Results

Fine-tuning the pre-trained base BERT model to label the entries of cyber logs with their field names is quite powerful. We initially trained and tested our model on whole logs that were all small enough to fit in one input sequence and achieved a micro-F1 score of 0.9995. However, this model cannot parse logs larger than the maximum model input sequence, and its performance suffered when the logs from the same testing set were changed to have variable starting positions (micro-F1: 0.9634) or were cut into smaller pieces (micro-F1: 0.9456). To stop the model from learning the absolute positions of the fields, we moved to training on log pieces. This training results in similar accuracy to the fixed starting positions and performs well on log pieces of variable starting positions (micro-F1: 0.9938).

We achieve the best results when we train our model on log pieces, and measure our testing accuracy by splitting each log before inference into overlapping log pieces, then recombining and taking the predictions from the middle half of each log piece. This allows the model to have the most context in both directions for inference. One of the most exciting features of cyBERT is its ability to parse log types outside the training set. When trained on just 1000 examples of each of nine different Windows event log types, it can accurately (micro-F1: 0.9645, see Figure 2) parse a never seen before Windows event log type.

Figure 2: Performance for Tests Including/Excluding Unseen Fields in Training

Next Steps

After an encouraging start with the high accuracy of the BERT base model our next steps work to make the cyBERT more robust and flexible. The current model is trained only on Windows event logs; we plan to collect a more diverse set of logs for training including additional Windows event logs and apache web logs. The language of cyber logs is not the same as the English language corpus the BERT tokenizer and neural-network were trained on. We believe our model will improve both speed and accuracy if we move to a custom tokenizer and representation trained from scratch on a large corpus of cyber logs. For example, the current BERT WordPiece tokenizer breaks down AccountDomain into A ##cco ##unt ##D ##oma ##in which we believe is more granular than the meaningful WordPieces of AccountDomain in the cyber log language. Our parser needs to move at network to speed to keep up with the high volume of generated logs. In the future we will move all preprocessing, tokenization, and post-processing to the GPU for faster parsing without the need to communicate back and forth with host memory.

Conclusion

cyBERT is off to a promising start in the long standing battle of man versus logs. In this post, we’ve shown how interpreting synthetic cybersecurity logs as a natural language has the potential to render traditional, regex-based parsing mechanisms obsolete and introduce flexibility and resilience at a new level to typical log parsing architectures. Parsing logs efficiently and correctly is critical to any security operations center, and cyBERT allows users to accomplish this without the need to develop extensive regex libraries. Further, as we increase the speed of pre- and post-processing with cyBERT, the ability to replay archived logs through new parsers will be possible, allowing security analysts the ability to quickly extract new information from older logs as needed. We’re excited about the future of cyBERT and sharing our work with the larger cybersecurity community!

References

Brad Hale, “Estimating Log Generation for Security Information Event and Log Management”, http://content.solarwinds.com/creative/pdf/Whitepapers/estimating_log_generation_white_paper.pdf.
Richardson, B., Radford, B., et al., “Anomaly Detection in Cyber Network Data Using a Cyber Language Approach”, 2018, https://arxiv.org/abs/1808.10742.
Radford, B., Richardson, B., Davis, S., “Sequence Aggregation Rules for Anomaly Detection in Computer Network Traffic”, 2018, https://arxiv.org/abs/1805.03735.

About the Authors

Rachel Allen is a Senior InfoSec Data Scientist in the AI Infrastructure team at NVIDIA. Rachel’s focus at NVIDIA is the research and application of GPU-accelerated methods to help solve information security and cybersecurity challenges. Her primary research interests involve the application of NLP and Bayesian statistical modeling to cybersecurity challenges, with cyBERT being her latest contribution. Prior to joining NVIDIA, Rachel was a lead data scientist at Booz Allen Hamilton where she designed a variety of capabilities for advanced threat hunting and network defense with their commercial cybersecurity group, DarkLabs. She is a former fellow and instructor at The Data Incubator, a data science training program. Rachel holds a bachelor’s degree in cognitive science and a PhD in neuroscience from the University of Virginia.

Bartley Richardson is an AI Infrastructure manager and Senior Data Scientist at NVIDIA. His focus at NVIDIA is the research and application of GPU-accelerated methods and GPU architectures that can help solve today’s information security and cybersecurity challenges. Prior to joining NVIDIA, Bartley was a technical lead and performer on multiple DARPA research projects, where he applied data science and machine learning algorithms at-scale to solve large cybersecurity problems. He was also the principal investigator of an Internet of Things research project which focused on applying machine and deep learning techniques to large amounts of IoT data. His primary research areas involve NLP and sequence-based methods applied to cyber network datasets as well as cross-domain applications of machine and deep learning solutions to tackle the growing number of cybersecurity threats. Bartley holds a PhD in Computer Science and Engineering from the University of Cincinnati with a focus on loosely- and un-structured query optimization. His BS is in Computer Engineering with a focus on software design and AI.