Preprocess Your Data at Lightspeed with Our GPU-based Tokenizer for BERT Language Models

Published in

RAPIDS AI

10 min readMay 28, 2020

Introduction

We’ve previously blogged about CLX, a collection of RAPIDS-based applications to cybersecurity, and cyBERT, a part of CLX that seeks to use large NLP models to parse raw cybersecurity logs into common formats. To get up to speed on what we’ve built to date, please refer to the previously published CLX blog and the cyBERT blog. That initial proof-of-concept demonstrated that this type of approach is possible and a viable improvement over heuristic-based regex parsers that dominate the industry.

This blog focuses on a portion of the cyBERT pipeline that was unnecessarily slow — tokenization. For our purposes tokenization is the process of transforming long text strings into smaller meaningful pieces to be fed into a language model for processing. The GPU-based Subword Tokenizer (GST) performs a task similar to that of the WordPiece tokenizer or the SentencePiece tokenizer with the BPE algorithm, applied as a pre-processing step before input into a BERT language model. The GST described below is up to 270 times faster than current CPU-based implementations.

The Problems

While the inferencing performance through the BERT model used in cyBERT is fast, the pre-processing and post-processing were slowing the entire pipeline. In fact, the majority of the time spent in inferencing was in tokenization. Figure 1 shows the overall speed for cyBERT v1. Over 70% of the time spent was in pre and post-processing.

Figure 1: Parsing performance for cyBERT v1

In order to run an inferencing task (parse the raw logs, in the case of cyBERT), it’s necessary to tokenize the raw logs. This prepares these logs for parsing. In Figure 2, this is represented by the “Tokenize” step (column highlighted in blue).

Figure 2: The complete cyBERT pipeline for both training and parsing/inference

There are a wide variety of tokenization methods available. Some of the simplest are dictionary tokenizers, where each individual token is given a numeric representation. While straightforward, this presents issues with large volumes of text in that the vocabulary size will quickly explode. Attempting to restrain this vocab size often results in frequent out-of-dictionary lookups, slowing the task even further. Instead, it is common to use a WordPiece style tokenizer for BERT-based pre-processing (referenced from here as a BERT tokenizer). One of the primary advantages of this BERT tokenizer is that it keeps the vocabulary size small by splitting more complex words into smaller, in-dictionary words. This has the extra benefit of eliminating out-of-dictionary words in the corpus. The tokenizer favors longer subword pieces with a defacto character-level model to fall back on because every character is part of the vocabulary as a possible subword. Consider the sentence:

“This example is 14 words and 19 tokens. AccountDomain: foo”

After WordPiece tokenization, this sentence gets broken into the tokens:

‘This’, ‘example’, ‘is’, ’14’, ‘words’, ‘and’, ’19’, ‘token’, ##s, ‘.’, ‘A’, ’##cco’, ‘##unt’, ‘##D’, ’##oma’, ‘##in’, ‘:’, ‘f’, ‘##oo’

Notice how some words (AccountDomain, tokens) are split into multiple tokens.

However, this type of tokenization still presents multiple issues.

First, there did not exist a complete on-GPU WordPiece style tokenization implementation that we could use. One of the primary ways to achieve acceleration of the cyBERT workflow is to migrate more components to the GPU to take advantage of the high degree of parallelization offered.
Second, even the existing BERT tokenizers relied on the truncation of sentences during pre-processing. Truncation is acceptable for some tasks, but when attempting to process raw cybersecurity logs that can be highly variable in length, truncation comes at the expense of accuracy. One of the ways the BERT model parses logs is by learning the context of the tokens. It is useful to consider tokens in their original context, without truncation.
Third, inferencing using a BERT model typically takes the form of sentence completion or question answering. In these use cases, once an answer has been formed, the rest of the input sentence can be disregarded.
In addition, the original sentences need not be reassembled. In the case of log parsing, both of these cases need attention. The entire log must be considered, and it is necessary to reconstruct split sentences as a post-processing step in order to provide the original words and context to security analysts.

These problems motivated us to create the first GPU-accelerated BERT tokenizer that does not truncate sentences, retains information necessary to reconstruct original words from splits, and is fully compatible with RAPIDS.

The GPU Subword Tokenizer (GST)

For the most part, the GPU Subword Tokenizer (GST) discussed is similar to other WordPiece style tokenizers found in common NLP libraries; the overall goal of the tokenizers remains consistent. The tokenizer accepts a raw input string as input and generates a tensor and attention mask that are ready for BERT inference via a deep learning package (e.g., PyTorch). However, this tokenizer also produces an additional metadata output. The metadata, along with an overlap mask generated from the attention mask, is used to reconstruct split sentences in the post-processing step. The GST is written in C++/CUDA with Cython wrappings that enable usage in Python. It is fully compatible with BERT-based language models in Hugging Face, and the output is ready for pipelining into inference, training, and fine-tuning tasks with PyTorch.

A Motivating Example

The best way to describe the tokenizer is with an example. Below we walk through a simple example that illustrates how the tokenizer works and how the metadata is used to reconstruct the logs from the parsed output. Assume there are two (relatively short, for illustrative purposes) logs.

The tokenizer’s parameters are set to have a max sequence length of 10, and a stride length of 6. It splits the token list and creates a tensor and an attention mask as output. These look like:

Unique to this tokenizer is the creation of metadata, which is a triple for every row in the tensor. The triple stores (logID, start, stop) for each row in the tensor.

The tensor and attention mask are fed to the BERT model for either fine-tuning or inference. The model predicted output is shown below. Areas highlighted in yellow represent the overlap between pieces of the same log.

It’s here when the metadata is used to stitch back together the pieces of the logs. Since the stride is set at 6 and the max length is 10, that means an overlap of 4 (2 tokens on the end of most rows, excluding the start of a new row). Grouping by the label (predicted output) gives:

Decoding yields:

Finally, cleaning up space and splits gives:

In the example above, the columns correspond (left to right) to the date/time, the action performed, domain name, and username. Even though the two logs were somewhat different in their raw forms, cyBERT is able to parse the logs into similar key/value items.

Speed Comparisons

Previously, the pipeline used the WordPiece tokenizer available in the Hugging Face transformers repo. Since that time, Hugging Face has released a new version of the WordPiece tokenizer written in Rust (versus Python). As part of the speed comparisons, the GPU tokenizer was compared to both the WordPiece tokenizer in the Hugging Face transformers library as well as the newer Rust-based tokenizer (Figure 3).

Figure 3: Speedup of GPU tokenizer over HuggingFace (HF) version

As shown in the chart, the GST is up to 271x faster than the Python-based Hugging Face tokenizer. It performs slightly slower than both versions for small numbers of records (up to 100), and this is expected performance for smaller datasets. For GPU-based workloads, speedup is often achieved only when the GPU cores are fully saturated. The GST is up to 6.2x faster than the Rust-based version of the tokenizer. In addition, the tokenizer supports non-truncation of input sequences as well as the output of metadata required to reconstruct individual sequences (logs in this use case). Neither of these is supported by either of the Hugging Face tokenizers.

Code Example

Calling the tokenizer is fairly straightforward. The tokenizer takes a RAPIDS cuDF series as input, and it also consumes the following additional parameters:

hash_file = the dictionary used in tokenization
max_sequence_length = the maximum length of a sequence after splitting
do_lower = convert the sequence to all lowercase (or not)
do_truncate = truncate the sequences to a maximum length of max_sequence_length
max_num_sentences = maximum number of sequences input
max_num_chars = maximum number of characters (total) across the input
max_rows_tensor = maximum number of rows in the resulting tensor

The code below calls the tokenizer for an example set of three sentences.

The parameters can be tuned to your use cases. For log parsing, we keep truncation off. Experimentation with the other parameters is ongoing and often depends, at least partially, on the logs being parsed.

Applications beyond CLX

We’ve focused this post on using the GST specifically with cyBERT using the vocabulary from the BERT base cased pre-trained model. The GST can use the hashed vocab from any pre-trained BERT model (e.g., BERT-base-multilingual, BERT-base-chinese) or a custom vocab from a model trained elsewhere. While the GST has additional functionality to handle arbitrarily long documents, by changing the truncation setting to TRUE, it can serve as a substitute for the CPU-based tokenizer in nearly all BERT preprocessing pipelines.

Next Steps

Currently, the first version of this subword tokenizer is available in the CLX GitHub repo. There is a feature request in RAPIDS cuDF to migrate the tokenizer code to the cuDF repo. This will allow even tighter integration with cuDF and easier to code preprocessing pipelines for those creating BERT-based workflows. There are some memory leaks in the current implementation that can cause issues at a higher volume. Those have been identified and are being fixed in the next version. Python bindings are provided by a Cython layer on top of the C++/CUDA code. This will be refactored and simplified in the future. The first version did not include the RAPIDS Memory Manager (RMM) to optimize the use of GPU memory. As of this writing, a pull request is in-progress to integrate RMM into the GST. There is no way to know a priori how large the tensor needs to be to accommodate a number of raw logs. Due to the variable length of cybersecurity logs, we typically set the maximum tensor size to something large. This is not memory efficient, and there are methods (e.g., historical statistics on log sizes, multiple-input tensors of various sizes, dynamically releasing/reallocating memory) that could be used to further optimize memory usage.

Authors

Bartley Richardson is an AI Infrastructure Manager and Senior Cybersecurity Data Scientist at NVIDIA. His focus at NVIDIA is the research and application of GPU-accelerated methods and GPU architectures that can help solve today’s information security and cybersecurity challenges. Prior to joining NVIDIA, Bartley was a technical lead and performer on multiple DARPA research projects, where he applied data science and machine learning algorithms at-scale to solve large cybersecurity problems. Bartley holds a Ph.D. in Computer Science and Engineering from the University of Cincinnati with a focus on loosely- and un-structured query optimization. His BS is in Computer Engineering with a focus on software design and AI.

Github | Twitter | LinkedIn

Rachel Allen is a Senior InfoSec Data Scientist in the AI Infrastructure team at NVIDIA. Rachel’s focus at NVIDIA is the research and application of GPU-accelerated methods to help solve information security and cybersecurity challenges. Prior to joining NVIDIA, Rachel was a lead data scientist at Booz Allen Hamilton where she designed a variety of capabilities for advanced threat hunting and network defense with their commercial cybersecurity group, DarkLabs. She is a former fellow and instructor at The Data Incubator, a data science training program. Rachel holds a bachelor’s degree in cognitive science and a Ph.D. in neuroscience from the University of Virginia.