Detecting Malicious IoT Network Traffic using RAPIDS Forest Inference Library (FIL) and cuStreamz

Bianca Rhodes US
RAPIDS AI
Published in
4 min readDec 16, 2020

--

Authors: Bianca Rhodes US and Rachel Allen

Using decision tree algorithms is an increasingly popular approach to cybersecurity use cases that have labeled training datasets, such as intrusion detection, network attack classification, and malware classification. Gradient boosted decision trees like XGBoost and LightGBM are increasingly appealing for these scenarios because of the ease of interpretability of the model predictions, ease of training, and accuracy that often outperforms alternative models.

The training and inference of these tree-based models already benefit from GPU acceleration, but the speed of inference is still a bottleneck for many cybersecurity applications where data volume and velocity is large. Last year, RAPIDS created the Forest Inference Library (FIL) to further accelerate GPU inference for Random Forest, XGBoost, and LightGBM models. In this blog, we dive into the details of how you can incorporate FIL into your streaming pipelines and show benchmarks of FIL inference on GPU vs. inferencing on CPU (spoiler alert: it’s really fast).

Creating an XGBoost Model to Detect Malicious IoT Network Traffic

Training a model that targets FIL for inference is the same process data scientists regularly employ for training any of the decision tree algorithms. For our public XGBoost model, we trained using labeled Zeek conn logs from the Aposemat IoT-23 dataset collected in the Stratosphere Research Laboratory. After analyzing the payload and behavior of each flow, logs that captured attempts to exploit some vulnerable service in the laboratory were labeled as attacks. We trained an XGBoost model on the more generalizable features of this dataset, specifically avoiding IP addresses and connection duration. Our final model achieved an F1-score of 0.9564. The saved model from this training is ready for deployment using FIL and cuStreamz.

FIL and cuStreamz

By utilizing cuStreamz and Dask, we construct a FIL prediction pipeline to optimize the processing of streaming IoT network data. cuStreamz allows the flexibility to poll log data streaming directly from Apache Kafka into a pipeline that will process that data with RAPIDS. In addition, we scale this pipeline to a multi-GPU environment using Dask. We’ll walk through the simple steps to create an end-to-end cuStreamz pipeline to read in network logs from Kafka and publish the prediction outcomes (i.e., results of inference) back to Kafka. The same IoT dataset is used for simplicity.

Figure 1: Overall architecture/pipeline for FIL inference with cuStreamz

First, we initialize our Dask cluster. We use the worker_init function to define what each Dask worker should do at initialization. We take this opportunity to load our FIL model as well as define our expected data columns and data types.

Next, we define our input source. In this example, we’ll be receiving our log data directly from Kafka. In order to achieve GPU accelerated reading from Kafka, we need to specify the engine as `cudf`.

Finally, we construct our cuStreamz pipeline using the two functions shown below. The predict function will receive the raw log data from Kafka and use our previously loaded FIL model to generate predictions.

This end-to-end example is also available in the CLX repo, a repository where we maintain examples of RAPIDS applied to cybersecurity use cases. Additional information is available in the CLX documentation.

Figure 2 captures the predict time taken with the above cuStreamz pipeline. To follow an example of capturing this benchmark, visit the corresponding CLX notebooks for FIL cuStreamz and XGBoost Streamz.

Figure 2: Performance results of FIL+cuStreamz (GPU) vs. XGBoost+Streamz (CPU)

We see an 11x speed-up by switching our streaming inference pipeline from Streamz with XGBoost to GPU-accelerated cuStreamz and FIL. We can also compare a CPU system with a single GPU (Figure 3). In this case, we observe that a single Tesla v100 GPU can outperform (by over 1.5x) a dual CPU system. Note that the GPU was using only one worker (vs. 10 works on CPU), and the GPU workflow used a larger polling window (5s vs. 1s on CPU). Even with these limits placed on the GPU workflow, the result is still a faster end-to-end inferencing pipeline.

Figure 3: Performance results comparing single GPU to dual CPU

By processing nearly one million streaming conn logs per second, a system can flag malicious traffic and devices in near-real-time, potentially mitigating an attack before the network is further compromised. XGBoost and LightGBM continue to gain popularity as effective machine learning models for intrusion detection [1,2,3,4]. With RAPIDS, FIL, Dask, and cuStreamz, it’s easy to migrate all of your existing decision tree models and workflows to the GPU for a serious speed up!

--

--