cuStreamz: More Event Stream Processing for Less with NVIDIA GPUs and RAPIDS Software

Satish Varma Dandu
RAPIDS AI
Published in
6 min readAug 6, 2020

--

By: Satish Varma Dandu, Chinmay Chandak, Jarod Maupin, Jeremy Dyer

Photo: Paul Mahler at Rocky Mountain National Park

Introduction

The stream processing market is experiencing exponential growth with businesses relying heavily on real-time analytics, inferencing, monitoring, and more. Services built on streaming are now core components of daily business. Structured telemetry events and unstructured logs are growing at a rate of over 5x year-over-year. Big Data streaming at this scale is becoming extremely complex and difficult to do efficiently in the modern business environment, where reliable, cost-effective streaming is paramount.

NVIDIA is a heavy user of stream processing. Two questions we had as we examined our use of streaming were:

  1. How can we dramatically scale the processing of streaming data while reducing infrastructure costs as our business needs evolve?
  2. How can we move toward a unified GPU-accelerated data platform that will handle all of our data processing needs?

When trying to answer these questions for our workloads at NVIDIA, our initial answer was, “Let’s auto-scale!” But even auto-scaling hit cost-efficiency limits with CPUs. Thus, we set out to develop a streaming library on GPUs.

This is our journey, written in a series of blog posts, on how our team at NVIDIA built RAPIDS cuStreamz, all the way from a skunkworks project to deploying in production. This post will introduce the cuStreamz technology, mention some early results, give a classic “Getting Started with Streaming” example, and finish with a high-level road map of what readers should expect in the coming posts.

What is cuStreamz?

cuStreamz is the first GPU-accelerated streaming data processing library. Written in Python, it is built on top of RAPIDS, the GPU-accelerator for data science libraries. The goal of cuStreamz is to accelerate stream processing throughput and lower the total cost of ownership (TCO). End-to-end GPU acceleration is quickly becoming the standard. An example of this is Flink adding GPU support, and we are excited to be part of this trend.

cuStreamz is built on:

  1. Streamz, a Python-based streaming library,
  2. Dask, a robust & reliable scheduler to parallelize streaming workloads, and
  3. RAPIDS, a GPU-accelerated suite of libraries leveraged for streaming computations.

Let’s look at the cuStreamz architecture, and then show you an example of cuStreamz in action.

cuStreamz Architecture

Streamz is an open-source Python library that helps build pipelines to manage streams of continuous data.

Typical Streamz workflow

cuStreamz accelerates Streamz by leveraging RAPIDS cuDF under the hood to perform computations on streaming data in an accelerated manner on GPUs. cuStreamz also benefits from cuDF’s accelerated JSON, Parquet, and CSV readers and writers. Our team also built an accelerated Kafka datasource connector to read data from Kafka really fast directly into cuDF dataframes, which gives a considerable boost to end-to-end performance. Streaming pipelines can then be parallelized using Dask to run in a distributed mode for better performance at scale.

Distributed cuStreamz workstream using Dask

The cuStreamz architecture is summarized at a high level in the diagram below. cuStreamz is a bridge that connects python-streaming with GPUs and adds sophisticated and reliable streaming features like checkpointing and state management. cuStreamz provides the building blocks you need to write streaming jobs that run reliably on GPUs with better performance at a lower cost.

Early Success

We consider two major types of streaming problems: stateless and stateful. Stateless use-cases do not need state management across batches of streaming data and may have simple or complex aggregations. Stateful use cases, on the other hand, need to maintain state over either short or long time windows.

A common idea in business computing is the total cost of ownership, or TCO, of different solutions. In our case, it might be accurate to say “total cost of rent” as we compared the price of running three different types of workloads running on AWS instances. The configurations varied between tasks, but essentially systems did or did not have an NVIDIA T4 GPU. We calculated how much the same amount of work would cost on CPU and GPU systems. Below are the early benchmarks measuring the improvement of TCO.

The following are the AWS EC2 specs used in each of the benchmarks: Simple Aggregations: m5.4xlarge (CPU): 32 vcpu, 128GB RAM; g4dn.4xlarge (GPU): 1 T4 GPU, 32 vcpu, 128GB RAM, Complex Aggregations: r5.xlarge (CPU): 4 vCPU, 128 GB RAM (15 machines used); g4dn.xlarge (GPU): 1 T4, 4 vCPU, 16GB RAM (2 machines used) [13 machine reduction], Short Time Window: i3.2xlarge (CPU): 8 vcpu, 61GB RAM SSD optimized for stateful use case (6 machines used); g4dn.2xlarge (GPU): 1 T4 GPU, 8 vCPU, 64GB RAM (two machines used) [4 machine reduction]

We’ve seen improvements in each case so far, but the most pronounced has been in the stateless streaming complex use case where 1500+ aggregations must be performed on the streaming batches of data. We are bringing a few streaming pipelines here at NVIDIA into production, and we project savings in the magnitude of $100,000s a year. As we move more workloads to GPUs, we expect to save more. These results are preliminary, and we expect them to improve, particularly in more stateful workloads, like time-windowed dashboards. We will add the TCO improvement benchmark for the long time window use case in a follow-up post.

Getting Started

If you have a Pascal or newer architecture NVIDIA GPU available to you, it’s easy to get started. cuStreamz is available via conda. There’s also helpful documentation at https://rapids.ai/start.html, including a tool that will generate the conda command for your platform (like the example below) as well as options for containers and building the code from source.

conda install -c rapidsai -c rapidsai-nightly -c nvidia -c conda-forge -c defaults custreamz python=3.7 cudatoolkit=10.1

We are continuously rolling out new features to make cuStreamz more robust, efficient, and reliable. The command above gets the nightly version of cuStreamz, which has new features available as they are developed.

Word Count Example

The classic getting started example in streaming is Word Count. So let’s get started!

In this example notebook below, the count for all the words gets updated with every batch that comes into the stream. To maintain cumulative state across batches, we use the Streamz Dataframe (SDF) data structure. If one only wants to compute Word Count for every batch, we also calculate the local (for each batch) Word Count, which does not require the use of state management provided by SDFs. Note that all the computations are performed on the GPU using cuDF, the RAPIDS dataframe library.

We can easily replace the source of the stream to be Apache Kafka as shown in the example below. Kafka is used as a source for production streaming pipelines, and cuStreamz has robust Kafka integration and checkpointing, which enables restarting from the point of job failures. Refer to https://kafka.apache.org/quickstart to start a local Kafka cluster, and refer to https://streamz.readthedocs.io/en/latest/api.html for more details on how to use the Streamz API.

Next Steps

In the coming weeks, we will release more in-depth blogs about key cuStreamz features, including how one can run cuStreamz at scale, our journey toward deploying cuStreamz in production, and more in-depth performance benchmarks. Here’s a list of upcoming topics in the cuStreamz blog series to look forward to:

Both Streamz and RAPIDS cuDF are open-source projects, and we encourage everyone to contribute. https://rapids.ai provides many resources to get started. As you try out cuStreamz, we encourage you to let us know of any features you’d like or bugs you encounter on GitHub.

--

--