The Significance of In-Broker Data Transformations in Streaming Data

How WebAssembly powered data transformations are changing the data scrubbing story for streaming data platforms?

Dunith Danushka
Tributary Data
6 min readAug 28, 2023

--

Illustration by Christina Lin

Data scrubbing or massaging is a critical aspect of data engineering, especially in streaming data pipelines where the data must be cleaned, filtered, and scrubbed in real time with the help of stream processing engines or some Python jobs.

However, that is about to change. Streaming data platforms have started supporting in-broker data transformations powered by WebAssembly (Wasm) to reduce the dependency on using a third-party system for simple data processing tasks.

In this post, I will discuss data scrubbing, how it’s traditionally done and its challenges, and how Wasm-based transforms are changing the landscape. I will also shed some light on how Redpanda supports in-broker data transformations and what benefits you can reap.

Data scrubbing is becoming expensive

With the ever-increasing volume, velocity, and variety of data flowing through a business, we can’t always expect the data to be in the right format, ready to be consumed by downstream consumers.

In the realm of streaming data, where data moves fast as they are generated, data must be corrected in real time before consumption. For example:

  • Scrubbing/redacting/normalizing sensitive data fields to hide from downstream consumers (filtering for privacy, GDPR, etc.)
  • Reducing large records into a smaller subset of important fields
  • Computing simple enrichments like Geo-IP translations
  • Transcoding records (for example, converting JSON records to Avro).

In the context of a streaming data platform like Apache Kafka, this is usually done by sending the raw data stream into a separate stream processing environment or pushing this work onto consuming applications (just a Python job somewhere) to do the dirty work of prepping that data for consumers, only for it to come right back into the broker for delivery.

That creates an unnecessary “data ping pong” between systems.

Resulting data is always written to a different topic, matched with the same partition structure.

Alex described this situation perfectly in his recent blog.

Yet the baseline complexity to stand up something “simple” often involves 3 or 4 distributed systems, multiple nights reading configurations and man-pages, and a few too many shots of espresso to start seeing the future. And once you are done, you end up ping-ponging the data back and forth between storage and computing when all you had to do was to remove a field out of a JSON object. To the data engineer, it feels like an endless game of system whack-a-mole just to start to do the interesting work of actually understanding the data.

In-broker data transformations

Instead of relying on a third-party system to scrub the data, what if we do that within the broker?

What if we let developers write their data transformation logic and ship them to the broker, allowing the broker to handle the execution?

Well, that certainly looks elegant on paper, even though we are trying to blur the boundary between storage and computing here. Have you ever seen a water purifier installed within the water tank?

However, the benefits of no “data ping-pong” between compute and storage reduce the cost, processing latency, and overhead, outweighing that argument in greater detail.

The real challenge lies in the implementation — shipping the data processing logic to a broker and having it execute that code faster, more reliably, and more securely.

Let’s dig into more details.

What is WebAssemby (Wasm)?

WebAssembly, or Wasm for short, is a new type of code that can be run in modern web browsers and provides new features and major gains in performance. It is not primarily intended to be written by hand; rather, it is designed to be an effective compilation target for source languages like C, C++, Rust, etc.

That means developers can code web client applications in a programming language of their choice, compile them down to Wasm, and run them inside the browser at near-native speed. Additionally, Wasm brings other advantages, such as portability, security (via sandboxed execution in the browser), and debugability.

Use Wasm for server-side processing?

What if we use Wasm to ship code to server-side applications, especially to brokers, in the same way it ships code to the user’s computer?

Returning to our original context, using Wasm to code the data transformation logic and shipping it to the broker provides several benefits.

Flexibility for developers: Wasm allows developers to write their transformations in any supported languages (C, C++, Rust, JS, and Go as of today). It would also give them the advantage of importing any library of their liking.

Near-native performance: Wasm has a small runtime footprint, less overhead, and runs in embedded mode. Codifying business practices like GDPR compliance or custom on-disk encryption mechanics can be done with near native-level performance at runtime.

Security: Data transformations are executed in sandboxes, which are isolated execution environments, providing improved security.

Portability: Wasm execution environments are standardized to promote portability, enabling code to be shipped across many platforms.

Redpanda Data Transforms — in-broker data transformations at scale

Several attempts have been made in the streaming data space to bring Wasm for inline data transformations, especially those executed at the broker. However, I don’t know how successful they are today in terms of getting deployed into production at scale and supported by developers.

Redpanda, the Kafka-API-compatible streaming data platform, recently released the Redpanda Data Transforms Sandbox, powered by Wasm. That brings many simple processing tasks in-broker, allowing streaming engineers to deliver 80%+ of their data processing tasks without any “data ping-pong” and do it all within a single platform — with Redpanda’s trademark simplicity, performance, and reliability.

Redpanda Data Transforms reduces the complexity of delivering clean, relevant and properly formatted data to a variety of downstream consumers

Redpanda data transforms provide a framework to create, build, and deploy inline data transformations on data written to Redpanda topics. You can develop custom data functions, which run asynchronously using a WebAssembly (Wasm) engine inside a Redpanda broker. A transform function processes every message produced to an input topic and returns one or more messages that are then produced to an output topic.

The sandbox contains a Go-based SDK seamlessly integrated within rpk, Redpanda’s CLI experience, and a single Redpanda broker (Docker container) that can deploy and run your transform functions on topics you create inside the fully-contained sandbox. It is currently in Technical Preview.

There will be more function types, more language support, and other features over time to help you build more complex functions, seamlessly interact with different data formats and produce richer and more valuable data products within a familiar and GitOps-friendly toolchain. But I highly encourage you to try that out and provide them with some feedback.

Will Wasm make stream processors obsolete?

Well, certainly not.

Wasm and stream processors are different. They have unique strengths and weaknesses, although some of their functionalities overlap.

While stream processors boast a mature community, code base, and vibrant ecosystem, Wasm is still a growing space with a relatively young track record compared to stream processors. Stream processors excel at maintaining a resilient state, fast recovery, and integrating with ecosystem components to build data pipelines.

The true power of Wasm lies in its flexible programming model, small computational footprint, near-native performance, and the ability to write more expressive logic. That will open more doors in the future, especially for writing applications that execute at the edge of the network, performing edge analytics, and empowering the storage layer with the necessary computational power.

So, if you already have a solid streaming data pipeline, don’t rip it and replace it with Wasm. Consider utilizing Wasm for “net new streaming scenarios” to accomplish mundane data scrubbing tasks within the broker with low latencies. Start with a simple use case without heavily in a stream processor and see how it goes.

Summary

I hope you learned something new today. This article is meant to be an introductory post, helping you understand the what, why, and how of Wasm in the streaming data space. I will come up with a deep dive into Redpanda’s Wasm implementation, along with a hands-on example, in a future post.

--

--

Dunith Danushka
Tributary Data

Editor of Tributary Data. Technologist, Writer, Senior Developer Advocate at Redpanda. Opinions are my own.