The Live Demo is a staple of tech talks and presentations, and for good reason. Seeing software working, in real time, as a presenter talks and interacts with it is so much more compelling than slides and bullet point lists. When we started thinking about how to sell Apache Beam running on Google Cloud Dataflow as an exciting technology to unify batch and streaming workloads, demos were the obvious choice!
The streaming demo we wanted to build processed a publicly available, synthetic dataset of retail transactions to drive an IoT-style dashboard with sparklines, counts and alerts. This post will introduce some streaming concepts and talk through some of the challenges we faced that were peculiar to creating a compelling, relatable demo of streaming tech.
Beam’s capabilities focus on production workloads — in the batch case you want to process a bunch of events (retail transactions, in our case) as efficiently as possible. You’re only interested in the final results, rather than the state of the system as it’s processing. There are some challenges to overcome in putting together a compelling streaming demo that you don’t face in the batch case.
The metrics we wanted to track were aggregations (counts and sums) representing volumes and values of transactions, which means we needed to consider windows in time. That windowing aspect added some extra challenges in terms of making a demo compelling and relatable.
When a Batch isn’t a Batch
We started with pipelines that read from large, compressed chunks of historical event data in Google Cloud Storage. The standard Beam IOs for processing data stored this way don’t provide any way to control the order in which the keys are read, so the events weren’t read in occurrence time order. That makes sense, as we’re effectively processing in batch mode, so the system figures out how to get things done efficiently.
In a real-time streaming situation, events are constantly arriving and should be processed as quickly as possible. There’s an inherent, rough ordering in time. For example, events arriving now happened more recently than events arriving ten minutes ago. Treating the data like a batch makes sense to process a real-life workload quickly, but it doesn’t make for a good streaming demo!
Another complication for streaming systems is completeness. How do you know you’ve seen all the data when the data’s arriving all the time, and some might be delayed (for example, because a system was running slowly, or a network issue caused events to queue up for a while)? Beam won’t emit windows until they are complete — that is, as a watermark that it tracks passes the end of the window. Our solution needs to ensure that event time doesn’t go backwards!
To deal with the ordering problem, we took control of how the events were “replayed” into Beam. We’d written the first pipeline as two pieces— the first one read the stored data from cloud storage, parsed it, and wrote it to Google Cloud Pubsub. The second pipeline did the more interesting work, reading from Pubsub, assigning timestamps, putting events in to time windows and computing the metrics we wanted.
We used the first pipeline to parse and write the raw data in batch mode into BigQuery. We could then query for the events we were interested in, order the events by time and export the results. We built a simple application in Scala to stand in for the first pipeline, emitting events onto PubSub in the kind of order you’d expect if the events really were streaming in.
We set Beam’s event timestamps to the time the events were emitted from the replay tool. That made things simple, as we weren’t trying to run a pipeline as if it really was 2006, and we could stop, start and restart the replay at any point without the watermark causing problems.
Preserving Patterns in Time
The next problem to solve was preserving the structure of the data in time. For example, if you have transactions for a high street store, you expect to see patterns in time. Maybe it’s busy in the morning through to lunchtime, then tails off through the afternoon into the evening. There won’t be any transactions when the store is closed. Just replaying the transactions as fast as possible ignores the original timestamps and so loses those recognisable patterns.
Instead of just replaying the event data, we had the replay tool look at the original timestamps, and pause for an appropriate length of time between events to preserve the original temporal structure.
The Need for Speed
In our dataset, the patterns that will be recognisable don’t present themselves in short seconds-to-minutes timeframes. They present over the hours of the day — no good for a demo! We needed a “fast-forward” button.
Adding a scaling factor to the Scala app was fairly straightforward. The logic in the app transforms the original event times, offsetting so that the first event happens at the time the demo starts, and scaling down the intervals between each event according to a predefined factor. Now, our demo replays an hour’s events every five seconds, giving people a taste of what might be possible even with a dataset that doesn’t really work on short timescales.
In production use, your data would likely be arriving in roughly time-order (and you’d deal with early/late-arriving data as appropriate for your goals), or you’d be processing batch and so you’d only care about the final results being produced quickly and efficiently. Any delays in your data being produced or arriving would happen naturally and you wouldn’t be trying to speed up the passage of time!
Challenges can arise when you’re trying to do live demos of real-time dashboards:
- processing the data in a meaningful order
- preserving the temporal structure of the data
- projecting long timeframes onto shorter ones that work as demos
If you’re thinking about using this kind of technology and you’re heading down the demo route, we hope our lessons learned will help you get the results you want quickly!
If you’re interested in our replay app, let us know in the comments. We may consider open-sourcing it if there’s interest. The replay-stream npm package is already out there and may be sufficient for your replay, if you don’t need the scaling element.