Building Live Streaming Demos

Paul Brabban
Oct 11, 2018 · 5 min read
Image for post
Image for post
Part of an Apache Beam/Dataflow Pipeline

The Live Demo is a staple of tech talks and presentations, and for good reason. Seeing software working, in real time, as a presenter talks and interacts with it is so much more compelling than slides and bullet point lists. When we started thinking about how to sell running on as an exciting technology to unify batch and streaming workloads, demos were the obvious choice!

The streaming demo we wanted to build processed a to drive an IoT-style dashboard with sparklines, counts and alerts. This post will introduce some streaming concepts and talk through some of the challenges we faced that were peculiar to creating a compelling, relatable demo of streaming tech.

The Challenges

Beam’s capabilities focus on production workloads — in the batch case you want to process a bunch of events (retail transactions, in our case) as efficiently as possible. You’re only interested in the final results, rather than the state of the system as it’s processing. There are some challenges to overcome in putting together a compelling streaming demo that you don’t face in the batch case.

The metrics we wanted to track were aggregations (counts and sums) representing volumes and values of transactions, which means we needed to consider windows in time. That windowing aspect added some extra challenges in terms of making a demo compelling and relatable.

When a Batch isn’t a Batch

We started with pipelines that read from large, compressed chunks of historical event data in . The for processing data stored this way don’t provide any way to control the order in which the keys are read, so the events weren’t read in occurrence time order. That makes sense, as we’re effectively processing in batch mode, so the system figures out how to get things done efficiently.

Image for post
Image for post
Image courtesy:

In a real-time streaming situation, events are constantly arriving and should be processed as quickly as possible. There’s an inherent, rough ordering in time. For example, events arriving now happened more recently than events arriving ten minutes ago. Treating the data like a batch makes sense to process a real-life workload quickly, but it doesn’t make for a good streaming demo!

Another complication for streaming systems is completeness. How do you know you’ve seen all the data when the data’s arriving all the time, and some might be delayed (for example, because a system was running slowly, or a network issue caused events to queue up for a while)? Beam won’t emit windows until they are complete — that is, as a that it tracks passes the end of the window. Our solution needs to ensure that event time doesn’t go backwards!

Replay

To deal with the ordering problem, we took control of how the events were “replayed” into Beam. We’d written the first pipeline as two pieces— the first one read the stored data from cloud storage, parsed it, and wrote it to . The second pipeline did the more interesting work, reading from Pubsub, assigning timestamps, putting events in to time windows and computing the metrics we wanted.

Image for post
Image for post
If the data was out of order, would you still recognise it? Image via

We used the first pipeline to parse and write the raw data in batch mode into . We could then query for the events we were interested in, order the events by time and export the results. We built a simple application in Scala to stand in for the first pipeline, emitting events onto PubSub in the kind of order you’d expect if the events really were streaming in.

to the time the events were emitted from the replay tool. That made things simple, as we weren’t trying to run a pipeline as if it really was 2006, and we could stop, start and restart the replay at any point without the watermark causing problems.

Preserving Patterns in Time

Image for post
Image for post
It’s not a race. Image via

The next problem to solve was preserving the structure of the data in time. For example, if you have transactions for a high street store, you expect to see patterns in time. Maybe it’s busy in the morning through to lunchtime, then tails off through the afternoon into the evening. There won’t be any transactions when the store is closed. Just replaying the transactions as fast as possible ignores the original timestamps and so loses those recognisable patterns.

Instead of just replaying the event data, we had the replay tool look at the original timestamps, and pause for an appropriate length of time between events to preserve the original temporal structure.

The Need for Speed

Image for post
Image for post
Like that, but faster. Image via

In our dataset, the patterns that will be recognisable don’t present themselves in short seconds-to-minutes timeframes. They present over the hours of the day — no good for a demo! We needed a “fast-forward” button.

Adding a scaling factor to the Scala app was fairly straightforward. The logic in the app transforms the original event times, offsetting so that the first event happens at the time the demo starts, and scaling down the intervals between each event according to a predefined factor. Now, our demo replays an hour’s events every five seconds, giving people a taste of what might be possible even with a dataset that doesn’t really work on short timescales.

In Summary…

In production use, your data would likely be arriving in roughly time-order (and you’d deal with early/late-arriving data as appropriate for your goals), or you’d be processing batch and so you’d only care about the final results being produced quickly and efficiently. Any delays in your data being produced or arriving would happen naturally and you wouldn’t be trying to speed up the passage of time!

Challenges can arise when you’re trying to do live demos of real-time dashboards:

  • processing the data in a meaningful order
  • preserving the temporal structure of the data
  • projecting long timeframes onto shorter ones that work as demos

If you’re thinking about using this kind of technology and you’re heading down the demo route, we hope our lessons learned will help you get the results you want quickly!

If you’re interested in our replay app, let us know in the comments. We may consider open-sourcing it if there’s interest. The is already out there and may be sufficient for your replay, if you don’t need the scaling element.

dunnhumby Data Science & Engineering

dunnhumby uses machine learning and data science to improve…

Paul Brabban

Written by

Consultant/Contractor, Software Development and Data Engineering. Functional Programming Advocate. More at https://tempered.works

dunnhumby Data Science & Engineering

dunnhumby uses machine learning and data science to improve customer understanding and help drive our clients' growth.

Paul Brabban

Written by

Consultant/Contractor, Software Development and Data Engineering. Functional Programming Advocate. More at https://tempered.works

dunnhumby Data Science & Engineering

dunnhumby uses machine learning and data science to improve customer understanding and help drive our clients' growth.

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store