Three Ways to Think About Streaming

Molly Sions
Capital One Tech
Published in
5 min readJan 9, 2018

Thanks to YouTube, Netflix, WatchESPN and a host of others, streaming is a relatively intuitive concept in our personal lives. When we try to apply that understanding to data though, things tend to get murky in a hurry. As a senior data analyst at Capital One, I cut my teeth on typical static data, loaded in batch and contained in a data warehouse. Just as I was getting comfortable, though, things changed. I was asked to build an application that feasted on a whole different beast: streaming data. In the process, I had to rebuild a lot of my assumptions about data to account for the constant movement that streaming entails, and I’m here to pass along a few different ways of thinking about streaming that I adopted out of necessity. All three are useful conceptualizations in different ways.

As a Continuous Flow

This is probably the most intuitive way to think about streaming, but at the same time the hardest to work with. It’s intuitive for the comparisons it brings in — the name “streaming” already implies the constant movement of rushing water, and all of us have both consumed and even produced enough live streams to understand what’s going on. Substitute pixels for rows and you’re there.

This, though, is a tough mode of thinking to step into from a coding standpoint. The key insight that arises is that stream data has to be caught as it comes in. Imagine trying to chase a fish down a river vs. putting a net out to get the fish as it passes by your boat. The question is, how do you build a net for data?

One answer is a DStream, which brings us to our second method.

As a Series of Tiny Tables

Weirdly, I just pictured a bunch of little baby tables crawling around and audibly said, “Aww.”

Anyway. This mode of thinking is the opposite of the previous one — really useful to start coding, but sort of strange to wrap your head around. To return to our water analogy, imagine your faucet occasionally spits out a rock, and you want to prove this to your landlord by catching one of the rocks.

You turn on the faucet and place a glass underneath it to catch any potential rocks. The glass takes about 30 seconds to fill, and when it reaches the brim you quickly slide another glass underneath of it. While the second glass is filling, you check the first one for rocks, then dump it out. When the second glass fills up, and you switch it the first and repeat the process.

This is one way you can start to think about processing stream data. You treat it the way you would treat a static table, but with the knowledge that tables just like it will keep appearing.

Incidentally, this is one way that Apache Spark allows you to code — you write a script to be run on X number of seconds’ worth of data, and then you connect to the stream and let it run over and over again. When you hear the phrase “A DStreams approach vs. a structured streaming approach” this thought process aligns more closely with DStreams.

As a Static Object that Keeps Extending

The water analogy is useful for capturing the ephemeral nature of stream data — running water is there for a second, and then it’s gone — but there’s a more static side to streaming’s bold personality. To capture this, we’re going to switch metaphors again. Now picture a scarf.

Why a scarf? For one, I’m posting this in the fall, so it’s very seasonally appropriate. More than that, though, it’s easy to conjure up the image of a scarf getting longer and longer as someone knits it. As length is added onto the end, the rest of the scarf does not change.

This is how a Kafka topic works: Data continuously comes in, but once that data lands in the topic, it stays relatively static. The implications of this insight are starting to play out in how newer consumers work.

Structured streaming is a particularly elegant one. Instead of constantly receiving data, a structured streaming app keeps track of time. If the app is refreshing every thirty seconds, it will ask the topic for the last thirty seconds worth of data, then send that chunk off to be processed. Thirty seconds later, it will query the topic again.

Conclusion

When architects decide on the locations of windows in a building, they think of the sun as moving around the earth. This is not literally true, of course, but it creates a useful frame of reference for their purpose. Streaming is an incredibly powerful concept, and none of these three modes of thinking captures the entirety of what’s happening. They are, however, three frames of reference that can be useful in formulating an approach.

Put another way, if the fish keep out-running you, consider knitting a scarf. And someone should probably go check on those baby tables.

Related

DISCLOSURE STATEMENT: These opinions are those of the author. Unless noted otherwise in this post, Capital One is not affiliated with, nor is it endorsed by, any of the companies mentioned. All trademarks and other intellectual property used or displayed are the ownership of their respective owners. This article is © 2018 Capital One.

--

--

Molly Sions
Capital One Tech

Half analyst, half python engineer, three-quarters product manager, all blogger