Streaming — options, options, which to choose?

Say you have a data source that changes several times per day and you want an updated view of that data every ten minutes — that’s a streaming use case. Likewise, if you want to see updated statistics from sensors every 60 seconds, that is also a streaming use case. Often, streaming is defined in a very narrow window — very fast moving data that requires instant action. In reality, however, most streaming use cases are between 60 seconds and 10 minutes because that is as fast as a person or process will act on the changing data. Our streaming technologies are designed to meet customers and use cases where they are, rather than pushing them to process data faster than they can use it.

It’s not just the definition of _streaming_ that varies, there are also different ways to work with streaming data inside of Snowflake. Having multiple tools for the job is by design — since streaming needs and use cases differ, so should the options we make available inside of Snowflake. This post covers three key reasons to choose one method over another; specifically, let’s unpack to three decisions for choosing how to _process_ streaming data — Streams and Tasks and Dynamic Tables.

1: Declarative or imperative?

Do you want to specify how things get done, or what the outcome should be? The difference between the two will be one of the biggest differences between Dynamic Tables and Streams & Tasks.

What do I mean by declarative and imperative? As a very high-level summary:

In a declarative model you specify what the outcome should be. In the case of a Dynamic Table, you specify a query for the end result — the table. You are essentially declaring what you want the output to look like, the table, and are letting Snowflake figure out how to do it.

In an imperative model you provide the control flow or logic to shape each step of the execution. The focus of this model is shaping how the execution is performed so you can guide Snowflake to the end result you want.

When you are choosing on how to build a pipeline, this decision between declarative and imperative is a big one. If you have complex application requirements, or want detailed control about how things get done, then imperative makes sense. While this model offers fine-grained control, it comes with a cost — a higher level of effort is needed to create and maintain imperative applications. Because everything must be specified upfront, the cost to build a pipeline is higher. If you are focused on the outcome you want and are less opinionated about how things get done, then a declarative model makes more sense. The upfront costs to develop and maintain a declarative pipeline can be lower.

There are other considerations when choosing between these two that might not be immediately obvious. For example, with an imperative pipeline, you may need more documentation so anyone who did not create the pipeline knows how it works and what is going on. With a declarative pipeline, on the other hand, almost anyone can look at the outcomes and results to understand how the pipeline works because something else is handling the execution.

Want an example? Think about dinner. In an imperative model, you will not only think of a dish to make, but you will also want to have a hand in how the dish gets made at each step — how much to add, when to heat and cool, and what ingredients to use. In a declarative model, you choose what you want from a menu and let the kitchen handle the cooking for you.

2: Automatic or manual?

After you choose the programming model that makes sense for your use case, the next big question to ask is — how should resources be used to run that program? Do you want manual control over how the pipeline runs, or would automatic execution be more useful?

With Streams & Tasks you have a spectrum of control between fully manual and somewhat automatic. Manual control can be useful in many cases — such as when you want to precisely control the resources used to run applications, have to navigate complex use cases or application states, or are wary of systems that provide more automation. Streams and Tasks puts you in the driver’s seat to choose the right level of automation, which provides options but also requires more work on your end to choose.

One big goal of Dynamic Tables is to automate as much of the execution as possible. Moreover, we not only want to automate the execution, but do it in a way that provides as efficient execution as possible. For example, in Dynamic Tables, you only need to specify a lag parameter as your objective for data freshness, while Snowflake figures out how frequently to refresh to meet that objective. As we will cover later, there’s also a way to tell downstream tables to refresh based on upstream tables — making life super easy. Refreshes, however, are done incrementally.

What does this mean? Well, if you refresh a table you can do it in one of two ways — you can refresh the entire table every time you want it updated or you can incrementally refresh only the parts that changed. Incremental refreshes are more cost and effort-efficient because work is only done on the data that has changed since the table was last refreshed. Dynamic Tables are designed to perform incremental updates, even when complex logic such as windowing functions are used.

What does this all mean? Generally speaking, unless you have a need to control the flow (imperative) and also how resources are used to run that program, automatic resource management can make a lot more sense. Automatic execution can save not only compute resources, but also your time in attention — instead of shaping the resources used to run a pipeline (manual) you can start working on your next pipeline because everything is automatic.

As with the programming model, there are also some interesting outcomes when automatic execution is chosen. For instance, consider error handling. Dynamic Tables with their automatic execution are designed to be fault resistant and provide rich data about how each refresh happened — things that would often be one’s responsibility if they instead chose manual execution.

3: How will this pipeline change over time?

Thinking about error handling and fault resiliency leads us to an important third question — how will a pipeline change over time? Pipelines are often not static and change over time. This can be true even if the control flow of a pipeline itself does not change because factors outside of the pipeline can change, for example:

  • Volume of data being fed into the pipeline
  • Business needs around the pipeline, requiring faster or slower processing
  • Shape or type of data being put into the pipeline
  • Institutional knowledge about the pipeline

If you know a pipeline and elements surrounding it will be static, for example, the data type and volume will not change and there are no changing business use cases for the data, then a relatively static pipeline might make sense.

Take an example, however, where things change.

Say you build a pipeline to process raw transactional data from a point of sale (PoS) system your company uses. When you first launch the pipeline, perhaps everyone agrees it makes the most sense to refresh the data once every hour, because the data won’t be acted on faster than that. Perhaps this is a pipeline that leads to a report used to balance stock between multiple stores in metropolitan areas. During the next holiday, however, maybe the inventory team wants to get new results every five minutes because they plan on actively managing inventory across stores in the same day.

if you build a pipeline where you not only manage the pipeline logic but also how it gets executed (e.g. tasks or orchestration) you now need, at a minimum, to adjust those mechanisms. The complexity of the problem scales with the complexity of the pipeline — more pieces means you need to rework more parts of the pipeline. If your pipeline has ten tasks, you need to now examine all of them and ensure they are all properly re-adjusted by hand.

With Dynamic Tables, you only need to change the lag parameter of the table; worst case you did not use lag = downstream and may need to adjust upstream tables to also refresh faster (by adjusting their lag parameter). What does the downstream bit mean? DTs are so automatic, you can set tables to refresh based on intermediate and upstream DTs! These changes would only take on the order of a few minutes at most, and the complexity of changing execution is handled for you.

As another example, suppose the volume of data radically changes — in this PoS use case, say the PoS data increases 10x during an especially big holiday. Unless you designed your pipeline to account for such a massive shift (which may also lead to inefficiency), the changing data volume will require pipeline rework. Dynamic Tables, on the other hand, can scale much more easily to changes in data volume across the entire pipeline.

What next?

Getting started will depend on what tools will work for your specific use case.

Dynamic Tables are going to be Public Preview very soon. If you are attending Snowflake Summit there will be several sessions covering Dynamic Tables, including:

Stay tuned to the Snowflake Blog for announcements and links to Dynamic Tables materials when it enters Public Preview.

Streams and Tasks are both generally available in Snowflake today. There are a number of resources to learn more about Streams and Tasks, including:

--

--