The Duality of Streams and Tables - Why It Matters?

How to turn a stream of events into a table and vice versa? How that leads to materialized views?

Dunith Danushka
Tributary Data
5 min readJul 18, 2021

--

Photo by James Balensiefen on Unsplash

Event capture state changes

An event is an immutable record of a fact that happened in the past. Essentially, an event has a key, value, and timestamp. For example:

Structure of a typical event.

Other examples include:

  • An item was sold.
  • A row was updated in a database table.
  • A sensor records a temperature of 27C in room 234.
  • A money transfer of $100 was made from Alice’s account to Bob’s account on July 18th, 2021, at 03.45 AM.

An event streaming platform captures incoming events into streams. Stream stores related events together according to the order in which they arrived.

Tables to derive the current state

Events provide an excellent chronological order of state changes made to a system. But at some point, we need to know the “current state” of the system because of the business needs.

For instance, we may need to answer these types of questions:

  • What is the status of order number 12345?
  • What is the current sales volume at retail store #23?
  • What percentage of machines are actively working at plant #45?

The simplest way to answer these questions is by writing the event stream into a database table and then queries it as needed.

For example, the ORDERS table records the incoming events related to orders. It already has a record for the order with ID 12345 with status set to CREATED. The next time an order event comes with the same ID and a different state, that record will be updated to reflect the change in the state.

A table only keeps the latest value for a given key

By querying the ORDERS table with ID 12345, we can get its latest status.

Duality of streams and tables

An interesting observation is that there is a close relationship between streams and tables. We call this the so-called stream-table duality. Essentially, this duality means that a stream can be viewed as a table, and a table can be viewed as a stream.

Stream as a table

We can turn a stream into a table by grouping the keys of individual events in the stream and then performing aggregate operations like COUNT() or SUM() on the event values. The table keeps only the latest value for events that share the same key.

In this example, a stream of events with key=username and value=location is being aggregated into a continuously updated table that tracks the number of visited locations per key=username:

Image credits

Table as a stream

A table is a snapshot of the latest value for each key in a stream at any given time. We can turn a table into a stream by capturing the changes made to the table — inserts, updates, and deletes — into a “changelog stream.” This process is often called change data capture or CDC for short.

To understand this better, consider a table that tracks the total number of pageviews by the user. As a new page view event is processed, the state of the table is updated accordingly. Whenever the table is updated, state changes at different points in time are emitted to the changelog stream as follows.

Image credits

The stream-table duality says that if you replay that changelog stream, you’ll get the original table (rightmost column).

Image credits

If you are not convinced yet, the simplest way to understand this is to think in database terms. A stream is one where the changes are interpreted as just INSERT statements (since no record replaces any existing record). In contrast, a table is a stream where the changes are interpreted as UPDATEs (since any existing row with the same key is overwritten).

Queryable state with materialized views

This duality of streams and tables allows us to use tables as high-performant materialized views. Meaning, a table can contain a pre-computed answer to a query in a ready-to-be-served form. That is known as a materialized view.

The materialized view subscribes to an incoming stream of events to refresh itself in an event-driven manner. For example, an order-summary view updates itself by listening to events coming from the orders topic.

Materialized views enable us to query the current state of a system. For example, it answers questions like:

  • What is the average temperature of room 213 within the last hour?
  • What is the status of order X?
  • How many packages have been delivered during the last week?

Where to store the materialized view?

Now that we know what materialized views are and what they are capable of.

As an application developer, where do you materialize your application’s state?

There are two ways:

  1. Using an internal state store: The application stores the current state in the local VM/container/process.
  2. Using an external state store: The application uses external storage like a database, search index, or a file system to store the state globally.

Internal state stores come in handy when the event processing application needs to perform low-latency aggregations on the stream. Examples include SUM and COUNT operations. Stateful event processing frameworks such as Kafka Streams and ksqlDB advocate for fault-tolerant and scalable internal state stores based on RocksDB.

Takeaways

Streams alone can’t answer questions related to the current state of a system. A stream has to be materialized into a table to derive the current state.

A stream can be turned into a table by aggregating individual event keys. Similarly, a table can be turned into a stream by emitting state changes to the table as a changelog stream.

This duality enables an event processing application to maintain a materialized view out of incoming events. This view can be built local to the application itself or stored externally based on the use case.

--

--

Dunith Danushka
Tributary Data

Editor of Tributary Data. Technologist, Writer, Senior Developer Advocate at Redpanda. Opinions are my own.