Hey, Materialize: what’s streaming on Twitch?

Marta Paes
7 min readNov 4, 2021

--

In this blogpost, we’ll use Materialize to tap into what’s happening on Twitch right now, and show how far you can push standard SQL to (literally) explore streaming data. Want to follow along? Head over to this GitHub repository for the demo!

Our end goal: a Metabase dashboard with real-time Twitch stats, powered by Materialize.

👋 Twitch

Twitch makes a lot of data available to the public through its Helix API (and, more recently, EventSub). To get things started, we’ll have a Kafka producer polling the get-streams endpoint and grabbing all active broadcasts into a Kafka topic named twitch-streams. The events landing in this topic are JSON-encoded and roughly look like:

{
"id":"44032144189",
"user_id":"14297727",
"user_login":"adriianeut",
"user_name":"Adriianeut",
"game_id":"33214",
"game_name":"Fortnite",
"type":"live",
"title":"🔴 LATE NIGHT: Cotorreo y Fortnite (?)🔥 | PC",
"viewer_count":27,
"started_at":"2021–10–11T06:38:04",
"language":"es",
"thumbnail_url":"https://static-cdn.jtvnw.net/previews-ttv/live_user_adriianeut.jpg",
"tag_ids":[
"d4bb9c58–2141–4881-bcdc-3fe0505457d1",
"d4461e94–570b-4973-b6a1–3afe026c12ca"
],
"is_mature":false
}

Because we’re dealing with real world, ugly data™️, there are some quirks to be aware of : 1) we’ll get multiple events per key (id), as viewers join and leave broadcasts, and 2) each broadcast can have multiple tags (tag_ids). We’ll use Postgres to keep up with the official list of tags, so we can enrich those cryptic UUIDs with human-readable descriptions (more on that later!).

An overview of the components in our Twitch analytics pipeline.

Materialize

Now that we have a feel for what the source data looks like and the different moving parts in our analytics pipeline, we can focus on Materialize and how to efficiently combine and transform all this data in real time using SQL.

Let’s walk through it.

1. Creating the sources

The first step is to tell Materialize where to find and how to connect to our data sources: Kafka and Postgres. If you’re coming from database land, a close concept to a source in Materialize is that of a foreign table: data that lives outside the database, but that can be queried like any other table once you pin down its external location.

Kafka

The Kafka source will ingest events from the twitch-streams topic as raw bytes (FORMAT BYTES), which need to be cast to JSON (we’ll skip going into detail in this step, but you can find the conversion here).

What’s with the ENVELOPE UPSERT bit? As there can be multiple events per id, but we’re only ever interested in keeping the most recent one, we need to instruct the source to work in upsert mode, so that it can retract old values and keep things fresh.

Postgres

In the real world, the list of available tags is maintained by Twitch. With this in mind, we’d probably want to replicate any edits to this list — say, a new tag is added or a description is updated — into Materialize instantly. One way to do this is to use the Postgres source to keep track of changes in the upstream database:

Note: We could achieve the same behaviour using Kafka+Debezium; this is the Kafka-less alternative for Postgres CDC in Materialize.

With the sources defined, Materialize now knows where to locate and how to access the data we need, once we need it. We can start thinking about some interesting questions to throw at it!

2. Asking questions!

If you think in SQL, I’m afraid you’ve already figured out what we need to do next: write some SQL queries. Then, we just need to trust Materialize to keep things up to date for us.

That’s it.

The rule of thumb is that anything you want to see answered repeatedly (but fast!) as the underlying data changes should be modelled as a materialized view.

Note: Understanding the difference between a traditional materialized view and how these work in Materialize is key. We’ll get there, but this blogpost has a great overview if you want to dig into it a bit more.

What are the most popular games on Twitch?

Let’s start by figuring out what the most popular games on Twitch are, based on how many viewers are currently watching the broadcasts. Putting our SQL brain to work, this can be translated into a simple aggregation over the Kafka source (after conversion), based on the game_id and game_name:

Whenever we use mv_agg_stream_game to find out the Top10 games being played, Materialize will have fresh and correct results at hand:

If we reissue this query over and over again, we’ll see the results change as new data streams in (hint: try to follow the agg_viewer_cnt). But how?

Instead of re-reading the source data and recomputing the results, Materialize keeps the state of mv_agg_stream_game indexed in memory and incrementally updates it as the underlying data changes. This is what makes it fast, and also what makes it different to other database systems that support “traditional” materialized views.

Next, let’s explore a couple of patterns that can be expressed in SQL, but only become exciting when applied to streaming.

What gaming streams started in the last 15 minutes?

The thing with streaming is that time keeps ticking alongunlike what you may be used to with batch, where your data is sort of frozen at query time. This means that “in the last 15 minutes” is a moving target that changes as time progresses. How do you express this in SQL?

In Materialize, you can use temporal filters to define a sliding window that expires events progressively falling outside this 15-minute window and picks up new ones (with the bonus of not having to learn any new syntax!). The key here is mz_logical_timestamp(), which keeps track of the logical time for your query:

Is Materialize doing its job? If we inspect the MIN(started_at) date for the mv_stream_15_min materialized view, we can see it moving forward as time ticks along. As expected, records are being expired and new, eligible ones are brought into the view.

What are the most used tags?

Another interesting (and somewhat feared) pattern is defining joins between multiple streaming sources. This is one of the key features in Materialize, and another example of how it thinks more like a database, and less like existing streaming systems.

In this case, we need to enrich events flowing through Kafka with reference data from Postgres to map the tag_ids to a matching, readable description. We can unnest the list of tags and pre-aggregate the results to avoid doing more work than necessary.

Remember how the Postgres source is supposed to be listening to changes upstream? If we update the description of a tag (for example, setting “Playing with Viewers” to “Playing with ALIENS”), this change is immediately reflected in the results.

Who are the most popular streamers for each of the Top10 games?

The last pattern we’ll dive into is event-driven queries. Now, this might sound a bit complex to grasp at first, but stay with me.

Let’s say that we’d like to know who the most popular streamer is for each of the Top10 games streaming on Twitch. If you remember, we wrote a query to yank that out of mv_agg_stream_game a few steps back. What we can do is use the results of this query (that, by the way, might change over time) and pass each game in the Top10 as a parameter to the body of a lateral join. Think of it as a for loop for SQL:

In practice, we can have an arbitrarily complex SQL query within the lateral join, and rest assured that it will only ever be executed for the subset of inputs being passed in from v_stream_game_top10.

Double-checking that we’re getting fresh data from Twitch. 💦

Materialize will keep around exactly 10 records in memory: the streamer with the highest current viewer count for each of the Top10 games; and will also only do work (or, run the subquery) if a new game enters the Top10 and old results need to be retracted.

Note: Lateral joins are not some weird new invention: they exist in PostgreSQL, but may be considered exotic 🍹 even to Postgres users. This blogpost is a great follow-up read on how lateral joins work in Materialize.

Metabase

To finish it off, we can keep track of the (live) answers to all these questions using a BI tool like Metabase. Because Materialize is wire-compatible with Postgres, it looks and feels like your regular database also when it comes to integrating with other tools in the ecosystem.

Here’s what Twitch looked like for a few seconds today (again!):

And we’re through! We’ve built an end-to-end streaming analytics pipeline using nothing but SQL and our bare hands.

If you’re looking to get started with Materialize, you can get a much simpler streaming setup going in a couple of minutes by installing it locally or signing up for Materialize Cloud. To learn more about how Materialize works, check out the documentation and come chat with us on Slack!

--

--