Rise of the Streaming Databases — Episode 1: Materialize
How Materialize disrupts the stream processing landscape by enabling developers to build real-time applications with standard SQL.
Businesses that can extract correct and timely insights from data will become the industry leaders of tomorrow, those that cannot eventually fade away. Processing data at rest along with the data in motion helps to make informed and actionable decisions.
Historical data analysis had been an established practice for a long time. But the industry adoption for real-time analytics was limited due to factors like longer development cycles, the need for specialized skills, and poor tooling options.
Today, products like Materialize overcomes these barriers by enabling developers to build real-time data processing applications with standard SQL. This article discusses Materialize, how it disrupts the stream processing application development, and learn the basics with an example.
Let’s take an example
Imagine you are reading a stream of page view events from a Kafka topic with the following format.
Your goal is to process these events to find the channels that has contributed to highest page views.
Traditionally, you would write the page view events to a database table first and then query that from a BI dashboard like this.
The problem with the above is that the database performs that expensive aggregation every time it receives the query. As more records are added to the table, the query latency will increase.
You can use a materialized view to overcome that. A materialized view is an approach to precompute a query’s results and storing them for fast read access. In our case, the BI application reads a materialized view that already has the pre-aggregated page views cached in the database.
Materialized views are refreshed periodically. Hence, there’s a chance of stale data included in the query result. Classic materialized views are not ideal if query latency and data consistency are top priorities.
Materialize solves this problem by allowing you to define a materialized view on top of incoming data and then incrementally update them as new data comes in. So, rather than recalculating the answer each time it’s asked, Materialize continually updates the answer and gives you the answer’s current state from memory — even in the presence of complex joins and arbitrary inserts, updates, or deletes in the input streams.
What is Materialize?
Materialize is a streaming database for real-time applications. It allows you to work with streaming data from multiple external sources using nothing but standard SQL.
You write arbitrarily complex queries; Materialize takes care of maintaining the results automatically up to date as new data arrives.
Your applications can query Materialized to get blazing fast results which often falls in the sub-second latency range.
Taking it out for a spin
As a developer, you work with Materialize as if it were a PostgreSQL database. Materialize is wire compatible with Postgres, allowing you to use a CLI tool like
mzcli (a wrapper around psql) to issue ANSI-92 compatible SQL statements against it.
You define your workflow by creating a SOURCE. Sources represent connections to resources outside Materialize that it can read data from. Materialize supports the following list of sources as of now.
In our earlier example, we can create a source to read JSON-formatted page view events from a Kafka topic like this.
CREATE SOURCE raw_pageviews FROM KAFKA BROKER 'kafka:9092' TOPIC 'page_views' FORMAT BYTES;
raw_pageviews source produces data as a single text column containing JSON. To extract the JSON fields for each page view event, you can use the built-in jsonb operators:
The above is a non-materialized view, which doesn’t store the query results but simply provides an alias for the embedded SELECT statement. We can now use this view as a base to create a materialized view that computes the pages with the highest traffic.
CREATE MATERIALIZED VIEW page_views_by_channel AS
count(*) as pageview_count
GROUP BY channel;
Dataflows, arrangements and indexes
page_views_by_channel view embeds a query like a traditional SQL view. But in contrast, it computes and incrementally updates the results of the embedded query as new data streams in. That lets users read from the view and receive fresh answers with incredibly low latencies.
The moment you create a materialized view, Materialize creates a dataflow. You can think of dataflow as a topology of ongoing transformations that tell Materialize how the final query output should be. Once executed, the dataflow computes and stores the result of the SQL query in memory, polls the source for updates, and incrementally updates the query results when new data arrives.
An index is a component that actually “materializes” a view by storing its results in memory. Each materialized view contains at least one index that maintains the embedded query’s result in memory; the continually updated indexes are known as “arrangements” within Materialize’s dataflows.
In the simplest case, the arrangement is the last operator and simply stores the query’s output in memory. In more complex cases, arrangements let Materialize perform more sophisticated aggregations like JOINs more quickly.
For a deep dive on these concepts, see API Components.
Reading the output of a materialized view
When reading from a materialized view, Materialize simply returns the dataflow’s current result set. In our example:
SELECT * from page_views_by_channel;
If you re-run the SELECT statement at different points in time, you can see the updated results based on the latest data.
You can use a sink to stream data out of Materialize, using either sources or views. Sinks are the inverse of sources and represent a connection to an external stream where Materialize outputs data. When a user defines a sink over a materialized view or source, Materialize automatically generates the required schema and writes down the stream of changes to that view or source. In effect, Materialize sinks act as change data capture (CDC) producers for the given source or view.
As of now, only Kafka sink is supported.
If we need to stream the output of our materialized view to a Kafka topic, the resulting sink definition will look like this.
CREATE SINK dashboard_sink
INTO KAFKA BROKER 'localhost' TOPIC 'frank-quotes-sink'
FORMAT AVRO USING
CONFLUENT SCHEMA REGISTRY 'http://localhost:8081';
Joining multiple data sources
Materialize supports all types of SQL joins, just like a traditional relational database, including complicated multiple-way (e.g., 6-way, 12-way) joins across disparate data sources.
We can join our
pageviews with a
users source to enrich the final result. The resulting query would look like this.
CREATE MATERIALIZED VIEW pageviews_by_user_segment AS
date_trunc('hour', pageviews.ts) as ts_hour,
count(*) as pageview_count
JOIN pageviews ON pageviews.user_id = users.id
GROUP BY 1,2,3;
Materialize has disrupted the stream processing market by giving developers a simplified environment to code their real-time applications with standard SQL.
That eliminates the need to build and maintain a tangled web of Microservices to perform complex operations on streaming data. All you need is a familiarity with SQL.
Materialize lets you perform complicated 12-way joins across disparate data sources. That prevents your data team from building and maintaining expensive ETL pipelines for data denormalization. You stream data into Materialize in its natural form and join them as you think fit.
All in all, Materialize would be an ideal choice to perform low-latency OLAP queries over ever-changing streams of data. You can also see additional use cases where Materialize has proven its capacity.