Building CQRS Views with Debezium, Kafka, Materialize, and Apache Pinot — Part 1
How to build an incrementally updated materialized view that serves queries in a faster and scalable manner?
Microservices architecture promotes a decentralized data management practice, where each service keeps its data private and exposes it only via well-defined API interfaces. Although that is for the greater good, developers find it challenging to implement queries that span across multiple service boundaries.
A Microservice often contacts several dependent services to fulfill a read request. For example, ShippingService queries the CustomerService to retrieve the customer’s address. These synchronous invocations add latency and brittleness to the entire architecture.
Alternatively, Microservices can maintain the dependent data within itself that often needs to answer a query. Materialized views are one way to achieve that, and we can keep them incrementally updated as the data changes in dependent services.
Incrementally updated materialized views seem lucrative for Microservices because:
- They can eliminate inter-service communication while fulfilling a read request (query).
- They can keep materialized views in sync when paired with the correct stream processor technology.
This two-part article series walks you through building a streaming ETL pipeline to maintain a read-optimized CQRS view that can answer queries faster. We will use Debezium to ingest dependent data into Kafka, Materialize to build and maintain materialized views, and Apache Pinot to serve enriched data at scale.
This post explores the problem space, designs a pragmatic solution to address that, and introduces you to the tool stack we use to build the solution in Part 2.
Use case: building an online pizza order tracker
I was inspired to write this post after ordering a pizza from a local pizza joint and anxiously waiting for the order to arrive because I was hungry :)
My local pizza joint has a web interface similar to the following and allows real-time tracking of a pizza order’s status.
While waiting for my pizza, I took a piece of paper and started drawing how can I design and build such an order tracking system to serve thousands of hungry customers like me.
The UI should display the following information.
- Order details, including the order number, timestamp, and total amount.
- Ordered items along with their quantities.
- Current status of the order, along with a history of order status changes.
Assuming we use a Microservices-based architecture for the implementation (duh!), we need at least three Microservices to populate the UI.
- OrderService — Provides information related to the order and its items.
- KitchenService — Provides order status changes.
- DelieveryService — Provides order delivery updates.
Having this in mind, let’s consider different architectural styles to build the status tracking UI.
Option 1 — Using service orchestration
The lowest hanging solution is to have services communicate together to formulate the order status information. When the UI contacts the OrderService, it will invoke the KitchenService and DeliveryService to pull out the necessary information.
That kind of service orchestration often results in the OrderService waiting, unmarshalling, and joining responses from dependent services, which adds latency and brittleness to the architecture. The scalability and reliability of the solution are capped at the lowest performing dependent service.
The solution will be Internet-facing and demands more scalability and performance as the traffic grows. So, we will eliminate this option.
Option 2 — Using a batch data pipeline
Instead of dynamically pulling off information, what if we pre-compute the results for the UI?
Well, that could be an option too. We can write a batch ETL job to extract information from all services and join them to form a denormalized table, keyed by order ID. That enables the OrderService to quickly lookup order summaries with a single call, using only the order ID.
Lack of data freshness is the main problem here. Increasing the frequency of the batch job will produce more recent results on the UI. But that again incurs additional costs on computation.
Option 3 — With incrementally updated materialized views.
Although latency was a concern, Option 2 brought up a decent idea of pre-computing the order status UI for every order.
What if the OrderService maintains the data belongs to its downstream services, KitchenService, and DeliveryService? That allows the OrderService to serve queries faster by only looking up local data. Moreover, it can keep that data consistent by subscribing to the state changes of downstream services and updating data accordingly.
That is what we are going to do in Option 3. We will create a read-optimized materialized view inside the OrderService that pre-computes the UI for each order. That view includes attributes from all dependent services and is keyed by the order ID, enabling to serve UI queries by doing a fast lookup based on the primary key.
How do we keep the materialized view up to date? We will have the OrderService listen to state changes in downstream services and update the materialized view accordingly.
We will use Debezium and Kafka to capture state changes from all Microservices as event streams and use Materialize to build a materialized view by joining those streams together. Materialize will keep the view up to date as new events come. We will stream changes in the view back to Kafka and have Apache Pinot ingest them to serve the UI queries.
Tutorial — Building an incrementally updated materialized view
Now let’s see how to implement a read-optimized view as discussed in Option 3. The good news is that you can find the complete solution already. All you need to do is to clone the solution, run it, and apply the necessary configurations to get it working.
You can find the completed solution from the following Git repository, located inside the cqrs-views
directory.
https://github.com/dunithd/edu-samples
The solution comes as a Docker Compose project. So, make sure you have the Docker Desktop installed and allocate at least 8GB of RAM and 6 CPU cores for the Docker daemon to work properly.
Clone the Git repository as follows or download an archive.
git@github.com:dunithd/edu-samples.git
Type the following in a terminal to create and start all application containers.
cd edu-samples/cqrs-views
docker compose up -d
The above command will bring up the following containers.
- mysql — Contains orders and order items
- zookeeper — Required for Kafka and Pinot
- kafka — Apache Kafka
- schema-registry — Confluent Schema Registry for Debezium and Materialize
- debezium — Kafka Connect image with Debezium MySQL connector pre-installed
- materialize — Materialze
- pinot-* — Pinot controller, broker, and server
After that, verify the overall health of the Docker stack by typing:
docker compose ps
We have three essential steps to achieve with this solution.
- Event-enabling Microservices data: Have Microservices stream their data into Kafka, using Debezium.
- Build and maintain a materialized view: Join the data streams with Materialize to build an incrementally updated materialized view.
- Serve the materialized view: Capture the changes in the view into an Apache Pinot table and use that to serve UI queries.
Part 2 of this blog explains these steps in detail while walking you through the necessary configurations you need to apply to get the expected outcome.
Head over to part 2