Operational Use case Patterns for Apache Kafka and Flink — Part 1

This is the first post of the series that shows building operational use cases with Apache Kafka and Apache Flink.

Published in

Tributary Data

6 min readJan 3, 2023

Apache Kafka is a distributed streaming data platform. It is designed for high throughput, low-latency streaming workloads, where scalable real-time data ingestion and fault-tolerant storage are critical.

Apache Flink is an open-source, unified stream-processing and batch-processing framework capable of executing arbitrary dataflow programs on data streams.

Kafka and Flink complement each other nicely, allowing Kafka to act as an upstream data source for Flink. The Flink Kafka Connector enables reading data from and writing data to Kafka topics with exactly-once guarantees. This combination has been very popular in the industry for building real-time analytics use cases, including streaming ETL, streaming analytics, operational monitoring, and anomaly detection.

In this article, I will walk you through a few patterns for building operational use cases with Kafka and Flink. Operational use cases differ from the typical analytical use cases and can directly impact business operations. Therefore, the message ordering, low latency, and exactly once delivery guarantees offered by both technologies will be crucial here.

The patterns are discussed at a high level, hoping you can grasp the idea underneath.

Pattern 1: CQRS/read-optimized views

CQRS pattern or read-write segregation is a common pattern in Microservices data management. The write path (aka the Command handling) of a Microservice is often very different from its read path (Query handling). The write path always tries to ensure data consistency. Hence it enforces data validation rules and transactional writes to multiple normalized tables. However, the read path focuses more on searching for data and reading them fast at scale, finding multiple table joins as friction.

To solve this mismatch, developers resorted to polyglot persistency, where you can use different data stores for read and write purposes and segregate commands and queries between them. For example, a relational database can handle all the write operations for a service, while a key-value store or a document database can handle the queries. That enables evolving services’ read and write sides without depending on a single data model.

But how do you synchronize writes with the read side? Let’s find out.

Pattern implementation

The above pattern uses a relational database for writes and a NoSQL database for reads. The changes in the write database are captured in real-time with Debezium, a Change Data Capture (CDC) framework built on Kafka Connect. Captured changes are written a Kafka topic, allowing Flink to perform optional data massaging, such as joins, enrichment, etc., and write them back to another Kafka topic. Finally, Kafka Connect synchronizes the processed output with the destination database, which can be a NoSQL database or a real-time OLAP database like Apache Pinot.

Kafka and Flink will ensure exactly-once semantics from end-to-end.

Challenges

There can be a slight delay between data replication across the write and read sides. For more information on that, you can read Gunnar Morling’s talk at Current 22.

Pattern 2: Async task reply

Assume that your web/mobile application UI accepts a task request from a user, such as uploading an image or creating a profile. The backend will take considerable time to process the task, which can’t be completed within the scope of an HTTP request.

The typical pattern is that the backend accepts the request, validates it, and returns HTTP 202 status to the UI, denoting that the task has been accepted for processing. That causes the UI to periodically poll the backend to know the processing status and update the UI.

What if the backend sends the task processing status to the UI asynchronously over a protocol like WebSockets or Server-sent Events (SSE)?

Pattern implementation

In this pattern, we can put an API to receive the task from the UI. Once received, the API or a Microservice places it on the tasks.new Kafka topic for processing. The UI then establishes a WebSocket connection to receive the task processing status asynchronously.

A group of Flink applications arranged as a Kafka consumer group consumes the tasks.new topic, validates them, processes them, and places them back on two Kafka topics — tasks.processed or tasks.failed, based on the processing outcome.

Kafka-to-WebSocket bridge is another service that consumes responses from both topics and relays them to the WebSocket channel to which the UI is connected. UI receiving the task status (either successful or failed) shows it as a notification.

Challenges

It will be tricky to bridge the security model between WebSockets and Kafka (SASL). You must think before implementing fine-grained authorization all the way back to Kafka if there are more layers involved in the middle, such as APIs, Microservices, etc.

Pattern 3: Backend rate limiter

Suppose a use case requires you to write an application that consumes records from Kafka and writes them to an external system, such as a cloud database (AWS RDS, for example). If the message arrival rate at Kafka is higher than the number of writes allocated for the database, it will soon run out of IOPS and halt write operations.

Assuming that database write operations are not latency sensitive, how would you design the Kafka consumer to rate limit the incoming messages so that the database will not get exhausted?

Pattern implementation

You can implement a Flink application as the Kafka consumer here, consuming messages from a Kafka topic and writing to a database via Flink JDBC Connector. Use windowing in Flink to buffer messages into small batches and perform a batch write.

For example, you can buffer messages with a one-minute length window and then flush the contents of the buffer through the connector.

Taking things to the next level with Redpanda and Decodable

There’s no doubt that Kafka and Flink are two great technologies invented in the past decade. But, sometimes, we wish they could be simpler, performant, cost-effective, and closer to most developers.

Redpanda and Decodable are two technologies that offer better performance, lower TCO, and simplicity compared to Kafka and Flink while maintaining API compatibility with them. In other words, you can swap out Kafka with Redpanda and Flink with Decodable here.

Summary

This post discussed how you could leverage Kafka and Flink to build three operational use cases. I’ve mentioned the how-tos and gotchas for each pattern, so take caution when building them in production.

Also, you can try these patterns with a different streaming data platform and a stream processor anytime, provided that they comply with the basic features mentioned here. E.g Pulsar and Flink.

Of course, the list continues. I will continue to explore more patterns in future posts of the series. Feel free to contribute your patterns and ideas so that we can make this a great series.

Operational Use case Patterns for Apache Kafka and Flink — Part 1

This is the first post of the series that shows building operational use cases with Apache Kafka and Apache Flink.

Pattern 1: CQRS/read-optimized views

Pattern implementation

Challenges

Pattern 2: Async task reply

Pattern implementation

Challenges

Pattern 3: Backend rate limiter

Pattern implementation

Taking things to the next level with Redpanda and Decodable

Summary

Written by Dunith Danushka