Kafka is a beast to learn. Although the core of Kafka remains fairly stable over time, the frameworks around Kafka move at the speed of light.
A few years ago, Kafka was really simple to reason about: Producers & Consumers. Now we also have Kafka Connect, Kafka Streams and KSQL onto the mix. Do they replace the Consumer or Producer API or complement them?
Let’s make sense of it all!
Using the right Kafka API
I identify 5 types of workloads in Apache Kafka, and in my opinion each corresponds to a specific API:
- Kafka Producer API: Applications directly producing data (ex: clickstream, logs, IoT).
- Kafka Connect Source API: Applications bridging between a datastore we don’t control and Kafka (ex: CDC, Postgres, MongoDB, Twitter, REST API).
- Kafka Streams API / KSQL: Applications wanting to consume from Kafka and produce back into Kafka, also called stream processing. Use KSQL if you think you can write your real-time job as SQL-like, use Kafka Streams API if you think you’re going to need to write complex logic for your job.
- Kafka Consumer API: Read a stream and perform real-time actions on it (e.g. send email…)
- Kafka Connect Sink API: Read a stream and store it into a target store (ex: Kafka to S3, Kafka to HDFS, Kafka to PostgreSQL, Kafka to MongoDB, etc.)
You may want to do things differently, and it’s possible you will make it work. For example, Kafka Consumer and Kafka Connect Sink API are quite interchangeable, if you’re willing to write a lot of custom code for your needs.
Overall, the guidelines above should help you achieve the most efficient workflows with the least amount of code and frustration.
Kafka Producer API
The Kafka Producer API is extremely simple to use: send data, it’s asynchronous and you will get a callback. This is perfectly suited for applications directly emitting streams of data such as logs, clickstreams, IoT.
It is very common to use this kind of API in combination with a Proxy
The Kafka Producer API can be extended and built upon to do a lot more things, but this will require engineers to write a lot of added logic. The biggest mistake I see is people trying to perform ETL between a database and Kafka using the Producer API. Here are a few things that are not easy to do:
- How to track the source offsets? (i.e. how to properly resume your producer if it was stopped)
- How to distribute the load for your ETL across many producers?
For this, we’re much better off using the Kafka Connect Source API
Kafka Connect Source API
The Kafka Connect Source API is a whole framework built on top of the Producer API. It was built so that developers would get a nicer API made for 1) producer tasks distribution for parallel processing, and 2) easy mechanism to resume your producers. The final goodie is a bustling variety of available connectors you can leverage today to onboard data from most of your sources, without writing a single line of code.
If you do not get the chance to find an available source connector for a source of yours, reason being you’re using a very proprietary system in your environment, then you will have to write your own source connector. Writing your own source connector is actually very enjoyable, debugging it much less.
Kafka Consumer API
The Kafka Consumer API is dead-simple, works using Consumer Groups so that your topics can be consumed in parallel. Although you need to be careful about a few things, such as offset management and commits, as well as rebalances and idempotence constraints, they’re really easy to write. For any stateless kind of workload, they will be perfect. Think notifications!
When you perform some kind of ETL, Kafka Connect Sinks are better suited as they will avoid you to write some complicated logic against an external data source.
Kafka Connect Sink API
Similarly to the Kafka Connect Source API, the Kafka Connect Sink API allows you to leverage the ecosystem of existing Kafka Connectors out there to perform your streaming ETL without writing a single line of code. Kafka Connect Sink API is built on top of the consumer API, but does not look this different from it.
If the data sink you’re writing to does not have an available connector (yet), you will have to write a Kafka Connect Sink (or consumer, if you prefer), and the debugging process might be a bit more complicated.
Kafka Streams API
If you want to get into the stream processing world, meaning reading data from Kafka in real-time and after processing it, writing it back to Kafka, you would most likely pull your hair out if you used the Kafka Consumer API chained with the Kafka Producer API. Thankfully, the Kafka project now ships with Kafka Streams API (available for Java and Scala), which enables you to write either a High Level DSL (resembling to a functional programming / Apache Spark type of program), or the Low Level API (resembling more to Apache Storm). The Kafka Streams API does require you to code, but completely hides the complexity of maintaining producers and consumers, allowing you to focus on the logic of your stream processors. It also comes with joins, aggregations and exactly-once processing capabilities!
You will have to write some code and it could become quite messy and complicated. Until recently, it was hard to unit test Kafka Streams applications, but now this can be achieved using the test-utils library. Finally, although Kafka Streams does look simple, it is actually quite a beast in the backend and will create state stores, most probably backed up by Kafka topics. This means that based on how complicated your topology will be, your Kafka cluster may have to start processing a lot more messages, although as an added benefit you will have “stateless” and fully resilient applications.
KSQL is not directly part of the Kafka API, but a wrapper on top of Kafka Streams. It is still very worth mentioning here. While Kafka Streams allows you to write some complex topologies, it requires some substantial programming knowledge and can be harder to read, especially for newcomers. KSQL wants to abstract that complexity away by providing you with a SQL semantic (not ANSI) that is close to what you already know today. I have to admit it is extremely tempting to use and makes your stream processors a breeze to write. Remember this is not batch SQL but streaming SQL so a few caveats will appear.
If you want to have complex transformations, explode arrays, or need a feature that’s not yet available, sometimes you’ll have to revert back to using Kafka Streams. The library is although developing at the speed of light so I expect feature gaps to be filled very quickly.
I hope this article was helpful to help you understand which Kafka API will be appropriate for your use case, and why. If you want to learn Kafka (…and Kafka Connect, Streams & more) check out my courses at https://kafka-tutorials.com !
If you liked this article, don’t forget to clap and share!