Taager’s Foray in Messaging Part 1; Apache Kafka vs Apache Pulsar

Published in

Taager Tech Blog

7 min readJan 17, 2022

Apache Kafka has been the product of choice for messaging (pub/sub) systems in the tech industry for a while. A large ecosystem of associated tools, vibrant and growing community, extensive penetration in the tech industry with its enterprise-level utility has been its major calling cards. However, Apache Pulsar, a relatively new entrant (although only by a couple of years) initially developed as a queuing system, has lately been broadened to provide features that Apache Kafka does not have as per steamnative.io.

Taager recently faced a scenario where our growing backend infrastructure (multiple microservices) required a messaging solution to help us scale out. Of course, Kafka being the more famous name, was the first thing in our mind, but we decided to give Pulsar a run as well, and it seems like the former might not hold the de facto position of preferred messaging solution in the tech industry for long.

Architecture:

Both Kafka and Pulsar share a similar core architecture, where we have a Broker and a producer writing and producing messages on that broker. The broker for both of them, as per the documentation, is stateless (not actually). To coordinate the cluster state, both use Apache Zookeeper (Legacy for Kafka, the latest release is Zookeeper free).

Pulsar also depends on another distributed component, and that is BookKeeper. Pulsar uses BookKeeper nodes (bookies) to store the actual messages and cursor positions and ZooKeeper for metadata storage. Although bookKeeper leverages RocksDB as an embedded database, this database is not an independently managed component.

More moving components in the case of Pulsar give Kafka an edge in ease of maintenance and scalability, especially as Kafka has removed dependency on Zookeeper; its scalability, which was already a strong suit of it, has only increased.

Brokers:

The Kafka broker technically is not stateless. Each broker stores a detailed log of its partitions. Hence, if we want to add new brokers in the Kafka cluster, it requires time to sync up.

On the other hand, Pulsar does provide stateless brokers, due to the distributed log stored in the BookKeeper. Hence, users can add new brokers easily to scale up with higher demand.

Partitions:

Kafka works on the model of partitions and consumer groups. By spreading data among several partitions and groups, we can achieve more throughput and lower latency. However, partition/group breakdown can be complex for the simple use case.

In Pulsar, we also have a concept of partition topics that are distributed among multiple brokers to increase throughput. However, for a more straightforward use-case in Pulsar, we are not required to specify the number of partitions or think about how many consumers the topic might have. Instead, we can add as many consumers as we want on a topic with Pulsar keeping track of it all.

Another feature to consider is that you can only increase partitions in running clusters, both Kafka and Pulsar. However, it can only be done for Pulsar if the topic is not global.

Streaming or Queuing:

Apache Pulsar has the additional capability to act as a message queuing system in addition to typical real-time purposes like Apache Kafka. This is due to its distributed ledger (BookKeeper), which has persistent message storage and offers automatic and custom load balancing across consumers for messages on a topic. So technically, Pulsar can be used for real-time streaming and message queuing. These features give it a slight advantage over Kafka as a queuing solution.

If we talk about stream processing, Kafka is much more mature with Kafka Stream and KsqlDB good enough solutions to replace Apache Spark and Apache Flink. On the other hand, Pulsar uses Pulsar Function, which is more straightforward and basic than Kafka’s solution.

Note: Kafka can also work a queue, as Consumer groups allow Kafka to behave like a Queue since each consumer instance in a group processes data from a non-overlapping set of partitions (within a Kafka topic)

Push vs Pull:

Kafka is pull-based; consumers pull messages from the server. Long-polling (reduces the downside of polling) ensures that consumers instantaneously consume new messages. Long-pooling makes it easier to scale as new consumers can join consumer groups and start consuming instantly.

Pulsar is push-based, relying on the pub-sub pattern. Producers publish messages to the server while subscribed consumers receive messages pushed by the broker.

Storage:

Using Kafka and Pulsar to store data, offering long-term storage solutions is possible. However, the underlying storage mechanism for both is quite different. For example, Kafka uses distributed logs among multiple brokers for storage, while Pulsar has Apache BookKeeper, a distributed log storage solution with RocksDB as an embedded database.

Message:

Delivery Guarantee:

Kafka provides At least once guarantee of message delivery, and Precisely once if we use it with Spark direct connector. On the other hand, Pulsar has messaging guarantee of At least once, At most once & effectively once.

Ordering Guarantee:

In Kafka, the message’s order is guaranteed within a partition, while in Pulsar, it depends on partitioning and using keys, per-key partition, or per producer partition.

Message Retention:

By default, Pulsars comes with disabled message retention. Users can enable message retention using configuration. The retention limit is per your storage capacity (cloud specific if managed cloud solution). Kafka enables retention by default and has no maximum (limited by your or cloud provider storage).

Message Size:

Pulsar has descent Levay for message sizes; users can increase message size using chunking (available only for persisted topics). By default, Message size can be up to five megabytes in Pulsar. In case of greater size, a feature is available to allow the producer to split the messages into smaller sizes automatically. The responsibility of putting these smaller messages lies on the consumer.

Kafka works best with smaller message sizes, around 1 Kb. However, configurations allow larger-sized messages in Kafka and managed cloud solutions like confluent managed; one can have this up to 8Mb.

Message Storage and Querying:

Both Kafka and Pulsar have SQL engines. The Kafka KSQL engine is a standalone product by Confluent and does not is licenced under the Conflent Community Licence. On the other hand, Apache Pulsar uses the Presto SQL engine.

A differentiation factor for Apache Pulsar over Kafka is its use of a unifying schema which it stores in its schema register, this is not present in Kafka. Presto SQL engine uses the schema registry to query messages, which are required to be ingested first and then queried. Where asKSQL streams the data in the same way a Streaming API application would continuously run and apply the queries.

Performance:

Performance is where both Apache Kafka and Apache Pulsar lock horns competitively. Multiple benchmarks and published results give somewhat varying results.

Throughput:

As per streamnative.io: Apache Pulsar beats Kafka in throughput. With the same durability guarantee as Kafka, Pulsar achieves 605 MB/s publish and end-to-end throughput (same as Kafka) and 3.5 GB/s catch-up read throughput (3.5 times higher than Kafka). Furthermore, increasing the number of partitions and changing durability levels have no impact on Pulsar’s throughput. However, changing partitions’ number or durability levels severely impacted Kafka’s throughput. On the other hand, the confluent.io benchmark shows the case to be otherwise. However, there are opensource benchmarks here & here cited by streamnative to corroborate their claims.

Latency:

Similarly, for latency, we also have conflicting claims; the confluent.io benchmark shows Kafka outperforms Pulsar when it comes to latency. However, streamnative dispute confluent benchmark citing lack of consistency in configuration of Apache pulsar and Kafka and display in their benchmark that Pulsar outperforms Kafka.

Streamnative benchmark looks more believable with available opensource datasets, showing Apache Pulsar to have lower latency at higher throughputs.

Note: streamnative.io is an event streaming platform built on Apache Pulsar. confluent.io provides managed deployments for Kafka

Language Support:

Decent amount of language support for both, Kafka being more mature, has virtually all major languages covered with client libraries. On the other hand, Pulsar also has good support for major languages like Java, Golang, Python, C#, Node.js, etc.

Geo-replication and Multi-tenancy:

Geo-replication is a feature in Pulsar. However, it’s not an add-on and is relatively easier to set up globally distributed applications. On the other hand, Kafka’s geo-replication mechanism (MirrorMaker) has a notorious reputation for being a trouble maker.

Pulsar allows you to have multiple tenants. In addition, namespaces will help you to keep things organized and simple. Unfortunately, Kafka does not have native multi-tenancy capabilities with complete isolation of tenants.

Documentation and Support:

Kafka has a significant advantage over Pulsar in this facet due to its greater penetration in the tech industry. It has a greater and more active community with more extensive documentation and support. The documentation for Pulsar is also good, and one advantage that Pulsar carries due to its smaller community is that Pulsar’s core maintainers are very active. Furthermore, Pulsar is fully open source with no components under any commercial license, as far as our knowledge.

Summary:

We have in no way or mean concluded that Pulsar is better than Kafka or vice-versa. However, in our relatively short journey with messaging solutions, Pulsar seems to have solved many of the hurdles limiting Kafka. So, if you can get over the apparent community size and maturity issues and are looking to adopt a unified messaging and event streaming platform, Pulsar might be the right solution for you.

Looking to work for Taager? Taager is hiring across different roles:
Backend Staff Engineer
Backend Senior Engineer
SRE
Frontend Senior Engineer
Senoir Mobile SW Engineer: Android
Senior Mobile SW Engineer: IOS
And others!