Why Apache Pulsar ? — A Gentle Comparison with Kafka

Published in

SFU Professional Computer Science

9 min readFeb 4, 2020

This blog is written and maintained by students in the Professional Master’s Program in the School of Computing Science at Simon Fraser University as part of their course credit. To learn more about this unique program, please visit {sfu.ca/computing/pmp}.

Authors: Danlin Chen, Yichen Ding, Wenxi Hu

What is Apache Pulsar?

Apache Pulsar is an open-source distributed publish-subscrib messaging system originally created at Yahoo and now part of the Apache Software Foundation.

So what is a messaging system? Examples like sending an email to a colleague to confirm work stage, calling a friend using a mobile phone to start a chat, ordering a product online and receiving the package can all be recognized as some type of messaging systems, but they are still different from the messaging system of the Apache Pulsar. Then the question is “What are the differences?” The answer is the concept of Subscription.

When it comes to “subscription”, people would think of subscribing to newspapers and magazines. Now, let’s review the process of subscribing to newspapers and magazines:

Readers subscribe to the newspapers / magazines at the post office;
The newspapers / magazines will be delivered to the post office after its publication.
The post office delivers the newspapers / magazines to the readers based on the subscription.

Although there are many types of newspapers and magazines, the quantities of publication and consumption are still certain. Compared with the complexity of Internet products, this can only be considered as a small case. Take Youtube as an example:

Each user can publish videos or follow (i.e., subscribe to) other users, and could be followed by other users at the same time. Because the Internet has broken through geographical restrictions, the number of users is considerable. And these factors make the “subscription” weaving an intricate network. Under the huge volume pressure, excellent real-time performance is required. Hence, this is why we need the distributed messaging system Apache Pulsar.

Let’s take a look at Pulsar with a preliminary impression of the newspaper and Youtube subscription process:

Concepts of Pulsar

Like the newspaper example mentioned above, Pulsar is used to receive the data sent from producers, and then send the data to the consumers. Producers feed messages into Pulsar and Consumers/Subscriber consumes messages from Pulsar.

Now, let’s see the inside structure of Pulsar — mainly four parts: property, namespace, topic, and subscription.

A property represents a tenant in the system. A namespace is a basic administrative unit and has functions such as setting permissions, managing replications across clusters, controlling message expirations, and performing critical operations. These inside parts make the Pulsar multi-tenancy possible. The topic is the core resource, a place that allows producers to append messages and consumers to pull messages. Under the same namespace, topics share the same settings, which makes configuring many topics at once possible. Subscription is designed to retain and receive all the messages published on the topic.

To make it easier to understand, we can see a Pulsar cluster as a company that supports various applications. A property represents a product line in the company. Hence, a Pulsar cluster contains one or more properties. A namespace is like one use case of the product, and each property could have several namespaces. A topic is one process in the use case. A namespace can contain any number of topics. Each topic can have multiple subscriptions as well. The relationships among producers, consumers, Pulsar( property, namespace, and topic) are shown in the following image:

Figure 1. Relationship between Producer, Customer, and Pulsar

Additionally, the namespace is designed to have two scopes: local and global. A local namespace is only visible in the Pulsar cluster that it is defined. Global namespace could be used across various Pulsar clusters that do not have to be in the same data center or location. Both local and global namespaces could be shared under proper settings.

Architecture

The range of message amounts that fed into a topic could be quite different. Depending on the number of consumers, the topics need to balance the throughput in different cases. To solve this balancing problem, Pulsar allows you to share the messages in a topic and store it on multiple machines. This design is called partitions.

When dealing with a large amount of data across several nodes, partitioning is a common approach that can achieve high throughput. By default, Pulsar topics are created as non-partitioned, but a partitioned topic could be created by simple CLI commands or API calls given a specific partitioning number. Pulsar controls how messages are partitioned by asking you to select a routing policy, such as single partitioning, round-robin partitioning, and hash partitioning.

Pulsar automatically partitions the messages and ensures that consumers and producers know nothing about partitioning. If an application was written by a single topic, it could still work after partitioning with no code changes. Hence, partitioning is just an administrative-level process.

The Pulsar cluster consists of 2 fundamental layers: a set of brokers as a serving layer, and a set of bookie nodes as a persistent storage layer. The Brokers as stateless components handle the partitioned parts of topics: store the received messages to the cluster, retrieve messages from the cluster, and send to the consumers on demand. The physical storage of the messages is handled by “bookie” nodes, which are the persistent storage for the Pulsar cluster. The Apache BookKeeper is the configuration used to manage bookie nodes. Since the broker layer and bookie layer are separated, scaling one layer is independent of scaling the other.

Advantageous Features

Segment-Centric Storage

One advantage of Apache Pulsar is Segment-Centric Storage. Similar to other large data processing systems, Pulsar automatically partitions the messages and stores in the Bookies nodes. However, instead of storing the complete partitions on a single cluster node and replicating it to additional nodes, the partition is further broken into segments and stored in the cluster distributedly. In this way, the capacity of a message partition is no longer limited by the capacity of the smallest node in the cluster, and can even be scaled up to the total capacity of the whole cluster. Since a cluster can be easily scaled up by adding more nodes, the segment-centric design makes Pulsar capable of storing streaming data for a long period of time in a more efficient way.

With the advantage of segment-centric storage, seamless cluster expansion, and seamless node failure recovery can be achieved instantly as well.

Seamless Cluster Expansion

As Figure 3 shows, Bookies A and B are added to the cluster while Broker 1 is currently writing segment N of topic-1 partition 1 to the cluster. The two new nodes A and B are discovered by Broker 1 immediately. Broker 1 then stores the remaining segments of partition 1 to the newly added Bookies A and B. Then, more and more segments could be written to the new nodes A and B without data recopying. Furthermore, Apache BookKeeper offers policies to make sure the load of each storage node is balanced.

Seamless Node Failure Recovery

As Figure 4 shows, a failure occurred that causes segment 3 corrupted on Bookie node 2. Apache BookKeeper detects the failure and schedules a replica repair at once. For the replica repair process, Apache BookKeeper retrieves the data of segment 3 from replicas on other Bookies and creates a new segment 3 on Bookie 1 (an active node that does not contain a replica of segment 3). This process is much more efficient than the partition-centric storage (replicates the whole partition), and it happens in the background that does not affect the continuity of the brokers to write to bookie nodes.

Subscription

There are three types of subscriptions provided to enhance the flexibility of the application:

Exclusive subscriptions (streaming messaging) — only a single consumer at any given time.
Failover subscriptions (streaming messaging) — multiple consumers are allowed to connect to a topic, but only one consumer will receive messages at any given time. The other consumers will start receiving messages only when the current receiving consumer fails
Shared subscriptions (queuing messaging) — multiple consumers can attach to the same subscription, and each consumer will receive a fraction of the message

Each subscription under the same topic can choose different subscription types, so two or more subscription types can coexist on the same topic, which greatly improves the messaging flexibility.

The three types of subscription mentioned above are mainly based on two messaging models in real-time streaming architecture. The first one is the queuing messaging model, which is unordered and point-to-point. Another one is the streaming model, which is in a strictly ordered messaging pattern. Pulsar provides a unified messaging model (combines producer, topic, subscription, and consumer), which allows consumers of the same subscription to use both messages queueing or streaming messaging.

Message Acknowledgement

In a messaging system, failures may occur during the transmission of a message. To avoid redelivering the message that was already consumed by the consumer and data loss, message acknowledgment is used to detect the failure and provide a recovery point when failures occur. Pulsar provides both cumulative acknowledgment and individual acknowledgment.

If messages are acknowledged cumulatively, any message before the acknowledged message will not be redelivered or consumed again. If messages are acknowledged individually, only the messages marked as acknowledged will not be redelivered in the case of failure. For exclusive subscription or failover subscription, both cumulative acknowledgment and individual acknowledgment can be applied, but for shared subscription, only individual acknowledgment can be used.

Comparison with Apache Kafka

At present, there already exist excellent and sophisticated messaging systems such as Apache Kafka, but why do we still need Apache Pulsar? The following provides a comparison between Apache Kafka and Apache Pulsar in various fields.

Segment-Centric vs. Partition-Centric

Apache Kafka is a partition-centric system, which means that a partition can only be stored on a single node and replicated to the other nodes. Its capacity cannot exceed the capacity of the smallest node in the cluster. This makes Kafka extremely inefficient for cluster expansion, since the process of rebalancing and recopying the whole partition is expensive and error-prone. Additionally, the partition is only available after the recopying process is completed. For example, Assume a partition has 4 replicas, if you lose just one replica, you need to recopy the whole partition to recover it.

On the other hand, Apache Pulsar is segment-centric as described earlier: a partition is segmented into further smaller pieces and distributed across the cluster. The process of data rebalancing and recopying is not required upon cluster expansion because new data segments automatically ramp up on new nodes. The figure below illustrates this advantage of Pulsar over Kafka.

Unified Messaging Model vs. Streaming Messaging Model

Apache Kafka is a streaming system focused on large scale message processing, and it does not support unordered and shared messaging. When working on a large amount of data, Pulsar can work effectively as Kafka. At the same time, in many low throughput use cases, Pulsar provides a shared messaging feature that is not supported by Kafka.

Selective acknowledgment vs. Cumulative acknowledgment

Kafka uses offset, which is a simple integer number, to record the current position of a consumer. As a result, Kafka cannot mark a message as acknowledged and leave the messages earlier as unacknowledged. Compared with the simple offset in Kafka, Pulsar uses a Pulsar Cursor allowing selective acknowledgments and redelivery of individual messages. Hence, Pulsar can maintain the positions of the acknowledged messages with more flexibility.

Conclusion

In this blog post, we have provided an overview of the architecture and messaging model of Apache Pulsar. We have introduced several unique design points of Apache Pulsar and the advantages over Apache Kafka. We hope this blog post will give you a better understanding of Apache Pulsar from different perspectives.