Why Nutanix Beam went ahead with Apache Pulsar instead of Apache Kafka?
In Nutanix Beam(Saas Product) we crunch a lot of data to find insights about cloud spend as well as cloud security. Nutanix Beam is built on our microservices & service mesh architecture using Consul, Nomad, Vault, Envoy and Docker for synchronous RPC style requests. Although we are not going to discuss our microservice architecture here (that’s for a different day and a different post :) ), we will focus on the key technology that supports this architecture — asynchronous communication between teams and microservices.
We use Disque & Conductor for batch processing. Both of these systems are queue-based systems. We wanted to add pub/sub capabilities into our platform, which led us to look for a streaming platform in which we can reliably store events and replay them when required. Because some of our team members have previous experience with Kafka and have operated Kafka at a moderate scale, it would have been easy for us to choose Kafka. But before deploying a technology we strongly believe it is important to look at the current landscape and validate our technology choices so that we don’t give in to bias toward familiar technology that may not be the best for a project. Also, it was important to make this decision carefully because the streaming platform would become a critical component of this SaaS product.
We first learned of Apache Pulsar when we saw it mentioned in a Github issue and decided to start evaluating it for our streaming platform use cases. Considering that Pulsar is under the Apache umbrella and has already graduated from Apache’s incubation process to become a top-level project, we could be confident that it was an established technology. Apache pulsar is running in production for many year in yahoo and adopted many companies mentioned here. We collected our list of use cases that require a streaming platform, and we started a deep dive analysis of the architecture of Apache Pulsar and Pulsar’s coordination, persistence, reliability, high availability, fault tolerance and client ecosystem.
Introduction to Apache Pulsar
Apache Pulsar is an open-source distributed pub-sub messaging system originally created at Yahoo and now part of the Apache Software Foundation.
Apache Pulsar uses an architecture that decouples message processing, serving, and storage. The storage layer embeds Apache BookKeeper for data persistence, replication and high availability. Message serving is handled by a set of stateless brokers that make up the serving layer, and message processing is handled by Pulsar Functions in their own layer of this architecture. Each layer of this architecture can be scaled independently — we can independently scale BookKeeper (storage layer) to increase storage capacity and throughput, Pulsar brokers (serving layer) to increase message serving throughput, and function workers (processing layer) to handle more data processing.
To learn more about Apache Pulsar, see the Apache Pulsar website and documentation.
Benefits for Development:
One interesting thing we found when we collected use cases is that we frequently want to consume messages using both message queues and pub-sub models. However, we can’t use Kafka as a queuing system because the maximum number of workers you can have is limited by the number of partitions, and because there is no way to acknowledge messages at the individual message level. You would need to manually commit offsets in Kafka by maintaining a record of individual message acknowledgments in your own data store, which adds a lot of extra overhead — too much overhead in my opinion.
With Pulsar you don’t need to worry about which model you are going to need for consuming data. In Pulsar you can consume and commit at the offset level as well as consume and acknowledge at the message level. For more details about how that works, the Streamlio team has written a nice blog here. Also worth noting is how Apache Pulsar handles message delivery in the case of a consumer crash: Pulsar will publish the message again to another consumer after a delay defined at the subscription level.
Benefits for Operations:
This is where things get interesting. Since we have operated Kafka at a moderate scale we already understood several of its limitations. We saw that Pulsar could address these limitations and better meet our needs.
Event Replay and Lagging Consumers:
In Apache Kafka, when consumers are lagging behind, producer throughput falls off a cliff because lagging consumers introduce random reads. Kafka’s architecture is designed in such a way that high throughput depends on accessing data sequentially in an append-only manner.
Apache Pulsar solves this problem using the combination of a segment-based architecture and isolation between reads and writes. You can read more about it in this blog. In a nutshell, your hot reads are handled by the brokers’ in-memory cache and hot writes are handled by the journal. Event replay/lagging consumer reads access data from ledgers, which are on a separate disk from the journal.
For example, suppose there is a topic and we have 5 consumers, all of them reading happily without any lag. In this case, reads are served by broker cache. Periodically, a background process is moving the data from the BookKeeper journal to ledgers. Now imagine that suddenly our machine learning engineer decides to consume from the beginning of the topic to train/retrain his model because he tweaked his model’s sigma & mu :). Because the lagging consumption/event replay is provided from the ledger, which is sitting in a different disk, the event replay does not impact the performance of the consumers reading the most recent data in the topic.
Data Retention without Exploding Our Cloud Bill
Apache Pulsar solves this with its storage tiering feature, which can offload older data into scalable storage systems like Amazon s3, Google Cloud Storage, Minio, etc. We can configure an offloading policy based on time and size. Once the data in a topic reaches the specified time or size threshold, older data segments are moved to object storage. The best part is that when a consumer requests data that has been offloaded to object storage, Pulsar transparently serves that data to the consumer from object storage. The consumer does not know whether the data came from disk or the object storage. In this way, your topic partition size is not limited to the size of storage on a single broker.
Adding nodes to an Apache Kafka cluster doesn’t necessarily solve performance problems related to overloaded nodes. Why not? New nodes added to a Kafka cluster will be used for new topics created after they are added, but won’t automatically take away some of the load from the overloaded nodes. To use them to spread the load, you need to do manual, costly rebalancing of data to migrate some of the topics from the old nodes to the new node.
In Pulsar, storage is handled by a storage layer provided by Apache BookKeeper. Pulsar uses a segment-based architecture where the messages in a topic partition are collected into segments, which are then persisted. As a result, there is no one-to-one mapping between topic partition and nodes as there is in Kafka. When a new storage node is added, some of the new segments are stored on the new node, reducing the load on the previously existing nodes immediately.
There is a nice diagram in this blog by Jack Vanlightly illustrating how this works.
Millions of Topics:
Another important requirement for our use cases is the ability to support millions of topics. For example, suppose we want to replay data for one customer. As long as we create one topic per customer, it is easy to create a subscription on a specific customer’s topic and then consumes the data for that customer. Deleting customer data becomes very simple as deleting the topic.
However, in Apache Kafka scaling a number of topics is a real architectural problem. Kafka creates multiple files (resources) per topic. Because of this, the number of topics we can create will be limited. However, if we were to put data from many or even all customers into a single topic, then when we want to replay data for a single customer we will be forced to replay the whole topic and discard all the messages except for those associated with the single customer we are interested in. Even if we were to use a partitioned topic with customer id as the partition key so that we can select just the partition where a single customer’s data is stored, there is a good amount of wasted resources needed on the broker and consumer side.
Pulsar does not create files per topics. Since it uses Apache BookKeeper’s segment storage architecture, the number of segments created is not determined by the number of topics. We like this feature a lot because with it we can go ahead and create one topic per customer, a simpler and more efficient design.
Pulsar is running in production for the last 6 months and supports use cases such as Pub/Sub, Queuing and schema registry.
Overall our take on Pulsar is that it greatly simplified the design of asynchronous communication in Nutanix Beam. It’s scalability and design allows us to make decisions about how to consume data on the consumer side based on the business use cases, rather than forcing us to make those decisions during the ingestion of data. That flexibility helps us to better support ever-changing business requirements.
To understand more about how apache pulsar works you can read this awesome article written by Jack Vanightly. Also, streamlio team has nice blog posts that talk about each of the pulsar architectural details in depth. Apache Pulsar has an awesome community that is more than welcome in helping us to get things right and very prompt on merging our pull requests.
PS: This article is written keeping in mind that every technology has its place in this world. To highlight some of the architectural differences we have compared it with kafka. We have used kafka in the past and we are thankful to kafka & kafka community.
Let me know your thoughts in the comments section. You can follow me on twitter @SkyRocknRoll