12 Reasons Why Every Global Enterprise Should Use Apache Pulsar Today

Ever since the cloud-native messaging and streaming platform Apache Pulsar was contributed to open source by Yahoo in 2017, the distributed platform has exploded in popularity. From 10,000 stars and over 450 contributors on GitHub to a Slack community with 5,000+ members, Pulsar has taken off with phenomenal usage by enterprises like Verizon Media, Yahoo! Japan, Tencent, Comcast, Overstock, and, of course, DataStax.

Enterprises are flocking toward Apache Pulsar as the platform of choice to manage hundreds of billions of events everyday. Why is Pulsar garnering all this sudden interest?

As a 15-year-veteran of the streaming and messaging space, I’ve seen trends come and go. When I founded the cloud messaging service Kesque in 2019 (later acquired by DataStax), I had done my research on Pulsar, and was excited about its potential.

My goal at Kesque was to build a messaging-as-a-service platform based on Apache Pulsar, and that was exactly what we did. We eventually expanded the cloud service, making it available in AWS, Azure, and GCP. Now, I’m part of the DataStax family leading the streaming engineering team, contributing to the upstream Pulsar project and building our Astra Streaming service, which, like Kesque, is powered by Apache Pulsar. I’m also the author of the O’Reilly ebook: Apache Pulsar Versus Apache Kafka: Choosing a Messaging Platform.

In this post, I will share 12 main reasons why there’s a high level of interest and momentum in Pulsar, and why every enterprise should be using it.

But first, what is Pulsar?

Pulsar is a cloud-native, distributed, open-sourced, pub-sub messaging and streaming platform. It originated as an event broker from Yahoo! in 2015 and contributed to the Apache Software Foundation (ASF) as a top-level project in 2017. A horizontally-scalable distributed system running on commodity hardware that reliably streams messages without losing data, Pulsar was originally designed to support Internet-scale applications such as Yahoo! Mail and Yahoo! Finance. Since it’s highly scalable, it can handle the most demanding data movement use cases out there.

Let’s get into the 12 reasons why every enterprise should be using Pulsar today.

Reason 1: Messaging, streaming, and queuing, all-in-one

Figure 1. Streaming (left) and Queuing (right) with Apache Pulsar.

How does Pulsar compare to a leading, traditional pub-sub messaging system option like Apache Kafka, or a messaging queuing system like RabbitMQ or ActiveMQ?

Kafka is the de facto standard for streaming use cases, and for good reason: it’s great at streaming and pub-sub and at delivering messages to multiple consumers. In Figure 1, multiple publishers are publishing to a topic, and the same message is sent to multiple consumers.

Meanwhile, RabbitMQ or ActiveMQ are great at queuing messages and competing consumer use cases. This is where we’re publishing publishers and sending messages into the topic, but only one consumer is consuming each of these messages. It’s trickier to accomplish competing consumer use cases in Kafka, because it works on the partition level and you can end up with extra partitions you don’t need or consumers that don’t consume any messages

Pulsar combines the best features of a traditional messaging system like RabbitMQ with those of a pub-sub system like Kafka. You get the best of both worlds in a high performance, cloud-native package.

Reason 2: Performance

Performance can be a touchy subject because anyone can make benchmarks look good.

But Pulsar was actually designed for high performance. Back in 2015, the developers of Pulsar took inspiration from Kafka and designed Pulsar to achieve hundreds of thousands or even millions of messages per second.

Pulsar supports up to millions of topics (and more), something that Kafka struggles with. A key consideration for Pulsar’s original design was that it needed to have less than 10 millisecond producer latency. When you publish a message, you receive acknowledgement within 10 milliseconds consistently. Pulsar’s architecture is optimized to get high throughput and low, consistent latency.

In a third-party, vendor-neutral analysis by Software Mill, Pulsar was rated as a high-performance messaging platform in their report issued in July 2021. It’s considerably faster than traditional messaging systems and can hold its own with the pub-sub crowd.

Reason 3: Modernize legacy applications

Pulsar’s flexibility makes it easy to modernize legacy applications.. Because of Pulsar’s capability to process message queuing exchange patterns, you can directly transfer older enterprise applications written for RabbitMQ or Java Messaging Service (JMS) without rewriting them.

If you have existing legacy JMS applications, you can do a drop-in replacement by switching your broker type in your application to Pulsar using DataStax’s Starlight for JMS, turning your Pulsar cluster into a JMS 2.0 compliant broker.

If you have legacy RabbitMQ applications, you can use DataStax’s Starlight for RabbitMQ to turn your Pulsar cluster into a RabbitMQ-compatible broker.

This saves costs by consolidating the brokers down into a single large horizontally scalable Pulsar cluster. You can then write new applications using event driven architectures and more modern techniques that live together on the same platform.

Enjoy the best of both worlds: keep your old applications, and take advantage of Pulsar features like message retention and replay.

Reason 4: Multi-tenancy to support different teams

Once you have a high-performance, scalable messaging system in place, you’ll want to share it between different teams and groups within your organization. It doesn’t make sense to replicate the system to make sure different teams don’t impact each other, or build a complex overlay system to simulate multi-tenancy.

Multi-tenancy is the ability for different user groups to use the same underlying resources in a fair way. Pulsar can limit the amount of resources different tenants in namespaces have access to. You can set the maximum number of producers and consumers and maximum rates on how much storage each consumer or topic can have. Unlike Kafka, you don’t have to build an entire multi-tenancy overlay manually or spin up a whole cluster for a new user group.

Reason 5: Geo-replication for data recovery

Another feature that’s built into Pulsar is geo-replication, which you can easily manage through Pulsar CLI or REST API, without the need to install another package on top. Geo-replication is key to recover your data during disasters or enhance the performance of your application.

Pulsar supports multiple topologies that replicate data from the active data center to the standby data center. If the active one fails, you can reconnect to the standby data center. You can also have a more complex topology where you publish a message in one data center in North America and consume it in a different data center in Europe. Basically, you can have an entire global message bus by using the geo-replication feature.

Figure 2. Geo-replication on multiple topologies.

Pulsar’s global configuration store also allows you to standardize the policies, namespaces and data centers, store them in a central location, and propagate the data to all data centers automatically.

Another built-in feature is replicated subscriptions. This is ideal for disaster recovery scenarios. In case of failure, a consumer can restart consuming from the failure point in a different cluster.

Reason 6: Kubernetes and cloud-ready architecture

Figure 3. Pulsar’s separation of BookKeeper layer and broker layer.

Because Apache Pulsar uses a multi-layer approach, separating where consumers connect (brokers) from the storage layer (BookKeeper), it fits very well into cloud infrastructures, which also separate these two concerns. Without having to expand both storage and computing at the same time, you won’t be paying for compute or storage you don’t need.

Apache Pulsar works naturally in Kubernetes, supporting rolling upgrades, rollbacks, and horizontal scaling. When coupled with persistent volumes backed by cloud storage with configurable performance dimensions, Pulsar is a highly durable and highly flexible messaging system that can scale from small test deployments to large production deployments with ease.

Pulsar also has a proxy component that solves some of the Kubernetes networking challenges that you can have with systems like Kafka. With the growing popularity of Kubernetes, Pulsar is evolving to grow and work in these innovations.

Reason 7: Easy scaling, up and down

With Pulsar, it’s easy for enterprises to scale their cluster up and down. In other systems, when you scale up you have to work to redistribute and rebalance the load between brokers. But Pulsar actively monitors the resources on your broker CPU memory network and automatically redistributes the load when it’s overloaded. Plus, you can scale down brokers just as easily, letting Pulsar automatically redistribute the load. Need more storage? Just add more BookKeeper nodes.

Whether you need to add more storage or need more throughput, Pulsar handles everything automatically for you. No manual partition rebalancing or long maintenance periods required!

Reason 8: Tiered storage

Tiered storage is another great feature in Pulsar. You can transfer older messages stored on Apache BookKeeper on high-speed, high-performance SSDs and move them into lower-cost storage options, like Amazon S3, Google Cloud, and Azure Blob Storage.

Pulsar does this automatically without your action and it’s all transparent to the client. Tiered storage is really helpful if you want to store a lot of events in Pulsar at little or no cost. Instead of spending premium dollars in storage, you can save a significant amount of money on long term data with Pulsar.

Reason 9: Lower total cost of ownership

Figure 4. GigaOm report on Pulsar.

Pulsar has a lower total cost of ownership when you combine its features, such as geo-replication and tiered storage, performance, and operational simplification It’s a great cost-saving option for dealing with complex scenarios and high-data volume.

In fact, when Splunk replaced Kafka with Pulsar, they found that its CapEx costs for servers and storage were reduced by 1.5–2 times and its OpEx costs decreased by 2–3 times. GigaOm also reported in 2021 that Pulsar offers up to 81% lower cost in 3 years compared to Kafka.

Reason 10: Completely open-source, forever

Another great feature of Pulsar is that most everything is available for free under the Apache Software foundation without vendor proprietary code or projects.

With software controlled by a vendor, even if it’s open source there’s a risk that licensing terms would change in the future adding new restrictions for use. But because Pulsar is under the Apache Software Foundation, they won’t change the licensing terms to be more restrictive. Pulsar is open-source today, and will be open-source tomorrow.

Reason 11: Pulsar functions and IO connectors for easy connections

Pulsar has a framework for lightweight stream processing. This means you can use Pulsar functions that are fully integrated with Pulsar CLI/API to clean and enrich your data, route events and write functions in Java, Python and Go.

You can also get your data in (source) and out (sink) of Pulsar using IO connectors without actually writing any code. Just configure the connectors and it’ll start sinking and sourcing the data using built-in connectors such as Elasticsearch, MySQL, Postgres and RabbitMQ.

Reason 12: Schema registry to prevent data incompatibility

When sending messages between producers and centers that are decoupled, which is always the case in a messaging or streaming platform, you want to make sure they can agree on the format of the data. If they don’t, then they cannot communicate with each other.

This becomes very important if you’re running tens or hundreds of microservices. Pulsar’s built-in schema registry supports Avro and JSON schemas and allows producers and consumers to register or learn the schema of the data sent on the topic, preventing data compatibility issues. You can also enforce particular schemas on specific topics or change your schema over time to make sure it’s backwards and forwards compatible.

Bonus reason: protocol compatibility with Kafka

I promised you 12 reasons, but here’s a bonus: you can connect Kafka clients directly to Pulsar without changing anything. Multiple projects are working towards protocol-level compatibility with Kafka, including DataStax’s Starlight for Kafka project.

These projects add features like multi-tenancy, geo-replication, and Kafka to Pulsar interworking. Enjoy Pulsar benefits like easier operations and scaling, even if you’re using Kafka!

Conclusion

Intrigued to get started with Pulsar? The easiest way is to use Astra Streaming, a DataStax streaming-as-a-service powered by Pulsar. It’s the most complete Pulsar-based cloud service integrated with the DataStax Astra DB service and you can get started for free!

If you’re not ready for cloud service but want to run Pulsar on-premises or as a self-managed setup, simply use DataStax Luna Streaming—a production-ready distribution version of Pulsar.

Sign up and find out more about Pulsar, Astra DB, and Luna streaming on our website and DataStax Medium. If you want to read more about Pulsar, check out this article by our CTO, Jonathan Ellis, and this article by me. Lastly, I’m always happy to answer any questions you may have about Pulsar on LinkedIn!

Resources

  1. Apache Pulsar
  2. Apache Pulsar GitHub repository
  3. Apache Pulsar Slack community
  4. DataStax Expert Support
  5. Apache Kafka
  6. DataStax Fast JMS for Apache Pulsar
  7. Evaluating persistent, replicated message queues, Softwaremill
  8. A Report on the Cost Savings of Replacing Kafka with Pulsar
  9. DataStax Astra Streaming
  10. DataStax Astra DB service
  11. DataStax Luna Streaming
  12. Four Reasons Why Apache Pulsar is Essential to the Modern Data Stack
  13. 7 Reasons to Choose Apache Pulsar over Apache Kafka

--

--

--

We’re huge believers in modern, cloud native technologies like Kubernetes; we are making Cassandra ready for millions of developers through simple APIs; and we are committed to delivering the industry’s first and only open, multi-cloud serverless database: DataStax Astra DB.

Recommended from Medium

GOOGLE CODE-IN 2018 EXPERIENCE

Managing Input and Output Operations — Language C

This Week’s Top Stories About odoo customization compnay

How to start writing unit tests?

Saving and reading data in Flutter with SharedPreferences

iOS UI Tests with FBSnapshotTestCase and fastlane

Resolved: Canon Printer Error Code U075 & Its Causes

Headless CMS: the Future for Websites

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
DataStax

DataStax

DataStax is the company behind the massively scalable, highly available, cloud-native NoSQL data platform built on Apache Cassandra®.

More from Medium

My top 5 tools to manage/develop with Apache Kafka

Migrating from SQL to NoSQL with Spring PetClinic and Apache Cassandra®

Recipe for a Scalable Cloud Native Data lake

Infinite scaling with containers and Kubernetes: The total perspective vortex

Container box with 4 marking resource limits, from top to bottom: resource limit, maximum usable resources, resource request, absolute minimum. Anything above the limit is marked as “waste”, anything between absolute minimum and resource request is marked “safety”.