#Tech in 5 — Apache Pulsar
An Introductory Technical Assessment of Apache Pulsar and an Example of How to Make it Work for You
As an avid user of both Apache Spark and Apache Kafka, I thought I knew a lot about streaming. Boy, was I wrong. This kind of thinking is dangerous for any developer; it stunts growth, prevents true mastery, and stalls your potential. When I feel like this, I find it essential to reach out to others, read some blog posts, and broaden my personal stack. I decided to look into a different streaming technology where I could do a quick down and dirty dev session; therefore, I chose Apache Pulsar.
High-Level Technical Discussion
Apache Pulsar is built on the publish-subscribe pattern, much like other streaming services such as Kafka and Storm. However, Apache Pulsar has some distinct advantages over some of the other streaming technologies that exist in the Big Data Ecosystem. One of these advantages is that Apache Pulsar separates out the serving and storage layers within the environment. Here is a quick rundown of how the Apache Pulsar Technology looks visually.
(Diagram from the Apache Pulsar Documentation)
So what is the big deal, why does separating out the storage and serving layers matter, you ask? This matters a great deal because now we can scale as we desire. Need more storage because you are building a historian with a slow Change Data Capture? You can scale back the compute dedicated for serving the stream and add more storage. What about if you need more processing and less storage? You can dedicate more nodes to processing the higher velocity streaming data and less to storage.
Traditionally, messaging systems such as those listed above chose to co-locate the processing and storage on the same cluster. This is not the case with Apache Pulsar; because the serving and persistence layers are separated, Apache Pulsar can scale while serving independently of scaling storage. This provides a much better capacity planning model that can be more cost-efficient. This fundamental idea can be distilled into the following statement:
When you need to support more consumers or producers, you can simply add more brokers.
Another Advantage of Pulsar is the Zero Rebalancing Time feature. Due to Pulsars’ uniquely layered architecture and the broker’s stateless nature, once a node is made available to the cluster, it’s immediately available for reads and writes. New data that is given to the nodes automatically starts getting written to new bookies. A bookie also only allows for an entry once it is persisted to its journal file on disk. Even if all servers went down and there were multiple disk failures, you would not lose the message. Kafka, by contrast, could lose the current message if the message was in RAM across all nodes on the cluster.
If you would like a deeper dive into the architecture, here is a post I found useful in understanding the nuts and bolts of Pulsar.
Understanding How Apache Pulsar Works — Jack Vanlightly
I will be writing a series of blog posts about Apache Pulsar, including some Kafka vs Pulsar posts. First up though I…
Alright, now the fun stuff, How do we use the thing? Well, first, we need to spin up an instance in docker. These steps are relatively simple as they are outlined in the documentation mentioned above. But I will go through step-by-step instructions to make sure everyone is on the same page.
Ahh, but before I begin, what data should I consume? Need something big, something useless for a toy example, Ah-ha! Twitter’s API looks tasty; I think I’ll take a chance and consume that data using my docker build. It’s close to real-time and would be an excellent test for a demo.
Now, let’s consume twitter data through their API and run it through Pulsar. This is more complicated as I tried to consume live data from one of the biggest data generators. If I need to, I can filter their stream to interesting tweets I want to see. Here I am filtering around information on Hashmap, Snowflake, and Data Engineering.
We can successfully listen to twitter data and filter on specific tags. I would say that is a win! This is a standalone docker image running on my local machine, and the performance was respectable. I suspect that a cluster would perform even better than this given the architecture of Pulsar.
Unfortunately, as of writing this post, there are no managed services out there for Apache Pulsar. This requires more hands-on knowledge to develop than say a Confluent Kafka implementation or a Databricks implementation of Spark. Still, for those who do not mind coding and managing their streaming services, Pulsar is for you!
Next week, we will look at a new technology: Dask! Stay Tuned and Subscribe to Hashmap to stay up to date with my blog posts. New Tech in 5 Minutes is designed to bring someone up to speed in a technology in around 5 minutes.
Borrowed Twitter Python code from here: http://adilmoujahid.com/posts/2014/07/twitter-analytics/
Feel free to share on other channels and be sure and keep up with all new content from Hashmap here.
Kieran Healey is a Cloud and Data Engineer with Hashmap providing Data, Cloud, IoT, and AI/ML solutions and consulting expertise across industries with a group of innovative technologists and domain experts accelerating high-value business outcomes for our customers.