Apache Pulsar-A Gentle Intro to Apache’s Newest Pub-Sub Messaging Platform

Karthikeyan Palanivelu
Capital One Tech
Published in
5 min readJul 17, 2019

Apache Pulsar is an open source pub-sub messaging platform backed by durable storage (Apache Bookkeeper) with the following cool features:

  • Geo-Replication
  • Multi-Tenancy
  • Zero Data Loss
  • Zero Rebalancing Time
  • Unified Queuing and Streaming Model
  • Highly Scalable
  • High Throughput
  • Pulsar Proxy
  • Functions

The Pulsar documentation explains every feature in detail; this blog is written from the perspective of a Pulsar user and covers at a high level what you need to know before getting started with Pulsar.

Terminologies

  • Apache ZooKeeper — Stores metadata information about Pulsar clusters.
  • Broker — Stateless component exposes REST and native endpoints to administer message transfer and storage.
  • Bookie — Bookie is an instance of Apache BookKeeper that stores the messages. This is the persistent store for Pulsar clusters.

Architecture

Pulsar — Architecture. Diagram taken from the Apache Pulsar Documentation.

Current messaging systems have taken the approach of co-locating data processing and data storage on the same cluster nodes or instances. That design choice offered a simpler infrastructure and some possible performance benefits due to reducing transfer of data over the network, but at the cost of scalability, resiliency, and operations. Apache Pulsar takes a cloud-friendly approach by separating the serving and storage layers.

Pulsar has a layered architecture with data served by stateless “broker” nodes, while data storage is handled by “bookie” nodes. This architecture provides the following benefits:

  • Scales brokers independently.
  • Scales bookies independently.
  • Containerizes the ZooKeeper, broker and bookies.
  • ZooKeeper provides the configuration and state of the cluster.
Pulsar — Architecture how it works. Diagram taken from the Apache Pulsar Documentation.

Here are the highlights from the above diagram which I find most exciting:

  • Load Balancer: Pulsar has an inbuilt load balancer which distributes load across all brokers internally.
  • Service Discovery: Pulsar has inbuilt service discovery to identify where and how to connect to the brokers using a single endpoint.
  • Global Replicators: Helps in replicating the data between n-brokers as configured for the namespace.
  • Global ZK: Global ZooKeeper helps in enablement of geo-replication.

Geo-Replication

Geo-replication is a typical mechanism used to provide disaster recovery. Generally, any database or message bus solution replicates data between two data centers. Pulsar supports multi-datacenter replication(n-mesh) with the below strategies:

  • Asynchronous Replication
  • Synchronous Replication

Global clusters can be configured at namespace level to be replicated within any number of clusters. From the below example, Datacenter C does not have a consumer, but it’s message can still be consumed in Datacenters A or B based on the subscription model.

Geo-Replication

Multi-Tenancy
Within an organization, the multi-tenancy feature helps stand up individual clusters for an enterprise by providing isolation to the data storage. This built-in feature drastically brings down the infrastructure and operational cost to the organization.

** Stay tuned for a detailed blog on establishing the multi-tenancy feature in Pulsar.**

Zero Rebalancing Time
Pulsar’s layered architecture and the broker’s stateless nature helps with zero rebalancing time. If a new broker is added to the cluster, it’s immediately available for writes and reads and does not spend any time rebalancing data across the cluster.

From the perspective of bookies, when a new bookie is added to the cluster it is immediately ready for any writes due to its underlying distributed log architecture and read/write isolation. Rebalancing of the data based on the segment replication configuration takes place behind the scenes, without any impact on the cluster.

Unified Queuing and Streaming Model
Pulsar supports both streaming and queuing semantic in one model. This feature is known as the subscription model. Consumers subscribe to the topic using any one of the following three subscription models:

Different types of Pulsar subscriptions

  1. Exclusive — Supports streaming semantic. There can be only one consumer at any given time.
  2. Failover — Supports streaming semantic. Multiple consumers are allowed to connect to a topic but only one consumer will receive messages at any given time. The other consumers will start receiving messages only when the current receiving consumer fails.
  3. Shared — Supports queuing semantic. Multiple consumers can attach to the same topic and each consumer will receive a fraction of the messages

Functions

Functions are the localized listeners which can live within or outside of Pulsar. From the usage per se, functions can be used for content based-routing, which can help enterprise applications receive only/all the intended messages.

Proxy

The proxy is needed to expose the brokers to the outside world when deployed in the cloud or Kubernetes. Proxy by itself can provide authentication and authorization, and seamlessly connect with the broker both on TLS or without TLS. Proxy has an inbuilt feature to pass authorization tokens to the broker for namespace permission validation.

Conclusion

Apache Pulsar is a powerful pub-sub model built on a layered architecture which comes out-of-the-box with geo-replication, multi-tenancy, zero rebalancing time, unified queuing & streaming, TLS-based authentication/authorization, proxy and durability. It is worth learning more about Apache Pulsar if you are working on a streaming platform, big data pipeline, or pub-sub message bus.

Happy Pulsaring !

Related Articles in This Series

Apache Pulsar — One Cluster for the Entire Enterprise Using Multi Tenancy
Apache Pulsar — Geo-Replication and Hybrid Deployment Model to Achieve Synchronous Replication

DISCLOSURE STATEMENT: © 2019 Capital One. Opinions are those of the individual author. Unless noted otherwise in this post, Capital One is not affiliated with, nor endorsed by, any of the companies mentioned. All trademarks and other intellectual property used or displayed are property of their respective owners.

--

--