Apache Pulsar — Gentle Introduction

Karthikeyan Palanivelu
3 min readJul 11, 2018

--

Apache Pulsar, it is a pub-sub messaging platform backed by durable storage (uses Apache Bookkeeper), with the following cool features:

  • Geo-Replication
  • Multi-Tenant
  • Zero Data Loss
  • Zero Rebalancing time
  • Unified Queuing and Streaming Model
  • Highly Scalable
  • High Throughput
  • Pulsar Proxy
  • Functions

Pulsar documentation explains every feature in detail. Blog is from the customer perspective of using Pulsar.

Architecture

Pulsar — Architecture

Pulsar has a layered architecture which isolates the storage mechanism from broker. This architecture provides me the following benefits of:

  1. Scale Brokers independently
  2. Scale Bookies independently
  3. Containerize the Zookeeper, Broker and Bookies.
  4. Zookeeper provides the configuration and state of the cluster.
Pulsar — Architecture how it works

Below are the highlights from the above diagram which excites us to choose Pulsar:

  1. Load Balancer : Pulsar has inbuilt load balancer which distributes load across all brokers internally.
  2. Service Discovery : Pulsar has inbuilt service discovery to identify where and how to connect to the brokers using single endpoint.
  3. Global Replicators : Helps in replicating the data between n-brokers as configured for the namespace.
  4. Global ZK : Global Zookeeper helps in enablement of geo-replication.

Geo-Replication

Geo-replication is out of the box solution in pulsar. Global clusters can be configured at namespace level to get it replicated within any number of clusters (n-way mesh solution). From the below example, datacenter C does not have a Consumer, but still message is consumed in datacenter A or B based on the subscription model.

Multi-tenant

Within an organization, multi-tenancy feature helps to standup one cluster for an enterprise by still providing isolation to data storage. This builtin feature will drastically bring down the infrastructure and operational cost to the organization.

** I am planning to publish a detailed blog on establishing this feature in Pulsar.

Zero Rebalancing Time

Pulsar’s layered architecture and broker’s stateless nature helps with zero rebalancing time. If a new broker is added to the cluster, it immediately available for writes and reads; does not spend anytime in rebalancing data across the cluster.

From the perspective of Bookies: when a new bookie is added to the cluster; it immediately ready for any writes due to its underlying Distributed Log Architecture — Read/Write Isolation. Rebalancing of the data based on the segment replication configuration will takes place behind the scenes without any impact for being in cluster.

Unified Queuing and Streaming Model

Pulsar supports both streaming and queuing semantic in one model. This feature can be achieved through Subscription Model. Consumers subscribe to the topic using any one of the subscription model:

  1. Exclusive — Supports streaming semantic
  2. Failover — Supports streaming semantic
  3. Shared — Supports queuing semantic

Functions

Functions are the localized listeners which can live within or outside of pulsar. From the usage per se, Functions can be used for content based routing which will help the enterprise applications to receive only/all the intended messages.

Proxy

Proxy are needed to expose the brokers to outside world when deployed in cloud or Kubernetes. Proxy by itself can provide Authentication and Authorization and seamlessly connect with broker on TLS or without TLS. Proxy has a inbuilt feature to pass authorization token to broker for namespace permission validation.

Conclusion

Apache Pulsar is a powerful pub-sub model built on layered architecture which comes out of the box with Geo-Replication, Multi-Tenant, Zero Rebalancing time, Unified Queuing &Streaming, TLS based Authentication/Authorization, Proxy and Durability.

Happy Pulsaring !

--

--