Geo-replication of Apache Pulsar in AKS

Published in

WeTheITGuys

5 min readFeb 13, 2024

In today’s data-driven world, where every decision, relies heavily on data, the importance of efficient data storage and streaming cannot be overstated. Apache Pulsar is one of the platforms being used nowadays. However, despite its growing popularity, there’s limited documentation on geo-replicating Pulsar clusters across multiple regions. So, why replicate? While Pulsar allows message duplication at various levels, including broker, namespace, and topic, the need for geo-replication arises for several reasons.

Firstly, speed is paramount. With clients scattered across the globe, reducing message latency significantly enhances consumption and production processes. Moreover, geo-replication ensures robust disaster recovery measures. In the event of a cluster failure, geo-replication enables swift recovery without significant downtime, restoring normal operations seamlessly. Additionally, it facilitates hassle-free upgrades of AKS clusters, Pulsar versions, and traffic management while ensuring uninterrupted service for production-level customers.

Setting up geo-replication across multiple regions involves configuring individual Apache Pulsar instances in each region. Initially deployed in standalone mode, these instances need to be transitioned to cluster mode. Achieving this requires establishing network connectivity between the clusters. Azure’s Global Virtual Network Peering offers a secure, private communication channel between clusters, ensuring data integrity without exposure to the public internet or the need for encryption. Alternatively, transit hubs streamline connectivity in scenarios involving numerous clusters, adhering to a classic hub-and-spoke model. In corporate companies, this step takes priority because of security concerns. Make sure to whitelist the required IPs in the Network Security Groups(NSGs) of the resource group to have a seamless experience.

With the network infrastructure in place, configuring geo-replication within clusters becomes the next focus. While Pulsar’s traditional setup involved a quorum of Zookeepers managing cluster-specific coordination, geo-replication mandates a global metadata store to oversee distributed synchronization and replication across clusters. This global metadata store can either utilize existing Zookeeper instances or be deployed separately, offering enhanced scalability and reduced complexity. I would recommend having a separate global zookeeper (Config-store) for handling all global-level operations.

This config store is gonna be another stateful set, very similar to Zookeeper with comparatively fewer configurations. A similar sort of stateful set, config map, secrets, and service can be created for our config-store. Since it is responsible for communication between clusters IPs need to be exposed as opposed to the local zookeeper. Leader elections are not required for it. This config store will make sure replication is in sync within all the clusters and if in case, the connection breaks it resumes the replication from where it broke last time.

While coding first step is to add all pulsar instances to one another, we need to enable it on a per-tenant basis in pulsar. It can be managed and configured at the namespace level, so Pulsar gives the flexibility to have geo-replication enabled for one namespace and disabled for others within the same tenant. Now for connecting replicating clusters we already have the network setup, we talked about config store as well. The next step is to configure each cluster to connect to the one another. Pulsar-admin tool can be used to perform this task. Go into the toolset, cd bin/pulsar-admin, and then let's say we want to connect the eastus2 cluster to the centralus cluster following command can be used.

bin/pulsar-admin clusters create \
 - broker-url pulsar://<DNS-OF-US2-EAST>:<PORT> \
 - url http://<DNS-OF-US2-EAST>:<PORT> \
eastus2

A similar command needs to be executed in the eastus2 cluster as well. Now when these pulsar instances are in sync. Asynchronous geo-replication can take place. Brokers are responsible for handling it and zookeepers contain the metadata regarding cursors as well. In case anything bad happens, replication will resume from where the connection broke last time. To replicate to a cluster, the tenant needs permission to use that cluster. Permissions can be granted to the tenant when you create the tenant by using the following command.

Specify all the intended clusters when you create a tenant:

bin/pulsar-admin tenants create my-tenant \
--admin-roles my-admin-role \
--allowed-clusters eastus2, eastus, centralus

Last but one of the most important steps is to configure config-store. All config stores need to know about the presence of other config-stores and their port numbers. These are the consistent highly available sources of truth that can tolerate failures. It is recommended to have the config-stores in at least 3 regions so that in case of failure, the majority of the quorum stays active and data stays consistent. For example, we can have a quorum of 7 servers spread across 3 clusters, that are responsible for decision-making, rest work as observers.

clientPort=2184
server.1=zk1.centralus.example.com:2185:2186
server.2=zk2.centralus.example.com:2185:2186
server.3=zk3.centralus.example.com:2185:2186
server.4=zk1.eastus2.example.com:2185:2186
server.5=zk2.eastus2.example.com:2185:2186
server.6=zk3.eastus2.example.com:2185:2186:observer
server.7=zk1.eastus.example.com:2185:2186
server.8=zk2.eastus.example.com:2185:2186
server.9=zk3.eastus.example.com:2185:2186:observer

Critical to this setup is configuring the config store. These highly available sources of truth must be distributed across multiple regions to withstand failures effectively. Load balancers or traffic managers can be employed to optimize traffic distribution among clusters, ensuring efficient load handling. Once everything is set up, geo-replication can be enabled at the subscription level as well. The benefit of enabling it is to keep the cursor synced across all the pulsar instances. In case in between of consumption if the situation arises to switch the pulsar instance, consumption can be continued from the same place where it left off last time. Consumer subscription sync time can be customized. It can be set to 1 second, so every second cursor will be synced.

To make sure subscription replication is working, it can be checked in the pulsar stats of a particular topic. The flag can be set as true from the client's code in most of the languages. In some other cases, it needs to be taken care of at the administrative level. If sub replication is working topic stats will show the replicationEnabled flag set to true for those subscriptions.

In conclusion, while setting up geo-replication for Apache Pulsar across Azure Kubernetes Service (AKS) clusters may seem daunting, the benefits are immense. Enhanced speed, robust disaster recovery, and seamless upgrades make it a worthwhile endeavor, ensuring uninterrupted data streaming operations in a globally distributed environment.

Implementing these steps may require time and effort, but the resulting benefits make it a worthwhile investment.

Geo-replication of Apache Pulsar in AKS

Written by Anmol Arora