working with Apache Pulsar in a one day Hackathon

Hila Elkayam
Israeli Tech Radar
Published in
4 min readFeb 22, 2020

The Hackathon focus was to use two new technologies from “try” section of the tech radar. We were a team of 5 people, Keren Finkelstein and myself choose to take the Apache Pulsar part.

“Apache Pulsar is an open-source distributed pub-sub messaging system originally created at Yahoo and now part of the Apache Software Foundation“

Our team wanted to solve the problem of not having visibility in data pipeline processing, by creating a heat diagram which presents the pipeline, the “busy” topic the redder it gets.

We choose to start with a simple example, which explains how to install pulsar and some basic use of it. We followed this article.

The pulsar docker installation was very easy, we chose to work with python so we installed “pulser-client” , we used the subscriber and publisher from the article and expected it all to run.

It didn’t.We encountered an error when trying to run the subscriber.

We tried to google the error and found all kinds of things which didn’t help and frankly we didn’t know the source of the problem, is it the docker? Is it something in the python code?

We tried different solutions which didn’t work. Finally, we found a blog post explaining that after installing the pulsar-client we can open python interpreter and type “import pulsar” if we receive no error that means the pulsar-client is installed. We tried that and got the same error we got when we tried to run the subscriber.

Well, at least we now know where the error is coming from.

There was no other choice, but to downgraded pulsar-client to 2.4.1.post1 version. Version 2.4.2 presented the same error (We used MacOS Mojave).

Great, we have pulsar up and running, we have a producer and subscriber but — We need a pipeline.
In order to do that we need to use Pulsar Functions.

Pulsar Functions are lightweight compute processes that:

1.consume messages from one or more Pulsar topics,

2.apply a user-supplied processing logic to each message,

3.publish the results of the computation to another topic.

And so we did.

Created our first pipeline, a very simple one, just to see how it works, a producer create a message, the pulsar function is listening on this topic and suppose to get it, do something and send another message which the subscriber is waiting for.

We searched the Pulsar documentation on how to deploy the function so Pulsar will recognise it, found some examples on how to deploy, it goes like this:

bin/pulsar-admin functions create — broker-service-url pulsar://localhost:6650 — py <path-to-function-code> -classname <class-name> — inputs <input-topics> — output <output-topics>

All the examples use the Pulsar’s admin client, none mentioned the use of Pulsar in a docker, but that part was easy, just use “docker exec”.

If you want to check the Pulsar recognise the function, use:

$ curl http://localhost:8080/admin/functions/<tenant>/<namespace>/<class-name>

If you didn’t create a tenant and namespace use the defaults :

tenant=public

namespace=default

We crossed our fingers and ran the pipeline. Nothing happened.

:-(

We started with looking at the code again but it seemed ok, we have no logs .. what should we do now?

Did I mention we’re on a hackathon and there are two hours to presentation time?!! Ahhhh.

We decided to move on and fake the pipeline at that point (and maybe return to it later) as we are on a tight schedule and have a lot of integration work ahead, which we didn’t yet know how to do.

A reminder, the goal was to create a heat diagram, which means we need to read stats on the topic, check how many messages we get per topic.

Pulsar API exactly that:

http://localhost:8080/admin/v2/persistent/<tenant>/<namespace>/<topic-name>/stats

You can take a look at the Pulsar API here .

The stats were zeros all over, we thought at the beginning that it’s because we should use persistent topics and we didn’t, but it turns out that this is the topics are persistent by default, so no problem there…

We played with it a bit and realised we can see the stats for a topic only when it is currently being transferred in the system.

15 minutes to presentation, We have a pipeline , we have numbers from the stats, we can create a heat diagram according to the stats number.

Mission Accomplished !

https://github.com/tikalk/pulsar-pipeline-monitor ( ⭐️ star us! ⭐️)

Conclusion:

We had only a day to set up the pipeline using Apache Pulsar, it was not very easy. We encountered several problems along the way, we needed to google for answers and there were not a lot of resources to look at.

Apache Pulsar comes with an extensive documentation but if you don’t understand something in the documentation you don’t have other places or examples to look at.

--

--