A feather in their caps

Kev Jackson
May 3 · 8 min read

We have undertaken a large integration project in the Warehouse Management Systems (WMS) team at THG that has required a significant amount of research into message queue technologies. This article describes the options we reviewed, our choice, and how the choice we made led to contributing to an open source project.

Once upon a time this was basically a choice between the two main (open source) contenders, Apache ActiveMQ or RabbitMQ.

As an organisation, we have experience using ActiveMQ and RabbitMQ for different workloads elsewhere in THG. Both of these technologies have proven to be reliable and safe within a single datacenter (DC).

Our project needs to be hosted in multiple DCs around the world to support our global operations. We need the elasticity to allow us to deploy both larger and smaller configurations depending on the requirements of the particular warehouse environment.

The two traditional platforms didn’t seem to fit the bill here; RabbitMQ couldn’t guarantee not sending duplicate messages which would be an anathema to our application that relies on the messaging infrastructure guaranteeing at-most-once delivery.

ActiveMQ’s network of broker clustering can create situations where a message is not delivered until a broker is restarted, and to avoid this behavioural quirk you need to define storage that spans at least two DCs. Our experience of using GlusterFS in this configuration hadn’t convinced us that this had the reliability we required for our application. The master-slave cluster configuration does provide synchronous commits in our two DCs, but doesn’t allow multiple brokers to be running simultaneously.

What’s the story?

So we did some digging into the current state of messaging systems — the landscape of which has expanded with the rise of cloud infrastructure and distributed systems coming to the fore:

The current set of streaming and messaging technologies listed by the CNCF

For the project we had a set of functional and non-functional requirements:

  • Open source with permissive license
  • Persistent queue & durable topic support
  • Java client library
  • At-most-once message delivery semantics
  • Large message support
  • Clustered broker to support expansion and multi-DC
  • Low operational burden and complexity
  • Ability to handle hundreds of thousands of messages per topic with reliable and predictable latency
A sample of a typical set of messages — note the increase around Black Friday.

We immediately discounted commercial options as “open source” was one of our key selection criteria. Apache Kafka, despite being a solid pubsub system was not compatible with our other non-functional requirements around message durability. Apache Spark and Storm relied on HDFS infrastructure, which comes with considerable operational burdens for the team.

None of the golang based production-ready technologies had the persistence or durability characteristics we required, so out went NATS and NSQ (and I couldn’t convince the team to rewrite the entire stack to use ZeroMQ 😒)

Amongst the other contenders:

  • Apache Heron is still incubating (and is focused on analytics streams)
  • Apache NiFi looks like it could be very powerful, however would require re-thinking our entire architecture to convert from an event-driven system to a flow-based model
  • Apache Flink focuses on analytics, there is support for event-driven applications however the use cases stated in the documentation were not a good fit for what we want
  • OpenMessaging is a specification not a product we could build upon
  • Apache Beam is a processing framework that simplifies writing batch processing or stream processing on top of a distributed processing backend — so although interesting in itself, it didn’t solve our immediate use case of messaging for an event-driven system

This left us with essentially Apache Pulsar or Apache RocketMQ — both of these claimed to solve our non-functional-requirements and were developed at large corporations before being open-sourced. We quickly tested Pulsar and found it simple enough to validate some example scenarios in a day of effort — this doesn’t mean that RocketMQ wasn’t suitable, just that we decided that Pulsar would be a good fit for our use cases.

Next steps

After settling on our next-gen messaging platform technology, we had to start integrating our microservices with the Pulsar client.

Like so many enterprise Java development teams, the WMS team had chosen to integrate with message queues via the EIP library Apache Camel. This allows us to have an abstraction between the application code in the microservice and the underlying messaging technology — a very useful abstraction layer when you need to move messaging providers…

There was just one problem — Apache Camel didn’t support Apache Pulsar, which meant we had two choices:

  1. Remove the Camel library from our codebase and integrate directly with Pulsar through the Java client library or
  2. Write the Camel integration component so we could retain our technology abstraction layer.

Removing Camel from the project would involve rewriting the common integration library we had layered on top of the Camel semantics along with re-implementing the Camel routing semantics (essentially moving that logic out of Camel and into our library code).

This didn’t seem like a positive use of our time compared to creating a camel-pulsar component that we could then simply include as a dependency.

At THG, like all Java-focused development organisations, we are heavy consumers of open source software, particularly Apache Java libraries and products, and it’s rare that we have the opportunity within a project to contribute back to the open source community as an organisation rather than as individual developers.

As part of the evaluation of the messaging platforms, I had made a minor change to assess how open the Pulsar community was to external developers. As expected the community was welcoming and helpful.

With projects that originate at a single large company, there is a danger that the core contributors are heterogenous and the project is “open-source in name only”. The ASF attempts to mitigate this danger by requiring that all projects (regardless of provenance) go through an “incubation” period where the community of contributors around the project has to expand and be sustainable beyond a single organisation.

With my small change marshalled through code review and with good interactions with the Pulsar core team, it made it easier to recommend the platform to the WMS team.

The process of contributing

Each open source project has different norms or methods of working with- and contributing to- the code. The Linux kernel development process is different from contributing to a GitHub-based project such as Kubernetes.

Having previously been heavily involved in the Apache open source ecosystem, I reached out to the Apache Camel team to discuss how the team members assigned the task of writing the new Camel component would work with the Camel codebase:

Initial contact with the extremely helpful Apache Camel team on gitter

Open source development can be split into two eras, pre-git and post-git. Before Linus wrote git, many open source projects used either CVS or SVN for source control, with the Linux kernel famously using BitKeeper.

A central-server approach, such as SVN, led to restrictions on who could ‘commit’ to the server and helped to define the process of code reviews etc. that all software engineering projects should have.

Apache projects had a ‘commit-then-review’ process where a commit to a project trunk was deemed to be good unless another committer reviewed it and asked for a change. This process was restricted by who could commit. To be able to interact with the SVN repository with both read and write privileges, a developer had to have ‘committer’ status. Having commit status on one project didn’t automatically mean you would get that same status on another project.

This model, although effective, didn’t lead to engaging huge numbers of developers who wanted to get a little fix into the codebase just for their specific bug. Instead it leads to having a smaller number of committers who spend much more time shepherding patches from developers into trunk.

Along comes Git and suddenly centralised source control is no longer the de-facto model. Decentralisation brought huge benefits to the Linux kernel team, allowing much easier patch management and comparing source trees between developers. Apache was fairly wedded to using SVN for all the project repositories, but it became clear that Git was a superior technology and the Apache Infra team investigated how to adjust both process and tech to retain the committer / developer division.

The real explosion in open source development came when a team with good UX skills built a web service based on Git and made using Git significantly easier for groups of developers who wanted to collaborate on a project without having to first spend time setting up the source control infrastructure — GitHub.

Just recently, the ASF has moved to Github as a primary host for all ASF projects while retaining backup git mirrors internally. This is a big endorsement of Github for open source communities.

In some ways the easiest part of the entire process — writing the initial code to achieve your goal. Two of our team, Chuks and Richard, started working on what was to become the camel-pulsar component.

We already used GitHub for our project, however we couldn’t use our commercial project space for open source contributions, so we used the THG OpenSource GitHub organisation for this work.

After getting the core functionality completed in a branch taken from Camel, we opened a Pull Request and then began the normal flow of review, address comments or alter code to meet the reviewers requirements etc.

There was one additional complication with the work. We are not currently using the most recent version of Camel, so our integration code was based on the version we were using and we could easily test with.

This led to some back and forth with the Camel developers who required that our changes be compatible with both the 2.x stream and master (3.x). This didn’t take too long to sort out and through the whole process, the Camel developers provided excellent feedback on code style, possible refactorings and missing code as part of the review process.

Merged

Finally we got the message we had been working towards:

Yay!

Camel 3 and Camel 2.24 will feature support for Apache Pulsar thanks to the following:

…and the enormously helpful Apache Camel contributors and community.

We’re recruiting

Find out about the exciting opportunities at THG here:

THG Tech Blog

THG is one of the world’s fastest growing and largest online retailers. With a world-class business, a proprietary technology platform, and disruptive business model, our ambition is to be the global digital leader.

Kev Jackson

Written by

Principal Software Engineer @ THG, We’re recruiting — thg.com/careers

THG Tech Blog

THG is one of the world’s fastest growing and largest online retailers. With a world-class business, a proprietary technology platform, and disruptive business model, our ambition is to be the global digital leader.