From In-House Managed RabbitMQ to GCP PubSub

Liron Kreiss
Cybereason
Published in
4 min readNov 18, 2019

Lessons learned from our journey in the world of queue technologies.

At Cybereason, we ran with RabbitMQ in production as a message broker between two of our services for over a year. We had a Kubernetes cluster consisting of Rabbit containers and an autoscaler pod. The deployment was single-tenant with a cluster for each of our customers, and our DevOps team managed the entire system in-house.

Are we speaking your language? Check out our careers page for open roles.

In most cases, the system worked as expected, however, things got dirty at scale. We faced cluster issues with persistency and partitioning. Working at scale also brought with it costs: more clusters, more pods per cluster, more DevOps people needed to manage the system. The ongoing deployment issues, combined with the cost of the system, forced us to redesign. The first decision we made was to outsource the queuing system, since we recognized it would reduce our costs significantly.

We started to explore our options. Though RabbitMQ is one of the best queueing technologies on the market, we chose to replace it. The candidates for a successor were AWS SQS and GCP PubSub. The title already gave it away that we went with GCP PubSub, but here is why.

  1. Most of our production environments are deployed in GCP. We anticipated it would be easier to integrate due to security requirements and to enable a smoother deployment.
  2. PubSub supports signing many subscriptions to the same topic, which matched our architecture’s future plans. In SQS, you’ll have to deploy SNS.

Proof of Concept Phase

Our team was largely unfamiliar with GCP PubSub, so we started with a proof of concept. We wanted to get familiar with the product’s capabilities and limitations before deployment. During the POC, we learned about two major parts of GCP PubSub that were crucial for us:

  1. Dependency Issues: PubSub only supports specific protobuf and grpc’s versions that conflict with our monolith’s legacy versions of these packages.
  2. Deployment Process: We learned what the deployment needs in terms of providing service accounts, permissions management, etc.

The POC’s definition of done was an end-to-end environment with PubSub as a replacement of RabbitMQ. No design discussions took place during that time. Reflecting back on the process, we realize that, although most proof of concept’s solely prove a concept, if it takes two engineers several days to implement, it might be wiser to bring the POC closer to the actual implementation. In this specific project, from resources to time considerations, it definitely would have made more sense to bring the POC closer to the actual implementation.

Implementation Phase

The implementation phase was divided into two parts: handling the dependency issues and implementing it on our monolith and microservice.

Part 1: Handling Dependency Issues

A huge part of implementing GCP PubSub was ensuring the smallest impact possible on the other monolith’s domains. We decided to shade the legacy protobuf and grpc versions. By shading these versions, we were able to solve the dependency issues, and more importantly, it saved us from a potentially major refactoring process to upgrade the legacy version in the monolith.

Part 2: GCP PubSub Implementation in our Monolith and Microservice

We had to make several design decisions after the POC ended that resulted in a major refactoring of the current system. We faced a couple challenges that affected the timeline for implementation.

  • Feature Flag: We decided to keep RabbitMQ with a feature flag, the system’s design for supporting both queues was challenging and limited us with the PubSub design.
  • Production Bugs: During the beginning of the implementation phase, we identified a bug that required a refactoring of the original Rabbit system and raised more architecture questions.
  • Altering the Retry Process: With GCP PubSub, you pay per request. In order to optimize costs, we altered our queue’s retry process, as shown below.
Our new queue retry process after moving to GCP PubSub.
  • Wrapping the PubSub client: Whether to wrap the PubSub client or consume it as a library was a question. The positives to wrapping the client is that it would prevent code duplication and make PubSub more accessible to other teams in our organization. However, PubSub already has a strong Java client and wrapping it up can limit us in the future. We chose to move forward with the wrapper, but there is a chance we will discard it in the future.
  • Independent Services: One of our goals in the project was to make our services as independent as possible by allowing them to create their own subscriptions. PubSub, however, only supports this if the service has permissions for every topic and subscription in the project. Since we are currently running in single-tenant mode, this didn’t meet our security requirements and our services are not (yet!) fully independent.

Summary

We will be rolling out GCP PubSub to production shortly, where we will undoubtedly learn more about the impact of replacing RabbitMQ as our queue system. The lessons raised here pose some open questions about this project and its future implementation. There is never a definite “right” or “wrong” in projects like these, but documenting the process and reflecting back on it is always the best way to iteratively improve.

Interested in joining our team and solving these kinds of problems? Check out our careers page for open roles.

--

--