Beating Conway’s Law: Achieving Distributed Ownership with Guice, RabbitMQ and Kubernetes

Published in

skai engineering blog

11 min readAug 2, 2022

Image by Tom Fisk: https://www.pexels.com/photo/industrial-buildings-connected-with-pipes-at-factory-6060188/

If you’re writing software or managing teams that do, you might be familiar with the (in)famous Conway’s Law:

Any organization that designs a system [..] will produce a design whose structure is a copy of the organization’s communication structure

Or with its inverse, which states that an organization’s structure will eventually reflect the system’s existing architecture.

Both of these make sense due to the inevitable connection between software and how we create it: we work in teams, and the boundaries between the teams’ responsibilities are easier to manage if they align with the physical boundaries (repositories, servers) between software components. Conversely, software components are easier to maintain if they have a clear owner — one team per component — so that tasks don’t fall between chairs or require a high level of coordination.

The problem arises when there’s a conflict between an optimal architecture — one that would be maintainable, simple and performant — and any reasonable organization structure that would align with the business goals and allow teams to develop expertise over their domains.

In this post, I’ll describe our attempt at solving a specific example of this issue. The solution was made possible thanks to the use of a “Plugin Architecture” (to be defined below), which in turn was enabled by using the right set of tools and technologies. We’ve since successfully replicated this solution onto several services within Skai, and hope it might be of use for others.

The problem: conflicting vectors of separation

Before we present the case at hand — a bit of background: at Skai, we build an omnichannel marketing platform for brands to manage and optimize their online presence. “Omnichannel” means, among other things, that we connect to over a dozen “publishers” (platforms capable of running ads — such as Google, Amazon, Facebook etc.), and help our customers manage ads via these connections using a single platform. These “connections”, for example, allow fetching daily reports from each publisher, for each customer. Data from these reports will inform every decision (automated or manual) taken by or for our users. The reports collection is just one of many per-publisher tasks we perform: we pull and push various types of data to and from each publisher.

This creates a “matrix” of processes — with one axis being the publisher, and the other being the task at hand:

A matrix of processes. In reality this matrix has ~180 cells

Each cell in this table represents an actual component that requires coding and maintenance. There’s obviously a lot of commonality across each row — for example, fetching reports from any publisher would probably contain similar concepts and mechanisms (persisting data, handling failures etc.), even though each publisher’s API is different. Similarly — there are commonalities across each column — no matter what I do with publisher X, authentication and authorization to its API will be similar, for example.

Architecturally, this immediately presents us with a few options (see diagram below):

Option 1 — stand-alone service for each combination (cell): provides the best isolation, but generates huge complexity and may create a high degree of duplication (e.g. when many different services end up implementing the same persistence logic for fetched reports)
Option 2 — cross-publisher service per task (row): we’ll have a handful of services, each one doing one thing (fetch reports; push ad changes; etc.) for all publishers. The plus side is rather clear — it makes it very easy to align all publishers into a single, consistent pipeline, handling the many cross-cutting considerations in one place.
Option 3 — service per publisher (column): violates the “do one thing” principle, as each per-publisher service ends up servicing many different use cases (fetch reports, pull ad changes etc.)
Option 4 — monolith: Do everything in one big monolith (yikes!). We’ve been there before, and it isn’t great! No Isolation, no clear boundaries, a spaghetti of dependencies — a clear no go.

The 4 theoretical “process-to-service” architectural layouts

Rather quickly, we decided to adopt Option 2. The downside: that’s exactly where the Inverse Conway’s Law rears its head. If we structure our architecture around these “rows” (tasks), we should also structure our teams that way, while our business constraints clearly point to preferring publisher-aligned teams that can develop expertise and knowledge of how each publisher behaves. A service-oriented structure means teams will have to maintain code pertaining to over a dozen different publishers — each with their own idiosyncrasies, special cases, and cadence of API changes. It also means that the task of “supporting a new publisher” — which happens often enough as we continuously expand — requires the coordination of multiple teams. Which is, to say the least… bad.

These are our “conflicting vectors of separation”: we want to structure our services around tasks but our teams around publishers, supposedly violating Conway’s Law — which means we’ll eventually regret one of these decisions. Can we have it both ways?

The Solution: Publisher Plugins

In an effort to “square the circle”, we came up with the following suggestion:

We’ll build a service per “task” (row), with all the common behaviors of that task
Each such service will support publisher plugins — small implementations of a lean, predefined SPI which can be written and maintained independently from the rest of the service
Per-publisher teams can build these plugins, while other teams will maintain the services themselves

Selected option: per-publisher plugins used by per-task services

None of this is big news, yet; Many code components are structured this way — with multiple implementations of a single interface called from some central, common code. The challenge is to achieve this while preserving what we like about our services and the way they’re built and deployed (read more here):

Full CI/CD flows with full test coverage and immediate, automated deployments
Effective monitoring, with alerts sent to the right team
Runtime isolation — making sure that one team’s service (or plugin!) can’t affect the behavior of another team’s component

In other words, even if we have a common service for pulling reports (and we do — it’s aptly named Publisher Reports Fetcher, or PRF, for short), we’ll want separate environments to run its Google plugin and its Amazon plugin, and we’ll want the team owning the Google plugin to be completely independent of the team owning the Amazon plugin, for example.

So — how do we implement these “plugins”?

Step 1: Module Separation

The first step is to structure the code into independent modules, so that each module is compiled and unit-tested separately. These modules (managed by the build tool at hand — we’re using Gradle) reside within a single repository to reduce complexity and simplify deployment flows (we’ll discuss the downsides later).

Here’s how a typical cross-publisher plugin-based service, such as PRF, is structured:

Let’s explain what happens here:

Main module contains the service itself — in our case a Dropwizard or Spring Boot web server
It depends on the plugin modules at runtime only — meaning, the plugin code is not accessible to the main module during compilation (and vice-versa). Plugins are also completely independent of one another
Common module contains the common SPI implemented by the plugins and called by Main. All other modules depend on it, but it contains only high-level definitions and a few utilities, and doesn’t depend on any other module

You might be asking yourself: if Main module only depends on the plugin modules at runtime, how can it call a specific module (e.g. Google) when needed? How does it know which plugins even exist? And how does it call the right one for each incoming request?

So glad you asked! We’ll build out the answer in the next few sections.

Step 2: Dynamic Module Loading with Guice

In this example (PRF), we’re using Dropwizard as our web server. Dropwizard has a good integration with Guice — Google’s lightweight dependency injection framework — so we use its ability to dynamically and declaratively load dependencies to load the different module’s implementation of the common SPI into Main, at runtime. Note that a very similar approach can be easily implemented using Spring.

First — let’s use an example of such an SPI. In PRF, this “SPI” is mainly this Scala trait, defined in the Common module:

The trait’s methods return the PublisherType (e.g. Google) and a ReportJobProcessor which does the actual work required from each plugin (process a GetReport job and return a stream of data, which is later persisted by the common code in Main). Each publisher plugin will have an implementation of this trait, e.g. GooglePublisher.

Each plugin would also contain an implementation of Guice’s AbstractModule, which defines instances that can be wired into each other at runtime. The main trick here is to “register” the implementation of the Publisher trait into a Set of publishers:

This means that a Set of Publishers can be injected into any class, and that set will contain all the publishers registered by any module who’s configure method was called. Therefore, the Main module can contain a simple lookup service for finding the right Publisher implementation for a given type:

The PublisherType values are themselves registered dynamically, so there needn’t be any class in the application who’s familiar with all publishers (which means adding a publisher requires no changes to common code!).

Lastly, we can dynamically decide which modules to load via configuration. By default, we set our system to load all modules via the following configuration entry:

That means we can use an environment variable named GUICE_MODULES to override which modules are actually loaded, when starting the service.

The result is fully-decoupled code that is packaged into one Docker image (with all plugin modules available) and a configurable subset of modules available at runtime. In the next section, we’ll see how this variable is used to ensure decoupling in production.

Step 3: Separating Deployments with Kubernetes

Now that we have a single repository, built into a single Docker image, which can be started with a configurable set of plugin modules — we can use Kubernetes to define separate deployments for the different publishers. Busier publishers can get a larger deployment (more pods), smaller ones can get minimal deployments or even share a deployment if we don’t mind publishers X and Y sharing the same runtime.

Doing this with Kubernetes is trivial — we define multiple deployments using the same Docker image with different settings, and with different values for the GUICE_MODULES variable we use in our configuration. For example, here’s an (abbreviated) configuration for two deployments:

You can see the deployments differ in the number of replicas (in this example, Google needs more pods), as well as the GUICE_MODULES variable which selects only one module for each deployment. Other per-plugin configurations can also be set.

Step 4: Routing Jobs with RabbitMQ

We now have multiple Kubernetes deployments running, each one capable of handling just one publisher — which gives us the isolation we were looking for while we can still manage one unified service and reuse its CI/CD pipeline and code. The only remaining question is — how do we route requests for Google reports to the Google deployment, and requests for Amazon reports to the Amazon deployment?

Before implementing this architecture, incoming requests (which are HTTP POSTs) were all enqueued into a (single) RabbitMQ queue, and consumed by all pods in the cluster. That setup provided an (approximate) exactly-once delivery semantics and a trivial retry mechanism (if a pod crashes, another one will handle the messages it failed to handle). We can enhance it to also provide the routing we’re after:

We’ll have multiple queues — one per publisher
Each pod will only consume messages from queues of the publishers it has loaded
A separate deployment will load none of the plugins, and only handle incoming requests, enqueuing them into the right queue based on the publisher type (which is part of the request)

Easy!

The end result is described by this diagram:

Routing messages to the right per-publisher deployment via queues

It’s worth noting that thanks to how RabbitMQ works — the queues need not be created manually or via automation; A queue gets created as soon as a consumer tries to consume messages from it or a producer tries to enqueue messages to it; So, as soon as a new deployment comes online — the new queue is created automatically.

Tying it all together

The resulting system is easy to evolve and scale over multiple vectors:

It’s easy to add publishers without the potential of affecting others: a team can write their own plugin (implementing a simple SPI), add a new deployment, and we’re live — without changing any common code or interrupting other team’s deployments
It’s easy to scale an existing publisher by configuring its stand-alone deployment or improving its code
It’s also easy to add features to the service (e.g. better retry support, new output formats etc.) as long as the SPIs stay stable (or are enhanced in a backward-compatible way, thus not forcing all plugin owners to adapt)

Each one of these common tasks can be fully owned by a single team, which makes these teams highly independent and effective. Our initial goal (“we want to structure our services around tasks but our teams around publishers”) was fully achieved with rather minimal friction.

Future Enhancements

Of course, no design is perfect and nothing is just easy. This approach works well for us, and was replicated into multiple cross-publisher services, but it does have some downsides or costs that can be further optimized. Some examples:

Per-task services share a single CI/CD pipeline. That means that tests for all plugin modules are executed on every release of the service, which might mean an unstable test managed by one team can block another team’s work. That’s currently our #1 pain point, and we’re looking into solutions — such as identifying the changed plugin(s) in each commit and only running tests for that plugin, as well as a minimal set of service tests. That should be rather effective, as almost no commits affect more than one plugin. It’s also safe to assume changes made to one plugin will not affect other plugins — they’re independent and won’t even run in the same environment!
Another outcome of this shared CI/CD pipeline is that releases become slower as more and more publishers are added, due to the increasing amount of tests. That too can be handled by being selective with which tests are executed based on the changes in each release.
Alerts are delivered to the team owning the specific deployment (e.g. the team that owns Amazon publisher). When there’s an issue affecting multiple deployments (e.g. an issue with some shared resource), all teams will get the same alerts, creating unnecessary confusion and disruption. Optimally, such alerts will be identified and routed to the team that owns the “main” module and not the teams owning the plugins.
This example uses Scala, Dropwizard, and Guice. It’s important to note that the concepts here are replicable to other libraries — and indeed we have other services designed similarly using Java, Spring Boot, and Spring.

Some of our cross-publisher services already support a double-digit number of plugins, which means these downsides are becoming more noticeable. Still, compared to other solutions we’ve tried or used elsewhere (monoliths, per-publisher services etc.) — this one seems to be a clear winner, balancing our organizational needs with a good, scalable architecture we can maintain.