Layers vs Silos, a tale of 2 microservice architectures

Emmanuel Joubaud
JobTeaser Engineering
12 min readApr 7, 2023

When it comes to the communication between microservices, there are 2 possible extremes:

  • All-sync: whenever a service needs data from another service, it fetches it via a synchronous API call (REST, gRPC, GraphQL). Service calls service calls service… which tends to evolve into layers of APIs, where each layer has dependencies on the next.
  • All-async: no sync calls between services, all communication is event-driven or in the form of data replication. This tends to form silos: independent services that are fed all the data they need by an async mechanism and can function independently of any other service’s availability.

Of course real-world systems rarely fit either of these 2 extremes perfectly. But whether you plan for it or not, you’ll probably end up with a dominant mode or, worse, a random mix.

Imagining a spectrum between these 2 extremes can offer a helpful mental model to design a more deliberate architecture, where you control the set of circumstances under which you tend towards one or the other, and the associated trade-offs.

Layers

An exemple of layers architecture, for a Career Service Management System

Business layer

So you’ve started building a bunch of business services for your microservice application. They implement business logic and store data. For isolation, you chose to go database-per-service: each service stores a bunch of business objects for which it is the Source of Truth.

If you opt for sync-only communications between your services, that means any time one service needs to read another’s data, it has to make a synchronous API request to the Source of Truth service, introducing a dependency between them.

And if you map those dependencies on a graph, you’ll likely notice that software systems tend to contain a few business objects that are depended on way more than others. They’re the core of the dependency graph. Often they’re your users, your referential data (categories), the items of your e-commerce site, etc.

So even within your “business layer”, your API should naturally tend to structure itself into sub-layers, with for instance some “core dependency” services being depended on by many of the other services.

Composition layer

To display data to your users, the frontend will need to make queries to some of your backend services. And it will often need to join the data of multiple services. Doing so in the client would incur a performance penalty: you’d need to make a bunch of client-server road-trips to each service…

So the natural evolution from there is to introduce one or more Composition services in front of your system to merge the data from several backend services into a single payload for your frontend.

Maybe you’re using a few Backend-for-Frontend services, maybe it’s a single GraphQL federation gateway, maybe you just call these services “frontend services”. In any case, you end up with what we will call here a “Composition layer”, where data integration between your services happens upstream of the end user’s API request, before it hits your business services.

Pitfalls of the Layers architecture

The Layers architecture is simple, it resembles a lot how we structure regular software in a monolithic application, with API calls instead of function calls. But let’s take a step back and consider some of its weaknesses.

Resilience

A major selling point of the microservice architecture is its resilience benefits: if I have a job board module and a career appointment module, and they live in the same monolithic application, an outage of your monolith means that both modules are down. If they’re independent backend services, the users of your job board module could still access it even if your career appointment booking system is down.

But if both your services actually depend on a third “core dependency” service, for instance the user service, then when that service is down, both services stop working.

The “core dependency” services tend to become Single Points of Failure, just like your old monolith was.

Fat composition layer, thin DB wrappers

Now that you have a composition layer, exposing a new endpoint might actually force you to implement 2 endpoints: one in the business service that owns the data, and one in the composition layer.

A typical approach to mitigate the problem is then to make your business service APIs as generic as possible, so they can easily accommodate a wide range of use-cases, and you only need to implement new specific use-cases in endpoints of the composition layer.

Designing generic APIs is hard. You need to accommodate not only the use-cases you have at hand, but all the possible use-cases the future might hold in store for you.

A common trick is to provide a set of very simple primitives, typically basic CRUD operations for data. Then you can combine those simple building blocks together in a higher-level layer to assemble more complex and powerful features: access control, sending emails, orchestration, sagas…

For instance, depending on whether it’s initiated by an admin or a basic user, an object creation may involve different ACL checks, input validations or different follow-up actions like sending emails to different people. A simple fix is to only handle the basic insertion of the object in the domain service, with perhaps a few invariant validations, and let the composition layer handle the use-case-specific ACL checks, validations and email-sending.

You can see how that creates a gravitational pull towards managing more business logic inside the composition layer, and turning business services into “anemic” database wrappers providing simple CRUD operations.

Joins

Performing and optimizing joins over API requests is typically much harder than with a SQL query within a single database. What if you need to join and paginate the results from 2 different sources while filtering on attributes of both? Or more complex queries involving 5 or 6 data sources? Those queries can be really hard to write and harder yet to optimize and cache correctly.

The distributed monolith

In the end that fat composition layer fetching data from various services via API calls ends up looking a lot like a good old monolith fetching data from its database through ORM calls.

Your business services may be simple and have well-defined APIs, but most of the actual business logic is stitched together in the fat composition layer and you don’t get much benefits in terms of resilience.

But you also bought all the complexities of operating a distributed system: the increased latency and network unreliability, the complex API joins, distributed transactions… Observability is complicated by cascading errors, you need tracing to find the source of incidents. You can’t really run your stack on your local environment anymore without running all the services or resorting to complex mocking…

Then is that complexity really avoidable? How can you truly decouple services? It can feel pretty hopeless. That’s where the “Silos” architecture comes in.

Silos

An exemple of layers architecture, for a similar Career Service Management System

Data replication

Now let’s try to imagine what the opposite of the “Layers” architecture would look like. Say we don’t allow any sync calls between services. Each service then needs to contain all the data it needs to perform its function without depending on any other service being available. I call such a service a “Silo”.

But without sync calls, how can your service access another’s data? One way or another you’ll need to dispatch the data across the services using some form of asynchronous data replication.

A typical implementation will use some sort of pub/sub or message queue, for instance Kafka, where all your data, either raw or in the guise of events, can be pushed and dispatched to “consumer services” interested in using it, so they can keep their own readonly copy of the data up-to-date in their local database. In this model, write operations typically go through a request to a “Source of Truth” service that handles the relevant validations and produces the data into the message queue. For instance, when a new user is created or updated, the user service will push a message to the “user” topic in Kafka, containing all the attributes of the new user. Then if another service needs read access the user data, it can subscribe to the “user” topic and replicate a readonly copy of the users in its own database, so that it can read and join the user data via regular SQL queries.

The main perk of this approach is that even if the “Source of Truth” service is down — or even the message queue — the “consumer” services can continue to work. Their data might be slightly outdated but in most cases it’s better than an outage.

Having their own copy of the data also gives your consumers a lot of flexibility in how they can store and query that data. Performing joins and complex queries is then as easy as if it was your service’s own data.

Inner-layers

As you may have noticed from the schema, our silos have layers too, but they’re inner-layers, inside each service. They‘re reminiscent of the “layers” architecture:

  • the replicated data, consumed via the pub/sub and readonly, is akin to the “core dependencies” layer
  • the service’s own data (the data for which it is the Source of Truth and receives write operations) can depend on the replicated data
  • the service’s business logic depends on both replicated and own data
  • some endpoints may expose or write only the service’s own data, executing the service’s business logic
  • finally some composition endpoints might expose a join of replicated and own data, possibly by calling the service’s own endpoints as a composition service would

Pitfalls of the Silos architecture

Replication costs

The biggest drawback of an architecture based on data replication is that implementing a replication mechanism is costlier and more complex than just querying an endpoint.

To consume a new table, you need to update the database schema of the consumer service, implement the replication logic, backfill the historic data and ensure each change will be propagated.

You need to minimize, detect and fix inconsistencies, handle deletion events, idempotence and resilience to ordering issues. And think carefully about the design, scope and size of your replication channels or topics: what data should go into each stream (a problematic that’s analog to API design and aggregate design in Domain Driven Design).

Tooling and automation can help but, depending on the level of flexibility and robustness you’re after, you might have to implement your own. Built-in database replication mechanisms like Postgres logical replication only work if all your services use the same DB engine. And off-the-shelf tools like the Debeizum CDC tend to work at the database schema level rather than through a versioned API, so they can’t help you without introducing strong coupling between the DB schemas of the source service and the consumers.

We will dive into ways of handling those challenges in a subsequent article.

Resource usage

Replicating data across services can take up a non-trivial amount of DB storage space, network bandwidth and computing power.

Unless you have a massive scale, your pub/sub is unlikely to be the bottleneck here: tools like Kafka can scale to deliver enormous amonts of data to a lot of consumers very fast. But the regular flow of upserts can take a toll on the database of smaller consumer services.

Limitations to service granularity

Because of replication costs and resource usage, this approach alone doesn’t lend itself well to splitting your system into a large number of very small microservices. You would end up replicating a lot of data in each service — and often the same tables in services within related domains.

One solution is to accept coarse-grained modular services, where each physical service contains several submodules that depend on the same parts of the data model. This is more akin to a Service-Oriented Architecture than true micro services, but that’s fine. SOA may actually be the right granularity for many companies that have yet to reach the bigger scales where real microservices really shine.

You can even let data cohesion guide you to find the right seems to break down your system into services: if 2 modules depend on a lot of the same data, they might be better off as two submodules of the same service, that replicates the data they both need into its local database. If they have very different data requirements, it’s a good sign they actually tackle different subdomains and belong to different services.

Or another approach might be to mix the two: define a bunch of domains, each with a core service that contains the most central entities and replicates data asynchronously, surrounded by a few satellite services that contain the logic of satellite submodules, integrating with the core through synchronous APIs.

Eventual consistency

Replicating data means you’re entering the wonderful world of distributed systems, and the CAP theorem tells you that, if you want to contain failures to single subsystems, your data cannot be always up-to-date across all your services. Your system needs to become tolerant to replication lag, and be designed accordingly.

If specific use-cases where strong consistency is required, you might have to fall back to synchronous API calls.

Of course choosing to go microservice with different databases means that you’re working with distributed systems anyway and need to give up on strong transactional consistency across your whole data model, even in a “layers” architecture.

Combining layers and silos

As you may have picked up by now, at Jobteaser we’ve chosen the “Silos” architecture as our default stance for business services. We mostly avoid synchronous calls between backend services and rely on data replication as much as possible instead.

But layers and synchronous calls are so important to software architecture that it’s not desirable to avoid them altogether. That’s why we also sprinkle bits of “Layers” where appropriate:

  • Avoid business logic duplication: One downside of replicating raw data is that it doesn’t transmit the associated business logic to manipulate that data. In a lot of cases that’s not a big deal, because we only need simple display or querying logic that actually belongs in the consumer service. Other times we can side-step the issue by producing pre-calculated values into the Kafka topic (typically calculations that only involve the state of the business object, with no other input parameters). Other times yet, when we need to reuse some presentation logic across different parts of the app, we can just expose the raw data in the client-facing endpoints and leave its manipulation to the frontend’s monorepo (e.g. i18n). And sometimes the logic is so trivial and non-vital that it’s not a big deal to duplicate it in a couple of places. Yet, when we need to DRY-up business logic and none of those workarounds make sense, we do resort to synchronous API calls between business services.
  • Strong consistency requirements: For some use-cases, dealing with data that’s not quite up-to-date is not an option. In those cases, we sometimes fallback to synchronous communications and have to accept outages when the upstream dependency is down.
  • Tech bricks: Besides our business services, that hold business logic and data, we also have a bunch of technical services that hold little to no business logic and serve as sorts of libraries or interfaces between our infrastructure and clients or external services. We call those “tech bricks”. We mostly interact with tech bricks through sync API calls, and they tend to cluster into layers upstream and downstream of our business services. For instance, our apigw and session management service live in an “access layer” that depends on our business services, and our email service — used to send transactional emails — lives in an outer layer that our business services depend on. Our “Silos” architecture has layers too, just not layers of business services depending on each other.
  • Mobile BFF: Our mobile apps require especially stable APIs because, unlike browser SPAs, we can’t roll out upgrades instantly to all clients. We need stable APIs to support older versions of the mobile apps. That’s why our mobile team chose to create a Backend For Frontend service that maintains a specific API for the iOS and Android apps. It doesn’t have its own database and performs sync API calls on many backend services, in a typical “composition layer” fashion.

Summary

The “layers” architecture, based on synchronous calls between services, is relatively simpler and cheaper to implement, but it tends to evolve into a distributed monolith and won’t unlock all the resilience benefits that a service-oriented architecture can provide.

The “silos” architecture, based on asynchronous integration, tends to be costlier and more complex to build, and to nudge you towards bigger services, but it gives you more resilience and isolation between services.

--

--