Your Distributed Monoliths are secretly plotting against you

Image for post
Image for post

What important truth do very few people agree with you on? — Peter Thiel

Before writing this, I spent a lot of time applying this question to a topic that’s very trendy nowadays, microservices. I believe I have found some interesting insights, some based on some reflections and others from real experiences, that I will share with you today.

Most implementations of microservices are nothing more than distributed monoliths.

The Monolith era

Every system starts with a monolith application. I’m not about to go on and on about it as a lot of people have already written on the subject. However, the grand majority of the content regarding monoliths focuses on aspects such as developer productivity and scalability, leaving behind the most valuable asset of every Internet-based company: data.

Image for post
Image for post
A typical architecture of a monolith application
  • Are consistent when it comes to data (assuming you’re using an appropriate database for your use case).

The Distributed Monolith Era

Let’s say your company is going well and application needs to evolve. You are getting bigger and bigger clients, and your Billing and Reporting requirements have changed in terms of feature set and volume.

Image for post
Image for post
A generic System’s architecture after decoupling Billing and Reporting from the core monolith app
  • CI/CD pipelines are working like a charm;
  • Your Kubernetes cluster is healthy and your engineers feel productive and happy.
Image for post
Image for post
A generic example of an ETL analytics system (at Unbabel we called it Automatic Translation Analytics)
  • It requires no big infrastructure changes (just adding a new microservice);
  • We were able to answer our business requirements in a short period of time.

1. Data changes

One of the biggest advantages of microservices is encapsulation. The internal representation of the data can change and the system’s clients are not affected because they communicate via an external API. However, our strategy required direct access to the internal representation of the data, which means that every time a team made a change on the way data is represented (e.g. renaming a field or changing a type from text to uuid), we had to change and deploy our ETL service.

2. Many different data schemas to handle

As the number of systems we had to connect to increased, we started dealing with a lot of heterogeneous ways of representing data. It became obvious that managing all those schemas, relationships and representations was not going to scale for us.

The root of all evil

In order to get a complete view of what happens in the system, we ended up following an approach that was similar to a monolith. The only difference is that there’s not only one system and a database, but dozens of them, each with their own representation of data and sometimes even with the same data replicated across them.

Image for post
Image for post
Linkedin’s data flow spaghetti mess circa 2011 — source

Breaking the Distributed Monolith with Event Sourcing

Much like the rest of the world, Internet systems are driven by actions. A request to an API can lead to inserting a record in a database. Now, most of the times we don’t really care about it because we only consider the update of the database state. The state update is a causal consequence of an event (in this case an API request) that happened. The concept of an event is a simple and yet it’s a very powerful aspect that we can use to break the Distributed Monolith.

“Why does having microservices emitting events help me with the distributed monolith problem?”

When you have systems emitting events you can have a log of facts that:

  • Is immutable: once an event is emitted it cannot be changed;
  • Is reproducible: a state of the system, at a given point in time, can be reproduced by replaying the log of events.

1. One source of truth

Instead of having N data sources to connect with possibly many different types of databases, in this new design, your source of truth is just one: the event log.

2. Universal data format

In the previous design, we had to deal with many data representations because we were coupled with the database directly. In this new one, we can express ourselves with a lot more flexibility.

Image for post
Image for post
An event following the AVO (Actor, Verb, Object) approach that models that fact of a user liking a photo

3. Decoupling between producers and consumers

Last but not least, one of the greatest advantages of having events is the effective decoupling of data producers and consumers. This not only allows systems to scale easier but it also reduces dependencies between them. The only contract between systems becomes the event schema.


References:

Talks

Unbabel R&D

A collection of articles from the Unbabel Research &…

Thanks to Rui Santos

João Vazao Vasques

Written by

Data @Unbabel | Sandboxer @Sandbox | Past: Talkdesker @talkdesk |Founder of Wazza.io | Software Eng @Uniplaces. Taekwondo black belt - 2nd place Worlds '12

Unbabel R&D

A collection of articles from the Unbabel Research & Development Team.

João Vazao Vasques

Written by

Data @Unbabel | Sandboxer @Sandbox | Past: Talkdesker @talkdesk |Founder of Wazza.io | Software Eng @Uniplaces. Taekwondo black belt - 2nd place Worlds '12

Unbabel R&D

A collection of articles from the Unbabel Research & Development Team.

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store