Laws of Distributed Architecture & How to build scalable applications
If you’re interested in Event Sourcing, CQRS, Microservices and distributed architecture with gRPC & Protobuf, then this guide will hopefully help you get started on the right path. Given the mentioned buzz words and inherit cool factor among engineers, failing to truly understand such systems is a recipe for disaster.
This guide assumes you know the difference between Monolith & Distributed Architecture (microservices)
** Spread this article like a distributed system to your friends, colleagues, bosses and even your kids **
As a technical investor I’ve met a lot startups who are creating vast amount of technical debt from the get go and designing their systems for failure. While at the same time as a technical founder, I realize engineers lack the necessary knowledge and clarity to understand such systems to be effective or efficient. The confusion and wrong decisions constantly leads to frustration of all team members detrimental to the project.
This guide is meant as just a starting point and it is up to you to continue building your own knowledge base by constantly researching, reading, watching and experimenting with as many cases as you can in order to become successful at it.
Before going into “Laws of Distributed Architecture”, we need to agree on the order of subjects and how you should approach starting a distributed architecture. You should not write a single line of code until you’ve done: (in the exact order)
- Event Sourcing
- Based on domain logic, industry, business model and value props
- Event storming with commands, topics and documents
- Objective: To build the flow of events based on what actually happens around the business (NOT ABOUT THE SYSTEM!) - Defining actions & triggers
- Commands that trigger certain actions or result in other events
- Aggregating events of similar nature into boundaries that can be packaged into the same service - Defining messages
- The event topics and basics of what the payloads need to be - Defining services (what they consume and produce)
- Messages are what services consume and produce
- Services are API endpoint available from each microservice
- Objective: Before any coding, lets agree what each service needs to do and what the payloads need to include - Go back to #1 and repeat until everybody has bought in! (Including the top management)
- If everybody agrees, then development can begin with services being assigned to different teams
Laws of Distributed Architecture
A) Single source of truth
Unlike monolith applications where central DB is the source of truth and all functions record and read from the same DB, in Distributed Arcitecture events stored in the message bus, are the only source of truth. “Event Store” is the layer of message bus where topics and messages are stored in a time series DB or file.
Recommended systems: Kafka & NATS Streaming
There are at 2 objective ways of subscribing to message bus:
- Fanout Subscription: where all the subscribers receive the same event once.
- Queue Subscription: where only 1 subscriber from the pool of subscribers will receive the message and if it doesn’t get acknowledged then the message bus will try another subscriber until it receive an acknowledgement.
It’s extremely important to understand the difference in subscription logic.
B) Each node/service must be decoupled, independent and stateless
Meaning, each microservice should have its own DB and only directly interact with message bus, single source of truth.
If a node needs to be added, upgraded or replaced, it shouldn’t effect any other part of the system. It also shouldn’t matter what tech stack (programming language, db type, …) each node is using.
Recommendations: gRPC+Protobuf for service definition and message modelling any programming language gRPC can work with. i.e Go Programming Language
C) Nodes should NOT connect or talk to each other
Each node should be as lonely and isolated as possible. Nodes should only be able to discover the message bus and NOT each other!
If you feel the need to connect to another node, i.e. for query, then you’re probably doing it wrong. Only the client or API interface service should be connecting to microservice to perform a task, and not other nodes.
Recommendation: Kubernetes
D) Always assume existence of several nodes of same service
In Distributed Architecture, scalability is achieved by horizontally increase the number of nodes performing the same service behind a loadbalancer. The question becomes:
How do you make sure DB of each node is near identical?
The answer is, single source of truth! The node that publishes an event to message bus, should NOT immediately record into its DB. Instead, the published message becomes the source of truth where all the nodes subscribed to the same event topic, receive it (fanout subscription) and record it into their DB, including the node originally produced the results.
Recommendation: Kubernetes
E) Events that result in other events should be apart of Queue Subscription
If an event triggers a logic inside the microservice that produces another event to be published into the message bus, then that first event should only be received from Queue Subscription.
Queue Subscription, means the message bus will send an event to only 1 of the subscribers. Thus, NONE of the other subscriber know about the existence of that event and they shouldn’t care either. If the receiving node successfully performs the task, another event will published where all the subscribers will receive a copy to record. However, if the receiving node fails to performs the task, the message bus will try another queue subscriber node in a quest for acknowledgement.
Which leads us to next law:
F) Fanout Subscriptions should NOT produce other events
An event received from message bus through fanout subscription, should only be written into DB if desired, but should NOT produce another event.
Imagine you have 100 nodes of user account service and they’ve all receive the same event, if wrongly architected, they will each publish the same exact event back to message bus. 100x the same event!
Already recommended NATS Streaming
G) The DB of each node should be worthless
Storage is cheap these days so replicas do not matter. Only the data in the single source of truth, Event Store, matters. Nodes and their DB are dispensable. It should not matter if you’d like to destroy X nodes in a pool of Y identical nodes or all Y nodes to start from scratch.
You should never be afraid to destroy nodes!
H) Newly added nodes need time to catch up
When adding a new node into a Distributed Architecture, that node will connect to message bus and demand all events from start of time, or start of some specified time or order. The catchup process, AKA Warm-up, is a process in which the microservice spends to populate its DB in order to arrive at near the identical state as other nodes of same kind.
Warm ups could take a long time given the number of events and size so keep that in mind if you need rapid scaling during peek hours.
I) Newly added node in warm-up mode should not be apart of any subscription pool, yet!
When a newly added node is consuming all previous events in history to populate its DB, it should not be subscribing to any fanout or queue channels until it has fully caught up!
Once ready, it should first unsubscribe from the topics it used for warm-up, and then announce its readiness to handle new events by subscribing again.
J) Store Everything and then some!
Microservices only store what they needs to get the jobs done and everything they store is dispensible. However, there are far more data begin passed around and generated that could become invaluable in near future. Thus, the storage type of message bus, Event Store, is extremely important. Backup and backup and backup some more!
The whole system could go down or be replaced, but the data in the Event Store is pure gold. In most cases the data may seem like regular dust until another service pops up that turns into gold, you never know when!
Recommendation: Both NATS Streaming and Kafka can store to file, sql and memory. I strongly recommend a sql backup with multiple file backups.
K) Multiple message buses
We talked about “single source of truth” but that doesn’t mean we can only have a single message bus to handle it all. It is great practice to separate message buses based on many principles.
The core of the application and all major services can revolve around a single message bus, but maybe analytics, logs, and other mass generated events which are not as priority as core events can be assigned to another message bus with a different event storage system. This way noise becomes manageable while backups are easier to handle.
This guide is what I use to offer many startups and enterprises consulting and technical support. Unlike consultants I tend to get involved with the head team (CTO & CEO and product owners) throughout the project until it has caught momentum with all the member and engineers.
This article will be under active edits and additions. If you have suggestions please let me know in the comments so I can address them.
I’m always looking to hire engineers for the above type of operation in my own company as well as the ones I advise or have invested in. If you’re interested message me directly.