Cloud Native is lean because it enables self-sufficient, full-stack teams to experiment and deliver fast, often performing many deployments a day. The team’s product is continuously evolving towards the optimal solution. At this pace the risk of making a mistake increases significantly. We are all human and it is impossible to completely eliminate honest human error, especially when we are moving fast. But we are moving fast on purpose and we don’t want to be afraid of making mistakes, because then we will slow down and the evolution of the product will suffer. Instead, we need to minimize the impact of these mistakes and recover quickly.
To mitigate the risk of moving fast, cloud-native systems are composed of autonomous services. These autonomous services are architected with bulkheads to limit the blast radius when something does go wrong. These bulkheads protect your service from other services when they have issues. The bulkheads also protect other services when your service is in trouble. Thus the objective is for autonomous services to continue working independently even when other services are broken. As the saying goes, “no harm, no foul”. Your team can continue to work at pace with the confidence that you can recover quickly if something goes wrong, without significantly impacting other services. But how does this work?
Autonomous services are implemented using an event-first architecture and mindset. Let’s start by contrasting this with the opposite, api-first architecture. Traditional microservices communicate using synchronous calls, which creates tight coupling and complex call chains. As the system becomes more and more complicated the natural result is what is referred to as a microservices death star,
where virtually every service is interdependent and the failure of any one service could significantly impact the entire system. The typical fixes, such as circuit-breakers and service meshes, just make the system more complicated and expensive.
The event-first alternative is straight-forward and more tolerant of failures, because autonomous services have no direct dependencies on other services. All inter-service communication is performed asynchronously via event streams. Each autonomous service stores its state in its own fully-managed, highly-available, cloud-native databases and produces events as its state changes. All synchronous communication is restricted to the intra-service calls to this internal data. Any data that is needed from other autonomous services is consumed via events and cached in these internal data stores.
With an event-first mindset we focus on the events we will consume and produce. This focus naturally results in the creation of bulkheads that provide for the stability of the system. Upstream services can produce events without concern for the availability of the downstream services. Downstream services continue to use their cached data even if an upstream service is unavailable. The system will eventually and naturally become consistent once a broken service recovers.
If you are thinking this sounds like Event Sourcing and CQRS then you are absolutely right, but with some twists. I will discuss some of these twists in further postings.
I would like to point out that some of the terminology has evolved since my books were published. In the Event Sourcing pattern I discussed two variants that I called event-first and database-first. Since that time event-first has taken on a new meaning, that I discuss in this post, as an architecture and a mindset. Going forward I have renamed the Event Sourcing pattern variants as stream-first and database-first. I also find these terms more pleasing because they are congruent.