Dr. Servicelove — or how I learned to stop worrying and love the Distribulith

These days it has become fashionable to bash anything that isn’t a microservice. You’re building a single large application? What an antiquated, terrible idea! That’ll never work! You need microservices, and lots of them, and you need a pile of third party tools and complicated infrastructure to support them. Without microservices, you’re doing it wrong. As with most debates, I think there is more nuance to the issue.

Today a friend and I were having a conversation about two of the main ways applications deal with fault tolerance: internally or externally.

Internal fault tolerance is what I think of when I see applications written in Elixir using OTP. The virtual machine and the language design itself is built around the actor pattern and the let it crash philosophy. Individual processes within the VM are able to recover from failure using supervisor strategies and many other techniques. Multiple BEAMs can talk to each other and form a distributed grid allowing for further resiliency and scale. Every aspect of an application written this way is explicitly aware of how it will handle failure and when. The process recovery strategy is built directly into the code of each individual process.

External fault tolerance is the notion that application code is deliberately unaware of how failure will be handled. People write individual processes in the so-called “cloud native” fashion (e.g. “12 factor”) and rely solely upon underlying infrastructure to provide health checks, process monitoring, horizontal process scaling, and recovery/restart. The canonical example I think of when dealing with external fault tolerance is an application built with docker containers scheduled and managed by some infrastructure like Kubernetes, Amazon ECS, Azure, etc.

Lately it seems like a lot of people are so hung up on the idea that to be cloud native, you have to be building standalone, RESTful HTTP microservices that they assume anything other than this is a bad practice. Certain types of microservice zealots would take one look at the Elixir/OTP internal fault tolerance pattern and dismiss it as old school and therefore bad and wrong.

I think people are missing the spirit of microservices, which is to create resilient, scalable implementations of the single responsibility principle that can be deployed, scaled, and managed independently. This flexibility decreases time to market and allows multiple teams to work on different features on their own schedules without creating the dreaded “stop the world” synchronized release event that used to be the norm in classic enterprise monolithic development.

The internal fault tolerance and scaling model I discussed earlier is something that I like to call the Distribulith, or distributed monolith. Despite sounding old school, there’s nothing inherently wrong with this — provided it’s done properly.

Using languages and frameworks like Elixir and OTP, you can compose a single entity that is distributed and inherently scalable. It is composed of individual processes that adhere to the single responsibility principle, can be deployed independently, and have their own supervisory and recovery strategies. The boundary of an Elixir/OTP application can be thought of as roughly analogous to the edges of a microservice ecosystem in modern “cloud native” jargon.

In a Distribulith, the code itself dictates the resilience and scaling strategy. In traditional microservices, the code defers these requirements to an underlying scheduler or platform (mesos, marathon, Kubernetes, ECS, GCP, and so on…).

Neither of these approaches is inherently good or bad — they’re simply different ways to solve problems. The Distribulith is particularly well suited to solving problems like real-time stock trading systems or ultra-low-latency backend processing systems for games and other environments.

“Traditional” microservices have their own niches. These are especially well suited to solving problems where you have a number of disparate data sources and areas of request/response functionality that need to support a rapidly growing or changing set of consumers. I have heard these ecosystems referred to as wrap and conquer or isolate and collaborate.

The reason I felt compelled to write this post was to hopefully help people see that monoliths aren’t necessarily a bad thing. Sure, there are countless examples of why monoliths are bad, but in our fervor to become cloud native and adopt microservices, I fear that many of us have become RESTful endpoint snobs and we’ve embraced the letter of the microservices law and forgotten the spirit of the movement.

In some cases, building a Distribulith may end up being far less complex than the additional complexity required of an infrastructure that supports a massive microservices ecosystem. Sometimes we end up going down a rabbit hole of having to add third party tool after third party tool to support all of the various non-functional requirements of our system when we could have done it all inside a distributed monolith.

Obviously your mileage may vary, and there is no way in a single post to tell you when to use a Distribulith versus when to create an ecosystem of microservices (or when to use a hybrid of the two). What I hope you get from this post is a desire to think critically about your architecture and evaluate all possibilities, and not just assume that a sprawl of bootstrapped HTTP servers is the only way to build a cloud native application.

So, in closing, go forth and think critically. Evaluate all of your options, and make evidence-based decisions that are right for your problem domain and not just based on how trending the hashtag might be.