Checklist: Building a “resilient” micro service

3 min readMay 4, 2018

Going through a paradigm shift in terms of how we essentially design/architect our application, I attempt to compile a list of essentials while building a (resilient)micro service for our systems.

While earlier our applications used to be one big monolithic chunk of code, they are now split into smaller, individually deployable parts. The systems are more robust, independent and provide capabilities to be more resilient and are capable to sustain constant changes at a much higher rate. As a result the overall development times which earlier used to be months/years have now gone down to days/weeks.

For a brief period, the design discussion(s) revolved around questions like,

Does the service has support for time-out(s)?
Can the service perform retries in case of unexpected failures?
Does the system have a proper/defined failure fallback?
Does it ensure that the application still keeps on running in the unlikely event of one of the parts going down and is it capable of preventing cascading impact of the entire system
Do they have capability to fail fast and avoid any additional overhead on the system as a result of indefinite time-outs/retries.
And, is the system capable of performing asynchronously

Well, when I mentioned briefly (during our design discussions), what I meant was answering these questions individually rather than asking,

“ Is your service ‘resilient’? ”

Before moving ahead and trying to list some important features, let’s try to understand the meaning of the word “resilient” itself, which is

“the power or ability to return to the original form, position, etc., after being bent, compressed, or stretched; elasticity.”

The definition might be generic in a broader ecological, social & technological sense, however it conveys the exact meaning in which we want our micro services to be performing and designed.

The most important feature(s) we are looking for in such a micro service are, that they have defined methods/ways for:

Timeouts — They have a mechanism to define timeouts
Retry — They have retry mechanism defined
Fallback — They have a proper fallback executions built-in
Bulkhead — In the event of failure in one part of the system the other parts of our application are perfectly capable of functioning individually and/or in conjunction
Circuit-Breaker — Capable of failing fast and avoid additional overhead on the entire system
Asynchronous — They are asynchronous or at least provide a mechanism where they might appear synchronous to the application using them but are written in such a way that they are essentially asynchronous.

Before we come to a close it is important to mention that how well and distinctively this showcases a difference from past when one non-functional piece of code in the large monolithic applications we built meant so much and resulted in the entire system bearing the impact. On the contrary, the designs now are more on the lines of keeping our systems available at all times, while some significant or insignificant parts of our application are still unavailable.

This might be just a beginners blog in the advanced world of micro services implementation, it’s an attempt to prepare a simple checklist (yet important) when building micro services for our system.

Now for those who are looking for more, there are a lot third-party libraries (e.g. Hystrix, Failsafe, Phystrix, etc.), already available in the market which assist you in the process of achieving these in a quick and efficient manner.

Thoughts compiled and (few)referenced from:

Checklist: Building a “resilient” micro service

Written by Pulkit Swarup