Microservice Reliability Engineering in Practice: Top 25 Easy Questions to Ask Your DevOps Teams

Published in

cloud native: the gathering

4 min readJun 28, 2019

All large systems fail in the cloud during their lifetime, at either the sub-component level, or catastrophically across the board. If you don’t believe that and take it seriously then the rest of this article won’t really be your thing.

Usually, running complex, distributed systems require dependence on a multitude of services, gateways, load balancers, hypervisors, security rules, networks, container schedulers, microservices of many flavors, virtual hardware, physical hardware, data centers, DNS systems, Datastores, In-Memory Caches, Object Storage, Block Storage, Autoscalers, etc. the list could go on a while. Most of these systems will have hiccups if not outright failures and its up to the system designer and author(s) to figure out how to continue servicing business needs while teams are putting out fires.

Reliability Engineering in the Real World: Questions to ask your DevOps Teams

Large distributed systems that require an expansive microservice catalog usually end up having large DevOp teams. DevOps in this case meaning development teams that also own their services for their whole lifecycle and interact with traditional Operations teams.

To rally these large teams into focusing on reliability can be a daunting task. What concrete steps can be taken to actually improve microservice reliability rather than simply talking about it when issues arrise and your blood pressure is skyrocketing? Here are the top 25 easy questions to ask your teams (and yes I am sure we are missing some really good other questions, but this is a great start to get the juices flowing):

How do your teams track and judge reliability/uptime/outages for your sub-systems, microservices and overall product?
If your teams take action on reliability engineering, how do you track your progress so that this work can be socialized to leadership and the greater project?
Do you have appropriate dashboards that monitor KPIs and SLOs in both a realtime fashion and historically for trends and analysis?
How much telemetry and logs can your monitoring systems store, in order to reliably track trending issues?
How quickly are the right teams notified there is an issue anywhere in the system?
Can responding DevOps teams access these dashboards and interpret them correctly, quickly and easily?
Can these dashboards isolate offending microservices so the noise of functioning services don’t cloud the picture?
Will your component(s) continue to service requests even though 1 or more of its downstream components have failed at some level?
Can your service provide a response of some sort, rather than an error, using either a static or cached response?
Will your upstream clients be ok with the degraded response you provide?
Can you quickly identify which downstream systems have failed?
Do you know what normal KPIs look like for your component(s) so that when a failure occurs, you can identify the abnormality quickly and easily?
How do you share KPIs and SLOs across teams and monitoring systems?
How are KPIs published and maintained?
How are microservice owner contacts published and kept up-to-date over time?
If your component fails, are you able to articulate what will happen to upstream clients?
Can you identify normal traffic versus abnormal traffic patterns to your service?
Do you receive appropriate alerts (to the right escalation team) when an abnormality occurs — and is the alert appropriately classified?
Can First Responders get to your logs quickly and easily without complex queries or lengthy sifting of data?
Are your components protected against sudden spikes in traffic that might cause your system to act in an abnormal way?
Is your system protected against a microservice using too much CPU, Memory or Network?
Can you perform maintenance on your system while handling peak traffic?
How do you deprecate services that are no longer in use so that they do not clog system resources?
Do your teams Gameday failure scenarios often to validate reliability features actually work (such as Circuit Breakers and Alerts)?
Do you have a published, easy to read, library of runbooks covering each microservice that detail these topics?:
- Description of service for the lay person
- Service Contacts/Owners
- Normal KPIs at peak/offpeak (what is a normal error%, TPS, latency,# of pods)
- Client Retry Expectations by Response Code
- Customer Experience Impact
- List of Alerts and Definitions
- Troubleshooting Steps following an Alert (Runbooks)
- Dashboard Links
- Logging Links
- How to Smoke Test / Synthetic Test your service
- Game Day History

There are many more questions which should be asked to any team that is involved in either the development, operations, customer-incident escalation or architecture of microservice-based systems. But this list is a good start to begin the conversation with teams on building focus on Resiliency Engineering and how to harden the systems ability to withstand upcoming failures which are guaranteed to occur at the worst time of the day/week and during the most important point of your application’s lifetime.

Microservice Reliability Engineering in Practice: Top 25 Easy Questions to Ask Your DevOps Teams

Written by Jonathan Tronson