The Sith Lord school of system administration on RCAs

Anthony Hobbs
2 min readJul 28, 2018

--

I teach the Sith Lord methods of system administration. If a server or service fails me, I allow it one chance to redeem itself (enact automated repairs). But if it still fails, I terminate the server and replace it with a new one. I also practice chaos engineering and terminate servers on a whim.

You have failed me for the last time computer!

This level of fault tolerance has serious architecture ramifications in that all services must be able to lose a server or two without visible impact to customers. Have you heard of DevOps? That partnership between operations and development? This is one of those cases where operations is imposing a architecture requirement in order to maximize automation. This is commonly referred to as having N+2 capacity (see the Google SRE Book for a more detailed explanation).

What about performing root cause analysis? There are plenty of idealists that say you must perform RCA on every single failure. However organizational best practices would have you limit time spent on operational issues to allow for development time. How can those two opinions be reconciled? It’s quite simple, they can’t.

Instead, I have alerts that let me know if a service is in a restart loop, and I run a report at the end of the week to tell me just how many times the automation performed remediation, and how often it failed so I can identify chronic issues and spend my limited operational time on those, instead of trying to find every needle in every haystack.

As a result, servers failures rarely interrupt my teams workflows, and we have great visibility into chronic issues so we can focus our investigative time on those.

I hope you liked the article, if you did please like it. You have the chance to make my day 😍

--

--