Is Kubernetes Enough for Service Reliability?

Published in

AI+ Enterprise Engineering

6 min readNov 3, 2020

TL;DR: In consulting engagements with my enterprise clients, a common challenge that keeps cropping up is creating reliable services (from monolithic designs, n-tiered, distributed containers, microservices, etc). While there are many best-practices and designs published on designing the infrastructure to be fault-tolerant and reliable, there aren’t many articles on software architectures that promote such behaviours. Both infrastructure and software architectures need to work in tandem to create the optimum fault-tolerant solution (and within the business budget for the service; there is a business RTO/RPO metric for a reason). Also, contrary to having multiple site designs, a single Tier-3 data-centre has good enough reliability for most enterprise use-cases and still meet the necessary reliability metrics. Usually, the culprit to most service failures is software-bugs and not infrastructure.

In the following fictitious dialogue between two colleagues, the Process-Supervision-Trees is introduced as one endeavour to building good software-architectures that promote fault-tolerance and reliability.

J: “Hey Francis, I remember you once said something about application/service up-time and reliability depends on the software architecture and that reliability should be designed for in the application?”

F: “Yes, you need to cater for it in your application design, else you won’t get the required service availability.”

J: “Well, I’ve packed my stuff into Docker containers, and have Kubernetes manage the pods. I’ve also got them load-balanced over 2 availability zones. I have also set it up with the standard health-checks such that every time a pod fails the liveness probes, it will restart it automatically. While it’s restarting and ‘not ready’, all the traffic will be serviced by the other load-balanced pods. So why is my service availability still not what I’m expecting?”

F: “What seems to be the problem?”

J: “Well, I noticed that even when the application seems alive to Kubernetes, internally the application may have gotten into an initial erroneous process state due to a bug. The application according to the Kubernetes probes is still healthy, but one or two internal processes may hit some buggy code and start giving wrong service responses or the system starts to slow down. Over time, the errors build-up until we get a catastrophic situation, and then only will Kubernetes restart the pod. The time before Kubernetes restarts the pod, I can’t be sure whether I am getting correct answers from the application-service.”

F: “Ahhh — You need to understand that Kubernetes control granularity is at the pod level. Seems like you need to have some management control over your application’s processes.”

J: “What do you mean? Control over the application processes?”

F: “Let me give you an example. You know about high-performance fault-tolerant Telco switches and how they provide 5 x ‘9’s of service reliability? Ericsson, a Telco switch vendor created a software language and framework called Erlang/OTP. To achieve the high-availability, the creators of Erlang/OTP mandated that the software processes should not have shared-state, and when there are issues, the platform provides a sophisticated error recovery model — the problematic processes are restarted, and not the whole switch itself; done correctly, only the offending process gets restarted while the rest of the switch’s processes carries on.”

J: “Are you trying to say that the Ericsson switch is like a Kubernetes pod?”

F: “In an abstract manner, they are similar. Both have computing, memory, software, networking, etc. Except that the switch probably has more specialized hardware for network I/O, but it’s still a computer with running processes. Let’s look at it this way — good enterprise data-centres are usually Tier-3 data-centres. Tier-3 data-centres means that they should have the availability of 99.982%. They are designed with N+1 fault tolerance, with no more than 1.6 hours of downtime per year. The infrastructure of a Tier-3 data-centre would be hard-pressed to fail in normal circumstances. So what usually causes the application service to fail? Its software bugs. The data-centre infrastructure seldom fails.”

F: “And you can’t prevent bugs from being created (we are humans after all). To achieve high-availability, what the system needs to do is to minimize the fall out when it encounters software bug issues. Ericsson’s switch implemented features like process supervision-trees to help prevent and minimize issues when bugs are encountered in each process. Software bugs are inevitable — but by recognizing this, and then implementing a software framework/architecture and mechanics to deal with it, is what made the telco-switch so reliable. Companies like WhatsApp use the same technique for its messaging platform, and they are only 50 engineers company-wide. RabbitMQ uses the same techniques too.”

J: “So are you’re saying that Kubernetes isn’t effective for reliability?”

F: “No, Kubernetes is great for coordinating and orchestrating infrastructure containers/pods. What I’m saying is that you need another layer to manage your application processes. Let me give you another example. You run Linux servers, don’t you? Let’s say one day, your print spooler process has hung and the print-jobs fail. What you would do next is list the process-id of the print spooler, and then issue a ‘kill’ command to the process, and then restart the print process daemon. You don’t restart the entire Linux machine just because the print spooler process dies — there may be other processes that are running fine and still doing its work. In effect, that’s what Kubernetes does. It restarts the entire pod, and not just the offending process; Kubernetes kills other active functioning processes at the same time. It works, but it’s a brute-force way of doing things. And as you mentioned, the Kubernetes probes can’t tell whether the pod is acting correctly as it should.”

J: “I think I understand now. We need another layer to manage our software processes. Like what you call it …. ?”

F: “Process Supervision-Trees. There are programming platforms that implement it internally like Erlang/OTP, and there are also services and libraries that have been developed for Go, Javascript, Java, Python etc to implement such process management capabilities. The supervisor process will coordinate and restart any processes that have errors and failed and there are strategic mechanics of how to restart those processes.”

J: “Ah …. That’s why you have been preaching about software architectures being important for reliable systems. Does this mean I don’t need to architect redundant multi-site infrastructures now if I adopt such supervision-tree patterns?”

F: “Multi-site infrastructure gives another protection redundancy, but your application/solution needs to be able to leverage it. Application Load-Balancers can only do so much if you have ‘state’ in your programming, and that’s why I always advocate following the 12-Factor-App mantra if you want to develop scalable/reliable systems. Addressing ‘state’ in software architecture is another level of difficulty if you’re not experienced with it. Going into a microservice architecture means dealing with distributed systems issues and ‘state’ consistency. But that is another discussion topic by itself.”

F: “What I am alluding to is that a single Tier-3 site with 99.982% availability has good enough infra reliability for the majority of the enterprise applications that we are building. Sure you can add an additional layer of infra-site-redundancy, and Kubernetes will help to manage the pods-cluster in multiple-sites/zones, but that’s only at the infra pod granularity. You need to go one level deeper into the software architecture and include application process management instead of just restarting pods. Going to a multi-zone distributed solution is complex, expensive and may not be needed if it is not called for.”

J: “Ok, I think I am aligned now. Kubernetes gives me the infra container/pods management capability, but I still need to architect and design the application-architecture with an application/process management capability like using supervision-trees. And this is even more important now as we are approaching development into microservices.”

J: “Hey, I remembered too, you have been touting functional-programming and side-effects, and about how that will help us developers to reduce bugs too?”

F: “Ahh, yes, that’s another passion topic of mine. Let’s grab some coffee before we go down that path.”

Additional Links:

1. Kubernetes Health Check Status Probes — https://cloud.google.com/blog/products/gcp/kubernetes-best-practices-setting-up-health-checks-with-readiness-and-liveness-probes

2. 12-Factor-App — https://12factor.net/

3. Data Center Tier Classification Standard — https://uptimeinstitute.com/tiers

4. Ericsson and Erlang — https://en.wikipedia.org/wiki/Erlang_(programming_language)

5. Erlang/OTP Supervision Trees — http://erlang.org/documentation/doc-4.9.1/doc/design_principles/sup_princ.html#:~:text=Supervisors%20are%20processes%20which%20monitor,application%20into%20workers%20and%20supervisors.

6. WhatsApp 50 Engineers for 900M users — https://www.wired.com/2015/09/whatsapp-serves-900-million-users-50-engineers/

7. RabbitMQ (AMQP and STOMP) Message Broker — https://en.wikipedia.org/wiki/RabbitMQ

8. Process Supervision Libraries — https://en.wikipedia.org/wiki/Process_supervision

Is Kubernetes Enough for Service Reliability?

Written by Francis Lee