Elevating your Kubernetes Cluster’s Resilience: Pod Disruption Budget

Arie Bregman
7 min readMar 19, 2023

--

One of the challenges in managing Kubernetes clusters is ensuring high availability and fault tolerance. While the concept of replicas is useful for making sure multiple instances of your application exist, it doesn’t guarantee your application will always run without disruptions. This is where Pod Disruption Budget (PDB) comes into play. PDB is a Kubernetes feature that helps maintain the stability of an application by setting rules regarding how many disruptions your application can deal with. In this article, we’ll deep dive into PDB and learn what is it exactly, when should you use it, and why. The practical aspect of this topic is covered in a separate article.

A completely random image created by Arie Bregman

What is a “Pod Disruption”?

A Pod disruption describes any scenario that impacts the running state of a Pod and causes it to be unavailable. It’s commonly divided into two types of disruptions:

  1. Involuntary Disruptions: Any unplanned scenario where our control to manage it is either limited or doesn’t exist. For example, hardware failures and kernel panic situations.
  2. Voluntary Disruptions: Planned actions performed by the cluster admin or application owner. For example, starting a node drain or upgrading manually a cluster.

What is a “Pod Disruption Budget”?

So now that you know what is Pod disurption, let’s discuss the mechanism that is supposed to help us manage it. Simply put, A Pod Disruption Budget, or PDB, allows you to manage the number of replicas that should be available at any given point in time. When defining PDB for an app, you specify one of the following:

  • The minimum number of replicas of a Pod that must be available at all times. Also known as min available.
  • The maximum number of replicas that can be unavailable. Also known as max unavailable.

Practically, that means that if for example your application has 6 replicas but you set a PDB of a minimum available 3 replicas, PDB won’t have any effect on your application as long as three replicas of your application are running. When getting to a lower number of replicas than 3, this is where some Kubernetes operations will be stopped — for example the scale down of your cluster will be stopped if it means that less than 3 replicas will exist as a result of the scale down of the cluster

The following output is of kubectl get pdb which lists what PDBs we have defined in our cluster. Assuming app-1 and app-2 both have one replica each, the PDBs will achieve the very same thing, one using the “Min Available” (minimal available number of replicas) and “Max Unavailable” (maximum number of replicas that can’t be unavailable).

$ kubectl get pdb

NAME MIN AVAILABLE MAX UNAVAILABLE ALLOWED DISRUPTIONS AGE
app-1-pdb N/A 0 0 3h53m
app-2-pdb 1 N/A 0 3h53m

Is PDB the optimal solution for ensuring your application is always running?

Shortly, No. PDB isn’t your new all-can-do solution for making sure your application runs 100% of the time. Not only because technically it isn’t bulletproof, but also because sometimes, other methods should be applied to make sure your application keeps running properly all the time. Let’s discuss some examples.

Let’s start with a simple example — let’s say you have a Pod called my-app with one replica and PDB applied with minAvailable=1 which means there must be always one replica running and so no disruptions to the pod are allowed. Now what would you think will happen if you go and run kubectl delete po my-app ? if your answer was “it will be deleted” then you are correct. PDB won’t prevent the Pod from being deleted although it violates the disruption budget for the simple reason that directly deleting a Pod with kubectl is considered an administrative operation called directly by the admin rather than an operation administrated by Kubernetes services themselves and so PDB has no effect when the admin directly removes a Pod.

Let’s take another example — a node that gets drained in order to be upgraded. You have more than one node in your cluster and your application is highly available, but every time the cluster goes through an upgrade all nodes get “drained” at the same time and so your application is down. Should PDB be your default way to manage it? not exactly. There are many, much more “subtle”, solutions before applying PDB in this case. One example would be to make sure replicas are running each on a separate node (Pod anti-affinity rules) and set a different upgrade strategy where each node gets upgraded one at a time.

Do not tempt to immediately go and apply PDB. Understand your environment and conditions so you can find a best-fit solution. PDB has pitfalls and hidden costs which one should be aware of because choosing it to manage any Pod disruption.

The pitfalls of PDB

Let’s mention what it actually costs to use PDB in your cluster

  • PDB will block certain operations from being fully completed. Let’s say you would like to drain a certain node from all of its workloads. In certain conditions, having PDB will block this operation from completing which results in having a node with applications that can’t be evicted due to PDB. It is true that sometimes this is exactly what you want to happen because that’s the reason you set PDB in the first place — to make sure your application is running the whole time, but if not planned accordingly, PDB may interrupt some existing processes you have in your environment
  • While some operations will not be completely blocked by PDB, it will delay them. Remember the node drain example? well, one of the processes that is using node drain is Kubernetes version upgrades. In GCP (Google Cloud Platform) for example, when upgrading the version of a certain GKE node pool, the nodes are cordoned and drained. While PDB will at first prevent the upgrade from completely draining your nodes, eventually it will happen, as GKE performs the operation eventually after one hour, even with PDB set. So PDB not only won’t prevent your applications from going down but it will actually delay this whole process of upgrading the node pool.
  • PDB can affect your cluster's ability to scale down. Imagine two nodes. Each with completely different applications, so if one node is down, some of the applications will be impacted. But what if I tell you that all the applications can run on the same node and Kubernetes would scale down your cluster if it could? The problem: it will cause a disruption that will impact your application availability and so PDB will prevent it from happening. You will then pay more for your cluster in the price of your application reliability.

PDB use-cases

Once we clarified PDB isn’t the answer to every scenario of Pod disruption and that using it should be thoroughly considered, let’s mention some of the use cases where you would want to actually use PDB

  • Your application can’t be highly available for different reasons (e.g. cost considerations) but you would like to have autoscaling enabled. In this case, you need to make sure that the node that scales down isn’t the one where your non-highly-available application is running. PDB is a good option to deal with this scenario.
  • Your application is highly available, but some operations you run on your cluster nodes bring down all replicas of your application at the same time and cause disruption. PDB can help you ensure these operations happen gradually so each node is treated separately in its own turn, in a way that doesn’t impact your application availability.
  • You have an application with multiple replicas, but when you update the application it takes down all the replicas at the same time, impacting the application’s availability. You can enforce rolling updates to the application by using PDB, stating that only a specific number of replicas can be down at any given point of time (or the other way around — specify how many should be available by minimum)

Prevent or won’t prevent?

So far we mentioned a lot of theoretical conditions and workflows in which PDB can help. To summarize the topic of PDB, let’s mention clearly some scenarios in Kubernetes and whether PDB will be helpful in each (If you would like to do a self-check of your knowledge on the topic, I recommend to try and answer before reading the content of each scenario).

Direct removal of a Pod with “kubectl delete pods”

Remember, directly removing a Pod is considered an “administrative operation” which isn’t part of the workflows triggered by Kubernetes API hence, PDB won’t prevent from the Pod being deleted

Eviction of Pods due to node memory pressure

When a node in Kubernetes reaches high utilization of memory, it’s called “node memory pressure” and to make sure some Pods won’t start getting OOM killed (killed due to out of memory), Kubernetes starts to evict them to another node. In the case of PDB, Kubernetes won’t be able to move a Pod if that is considered a violation of the PDB set, so PDB will be useful for preventing it.

Complete Removal of the Node

When a node is completely removed for whatever reason, all the workloads on it will be removed as well and PDB won’t stop the removal of the node.

Kubernetes Node Pool Upgrade

The way a node pool is usually upgraded is by applying a cordon & drain on a node. PDB will prevent disruption on your Pods caused by a drain operation

Summary

In this article, we covered PDB — the feature in Kubernetes to manage your application resiliency in case of disruptions. We started with explaining what a Pod disruption even means and then deep-dived into what PDB is and when you should use it (or not). Ready to get your hands dirty by experimenting with PDB? jump to the practical guide.

--

--

Responses (1)