How Can I Isolate, do Maintenance and Debug an ALB/Ingress Controller; IBM IKS Cheat Sheet #2

This post is targeting use cases when I have multiple IBM Kubernetes Service ALBs / Ingress Controllers working in parallel (for example in a multizone setup) and I want to do maintenance or debugging on one of them. I want to remove one ALB IP from DNS (and by that remove production traffic), but still keep it operational without affecting the existing connections and the other ALB(s).

When would I do this?

There could be a number of reasons. One use case could be if I want to do maintenance in one zone and I want to take that ALB out of production. Like reboots, re-creating worker-pools, some bigger changes that could affect my traffic/capacity, etc.

Although there are multiple ways to remove an Ingress controller, like disabling the ALB would also work but that takes the ALB offline: Disabling will take the ALB pods down immediately and they will not respond to traffic anymore. It will take some time until that ALB IP is removed from DNS, therefore it is not a good idea to do it in production.

Other use case would be if I have 3 ALBs and for some reason one of them returns error codes or behaves funny, so I want to do some debugging only on the funny one.

Summary: Goal is to remove one ALB IP from the production host name I want to work on, while the ALB is still up and running so I can wait until traffic has drained before I start any disruptive working on it.

Important note: Although the IKS ALB host names are set with 30 seconds DNS TTL, the reality is that there are a lot of DNS resolvers and applications that cache the DNS host names for longer. It is generally suggested to wait two hours before considering that the IP actually got removed from production traffic. (You can monitor your traffic via checking your specific ALB’s logs also).

How do I do this?

Simply by changing the health check response for that ALB IP from “healthy” to “not found”. Health check will automatically remove the IP from DNS.

Before we start

Make sure you went through the basics explained in my previous post. You may need to learn commands from that post to make this work.

For further details how the IKS ALB/Ingress Controller works in a singlezone and multizone environment please visit the official documentation.

Example environment

My example cluster is in us-south region of IBM Cloud, and I chose to run in two zones, in dal10 and dal13.

$ ibmcloud ks albs --cluster arpad-ipvs-test-aug14
$ host arpad-ipvs-test-aug14.us-south.containers.appdomain.cloud

I am going to remove the ALB1 with the 169.62.196.238 IP from DNS, while ALB2 with 169.62.196.222 IP will take all the production traffic and both will be still up and reachable.


Removing an ALB IP from DNS

1.) Check which ALB is serving the IP 169.62.196.238:

$  ibmcloud ks albs --cluster arpad-ipvs-test-aug14 |grep 169.62.196.238

It is: public-cr24a9f2caf6554648836337d240064935-alb1 in my cluster.


2.) Important: You need to do the disable health check on all ALB1 pods (I have two), so you have to run the steps 3, 4, 5 for all ALB1s in the following list:

$ kubectl get pods -n kube-system |grep public-cr24a9f2caf6554648836337d240064935-alb1

Hint: Look at the diagram above and find the pod names ending -8rvtq and -trqxc.


3.) Check the state of the server_name line in the nginx.conf:

POD A

kubectl exec -ti 
public-cr24a9f2caf6554648836337d240064935-alb1-7f78686c9d-8rvtq
-n kube-system -c nginx-ingress -- grep server_name /etc/nginx/conf.d/kube-system-alb-health.conf

POD B

$ kubectl exec -ti public-cr24a9f2caf6554648836337d240064935-alb1-7f78686c9d-trqxc -n kube-system -c nginx-ingress -- grep server_name /etc/nginx/conf.d/kube-system-alb-health.conf

This looks good, both server_names are present.


4.) Let's insert a # in front of the server_name, which will break the health check and that removes it from DNS:

POD A

$ kubectl exec -ti public-cr24a9f2caf6554648836337d240064935-alb1-7f78686c9d-8rvtq -n kube-system -c nginx-ingress -- sed -i -e 's*server_name*#server_name*g' /etc/nginx/conf.d/kube-system-alb-health.conf

POD B

$ kubectl exec -ti public-cr24a9f2caf6554648836337d240064935-alb1-7f78686c9d-trqxc -n kube-system -c nginx-ingress -- sed -i -e 's*server_name*#server_name*g' /etc/nginx/conf.d/kube-system-alb-health.conf

Note: The command has no output if successfully executed.

Check if the change happened: POD A

$ kubectl exec -ti public-cr24a9f2caf6554648836337d240064935-alb1-7f78686c9d-8rvtq -n kube-system -c nginx-ingress -- grep server_name /etc/nginx/conf.d/kube-system-alb-health.conf

POD B

$ kubectl exec -ti public-cr24a9f2caf6554648836337d240064935-alb1-7f78686c9d-trqxc -n kube-system -c nginx-ingress -- grep server_name /etc/nginx/conf.d/kube-system-alb-health.conf

Good, # is there in front of server_name.


5.) Reload the nginx config for the ALB:

POD A

$ kubectl exec -ti public-cr24a9f2caf6554648836337d240064935-alb1-7f78686c9d-8rvtq -n kube-system -c nginx-ingress -- nginx -s reload

POD B

$ kubectl exec -ti public-cr24a9f2caf6554648836337d240064935-alb1-7f78686c9d-trqxc -n kube-system -c nginx-ingress -- nginx -s reload

Note: The command has no output if successfully executed.


6.) Test whether the albhealth... still responds healthy, which would be unexpected (hint: run it a few times to cycle through multiple ALB1 pods):

$ curl -X GET http://169.62.196.238/ -H "Host: albhealth.arpad-ipvs-test-aug14.us-south.containers.appdomain.cloud"

Good, we got a 404, this means the health check will fail. (Health check expects healthy as response.)


7.) Give it a minute or two and you should see the IP removed from the DNS response:

$ host arpad-ipvs-test-aug14.us-south.containers.appdomain.cloud

Good. My 169.62.196.238 IP is not in the DNS response anymore. You can also ask Cloudflare DNS directly, if you are worried your DNS is not honoring TTLs:

$ host arpad-ipvs-test-aug14.us-south.containers.appdomain.cloud ada.ns.cloudflare.com

Alright, I have removed the 169.62.196.238 ALB from DNS, what can I do now? I can go ahead and run my debug tests against the application, talking to the ALB with the following command, while new production traffic goes to the healthy (169.46.52.222) one and .238 is draining:

$ curl -X GET --resolve my-app.arpad-ipvs-test-aug14.us-south.containers.appdomain.cloud:443:169.62.196.238 https://my-app.arpad-ipvs-test-aug14.us-south.containers.appdomain.cloud/

Note, I resolve the hostname specifically to IP 169.62.196.238 and use port443.

Also if this service was in production (based on my application usage pattern) after my 169.62.196.238 ALB1 is drained completely, I can disable it, or do whatever I want that will impact the pods availability. Suggested wait time is minimum 2 hours after removed from DNS.


Restoring an ALB IP to DNS

8.) Restoring the health check:

POD A

$ kubectl exec -ti public-cr24a9f2caf6554648836337d240064935-alb1-7f78686c9d-8rvtq -n kube-system -c nginx-ingress -- sed -i -e 's*#server_name*server_name*g' /etc/nginx/conf.d/kube-system-alb-health.conf

POD B

$ kubectl exec -ti public-cr24a9f2caf6554648836337d240064935-alb1-7f78686c9d-trqxc -n kube-system -c nginx-ingress -- sed -i -e 's*#server_name*server_name*g' /etc/nginx/conf.d/kube-system-alb-health.conf

(Optionally check if # is removed as described in step 3.)


9.) Reload nginx so the change takes effect (this is basically same as step 5):

POD A

$ kubectl exec -ti public-cr24a9f2caf6554648836337d240064935-alb1-7f78686c9d-8rvtq -n kube-system -c nginx-ingress -- nginx -s reload

POD B

$ kubectl exec -ti public-cr24a9f2caf6554648836337d240064935-alb1-7f78686c9d-trqxc -n kube-system -c nginx-ingress -- nginx -s reload

Note: The command has no output if successfully executed.

Healtcheck is back to healthy:

$ curl -X GET http://169.62.196.238/ -H "Host: albhealth.arpad-ipvs-test-aug14.us-south.containers.appdomain.cloud" 
healthy

Health Check will pick it up IP will get registered in DNS again after a few minutes:

$ host  arpad-ipvs-test-aug14.us-south.containers.appdomain.cloud 

Note: We are working on automating this for all ALBs or LoadBalancer services via a single API call, which will greatly reduce the number of steps you have to go through. In the meantime please use these steps. Thank you for reading. :) Happy to hear feedback on our Slack Channel.

Debugging IKS ALB / Ingress? Check out the official documentation page.

Additional articles:
- Useful commands on the IKS Ingress/ALB Cheat sheets.
- IKS ALB/Ingress Controller Timeouts, Dropped Websocket Connections