IKS ALB/Ingress Controller Timeouts, Dropped Websocket Connections; IBM IKS Cheat Sheet #3

I have issues with timeout, the IKS ALB drops connections prematurely. How can I adjust the timeouts? I see my websocket connection being dropped after ~60 seconds…

Both keep-alive and Websocket connections are supported on the IKS ALB / Ingress Controller. If you see premature disconnects, there can be a lot of different reasons for it, but typically you hit a timeout somewhere in the stack.

When you are debugging make sure that:

  • you are using an Internet connection that does not have any proxies or firewalls that do not tolerate long lived connections,
  • do debug runs and test from multiple locations, like home ISP, phone tethering, etc. to isolate if the behavior is consistent.

Timeouts

When you are connecting to your applications via the ALB / Ingress Controller, there are two locations where timeouts are configured:

  1. [Client] ===(*client-side-timeout*)=== [ALB] -- Default is 8 seconds, but only for HTTP keep-alive requests, typically not interfering with WS, but you can test it. Once the connection is established between the [Client] and the [ALB], the [ALB] will open a new connection to the [POD] that runs your application, so there is a timeout also between
  2. [ALB] ===(*upstream-side-timeout*)=== [Application POD] -- Default is 60 seconds

Typically the issue with Websocket (WS) connections (when the [Application POD] is running WS) is with the *upstream-side-timeout*. Best way to confirm is to check the ALB logs.

You have two options: Enable heartbeats (Websocket), increase timeouts.


Enable Heartbeats (Websocket)

The easiest and safest way is to make sure you set some sort of heartbeat in your Websocket application to make sure the TCP connection is kept alive. Long lived TCP connections with no apparent traffic are generally dropped by more aggressive timeouts, which can be not just in the ALB, but also a transparent proxy, firewall that sits in front of the client. With frameworks such as WAMP you can enable heartbeats. The suggested interwal is 58 seconds or less to be safe.


Increase Timeouts

The other option is to increase timeouts, which is only recommended if you know there is nothing on the path in the network that would operate with a lower timeout now or in the future (transparent proxy, firewall, etc.).

How do I change the client-side-timeout?

Related official documentation is found here under: Increasing the keepalive connection time:

1.) Edit the Ingress ConfigMap

$ kubectl edit cm ibm-cloud-provider-ingress-cm -n kube-system

2.) Add/Change the keep-alive value to your desired. (Default is 8s and by default this line is not present in the ConfigMap.)

The keep-alive: setting in the ConfigMap will change the keepalive_timeout setting in the ALB nginx configuration.

apiVersion: v1
data:
keep-alive: "300s"
kind: ConfigMap
metadata:
name: ibm-cloud-provider-ingress-cm
namespace: kube-system

3.) Save and Verify:

$ kubectl get cm ibm-cloud-provider-ingress-cm -n kube-system -o yaml

Another way to verify, just replace your ALB pod name with yours (to find out what your ALB name is read my earlier post here or see the official documentation):

$ kubectl exec -ti public-cr24a9f2caf6554648836337d240064935-alb2-5cbb674fd5-tff54 -n kube-system -c nginx-ingress -- grep -H -R keepalive_timeout /etc/nginx/nginx.conf
/etc/nginx/nginx.conf:  keepalive_timeout 300s; # <-- Changed, good

How do I change the upstream-side-timeout?

Related documentation is found here, where you can learn also how to apply annotations.

1.) Add the following annotation to your ingress resource, to increase the 60 seconds default to 300 seconds:

ingress.bluemix.net/proxy-read-timeout: "serviceName=<YOUR SERVICE NAME> timeout=300s"

The proxy-read-timeout annotation will change the proxy_ready_timeout setting in the ALB nginx configuration.

2.) How do you check if the configuration has taken effect?

$ kubectl exec -ti public-cr24a9f2caf6554648836337d240064935-alb2-5cbb674fd5-tff54 -n kube-system -c nginx-ingress -- grep -H -R timeout /etc/nginx/conf.d/
/etc/nginx/conf.d/default-source-ip-ingress.conf:              proxy_connect_timeout 60s;  # <-- This stayed 60s, good and expected
/etc/nginx/conf.d/default-source-ip-ingress.conf: proxy_read_timeout 300s; # <-- Changed, we are good.

Further useful articles:
- Useful commands on the IKS Ingress/ALB Cheat sheets.
- How Can I Isolate, do Maintenance and Debug an ALB/Ingress Controller