Request Load Distribution in Kubernetes and AWS

Vinod Canumalla
Expedia Group Technology
10 min readJul 11, 2018

Kubernetes (k8s) facilitates deployment automation and efficient sharing of compute resources on cloud infrastructure such as AWS. This allows developers and Ops teams to easily deploy and scale components on cloud infrastructure. However, some useful features go unnoticed or worse may be used in the wrong way. One such feature is request distribution or load balancing uniformly across multiple pods in a k8s cluster.

In physical infrastructure this is usually done by network appliances such as Citrix Netscaler but in k8s and AWS infrastructure layer 7 load balancing is somewhat lacking. The load balancer appliance works on layer 7 which provides rich features such as connection pooling, connection persistence, compression, and various load balancing algorithms to distribute requests evenly across back-end instances. In AWS and k8s, by default you get layer 4 load balancing that uses TCP listeners and this does not provide features such as connection pooling, connection persistence, compression or various load balancing algorithms.

AWS ELB load balances the back-end instances with either request load balancing for HTTP listeners or connection load balancing for TCP listeners. AWS Classic Load balancer (ELBv1) just provides a round-robin load balancing algorithm for TCP listeners. But with AWS Application Load balancer (ELBv2) you get limited functionality of connection pooling, path based routing and the least outstanding requests routing algorithm for HTTP and HTTPS listeners. In k8s a Service, which is a Kubernetes object that provides layer 4 load balancing for multiple pods, will provide a random load balancing algorithm using Kube-proxy iptable rules to load balance client connections to the back-end pods.

The request load imbalance across pods in k8s cluster in AWS is quite evident particularly when migrating your component stack, that was optimised for physical infrastructure, to cloud infrastructure. Uneven load balancing causes hot spots on certain pods making it difficult to capacity plan the number of replicas needed for a deployment and also making the user experience worse due to k8s evicting the pods and creating new pods frequently. Migrating components from physical to k8s on AWS cloud infrastructure has highlighted the following areas that contributed to request load imbalance:

  1. DNS Caching within client components
  2. AWS ELBv1 Layer 4 load balancing and its limitations
  3. k8s Service “externalTrafficPolicy”
  4. k8s pod scheduling, rolling-updates and cluster maintenance
  5. Connection persistence within k8s cluster

1) DNS Caching within client components

To explain the DNS caching issue first I need to explain the way AWS configures ELBs based on number of availability zones (AZs).

An AWS ELB has a host name and has multiple IP addresses (multiple A records for a CNAME) based on the number of availability zones configured. Usually for your web application there will be a website name which is a Route53 entry which points to an ELB host name which is a random name as the ELBs are managed by k8s and hence these ELBs can be destroyed and recreated during k8s cluster deployments. For example: your web application’s Route53 entry is “myApp.nginx-ingress.k8s.aws.hcom.cloud” which points to an ELB host name “internal-a93fc8835ad9911e79aa80274518d752–1486378487.us-west-2.elb.amazonaws.com”. Now if you resolve these names against a DNS server you see as below:

$ host internal-a93fc8835ad9911e79aa80274518d752-1486378487.us-west-2.elb.amazonaws.com
internal-a93fc8835ad9911e79aa80274518d752-1486378487.us-west-2.elb.amazonaws.com has address 10.28.18.180
internal-a93fc8835ad9911e79aa80274518d752-1486378487.us-west-2.elb.amazonaws.com has address 10.28.37.246
internal-a93fc8835ad9911e79aa80274518d752-1486378487.us-west-2.elb.amazonaws.com has address 10.28.55.112
$ host myApp.nginx-ingress.k8s.aws.hcom.cloud
myApp.nginx-ingress.k8s.aws.hcom.cloud has address 10.28.18.180
myApp.nginx-ingress.k8s.aws.hcom.cloud has address 10.28.37.246
myApp.nginx-ingress.k8s.aws.hcom.cloud has address 10.28.55.112

From the above it is clear that the Route53 and ELB host name resolves to the same set of IP addresses and there are 3 IP addresses because of 3 availability zones configured.

Now a client that is sending requests to the Route53 entry will resolve the host name to one of the 3 IP addresses every time it queries the DNS server. If a client caches this IP address for a certain period then all requests will be sent to a single IP and availability zone. Java clients by default cache DNS responses indefinitely which is not good for request load distribution across multiple AZs.

The problem is AWS ELB instances are load balanced by DNS records which is the root cause but in order to work around this the clients must be aware of multiple IP addresses of ELBs/Route53 entries and cycle through all IP addresses to evenly distribute request load to all available ELB instances in all AZs. This can be achieved either by lowering the DNS response cache TTLs or completely disabling the DNS response caching within client components.

For example, in Java clients by default the JVM caches the DNS resolution of host IP details forever so the JVM DNS cache TTL must be set to a lower value (for example: 0) to avoid uneven request load distribution. The below JVM flag can be used to disable the JVM DNS cache.

-Dsun.net.inetaddr.ttl=0  #(Java 8 or earlier)

# If the global config "networkaddress.cache.ttl" in "$JAVA_HOME/lib/security/java.security" file is enabled then the above individual JVM override is ignored as "java.security" file config applies to all JVMs running on the host.
# It is preferable to use "networkaddress.cache.ttl" in "$JAVA_HOME/lib/security/java.security" file.

This graph below shows the comparison of client DNS caching enabled vs disabled. The 3 lines are the count of connections (similar request counts) to 3 ELB instances in 3 AZs.

2) AWS ELBv1 Layer 4 load balancing and its limitations

AWS ELBv1 and ELBv2 do support cross-zone load balancing which enables an ELB to send connections/requests equally to all available worker nodes behind the load balancer. However, in ELBv1 by default cross-zone load balancing is disabled when provisioned using the API call (but is enabled if provisioned in AWS Console) and uses TCP listeners which works at layer 4 hence only the connections are distributed across worker nodes within an AZ.

Once these connections are established between client and a back-end instance then all requests go through it irrespective of the number of requests served by each connection which could leave the worker nodes getting different numbers of requests.

It is essential to understand this behaviour of ELBv1 and TCP listeners that contribute to request load imbalance. So once you lowered the DNS client caching, as explained above, next is to ensure the ELBs cross-zone load balancing is enabled and if using the TCP listener then ensure the component running on back-end instances are recycling the connections frequently.

3) k8s Service “externalTrafficPolicy”

For some parts of your component stack (e.g. frontend apps) you may want to expose a Service onto an external (outside of k8s cluster) load balancer such as AWS ELBs. This can be achieved by creating a Service of type LoadBalancer. One of the properties of this Service is “externalTrafficPolicy” which defines how the connection/request routing is performed from an AWS ELB to k8s pods running behind the Service on worker nodes. The “externalTrafficPolicy” has two configuration values “Cluster” and “Local” -

A “Cluster” value means the ELB is configured with all worker nodes as “InService” and on each worker node the kube-proxy iptable rules are setup to route the traffic to the relevant pods running anywhere in the cluster.

A “Local” value means the ELB is configured with all worker nodes but only the worker nodes that have the relevant pods running will be set to “InService” status and all others are “OutOfService”. This is because the kube-proxy iptable rules are setup only on the worker nodes with relevant pods running to route the traffic local to the worker node.

By default, k8s sets the externalTrafficPolicy to Cluster which means a request can be routed randomly to relevant pods running on any worker node within k8s cluster. This default configuration could add an additional hop for requests and responses when an ELB send requests to a worker node where there is no pod running but kube-proxy iptable rules will redirect the connection/request to a pod running on another worker node. By setting the externalTrafficPolicy to Local the ELB sends requests directly to the worker nodes with relevant pods running and the request will only be served by the pods local to the worker node. This will ensure no additional hop for both request and response, this is great isn’t it! But unfortunately, this “LocalexternalTrafficPolicy is prone to uneven load distribution under certain conditions such as unequal number of pod placement by k8s on worker nodes for your component.

As shown in this diagram, the number of pods on each worker node and the number of worker nodes in each AZ are different and hence different percentages of workloads across worker nodes and across 8 pods of “myApp” component.

An ELB is only aware of worker nodes but not aware of the number of pods running on worker nodes. So, an ELB sends connections/requests to “InService” worker nodes uniformly in a round-robin method and the number of pods on a worker node will share total connection/requests arriving at the worker node. This results in an unequal amount of request load at each pod level when there is an imbalance in the number of pods per worker node.

You can use Local externalTrafficPolicy as long as an equal number of worker nodes provisioned per AZ and an equal number of pods are placed by k8s on multiple worker nodes so the load distribution will be uniform and also the extra hop can be avoided. But when pods are non-uniformly placed on multiple worker nodes then it is advisable to use the externalTrafficPolicy set to “Cluster” for uniform load distribution across pods. The extra hop is negligible compared to the impact of load imbalance.

4) k8s pod scheduling, rolling-updates and cluster maintenance

A single k8s cluster can span across multiple AZs creating a single cluster per AWS region. But k8s is partially aware of AZs and it does not enforce AZ affinity when scheduling a deployment.

If a cluster is provisioned with equal number of worker nodes in each AZ and worker nodes have enough resources then during the initial deployment of your component, k8s will try to deploy the pods in a round-robin fashion across AZs and worker nodes. So, there is a chance of uniform distribution of component replicas across worker nodes and AZs but this is not guaranteed.

Also during k8s cluster rolling-updates or any maintenance activity or an AZ failure will cause the pod distribution across worker nodes and AZs to go out of sync due to k8s spinning up the downed pods on other worker nodes to maintain the requested number of replicas.

There is no automatic redistribution of pods after the maintenance activity is completed or a failed AZ is back online. So this pod imbalance could potentially contribute to an uneven request load distribution when cross-zone load balancing is disabled or externalTrafficPolicy is set to Local as explained above.

5) Connection persistence within k8s cluster

In k8s cluster when you deploy your component with multiple replicas then to access your component a Service needs to be created to load balance all replicas of your component. The default ServiceType is a ClusterIP which exposes the service on a cluster-internal IP. This makes the service only reachable from within the cluster and this is useful for all internal communication between various components within k8s cluster. The front-end components which talks to dependency service components can use this ClusterIP (also has a DNS name format — <component_name>.<namespace>) to send requests across all pods of that dependency service. The ClusterIP service load balancing works at TCP level (layer 4) and hence it will load balance only the connections but not requests.

Usually in physical environment where layer 7 load balancing is used, the connection persistence is enabled between front-end components and any dependency service components for performance reasons. When these dependency service components are migrated to k8s cluster this connection persistence will cause a request load imbalance due to layer 4 load balancing in k8s ClusterIP service. Even when the number of connections are distributed in a round-robin manner across all pods of the dependency service component by k8s ClusterIP, it cannot control the number of requests going through the connections. Also when a pod is evicted all the existing connections could be re-assigned randomly to other existing pods and any new pod created by k8s may not get any request load until new connections are established.

The following are some of the possible options to overcome the ClusterIP load balancing limitation within k8s cluster.

  • Maybe disable connection persistence (either at the client or forcing via the server but not advisable for performance reasons.)
  • Use layer 7 Load balancing via nginx ingresses
  • Wait until a better Load balancing support implemented in Kubenetes
  • Recycle connections frequently (on tomcat reduce maxKeepAliveRequests value. But this will only serve to smooth out request load distribution a little more)

You will need to adjust maxKeepAliveRequests or equivalent value to balance uniform request distribution and avoid connection flooding based on the rate of requests your component receives.

Too high a maxKeepAliveRequests value will lead to fewer connection recycling and request distribution could become more imbalanced.

Too low a maxKeepAliveRequests value will cause frequent connection recycling and more balanced request distribution. However, you need to be careful while lowering this value as too frequent connection recycling could cause connection flooding, latency spikes due to connection overhead.

Request rate distribution with persistent connections is all over the place in the below image:

Recycling connections frequently has improved the request rate distribution more uniformly which also helped to reduce the number of replica pods required.

Conclusion

  • There is no automatic uniform request load distribution across k8s cluster on AWS cloud infrastructure.
  • It is essential to manage request distribution evenly across pods within k8s cluster and across AWS worker nodes and AZs.
  • Identify all clients that generate requests and ensure all clients are aware of multiple IP addresses of dependency service endpoints and react quickly based on changing IP addresses.
  • Understand AWS ELBs and k8s load balancing features and enable appropriate policies when integrating AWS and k8s components.
  • Understand affinity and anti-affinity features of k8s and manage the k8s cluster effectively to re-balance a component’s pods evenly across worker nodes and Availability Zones.
  • Tune your component appropriately based on the infrastructure features and intermediary components that handle connections and requests.

--

--