How session stickiness disrupts auto scaling in k8s

Published in

Nerd For Tech

6 min readOct 9, 2021

In a desperate attempt to scale a mammoth monolith, I stumbled upon an interesting limitation of the AWS application load balancer (ALB). Autoscaling and load balancing are two completely different aspects. However, the coordination is crucial in forming efficient scalable sub systems. This is the story of sticky sessions and how they ruthlessly affect the relationship between load balancing and scaling.

This is going to be a long, yet adventurous read. Grab your chips and keep scrolling.

Getting started

A nice to have prerequisite

A fair understanding of AWS load balancing and Kubernetes (or k8s) horizontal pod autoscaling (HPA) will fairly ease the read further.

If you have some lingering idea of these, let me breeze through the concepts used in this article.

Kubernetes HPA

Consider your application is deployed in Amazon AWS and orchestrated by Kubernetes. To account for the fluctuating load requirements of the application server, the k8s HPA scales out or scales in the pods based on the configured criteria, say, average CPU utilization over all the running pods. This way, you only use the required AWS resources, minimising under utilisation and paying just for what you use.

AWS Application Load Balancer (ALB)

Say you configure AWS ALB for the load balancing needs. Now that the pods are frequently scaling, the ALB has to productively distribute the load to underlying pods. It’s crucial for the load balancer algorithm to be optimal and uniformly distribute the load.

Sticky sessions

Often used interchangeably with Session Affinity, stickiness is a feature of load balancers (here ALB) where the load balancing algorithm, while distribution of the load, considers the correlation between requests and the server. In other words, the requests show “affinity” to a server, guided by the LB algorithm.

Note: In the context of the article, consider session == series of ‘n’ requests from a user session. Also, pods and servers are used interchangeably .

All the requests from a specific client routed to a single pod

Requests from the same client distributed to multiple pods

Sticky sessions are helpful where maintaining the session state on the server optimizes the performance, say, like an in-memory cache.

Stickiness in AWS-ALB

To identify the stickiness, ALB relays a cookie — called AWSALB — to and from every request. For a new session, with no AWSALB cookie, a server is chosen based on the configured LB algorithm. Check this for more.

Now that you are comfortable with the basics, let’s get into the interesting stuff.

Problem-1: Balancing of sticky sessions

Our application has a similar setup of AWS-ALB and k8s orchestrating the scaling needs. Also, the application reaps substantial benefits from caching users’ session state in memory, naturally inclined to opt session stickiness for ALB.

When I experimented with the load testing scenarios, the difference in non-sticky sessions and sticky sessions, in the context of HPA, is distinct.

Non sticky scenario: load distribution to scaled pods

For non sticky, stateless requests, when HPA scales the pods, newly created pods shared the load.

Sticky scenario: load distribution to scaled pods

For sticky requests, when HPA scales the pods because of increased load in the existing pod, the new pods aren’t allocated any load — nullifying the scaling of new pods.

Intensity of the problem

What if there are no active sessions on a pod but burst of the requests from these sessions suddenly becomes active at the same time?

The load balancer silently does its job of routing these requests to the “sticky” server, even though there are other servers that can comfortably take the load — culminating in eventual demise.

The load scenario is slightly artificial despite being simulating our real world application usage patterns. It assumes that most sessions are long lived and the creation of new sessions is comparably less often.

Further exploration lead to another interesting problem.

Problem-2: Inefficient Load balancing algorithm

AWS ALB, by default, uses Round-Robin algorithm to distribute the requests. It’s naively cycles the requests through available servers, ignoring the load on the system.

For a while, let’s consider a more “real” situation where there is balance between old sessions and creation of new ones. Also, the load on the servers is not homogenous.

For obvious reasons, Round Robin fails.

LOR saves the day

Instead of default Round-Robin strategy, the AWS ALB can be configured to use Least Outstanding Requests strategy. As the name says, the algorithm considers the number of “active” requests a server is handling while routing the traffic. Theoretically, LOR is efficient compared to RR in case where servers are handling differential load.

A nice post for detailed understanding

But stickiness spoils the party — yet again

Consider a hypothetical scenario.

I have 3 backend servers (with sticky sessions) Server1, Server2, Server3. And each server has a maximum capacity of 200 active requests.

Consider the active requests that they are serving (and the load) at a particular point in time as:

Server1: 150 (75%) active requests
Server2: 200 (100%) active requests
Server3:180 (80%) active requests

Note: Not all sticky sessions attached to corresponding servers have active requests at this point, i.e. Server2 may have multiple other attached sessions whose requests are not currently being served.

When a request from a fresh session, session-X, arrives, it should be routed to Server1 “based on LOR algorithm configured”, as Server1 has the minimum active requests.

When a request from a sticky session, session-N-Server2, arrives, it is still being served by Server2 — making the LOR ineffective.

Stickiness beats load balancing and nullifies HPA

So, what am I looking for?

An intelligent algorithm that considers the number of sticky sessions associated with a server that are not necessarily active, while routing the traffic.

TL;DR

The LB should route the incoming requests to Server with “Least Overall Sessions Attached”.
The LB should re-route the sticky request to an optimal server in case the server is overloaded — capitalizing the HPA.

If you haven’t guessed it already, here ends my quest to find this hypothetical load balancer.

On a side note: We’ve brainstormed several ideas, like rejecting the sessions in the application and rate-limiting, but none of these were practical.

Few months later… I found this.

I came to know that Gloo gateway’s Hash Ring algorithm supports session affinity with the added benefit of load redistribution (#2) in case of scaling events.

This comes with an asterisk, that the session state — under the hood — needs migration to a newly found server. This opens a gateway to its own challenges. But that’s a story for another day.

Key take aways

Stickiness is more of a “feature” than a limitation. It just doesn’t align with the principles of auto scaling.
There is no silver bullet to solve all of our scalability needs. It’s all about agreeing and living with the trade-offs.
Scaling a monolith is a herculean task.

Up next: In the next article, I’ll be discussing yet another interesting scalability limitation around JVM applications in containerized environments.