Solving AWS Aurora RDS load balancing problem with pgPool-II

Published in

talabat Tech

5 min readSep 30, 2020

DNS-based load balancing has been around for a long time. AWS Aurora relies on this technology for load balancing the read replicas behind the reader endpoint. Recently, here at talabat, we have observed some incidents due to sudden spikes in traffic and the load not getting properly balanced across all the reader nodes, which would cause some nodes to be over-utilized and some others to be underutilized as below:

So, we took a deep dive into how the writer and reader endpoints load balancing works. AWS uses DNS round robin to route traffic to the reader endpoints with a TTL of five seconds. This is when we realized that the underlying algorithm may be suitable for a consistent traffic pattern but not efficient for traffic spikes which is very common pattern with our traffic pattern because the split-second traffic peaks could result in inconsistent load distribution across the reader nodes.

We decided to do little experiment by querying the reader endpoint of one of our Aurora clusters which has three replicas and below are the results:

First Test

for ((i=0;i<100;i++)); do dig CNAME rds.cluster-ro-abcdefghij.eu-west-1.rds.amazonaws.com +short >> output.txt; done; cat output.txt | sort | uniq -c ; rm output.txt;47 rds-1.abcdefghij.eu-west-1.rds.amazonaws.com.
53 rds-4.abcdefghij.eu-west-1.rds.amazonaws.com.

Second Test

for ((i=0;i<100;i++)); do dig CNAME rds.cluster-ro-abcdefghij.eu-west-1.rds.amazonaws.com +short >> output.txt; done; cat output.txt | sort | uniq -c ; rm output.txt;47 rds-1.abcdefghij.eu-west-1.rds.amazonaws.com.
53 rds-4.abcdefghij.eu-west-1.rds.amazonaws.com.

Third Test

for ((i=0;i<100;i++)); do dig CNAME rds.cluster-ro-abcdefghij.eu-west-1.rds.amazonaws.com +short >> output.txt; done; cat output.txt | sort | uniq -c ; rm output.txt;  28 rds-3.abcdefghij.eu-west-1.rds.amazonaws.com. 
72 rds-4.abcdefghij.eu-west-1.rds.amazonaws.com.

In these three tests, the maximum number of endpoints returned by the DNS server was two instead of three. Also, the last test validated our claim of the inconsistency in the load balancing of the reader endpoints.

The option of using the recently announced RDS Proxy service which could have been the life-saver for us wasn’t suitable for our use-case because RDS Proxy only connects to the writer endpoints and doesn’t support connection pooling for reader endpoints.

“In an Aurora cluster, all of the connections in the connection pool are handled by the Aurora writer instance. To perform load balancing for read-intensive workloads, you still use the reader endpoint directly for the Aurora cluster.”
Reference: https://docs.aws.amazon.com/AmazonRDS/latest/AuroraUserGuide/rds-proxy.html

Now, in order to avoid the spikes and to utilize the newly autoscaled reader nodes, we had to find a solution to evenly distribute the load across all the RDS nodes.

We had known that pgPool-II (a reverse proxy server sitting between clients and PostgreSQL) is a very promising tool and also takes care of connection pooling apart from load balancing but it has below shortcomings:

It only supports static configuration of backends which doesn’t work for aurora with autoscaling enabled.
It only supports auto-healing for PostgreSQL backends if replication status enabled (this is disabled in aurora).
It also does not natively supports container/cloud environment. Our applications run on Kubernetes and we wanted to use kubernetes for pgPool-II deployment as well.

So we thought to write a goLang wrapper on top of stock pgPool-II to address the shortcomings described above to leverage load balancing and connection pooling capabilities of pgPool-II. Special thanks to Oluwaseun Obajobi (Oba) for writing and implementing the goLang wrapper.

With our mind set on building this application to be cloud-native, the application was packaged as a docker image and deployed in our Kubernetes cluster with helm chart. Running containers in Kubernetes production especially a middleware between a service and database has to be built with extreme reliability and consider scenarios like:

Accepting traffic only when the container is healthy.
Draining database connections before shutdown.

Accepting traffic only when the container is healthy

Pgpool-II doesn’t expose any endpoint for health-check but we can actually verify the availability by running a simple PSQL command

PGCONNECT_TIMEOUT=15 PGPASSWORD=pass psql -U postgres -d postgres -h 127.0.0.1 -tA -c "SELECT 1"

The above command was wrapped into a simple command which was used as Kubernetes readinessProbe as below:

deployment.yaml 
--- 
readinessProbe:   
  exec:     
    command: ["rds-pgpool", "check"]

With the above, we can guarantee that traffic are only accepted when Pgpool is actually up and ready to serve traffic.

Draining database connections before shutdown

pgPool-II has two shutdown modes which are smart or fast. According to this documentation, if smart option is specified, pgPool-II will wait until all clients are disconnected before shutting down.

pgpool -m smart stop

This definitely was the perfect option for us to safely shutdown the containers without impacting the application clients. To safely implement the graceful shutdown in Kubernetes, we used preStop hook as described below:

deployment.yaml 
--- 
lifecycle:   
  preStop:     
    exec:       
      command: ["rds-pgpool", "stop"]

The outcome of implementation resulted into an evenly balanced database traffic across our endpoint with some of the positive outcomes as below:

We did a load test with the pgPool-II implementation using Helm charts on Kubernetes and we could clearly see the difference in how smoothly the CPU utilization scaled up. The load is almost equally balanced across all the nodes including the newly autoscaled ones.

After autoscaling:

The same thing goes with the number of connections or sessions as well.

With this implementation, at talabat, we’ve been able to evenly distribute database connections between our replica endpoints with close to 100% efficiency. We are still actively looking at the possibility of improving the read replica scaling which takes about eight minutes.

Subscribe to this blog and don’t miss our next article about how we solved the slow autoscaling with predictive scaling!!