Saving inter-zone transfer costs in EC2 with HAProxy

By Ayush Goyal

TL;DR: We saved tens of thousands of dollars in ec2 inter-zone transfer costs by smartly routing traffic using haproxy.

At Helpshift, our infrastructure is hosted in EC2. Like any sensible infrastructure in EC2, we utilize multiple Availability Zones for high availability of services. A caveat of hosting services in this setup is the costly charges for inter-zone network transfers.

A couple of years back, we realized that our inter-zone transfer costs were a significant chunk of our monthly AWS costs. Upon further digging, we found out that most of this transfer was from our redis cache cluster.

The redis cluster had few slaves distributed in multiple availability zones. Clients were accessing one of the healthy slaves, discovered via Route53 health checked dns. The clients were dumb and making calls to slaves in other zones. We required a solution where clients will make calls to healthy slaves within same zone to avoid inter-zone transfer costs, but could access slaves in other zone in case of any unavailability in the same zone.

An idea that popped to us was if clients could somehow achieve this with haproxy, a service we have come to rely on for nifty solutions. A quick look at the haproxy documentation pointed us to the backup config parameter for the server line. The haproxy docs for this param reads:

When “backup” is present on a server line, the server is only used in load balancing when all other non-backup servers are unavailable. Requests coming with a persistence cookie referencing the server will always be served though. By default, only the first operational backup server is used, unless the “allbackups” option is set in the backend.

Another important configuration to lookout for when using backup parameter is option allbackups for the backend. Haproxy documentation explains it as follows:

By default, the first operational backup server gets all traffic when normal servers are all down. Sometimes, it may be preferred to use multiple backups at once, because one will not be enough. When “option allbackups” is enabled, the load balancing will be performed among all backup servers when all normal ones are unavailable.

All we had to do was setup haproxy on clients to access redis slaves with a configuration that takes availability zones of the client and redis slaves into account and mark slaves in other availability zones with backup parameter. This was easily achieved with an ansible template for haproxy configuration which used our ec2 inventory. So for a client node hosted in AZ1, the following configuration would be generated by ansible:

frontend redisslave
mode tcp
option tcp-smart-accept
option splice-request
option splice-response
timeout client-fin 5s
default_backend redis-slave-servers
backend redis-slave-servers
mode tcp
option tcp-smart-connect
option splice-request
option splice-response
option allbackups
timeout tunnel 70s
server S1 check # S1 is in AZ1
server S2 check backup # S2 is in AZ2
server S3 check # S3 is in AZ1
server S4 check backup # S4 is in AZ2

With this config, the client can connect to redis slaves in it’s own availability zone via haproxy and will switch to redis slaves in other zone only if all slaves of the same zone are down. This solution was tested and deployed in production within couple of days and we observed significant reduction in the next month’s bill.

This also solved the issue of adding/removing new slave in the redis clusters without restarting the client service as each time a new slave was added to redis cluster all we had to do was update haproxy configuration via ansible on the client nodes. This solution has been working flawlessly in production for last couple of years.