How we cut Kafka Data Transfer costs by 90%

Kafka 2.4 rack-awareness helped us to drastically reduce our AWS Data Transfer costs

Ronny Roeller
NEXT Engineering
4 min readAug 21, 2020

--

We have been using Kafka, and later AWS’s hosted version “MSK”, for a couple of years — and we love it. Kafka has proven to be a reliable backbone of our microservice architecture.

Why are our Data Transfer costs exploding?

Around September 2019 we suddenly observed a drastic increase in our AWS Data Transfer bill: costs doubled within just 2 months. We suspected that this was caused by our MSK usage. We consulted CloudWatch to understand better what was going on, and how we could counter the increased Data Transfer. Yet, the MSK traffic shown in CloudWatch explained less than 1% of our AWS bills. Time for AWS support.

By May 2020, six months of emails and calls with the AWS Support had passed but still no resolution. In the meantime, the Data Transfer costs kept increasing by ~10% per month, making the issue more and more pressing.

Thanks to our AWS account manager, we finally got the right technical guys to look into our issue. It turned out that the raw BytesOut count exposed in CloudWatch would need to be multiplied by the number of consumers (microservices) to get to the final data transfer volume on the bill. For us, every 1MB BytesOut translates to 100+MB data transfer!

At the same time our services are distributed over 3 availability zones. With Kafka 2.3, a client would always fetch from the leader of a partition — which most often was in a different zone than the client. That meant, despite AWS not charging for traffic within an availability zone, most of our traffic ended up on our bill as Data Transfer between availability zones.

We concluded that there wasn’t any unnecessary traffic in our setup. The only solution was to route the traffic more smartly: Have each microservice read from the MSK broker in the same availability zone, thereby avoiding data transfer between availability zones.

Rack awareness to the rescue

Luckily, Kafka’s 2.4 release had introduced exactly what we needed: rack awareness. With thisfeature, Kafka clients get an understanding where they physically are and can connect to the closest broker.

The first part of adopting rack awareness proved trivial. We upgraded MSK to Kafka 2.4. The second part turned out to be way more tricky: As our microservices are written in NodeJS, we don’t use the standard Java Kafka client but built on KafkaJS. KafkaJS itself is awesome but, naturally, features tend to be a bit slower implemented than in the Kafka Java client. Rack awareness was one of them.

The power of Open Source

The beauty of Open Source: If you care enough, you can just do it yourself!

We took some inspiration from an earlier attempt by Figma, and set out to add rack support to KafkaJS. We made rack awareness work in our own KafkaJS fork, and then collaborated closely with the amazing KafkaJS maintainer Tommy Brunn to merge the Pull Requests (1, 2, 3, 4) into the official repository.

The last piece of the puzzle was then to make our services themselves “rack-aware”. This means for MSK to set the rack.id configuration to the Availability Zone ID. We consider this an operational problem: The services shouldn’t know how to do that. Instead, the environment that deploys the services provides the availability zone to the service through environment variables.

For our Kubernetes (EKS) infrastructure we wrote another service that watches for new pods. For each new pod, the service adds annotations for the availability zone name and ID based on the annotations of the node the pod got scheduled onto. These annotations are then referenced in the pod template.

Slashing the AWS bill

Let’s come to the most important question: Did it help?

The following graph shows the dramatic impact when we rolled out rack awareness to our environments between July-14 and July-15: a whopping 90% drop in Data Transfer between availability zones!

AWS transfer costs

A great thank you to all the people who helped us along the journey. Without your experience, critical feedback, and support this wouldn’t have been possible!

Happy coding!

--

--

Ronny Roeller
NEXT Engineering

CTO at nextapp.co # Product discovery platform for high performing teams that bring their customers into every decision