The hidden cross AZ cost: how we reduced AWS Data Transfer cost by 80%

9 min readOct 26, 2023

As expected, AWS services are not for free. And we, as BTB SaaS, are not free either. We definitely try to minimize the infrastructure cost as much as possible to increase the gross profit margin.

When I started my journey as a Cloud Software Architect a few years ago, I heard many stories about huge unnecessary instances that were kept up and running for days or how to leverage spot instances optimization (which I’m not sure are still so economic those days, but this is subject to another article). Everything sounded pretty straightforward: we mainly need to track resource consumption and automate the shutdown of unused ones.

But nothing prepared me for the supersize in our AWS production billing report: the most expensive service, the one that takes 25% of the cost, is cross-AZ data transfer! Not the compute of running a 100 nodes EKS cluster, nor the various types of Databases, while some of them are pretty “managed and expensive” like AppStream, Dynamo, and ElasticSearch, not our MSK Kafka cluster, but the amount of data transferred between AZs within the region…

After talking to our AWS account manager, the assigned solution architect, and the AWS billing expert I realized we are not alone. Many companies suffer from this hidden cost, both in the aspect of not estimating it correctly or struggling with investigation and optimizations. I can think of several reasons why this is so common:

Highly Available Distributed Systems — we are all encouraged and educated to use multiple AZs to increase service availability. Multiplying it by the number of microservices that talk to each other, DBs, Kafka topics, and all sorts of data that should be replicated, increases the cross-AZ traffic. Even though it doesn’t necessarily mean that every hop/API call between microservices should necessarily cross the AZ boundary, it increases the probability, especially if you are not aware of it.
Hard debugging — even if you have the correct tooling (like VPC Flow logs which is indeed very useful), it’s pretty complicated to consider where the traffic is coming from and where it’s going. This becomes more challenging when it comes to running EKS (k8s) across multiple availability zones with all its internal components, including control and data planes.
Bi-directional pricing — without debating if AWS documentation is misleading or not, but when you read “cross AZ data transfer within the region costs $0.01/GB” don’t be confused: effectively, it’s 0.02$ per GB since each transferred gigabyte counted as 2GB on the bill: once for sending and second for receiving. If you have 2 EC2 instances on different AZs and 1 GB is transferred between them, it’ll pay 0.02$ under different operations in Cost & Usage terms: InterZone-in and InterZone-out. But from the other hand, it also means that once you optimize and reduce x transferred GBs, you get back 2x cents :)

We got the challenge. What can we do with that?

The first step is obvious but may be misleading: monitor the different components located on AZs but focus on the right resources and the right metrics.
Luckily enough, AWS Cost & Usage reports support resource level granularity, so I could easily aggregate the amount of data transferred out from or transferred into a specific instance per hour. This way I noticed that there are actually 7 problematic instances, all of them EKS k8s nodes, that are responsible for 70% of the traffic, while all other 73 nodes consume 30% only. It was also pretty clear that it’s not a k8s common issue but more about specific PODs running there… This helped me focus on the right resources.

Few resources are responsible for most of the traffic

When presenting the kubernetes.network.tx_bytes metric per every POD located in any of those nodes, it was clear that we have two interesting directions: the blue one which statistically transmitted 100 MiB/s, and the purple one with many picks that on average transmitted 20 MiB/s. Remember that transmitted out from the POD doesn’t necessarily mean out from the node (and cross nodes traffic doesn't necessarily cross AZs), it may be POD internal communication within the node.
For that, I added aws.ect.network-out metric which is at a node level and verified that summing up those PODs traffic is aligned with the traffic that went out from the mentioned 7 nodes. So looks like I’m focusing on the right metrics which gave me an idea of which services to blame…

Istio certification refresh issue

The blue line presented Istio PODs. For one who is less familiar with that, Istio is a service mesh that can be used for traffic management, observability, and security. In our case, it’s used mainly for role-based access, authentication, and authorization across k8s services. The interesting point was that the continuous traffic was coming from istiod PODs that are part of the Control Plane and mainly used for certificate management, so why they kept working so hard, even when there were not any new deployments in the cluster and nothing really changed there?

When digging into istio logs, it showed like istiod Pilot triggered a full push of certificates to all PODs, even if they are not changed at all! Notice the full=true at the end of the log message. The average certificate size is 100–300 KB multiplied by ~900 PODs, you can imagine how much traffic is going out from istiod…

2023-09-27T11:51:33.844081Z info ads Push debounce stable[59230] 3 for config ServiceEntry/prod-0-flink-jobs/prod-0-594133-p-con-flink-mgmt-s1-196a81c4-e-rest.prod-0-flink-jobs.svc.cluster.local and 1 more configs: 100.052469ms since last change, 102.888669ms since last push, full=true
27-9-2023 14:53:04.862
2023-09-27T11:53:04.862282Z info ads Push debounce stable[59250] 3 for config ServiceEntry/prod-0-flink-jobs/prod-0-594133-p-con-flink-mgmt-s1-6c2fac16-8-rest.prod-0-flink-jobs.svc.cluster.local and 1 more configs: 111.327819ms since last change, 111.332977ms since last push, full=true
27-9-2023 14:53:49.739
2023-09-27T11:53:49.735887Z info ads Push debounce stable[59267] 3 for config ServiceEntry/prod-0-flink-jobs/prod-0-594133-p-con-flink-mgmt-s1-10509001-3.prod-0-flink-jobs.svc.cluster.local and 1 more configs: 104.05515ms since last change, 104.06017ms since last push, full=true
27-9-2023 14:55:20.810

When trying to understand where the full pushes are coming from, we noticed that we have a single k8s deployment in a CrashLookBackOff state, and together with istio dead-loop bug, it caused a constant certificates refresh for all PODs in the cluster…

Two important takeaways:

Keep monitoring your cluster healthiness — don't ignore any deployment with an unhealthy state. You can never know how costly they are…
Debugging cost issues is a great way to discover functional bugs! If something costs us too much, it may be just implemented wrongly.

Once we fixed the issue by removing CrashLookBackOff deployment, the daily cross AZs data transfer by 95% which cut our AWS daily cost by 23%. Crazy.

Data Shuffling between stream operators

But the story didn’t finish… After a few more days, I realized the cost had risen again. Not to the original high value, but still:

The same production traffic and data volumes, the same services, what’s going on?

Then I returned back to the original kubernetes.network.tx_bytes metric graph and noticed that the purple line was not reduced, meaning the service kept consuming 20MiB\s. But I need to answer two questions: (1) why was production cost not stable if the service behaved like this during the entire period? (2) why did the service behave like this? 20MiB\s is very high!

So the purple line presented a service whose job is to consume all data in our events bus, meaning every topic, and persist them parquet files stored in S3. When this service was initially designed, a decision was that S3 objects would be created at a level of a tenant, event type, and time bucket. For example, a file per [tenant-123]-[event-type-A]-[26.10.2023-10:50] that contains all events of type A for tenant-123 consumed at this minute. This service is implemented, like all our streams processed, as a Flink job with multiple operators. First operators consume from Kafka and decrypt the messages, which are then sent to KeyBy operator that logically partitions a stream into disjoint partitions so that all records with the same key are assigned to the same partition and will be stored in the same file.

But here is the point: this operator is pretty expensive! the fact that all data flowing through Kafaka should be grouped logically, crossing the boundaries of the Kafka partitioning strategy, may create huge data re-shuffling between the operator instances!

Just to reflect the current behavior visually, if we take the kubernetes.network.tx_bytes of one consumer (or actually a Flink task manager that executes this operator) and add kubernetes.network.rx_bytes of a second consumer, those two lines are almost identical which means that almost all data is shuffled between the two!

So what can we do with that? very simple: we could decide that a file per [tenant-123]-[event-type-A]-[26.10.2023–10:50]-[consumer-id]. And no client really cares, since anyway get a list of objects per S3 bucket path and know how to handle multiple ones. If we are adding the consumer level, the data grouping is internal to the operator and doesn’t require physical data shuffling.

Returning back to the first question of why production cost was not stable? Now we can understand it: because it’s traffic between two PODs, we are very dependent on where k8s locate them. If by any chance they are co-located under the same node or different nodes on the same AZ, no cost. If they are located on different nodes on different AZ — we’ll pay for 1TB data transfer daily…

When I added additional k8s metrics tags (host and AZ), I realized that this is exactly what happened in Prod: the cost is directly driven from those PODs locations.

Two important takeaways:

Even if you think the issue was fixed, keep monitoring your cost! It may be changed even with the same code base and the same data volumes…
Any simple design decision can dramatically impact the cost! It’s obvious when you choose DB technology but even when choosing the exact granularity of the S3 objects you store (and no one really cares about it, just a decision), it may impact the cost significantly…

Rack-aware Kafka consumers

When it comes to Kafak consumers, one of the features that were introduced by Apache Kafka 2.4 is Kakfa consumers fetching the from closest replica which can definitely optimize the reduce the cross AZs traffic in multi-AZ Kafka cluster. The idea is very simple: if the consumer knows in which rack it’s located, the broker can let it consume from the in-sync-replica of the partition located on the same AZ and not necessarily from the partition leader.

As long as replicas #1 and #2 are kept in sync, every consumer consumes the data from the closest respective replicas.

This change was not yet implemented, but by looking on aws.kafka.bytes_out_per_sec metric which represents the number of bytes sent to kaka consumers, we can reduce the daily data transfer by an additional ~30GB which sounds pretty low but can increased dramatically relative to the input data volumes.

Conclusions (as-for now :))

Tracking cross AZs data transfer cost may be pretty challenging but if you go with a data-driven approach based on the “correct data“ (in this case metrics mainly), you’ll probably find a lot of space for optimizations.
As mentioned earlier, cost issue investigation is not for reducing your expenses. It’s also a great way to find bugs, which I’m sure everyone has.
And finally, the main idea here is to eliminate the cross-AZ data transfer. But reducing the hidden one: knowing where it comes from, whether it makes sense, and whether you are willing to pay for it…