A true story of deep dive to Nat Gateway cost analysis

Ben Hoffman
4 min readFeb 16, 2023

--

Introduction

Our SaaS application is deployed on AWS in several regions. It is built from micro-services running on Kubernetes (Amazon EKS). In addition to many AWS services, we also use managed services like Confluent Kafka and Snowflake.

The micro-services are located in private subnets and connect to the Internet through NAT Gateway.

We have a lot of internal communication, between pods, to S3 via VPC endpoint and also external communication with some of the services mentioned above and 3rd parties we integrate with.

This is a high-level diagram of the application (only the relevant parts of the story):

What is NAT Gateway?

NAT Gateway is a highly available AWS-managed service that makes it easy to connect to the Internet from instances within a private subnet in an Amazon Virtual Private Cloud (Amazon VPC).

Taken from: What is Nat Gateway

What is the cost model of AWS NAT Gateway?

When using NAT gateway, you pay an hourly charge for the instance and for the amount of data that was transfered.

https://aws.amazon.com/vpc/pricing/

What was the issue we identified?

We monitor our cost and usage very closely. At some stage, we observed that the billing for NAT Gateway traffic increased and reached thousands of dollars per month. After this increase, the bill stayed stable for a couple of months.

Even without analyzing the cost, it was wasn’t proportional to the bill of other AWS services we are using. When having all our computing on AWS, It didn’t make sense that the NAT Gateway bill was about 50% of the compute bill.

How did we analyze the traffic in our account?

The first thing that needs to be done is to enable flow logs in the account.

We used this great article from AWS to analyze the traffic: Analyze NAT Gateway Traffic.

Since we really love Athena, we chose that option to visualize the data that was collected.

The outcome of this exercise was a very long list of external IPs and the traffic sent to them. It was very clear to us what are the “heavy” IPs, but we encountered an interesting challenge.

How do we map IP to service?

When using managed services (SaaS) you decide how to use them. You can choose the cloud provider and the region where they will be deployed.

This introduces a challenge because when searching for the “heavy” IPs, we got the non-helpful fact that they belong to AWS. But who is the actual cloud provider behind it?

The way to move forward was to open support tickets to AWS and our other providers (we were focusing on Confluent). We got quick and helpful information that the IPs are of our Kafka clusters.

How did we nail the issue?

We thought that all our microservices that communicate with Kafka do it the same way, but eventually, we understood that this assumption was incorrect. In order to progress fast, we took one of the test environments and started disabling services that were candidates for causing this issue, we did this for the services that communicate with Kafka.

We disabled a few services and looked at the differences in the traffic in our Athena table. Very quickly, we identified the problematic service.

Our next step was to understand precisely what is this colossal traffic that goes out of this service. To analyze the traffic, we installed Wireshark on the services pods.

This is what we saw:

Endless calls from our pod to Kafka (port 9092).

500000 packets/min.

Empty TCP PSH, ACK packets (with no HTTP data) — thousands of requests each second.

We understood, from the pattern and data in Wireshark, that we had misconfigured something that is causing our service to call its Kafka topics endlessly.

What was the solution?

The solution is very technical and not important :). We will focus on what we learnt from this exercise.

It was a small wrong conversion in a yml file that impacted our cost heavily.

Lessons learned:

  1. Proportions — when looking at cost, traffic, sizes etc… try to understand if the proportion is right. Usually, there is a good correlation between the compute sizes, cost, and traffic. Anomalies should be examined carefully.
  2. Focus and ownership — when owners are defined, missions will be completed. In order to drive results in areas inside the organization, a leader/owner should be defined. That leader will make sure the mission is progressing.
  3. Elimination to understand the problem — it is super important to have lower envs that can be played with. The elimination of services was one of the keys to solving our problems. The cost was a test environment (one of many) that wasn’t functional for a few days.
  4. Don’t make assumptions — when fighting challenging issues, assumptions will hold you back. Assumptions can lead you in the wrong direction. Try to start from a blank page and make fact-based progress.
  5. Cost should be discussed in features design — cost analysis is an integral part of planning. Understanding cost will assist us in selecting the right tools and technologies and will assist us in understanding how much a feature will cost.
  6. Cost alerting should be configured to detect cost anomalies and avoid unpleasant surprises at the end of the month.

--

--