How We Cut 40% in NAT Gateway Costs

Eran Levy
Zesty Engineering
Published in
3 min readJul 28, 2024

Managing costs is a critical aspect of running efficient cloud infrastructure. At Zesty we continuously monitor our cloud costs and usage and there was a surge in our NAT Gateway costs. We further analyzed that and managed to reduce our NAT Gateway costs by more than 40%. We are still in the process of reducing the inbound traffic as well which is going to reduce our NAT Gateway costs significantly but we found it interesting enough to tell. Here’s a detailed walkthrough of our approach and findings to cut our outbound traffic.

I would like to thank Mark Serzde & Aviram Alter for their participation and contribution to this achievement!

Initial Cost and Usage Analysis

We began by analyzing our NAT Gateways’ total cost and usage. As Zesty, we built a lakehouse that hosts all our data and we are the customers of ourselves as well. Hence, our CUR data ingested into our platform.

We queried our lakehouse to identify the costly NAT Gateways. Here is the query we used:

select line_item_resource_id, line_item_usage_account_id,       
sum("line_item_net_unblended_cost") as "Net Unblended Cost"
from "zesty_bronze"
where line_item_usage_type like '%NatGateway-Bytes%'
and year='2024'
and month='5'
and org_id = '3863721556980'
group by 1,2
order by 3 desc;

The query results revealed the top NAT Gateway IDs:

NAT Gateway IDs

We confirmed that they all reside in our AWS production account. In case you are using AWS “Cost Explorer”, you can obviously explore that by filtering “Usage Type” for all types containing “NATGateway-Bytes” and group “Linked account”.

We will need to identify if the traffic was primarily “Internet” or cross AWS services traffic, the easiest way to approach that is filtering the “EC2-Instances” service by “Usage Type” with all “DataTransfer-In-Bytes” and “DataTransfer-Out-Bytes”:

The result indicated that our total “Data Transfer” In/Out bytes were significantly lower than our total NAT Gateway usage GB, as shown in the next screenshot:

This proved that “Internet” traffic was not the primary cost driver.

Enabling and Using VPC Flow Logs

To dive deeper into the traffic patterns, we enabled VPC Flow Logs. This step involved configuring VPC Flow Logs to capture detailed network traffic information, including fields like `pkt-src-aws-service`. For more details, refer to the AWS VPC Flow Logs documentation.

Using Athena, we queried the VPC Flow Logs table to understand the traffic through the top NAT Gateways.

The following query provided a high-level view of the traffic:

select interface_id, sum(bytes) as total_bytes, sum(bytes)/(1024*1024*1024) as gb_usage from vpcflowlogs_tbl 
group by 1 order by 3 desc;

We manually cross-checked the top “ENI” IDs with the relevant NAT Gateways and analyzed the traffic for specific ENIs:

select interface_id, sum(bytes) as total_bytes
from vpcflowlogs_tbl
where day='28' AND month='05' AND interface_id='eni-xyz'
group by 1;

Drilling Down Further

For deeper analysis, we focused on the traffic patterns going out of the NAT Gateway range based on the configured subnet:

select dstaddr, sum(bytes) as total_bytes
from vpcflowlogs_tbl
where day='28' AND month='05' AND interface_id='eni-xyz' AND srcaddr like '10.1.102.%' and dstaddr not like '10.1.102.%'
group by 1;

Findings and Actions

The analysis revealed significant data transfers between services in different VPCs and misconfigured endpoints for AWS services like S3 and DynamoDB. After correcting these configurations and updating our routing tables, we observed a substantial reduction in NAT Gateway usage.

Results

The impact of these changes is clearly visible in the following graph, showing a dramatic drop in transfer rates:

--

--