AWS 80% Inter-zone Transfer Cost Reduction

Reeve Lau
4 min readJan 13, 2023

--

We are running a e-Commerce system on AWS with average annual spending over 150K on infrastructure. After 1 full year on the production, we discovered there are a significant spending in inter-zone network transfer. The unexpected spending was discovered on AWS Cost Explorer during our annual cost review.

Although data transfer cost within the same zone is free , the data transfer cost across 2 different Availability Zones will cost $0.01/GB. In Cost Explorer, we recorded $1000/month spending on the inter-zone transfer. That is 80TB/month data equivalent data transfer across AZ.

Our system consists of a web service, a cronjob, a MySQL RDS , a ElastiCache and ElasticSearch instance. The components are distributed in 3 Availability Zone within the same AWS Region. With AWS VPC Flow Logs, we could quickly find the IP address of service of the most bytes in and out of the VPC. And then we identified that data transfer was contributed by the READ command of the ElastiCache service from the application.

The architecture was a few EC2 instances distributed evenly across 3 AZ of the region to provide resilience to zone level failure. Application running in the EC2 will access the ElastiCache via the primary endpoint. As shown in the picture below. A replica was setup to fulfil the redundant requirement. But the replica was left idle most of the time when the system is running normally.

Architecture before the improvement

Besides, the extra incurred cost, there were also a big problem when the e-Commerce site reach the peak customer volume. The single primary Elasticache endpoint became the most significant delay component of the system. All the customers request timed out because of a huge queuing delay by establishing a connection with the primary endpoint.

Identified Problems

  1. Unreasonable inter-zone transfer cost, as high as $1000/month, causing by Redis READ command
  2. ElastiCache primary endpoint congestion

Solution

With some efforts to review the application, we found that the Redis library did support read write segregation. It just needed to have the correct settings. So we adjusted the application settings and added an extra ElastiCache read replica to ensure all AZ having an available Redis endpoint. The new architecture is show below.

  1. Adding extra read replica in AZ C, cost us $1590/year
  2. Application write to primary Redis endpoint and Read from the Redis endpoint in the same availability zone. The Redis endpoint may be a read replica or the primary node depending on the AZ.
Architecture after the improvement

The new design seemingly loses the redundant property comparing to the old design. Since there isn’t a back up for the read replica in the AZ, the replica simply become single point of failure for the application in the same zone. But we have the resilient requirement covered by employing a feature of Envoy proxy; load balancing priority-levels.

Envoy Proxy provides backup to failure of the replica by health checking

The Envoy proxy will health check 2 Redis endpoints, the Redis endpoint that locates with in the same zone and the Reader Endpoint of the ElastiCache instance. The same zone Redis endpoint will take priority when it success the health check and the Envoy service will fail over the application-elasticache connection to Reader Endpoint otherwise. More on Enovy proxy will be in another story.

Result

From AWS Cost explorer, the daily spending on “EC2: Data Transfer — Inter AZ” was down significantly from the first day we deployed the solution to the production. See the graph below, the cost was down from 25/day to around 3/day, almost 88% reduction. Overall, we spend 1590/year to save 800/month spending which is 80% cost reduction annually.

The improvement was deployed on 10-Dec and the cost was reduced 80% on the same day via AWS Cost Explorer

From our APM tool, the measured “response time median” has also 60% reduction.

Before

After

Conclusion

Redis as a NoSQL data storage does solve some of the unique use cases. To fully utilize its full potentials, it requires efforts from multiple parties, including application, the 3rd party libraries and the hosting platform. Unlike the traditional RDBMS that provides a rigid and solid performance model, user of the NoSQL system will need much more effort to optimize the integration between components. A seasoned architect in the team to provide guidance and coordination will bring sea change effects.

--

--