We are running a e-Commerce system on AWS with average annual spending over 150K on infrastructure. After 1 full year on the production, we discovered there are a significant spending in inter-zone network transfer. The unexpected spending was discovered on AWS Cost Explorer during our annual cost review.
Although data transfer cost within the same zone is free , the data transfer cost across 2 different Availability Zones will cost $0.01/GB. In Cost Explorer, we recorded $1000/month spending on the inter-zone transfer. That is 80TB/month data equivalent data transfer across AZ.
Our system consists of a web service, a cronjob, a MySQL RDS , a ElastiCache and ElasticSearch instance. The components are distributed in 3 Availability Zone within the same AWS Region. With AWS VPC Flow Logs, we could quickly find the IP address of service of the most bytes in and out of the VPC. And then we identified that data transfer was contributed by the READ command of the ElastiCache service from the application.
The architecture was a few EC2 instances distributed evenly across 3 AZ of the region to provide resilience to zone level failure. Application running in the EC2 will access the ElastiCache via the primary endpoint. As shown in the picture below. A replica was setup to fulfil the redundant requirement. But the replica was left idle most of the time when the system is running normally.
Besides, the extra incurred cost, there were also a big problem when the e-Commerce site reach the peak customer volume. The single primary Elasticache endpoint became the most significant delay component of the system. All the customers request timed out because of a huge queuing delay by establishing a connection with the primary endpoint.
Identified Problems
- Unreasonable inter-zone transfer cost, as high as $1000/month, causing by Redis READ command
- ElastiCache primary endpoint congestion
Solution
With some efforts to review the application, we found that the Redis library did support read write segregation. It just needed to have the correct settings. So we adjusted the application settings and added an extra ElastiCache read replica to ensure all AZ having an available Redis endpoint. The new architecture is show below.
- Adding extra read replica in AZ C, cost us $1590/year
- Application write to primary Redis endpoint and Read from the Redis endpoint in the same availability zone. The Redis endpoint may be a read replica or the primary node depending on the AZ.
The new design seemingly loses the redundant property comparing to the old design. Since there isn’t a back up for the read replica in the AZ, the replica simply become single point of failure for the application in the same zone. But we have the resilient requirement covered by employing a feature of Envoy proxy; load balancing priority-levels.
The Envoy proxy will health check 2 Redis endpoints, the Redis endpoint that locates with in the same zone and the Reader Endpoint of the ElastiCache instance. The same zone Redis endpoint will take priority when it success the health check and the Envoy service will fail over the application-elasticache connection to Reader Endpoint otherwise. More on Enovy proxy will be in another story.
Result
From AWS Cost explorer, the daily spending on “EC2: Data Transfer — Inter AZ” was down significantly from the first day we deployed the solution to the production. See the graph below, the cost was down from 25/day to around 3/day, almost 88% reduction. Overall, we spend 1590/year to save 800/month spending which is 80% cost reduction annually.
From our APM tool, the measured “response time median” has also 60% reduction.
Before
After
Conclusion
Redis as a NoSQL data storage does solve some of the unique use cases. To fully utilize its full potentials, it requires efforts from multiple parties, including application, the 3rd party libraries and the hosting platform. Unlike the traditional RDBMS that provides a rigid and solid performance model, user of the NoSQL system will need much more effort to optimize the integration between components. A seasoned architect in the team to provide guidance and coordination will bring sea change effects.