Elastic Beanstalk vs APIG, Part 1 in an AWS Showdown

We recently had our first customer-facing Amazon Web Services (AWS) infrastructure roll out for the main production websites here at Zappos. For reasons that are outside the scope of this blog post, we now have every product page at zappos.com and 6pm.com making an AJAX call to a new service (hosted in AWS) to check for any last minute availability and pricing changes for the product.

Of course, since this call is technically being made by each and every one of our product page visits, we need to make sure it happens as quickly as possible. Latency is of key importance as we’ve seen a direct correlation between slower speeds and lower sales. In general, a sustained speed difference of 100ms will result in a loss of revenue of about $10,000,000 for the year. I don’t know if this applies to other websites, but it’s definitely something we’ve seen at Zappos.

This latency-to-revenue correlation makes latency one of our most important metrics to track when we release a new feature. We were introducing a call that had the potential of negatively affecting our bottom-line, so we needed to make sure it was as efficient as we could make it.

The most obvious and most effective performance boost you can make for your website is to move the content closer to the users by distributing the content through a Content Delivery Network. This works great for static content where it’s easy to make copies all over the United States, but it falls short when you are calling a RESTful API where the results that are returned are different based on what time you call the service.

We needed a way to distribute the installation of the web service and the underlying database to different regions of the United States.

Enter AWS and it’s separate regions.

One of the benefits of being owned by Amazon is that we are encouraged to use AWS as much as we need to. Every engineer has access to whatever AWS resources they would like to use. This is true for the simplest development purposes as well as the largest production needs. The redtape that normally exists in companies for allocating resources is minimized at Amazon for engineers needing AWS resources which leaves our engineers free to focus on the best solution.

When I came to the project, my peer had already created a DynamoDB table that contained all the pricing and availability information I needed to serve to the product pages. The database was located in the AWS us-west-2 (Oregon) region.

My first inclination for any solution is to be as simple as possible. Start with a simple infrastructure and see if it fits your needs, then expand as you need to. A more complicated infrastructure has significant costs in both reliability and maintenance, so it’s important to make sure you are getting the proper performance gains to justify the added complexity.

In the case of our new pricing and availability service, we had a DynamoDB table in Oregon and we wanted to see what distributing a service to different US regions did for performance.

AWS had announced in June a new API Gateway (APIG) service that was built for creating APIs and I had decided to give it a tryout for our new service. The APIG would serve as a standard API interface while the actual code logic would be contained in an AWS Lambda service running Node.js. AWS did it’s job in making the configuration and deployment of such a situation very straight forward and simple.

So I built identical APIG services in the us-west-2 (Oregon) and us-east-1 (Viginia) AWS regions. I then rolled out the same Node.js code for both regions with one little exception. For simplicity sake, I changed the Lambda code in us-east-1 to access the DynamoDB instance in us-west-2. We were going to see what splitting the server code, but leaving the database in a single shared location would do to performance.

As you would expect, hitting the APIG service in us-east-1 was about 200ms slower than hitting the APIG service in us-west-2, no matter where you were in the US. Those in the eastern US states were only seeing negligible speedups in having the APIG and Lambda services located closer to them. This translates into a possible $10,000,000 to $20,000,000 in lost revenue for the year. It wasn’t good enough to have the services closer to the end-users, we also needed to have the data closer.

Luckily, AWS provides a cross-region replication solution using DynamoDB Streams and an Elastic Beanstalk Worker Environment. Running through the walkthrough quickly got us a replicated database that would remain synchronized between the east and west regions.

  • NOTE: I actually had to run the walkthrough twice. The first time, it replicated the database just fine, but failed to keep them synchronized. Upon further investigation, I realized I had setup the Elastic Beanstalk environment with instance types that were too small (t1.micro), so I had to recreate the replication.

Changing the Lambda code in the us-east-1 region so that it read from the local database instead of the one across the country gave us the kind of performance gains we were expecting for the states in the eastern United States.

Having just saved the company a potential $10,000,000 in revenue, I needed to see if there was more that could be done.

In part 2, I’ll explore the performance difference in our RESTful API between HTTP and HTTPS protocols and how we worked around them.

Then, finally, in part 3 I’ll explain how we let the end-user know which region they should be using for the service call.