Elastic Beanstalk vs APIG part 2: HTTP vs HTTPS

In part 1 of this showdown, I discussed how we, at Zappos, replicated a key website service to different AWS regions for performance purposes. In this part, I’ll be discussing how and why we decided to not use AWS API Gateway (APIG) and instead opted to use our own Node.js application running on Elastic Beanstalk.

We first implemented our service with an APIG front-end being backed by an AWS Lambda function that did the processing. The APIG was very simple to setup and with some minimal configuration, we were able to have a API that did what we needed it to do (which was read a record from a database, do some minimal calculations, and return a JSON object to the caller).

We ran a few tests to see that we were getting the kinds of performance gains we were expecting when hitting the east instance from eastern states and hitting the west instance from western states. Our tests showed we were saving about 80 milliseconds in round trip by hitting the correct regional service.

It was at this point, that a seasoned and respected engineer here asked about switching from HTTPS to HTTP. The data we are serving is product specific with no customer data or information at all. As such, if we could move to HTTP, it might saves us a few more milliseconds.

EFF’s Web Privacy Campaign

Unfortunately, the APIG will only do HTTPS. To me, this made sense because there is a strong push by web giants to have all communications on the Internet be encrypted. I know that one of the initial concerns back in the day about HTTPS, was the increased load on your servers because you have to do some heavy calculation with public/private key encryption as well as lighter synchronous encryption. In early web days, these extra calculations would have a noticeable impact on your web server, so it made sense to only encrypt what absolutely needed to be encrypted.

But it’s been years since that’s been true. With the increase in processing power and special instructions on modern processors that focus on encryption, the computational overhead is negligible in an HTTPS connection.

So when I was getting the advice from a respected engineer to see what we could do about HTTP versus HTTPS, my initial reaction was to question the need. Surely if Google and other internet giants are pushing for all communication to be HTTPS, then any performance hit must be negligible, right?

But I respected this engineer as someone who was more experienced than I on the subject. I decided I needed to do a side-by-side comparison to see if it was worth it to have a more complicated architecture just to provide a HTTP performance boost.

At this point, my respected engineering colleague suggested I engage the big guns. By big guns, I mean Zappos Tech Operational Engineering. In my 25+ year career I’ve never worked with a more impressive team of TECH OPS. Not as a consultant for many Fortune 500 companies. Not even as an engineering at a large tech SaaS company.

Not only are these guys good, but they are genuinely helpful and want to make a big difference. In the past, I’ve worked with teams of really competent engineers who take the attitude of “not invented by me,” so they tend to just shoot down any idea that may cause them more work. In contrast, I realized our TECH OPS team was different when, taking a “forgiveness over permission” approach, a small hack-a-thon team I was on was drilling holes in the wall and running power over ethernet with some homemade cables. One of the OPS engineers came by and asked what we were doing. We explained our hack-a-thon project (a monitor that detected when the bathroom stalls were occupied). When we were done explaining it, the engineer said “Cool. when you’re done with your hack-a-thon and you want the setup to be more professional, just give me a call.”

Not what I was expecting. That was early on in my tenure here at Zappos and since then the exploits of these guys are legendary. If I could tell you how these guys dealt with a DOS attack that was so massive it almost had our ISP cutting our connection, you would realize why I don’t think it’s a stretch to call these guys superheroes.

So, for my AWS experiment, I raised the bat signal and called in TECH OPS. I knew they’d have some good insights into what I was trying to accomplish and how I should be accomplishing it.

They didn’t disappoint. They confirmed that the APIG indeed did not have HTTP capabilities and if I were going to try out HTTP, I should setup an Elastic Beanstalk environment to pass the API requests on to the Lambda service I had already created.

They then took it a step further and asked about our plans for having the end users hit the proper region instance based on their own GEO location. We had already designed a client based solution (that I’ll explain in more detail in part 3) and the TECH OPS guys pointed out that AWS’s DNS service has a built-in latency-based routing policy that allows you to specify different DNS CNAME records for different AWS regions. When the client does a DNS lookup, it gets the proper host based on AWS’s latency records for their region.

So, I went forth and implemented the Elastic Beanstalk environment with an HTTP interface to get around the HTTPS requirement for APIG and then ran some tests. On average, we saw a performance improvement of about 30–70 milliseconds per product page load when using the HTTP solution instead of the APIG HTTPS solution. As I mentioned in part 1, that would mean a potential $5,000,000 in revenue, which was well worth the added complexity.

So it seems that for our service that had a small payload, the overhead of HTTPS handshake costs too much to justify. HTTP/2 might help a bit with this, but probably not much since the biggest gains for HTTP/2 revolve around keeping connections open to reduce the overhead. I’ll have to run some tests when HTTP/2 is ready to see if the header compression and binary format will make much of a difference, but I doubt it.

In the final part, I’ll discuss our client-based solution for determining how to assign the end users to the correct AWS region based on existing traffic.